Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

dnadlinger · 2019-10-28T10:01:28Z

Bug Report

One-Line Summary

The core device (current master) crashes every once in a while with a Utf8Error.

Issue Details

This is a bit hard to reproduce, but our core devices currently crash fairly reliably after a while (maybe once every day or two) with a Utf8Error. The backtrace is as follows:

panic at src/libcore/result.rs:945:5called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 42245, error_len: Some(1) }
backtrace for software version 5.0.dev+1412.g645e8bc5;master:
0x40041a70
0x4001abfc
0x40003088
0x40002ac4
0x4001e47c
0x4001cec4

->

/home/bb/worker/kasli_hoa2_master/build/artiq/firmware/libunwind_backtrace/lib.rs:42
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1534014507178/work/src/libcore/panicking.rs:92
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1534014507178/work/src/libcore/macros.rs:26
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1534014507178/work/src/libcore/result.rs:782
/home/bb/worker/kasli_hoa2_master/build/artiq/firmware/runtime/mgmt.rs:46
runtime0-d160d8e370f2c960423fd88aba1a6961.rs:?

This is in the core log forwarding handler. At this point, I'm not sure what the actual issue is: whether, say, memory corruption leads to invalid UTF-8 or reading beyond the initialised buffer, or e.g. a long backlog just gets truncated at a multi-byte boundary.

Clearly, crashing in the log forwarding code isn't very helpful, though, as it obscures the actual problem. I'd argue that we probably shouldn't be validating UTF-8 here at all for performance, but at the very least we should be inserting replacement characters (U+FFFD) for invalid UTF-8 rather than panicking (and thus eliminating any trace as to what could be going on).

Your System (omit irrelevant parts)

ARTIQ version: 5.0 dev (in sync with f2f7170)
Hardware involved: Kasli v1.1

The text was updated successfully, but these errors were encountered:

whitequark · 2019-10-28T12:31:37Z

It might be this line, but please decode the backtrace. The idea is that valid UTF-8 goes in, any partially overwritten characters are left out, valid UTF-8 goes out, so that unwrap should never trigger.

dnadlinger · 2019-10-28T13:21:00Z

but please decode the backtrace.

Ah, the second chunk of lines in the above is already the decoded backtrace.

whitequark · 2019-10-28T13:41:39Z

Right, OK. I'm not sure what causes this. It'd be necessary to apply a local patch and take a look at the data, unless you see any obvious bugs in the log_buffer code.

dnadlinger · 2019-12-17T19:51:00Z

This turns out to be easy to reproduce, or at least an adjacent bug is: Just print a non-ASCII character from the kernel, e.g.

from artiq.experiment import *

class UnicodeKernelLog(EnvExperiment):
	def build(self):
		self.setattr_device("core")

	@kernel
	def run(self):
		print("σ")

dnadlinger · 2019-12-17T19:52:18Z

(This lead to:)

panic at src/libcore/result.rs:945:5called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: None }

dnadlinger · 2019-12-17T20:14:37Z

The generated IR seems obviously wrong:

@S.nn = private unnamed_addr constant [3 x i8] c"n:n"
@S.sn = private unnamed_addr constant [3 x i8] c"s:n"
@S. = private unnamed_addr constant [2 x i8] c"\CF\83"
@typeinfo = local_unnamed_addr global [1 x %D.11*] zeroinitializer

[…]

define private fastcc void @_Z49artiq_run_unicode_kernel_log.UnicodeKernelLog.runzz() unnamed_addr #0 personality i32 (...)* @__artiq_personality !dbg !11 {
entry:
  %rpc.ret.alloc = alloca {}, align 4, !dbg !13
  %.9 = alloca { i8*, i32 }, align 4, !dbg !13
  %.9.repack = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %.9, i32 0, i32 0, !dbg !13
  store i8* getelementptr inbounds ([3 x i8], [3 x i8]* @S.sn, i32 0, i32 0), i8** %.9.repack, align 4, !dbg !13
  %.9.repack1 = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %.9, i32 0, i32 1, !dbg !13
  store i32 3, i32* %.9.repack1, align 4, !dbg !13
  %rpc.stack = call i8* @llvm.stacksave(), !dbg !13
  %rpc.args = alloca i8*, align 4, !dbg !13
  %rpc.arg0 = alloca { i8*, i32 }, align 4, !dbg !13
  %rpc.arg0.repack = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %rpc.arg0, i32 0, i32 0, !dbg !13
  store i8* getelementptr inbounds ([2 x i8], [2 x i8]* @S., i32 0, i32 0), i8** %rpc.arg0.repack, align 4, !dbg !13
  %rpc.arg0.repack2 = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %rpc.arg0, i32 0, i32 1, !dbg !13
  store i32 1, i32* %rpc.arg0.repack2, align 4, !dbg !13
  %0 = bitcast i8** %rpc.args to { i8*, i32 }**, !dbg !13
  store { i8*, i32 }* %rpc.arg0, { i8*, i32 }** %0, align 4, !dbg !13
  call void @rpc_send(i32 2, { i8*, i32 }* nonnull %.9, i8** nonnull %rpc.args), !dbg !13

σ should be U+03C3, and is ~~0xCE 0xA3~~0xCF 0x83 in UTF-8.

whitequark · 2019-12-17T20:44:42Z

σ should be U+03C3, and is 0xCE 0xA3 in UTF-8.

Python disagrees:

>>> b"\xcf\x83".decode('utf-8')
'σ'

dnadlinger · 2019-12-17T20:50:15Z

Ah, interesting; then just the length is wrong.

Edit: Fixed the above; I had accidentally used upper-case sigma (U+03A3). If I read the IR correctly, it's still a byte short, though.

dnadlinger · 2019-12-17T21:20:55Z

It still seems wrong that prints from the kernel CPU (invalid UTF-8 or not) can crash the comms CPU. Perhaps the Request::PullLog handler should just forward the bytes, and let the client code deal with the fallout? Inserting replacement characters at this stage is a little annoying to implement (string length changes, so we'd need to transcode directly into the socket buffer).

whitequark · 2019-12-17T21:24:17Z

Perhaps the Request::PullLog handler should just forward the bytes, and let the client code deal with the fallout?

Inserting invalid UTF-8 into a Rust &str is UB, so you'd have to significantly restructure the entire firmware somehow.

dnadlinger · 2019-12-17T23:06:11Z

Inserting invalid UTF-8 into a Rust &str is UB, so you'd have to significantly restructure the entire firmware somehow.

Hmm, wouldn't all that is required be to not needlessly convert user data that is just ferried between host and kernel from &[u8] to str (and variants)? In other words, treat it as untrusted?

Something along these lines (incomplete):

--- a/artiq/coredevice/comm_kernel.py
+++ b/artiq/coredevice/comm_kernel.py
@@ -161,7 +161,7 @@ class CommKernel:
         return self._read_chunk(self._read_int32())

     def _read_string(self):
-        return self._read_bytes().decode("utf-8")
+        return self._read_bytes().decode("utf-8", "replace")  # XXX: Actually handle here?

     #
     # Writer interface
diff --git a/artiq/firmware/libproto_artiq/rpc_proto.rs b/artiq/firmware/libproto_artiq/rpc_proto.rs
index c0bab177..e770d0e4 100644
--- a/artiq/firmware/libproto_artiq/rpc_proto.rs
+++ b/artiq/firmware/libproto_artiq/rpc_proto.rs
@@ -118,7 +118,7 @@ unsafe fn send_value<W>(writer: &mut W, tag: Tag, data: &mut *const ())
                 writer.write_u64(*ptr)),
         Tag::String =>
             consume_value!(CSlice<u8>, |ptr|
-                writer.write_string(str::from_utf8((*ptr).as_ref()).unwrap())),
+                writer.write_bytes((*ptr).as_ref())),
         Tag::Bytes | Tag::ByteArray =>
             consume_value!(CSlice<u8>, |ptr|
                 writer.write_bytes((*ptr).as_ref())),
--- a/artiq/firmware/liblogger_artiq/lib.rs
+++ b/artiq/firmware/liblogger_artiq/lib.rs
@@ -31,7 +31,7 @@ impl<'a> LogBufferRef<'a> {
         self.buffer.clear()
     }

-    pub fn extract(&mut self) -> &str {
+    pub fn extract(&mut self) -> &[u8] {
         self.buffer.extract()
     }
 }
diff --git a/artiq/firmware/libproto_artiq/mgmt_proto.rs b/artiq/firmware/libproto_artiq/mgmt_proto.rs
index ffa2a976..13cddca1 100644
--- a/artiq/firmware/libproto_artiq/mgmt_proto.rs
+++ b/artiq/firmware/libproto_artiq/mgmt_proto.rs
@@ -82,7 +82,7 @@ pub enum Reply<'a> {
     Error,
     Unavailable,

-    LogContent(&'a str),
+    LogContent(&'a [u8]),

     ConfigData(&'a [u8]),

@@ -166,7 +166,7 @@ impl<'a> Reply<'a> {

             Reply::LogContent(ref log) => {
                 writer.write_u8(2)?;
-                writer.write_string(log)?;
+                writer.write_bytes(log)?;
             }

             Reply::ConfigData(ref bytes) => {
diff --git a/artiq/firmware/runtime/mgmt.rs b/artiq/firmware/runtime/mgmt.rs
index c03682e8..7a691b6e 100644
--- a/artiq/firmware/runtime/mgmt.rs
+++ b/artiq/firmware/runtime/mgmt.rs
@@ -43,7 +43,7 @@ fn worker(io: &Io, stream: &mut TcpStream) -> Result<(), Error<SchedError>> {
                         let mut buffer = io.until_ok(|| logger.buffer())?;
                         if buffer.is_empty() { continue }

-                        stream.write_string(buffer.extract())?;
+                        stream.write_bytes(buffer.extract())?;

                         if log_level == LevelFilter::Trace {
                             // Hold exclusive access over the logger until we get positive

Will finish this up tomorrow…

dnadlinger · 2019-12-17T23:28:09Z

(NB: This issue now conflates three separate issues: Handling of kernel-generated strings on the comm CPU, the print() RPC literal argument issue, and whatever else (truncation/corruption?) the origin of the log forwarding crash might be.)

thomasfire · 2022-11-10T03:59:53Z

Is there a way to reproduce the origin log forwarding crash?

dnadlinger · 2022-11-10T19:11:51Z

Fixed the typo in the above message; I simply meant the origin of the log forwarding crashes we have been seeing.

To reproduce it, you'd need some way of generating an ARTIQ Python string that's not valid UTF-8, and then just log it. You could get it by running my above test case without your #1990, or perhaps generating an invalid string in a class attribute or RPC return value (though those might have some checks in place).

As for the actual root cause of the crashes we've been sporadically seeing (i.e. the reason for the data to be invalid UTF-8 to begin with, not why it causes crashes), I'm not sure. There are, unfortunately, several soundness bugs in the ARTIQ Python memory lifetime analysis, so it could e.g. be memory corruption bugs caused by any of these. For instance, IIRC of them was related to corruption of exception messages when catching/re-raising exceptions before the rewrite that made them statically fixed. Ideally, there should never be any invalid strings in ARTIQ Python, of course, but in practice, I'd argue that the runtime crashing without any (even partial) indication of the source of the invalid data is much worse than just passing it through to the host (which is faster as well!), and dealing with it there (e.g. by inserting replacement characters).

thomasfire · 2022-11-11T02:07:00Z

passing it through to the host (which is faster as well!), and dealing with it there

I don't think changing that behavior would be meaningful and possible, since logging relies on the third-party's log_buffer, which operates strings only.
Though, I agree there are some memory-related bugs here, that worth investigating.

dnadlinger added area:coredevice area:runtime labels Oct 28, 2019

dnadlinger changed the title ~~Core-log-related Utf8Error crashes~~ Unicode in kernel print()s panics comm CPU (Utf8Error) Dec 17, 2019

mbirtwell mentioned this issue Mar 11, 2022

Unsound parameter lifetime handling #1677

Open

sbourdeauducq assigned thomasfire Oct 10, 2022

thomasfire mentioned this issue Nov 8, 2022

Fix const string's length #1990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

dnadlinger commented Oct 28, 2019 •

edited

whitequark commented Oct 28, 2019

dnadlinger commented Oct 28, 2019 •

edited

whitequark commented Oct 28, 2019

dnadlinger commented Dec 17, 2019

dnadlinger commented Dec 17, 2019

dnadlinger commented Dec 17, 2019 •

edited

whitequark commented Dec 17, 2019

dnadlinger commented Dec 17, 2019 •

edited

dnadlinger commented Dec 17, 2019

whitequark commented Dec 17, 2019

dnadlinger commented Dec 17, 2019

dnadlinger commented Dec 17, 2019 •

edited

thomasfire commented Nov 10, 2022

dnadlinger commented Nov 10, 2022

thomasfire commented Nov 11, 2022

Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

Comments

dnadlinger commented Oct 28, 2019 • edited

Bug Report

One-Line Summary

Issue Details

Your System (omit irrelevant parts)

whitequark commented Oct 28, 2019

dnadlinger commented Oct 28, 2019 • edited

whitequark commented Oct 28, 2019

dnadlinger commented Dec 17, 2019

dnadlinger commented Dec 17, 2019

dnadlinger commented Dec 17, 2019 • edited

whitequark commented Dec 17, 2019

dnadlinger commented Dec 17, 2019 • edited

dnadlinger commented Dec 17, 2019

whitequark commented Dec 17, 2019

dnadlinger commented Dec 17, 2019

dnadlinger commented Dec 17, 2019 • edited

thomasfire commented Nov 10, 2022

dnadlinger commented Nov 10, 2022

thomasfire commented Nov 11, 2022

dnadlinger commented Oct 28, 2019 •

edited

dnadlinger commented Oct 28, 2019 •

edited

dnadlinger commented Dec 17, 2019 •

edited

dnadlinger commented Dec 17, 2019 •

edited

dnadlinger commented Dec 17, 2019 •

edited