Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

Open
dnadlinger opened this issue Oct 28, 2019 · 15 comments
Open

Unicode in kernel print()s panics comm CPU (Utf8Error) #1379

dnadlinger opened this issue Oct 28, 2019 · 15 comments

Comments

@dnadlinger
Copy link
Collaborator

dnadlinger commented Oct 28, 2019

Bug Report

One-Line Summary

The core device (current master) crashes every once in a while with a Utf8Error.

Issue Details

This is a bit hard to reproduce, but our core devices currently crash fairly reliably after a while (maybe once every day or two) with a Utf8Error. The backtrace is as follows:

panic at src/libcore/result.rs:945:5called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 42245, error_len: Some(1) }
backtrace for software version 5.0.dev+1412.g645e8bc5;master:
0x40041a70
0x4001abfc
0x40003088
0x40002ac4
0x4001e47c
0x4001cec4

->

/home/bb/worker/kasli_hoa2_master/build/artiq/firmware/libunwind_backtrace/lib.rs:42
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1534014507178/work/src/libcore/panicking.rs:92
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1534014507178/work/src/libcore/macros.rs:26
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1534014507178/work/src/libcore/result.rs:782
/home/bb/worker/kasli_hoa2_master/build/artiq/firmware/runtime/mgmt.rs:46
runtime0-d160d8e370f2c960423fd88aba1a6961.rs:?

This is in the core log forwarding handler. At this point, I'm not sure what the actual issue is: whether, say, memory corruption leads to invalid UTF-8 or reading beyond the initialised buffer, or e.g. a long backlog just gets truncated at a multi-byte boundary.

Clearly, crashing in the log forwarding code isn't very helpful, though, as it obscures the actual problem. I'd argue that we probably shouldn't be validating UTF-8 here at all for performance, but at the very least we should be inserting replacement characters (U+FFFD) for invalid UTF-8 rather than panicking (and thus eliminating any trace as to what could be going on).

Your System (omit irrelevant parts)

  • ARTIQ version: 5.0 dev (in sync with f2f7170)
  • Hardware involved: Kasli v1.1
@whitequark
Copy link
Contributor

It might be this line, but please decode the backtrace. The idea is that valid UTF-8 goes in, any partially overwritten characters are left out, valid UTF-8 goes out, so that unwrap should never trigger.

@dnadlinger
Copy link
Collaborator Author

dnadlinger commented Oct 28, 2019

but please decode the backtrace.

Ah, the second chunk of lines in the above is already the decoded backtrace.

@whitequark
Copy link
Contributor

Right, OK. I'm not sure what causes this. It'd be necessary to apply a local patch and take a look at the data, unless you see any obvious bugs in the log_buffer code.

@dnadlinger dnadlinger changed the title Core-log-related Utf8Error crashes Unicode in kernel print()s panics comm CPU (Utf8Error) Dec 17, 2019
@dnadlinger
Copy link
Collaborator Author

This turns out to be easy to reproduce, or at least an adjacent bug is: Just print a non-ASCII character from the kernel, e.g.

from artiq.experiment import *

class UnicodeKernelLog(EnvExperiment):
	def build(self):
		self.setattr_device("core")

	@kernel
	def run(self):
		print("σ")

@dnadlinger
Copy link
Collaborator Author

(This lead to:)

panic at src/libcore/result.rs:945:5called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: None }

@dnadlinger
Copy link
Collaborator Author

dnadlinger commented Dec 17, 2019

The generated IR seems obviously wrong:

@S.nn = private unnamed_addr constant [3 x i8] c"n:n"
@S.sn = private unnamed_addr constant [3 x i8] c"s:n"
@S. = private unnamed_addr constant [2 x i8] c"\CF\83"
@typeinfo = local_unnamed_addr global [1 x %D.11*] zeroinitializer

[…]

define private fastcc void @_Z49artiq_run_unicode_kernel_log.UnicodeKernelLog.runzz() unnamed_addr #0 personality i32 (...)* @__artiq_personality !dbg !11 {
entry:
  %rpc.ret.alloc = alloca {}, align 4, !dbg !13
  %.9 = alloca { i8*, i32 }, align 4, !dbg !13
  %.9.repack = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %.9, i32 0, i32 0, !dbg !13
  store i8* getelementptr inbounds ([3 x i8], [3 x i8]* @S.sn, i32 0, i32 0), i8** %.9.repack, align 4, !dbg !13
  %.9.repack1 = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %.9, i32 0, i32 1, !dbg !13
  store i32 3, i32* %.9.repack1, align 4, !dbg !13
  %rpc.stack = call i8* @llvm.stacksave(), !dbg !13
  %rpc.args = alloca i8*, align 4, !dbg !13
  %rpc.arg0 = alloca { i8*, i32 }, align 4, !dbg !13
  %rpc.arg0.repack = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %rpc.arg0, i32 0, i32 0, !dbg !13
  store i8* getelementptr inbounds ([2 x i8], [2 x i8]* @S., i32 0, i32 0), i8** %rpc.arg0.repack, align 4, !dbg !13
  %rpc.arg0.repack2 = getelementptr inbounds { i8*, i32 }, { i8*, i32 }* %rpc.arg0, i32 0, i32 1, !dbg !13
  store i32 1, i32* %rpc.arg0.repack2, align 4, !dbg !13
  %0 = bitcast i8** %rpc.args to { i8*, i32 }**, !dbg !13
  store { i8*, i32 }* %rpc.arg0, { i8*, i32 }** %0, align 4, !dbg !13
  call void @rpc_send(i32 2, { i8*, i32 }* nonnull %.9, i8** nonnull %rpc.args), !dbg !13

σ should be U+03C3, and is 0xCE 0xA30xCF 0x83 in UTF-8.

@whitequark
Copy link
Contributor

σ should be U+03C3, and is 0xCE 0xA3 in UTF-8.

Python disagrees:

>>> b"\xcf\x83".decode('utf-8')
'σ'

@dnadlinger
Copy link
Collaborator Author

dnadlinger commented Dec 17, 2019

Ah, interesting; then just the length is wrong.

Edit: Fixed the above; I had accidentally used upper-case sigma (U+03A3). If I read the IR correctly, it's still a byte short, though.

@dnadlinger
Copy link
Collaborator Author

It still seems wrong that prints from the kernel CPU (invalid UTF-8 or not) can crash the comms CPU. Perhaps the Request::PullLog handler should just forward the bytes, and let the client code deal with the fallout? Inserting replacement characters at this stage is a little annoying to implement (string length changes, so we'd need to transcode directly into the socket buffer).

@whitequark
Copy link
Contributor

Perhaps the Request::PullLog handler should just forward the bytes, and let the client code deal with the fallout?

Inserting invalid UTF-8 into a Rust &str is UB, so you'd have to significantly restructure the entire firmware somehow.

@dnadlinger
Copy link
Collaborator Author

Inserting invalid UTF-8 into a Rust &str is UB, so you'd have to significantly restructure the entire firmware somehow.

Hmm, wouldn't all that is required be to not needlessly convert user data that is just ferried between host and kernel from &[u8] to str (and variants)? In other words, treat it as untrusted?

Something along these lines (incomplete):

--- a/artiq/coredevice/comm_kernel.py
+++ b/artiq/coredevice/comm_kernel.py
@@ -161,7 +161,7 @@ class CommKernel:
         return self._read_chunk(self._read_int32())

     def _read_string(self):
-        return self._read_bytes().decode("utf-8")
+        return self._read_bytes().decode("utf-8", "replace")  # XXX: Actually handle here?

     #
     # Writer interface
diff --git a/artiq/firmware/libproto_artiq/rpc_proto.rs b/artiq/firmware/libproto_artiq/rpc_proto.rs
index c0bab177..e770d0e4 100644
--- a/artiq/firmware/libproto_artiq/rpc_proto.rs
+++ b/artiq/firmware/libproto_artiq/rpc_proto.rs
@@ -118,7 +118,7 @@ unsafe fn send_value<W>(writer: &mut W, tag: Tag, data: &mut *const ())
                 writer.write_u64(*ptr)),
         Tag::String =>
             consume_value!(CSlice<u8>, |ptr|
-                writer.write_string(str::from_utf8((*ptr).as_ref()).unwrap())),
+                writer.write_bytes((*ptr).as_ref())),
         Tag::Bytes | Tag::ByteArray =>
             consume_value!(CSlice<u8>, |ptr|
                 writer.write_bytes((*ptr).as_ref())),
--- a/artiq/firmware/liblogger_artiq/lib.rs
+++ b/artiq/firmware/liblogger_artiq/lib.rs
@@ -31,7 +31,7 @@ impl<'a> LogBufferRef<'a> {
         self.buffer.clear()
     }

-    pub fn extract(&mut self) -> &str {
+    pub fn extract(&mut self) -> &[u8] {
         self.buffer.extract()
     }
 }
diff --git a/artiq/firmware/libproto_artiq/mgmt_proto.rs b/artiq/firmware/libproto_artiq/mgmt_proto.rs
index ffa2a976..13cddca1 100644
--- a/artiq/firmware/libproto_artiq/mgmt_proto.rs
+++ b/artiq/firmware/libproto_artiq/mgmt_proto.rs
@@ -82,7 +82,7 @@ pub enum Reply<'a> {
     Error,
     Unavailable,

-    LogContent(&'a str),
+    LogContent(&'a [u8]),

     ConfigData(&'a [u8]),

@@ -166,7 +166,7 @@ impl<'a> Reply<'a> {

             Reply::LogContent(ref log) => {
                 writer.write_u8(2)?;
-                writer.write_string(log)?;
+                writer.write_bytes(log)?;
             }

             Reply::ConfigData(ref bytes) => {
diff --git a/artiq/firmware/runtime/mgmt.rs b/artiq/firmware/runtime/mgmt.rs
index c03682e8..7a691b6e 100644
--- a/artiq/firmware/runtime/mgmt.rs
+++ b/artiq/firmware/runtime/mgmt.rs
@@ -43,7 +43,7 @@ fn worker(io: &Io, stream: &mut TcpStream) -> Result<(), Error<SchedError>> {
                         let mut buffer = io.until_ok(|| logger.buffer())?;
                         if buffer.is_empty() { continue }

-                        stream.write_string(buffer.extract())?;
+                        stream.write_bytes(buffer.extract())?;

                         if log_level == LevelFilter::Trace {
                             // Hold exclusive access over the logger until we get positive

Will finish this up tomorrow…

@dnadlinger
Copy link
Collaborator Author

dnadlinger commented Dec 17, 2019

(NB: This issue now conflates three separate issues: Handling of kernel-generated strings on the comm CPU, the print() RPC literal argument issue, and whatever else (truncation/corruption?) the origin of the log forwarding crash might be.)

@thomasfire
Copy link
Contributor

Is there a way to reproduce the origin log forwarding crash?

@dnadlinger
Copy link
Collaborator Author

Fixed the typo in the above message; I simply meant the origin of the log forwarding crashes we have been seeing.

To reproduce it, you'd need some way of generating an ARTIQ Python string that's not valid UTF-8, and then just log it. You could get it by running my above test case without your #1990, or perhaps generating an invalid string in a class attribute or RPC return value (though those might have some checks in place).

As for the actual root cause of the crashes we've been sporadically seeing (i.e. the reason for the data to be invalid UTF-8 to begin with, not why it causes crashes), I'm not sure. There are, unfortunately, several soundness bugs in the ARTIQ Python memory lifetime analysis, so it could e.g. be memory corruption bugs caused by any of these. For instance, IIRC of them was related to corruption of exception messages when catching/re-raising exceptions before the rewrite that made them statically fixed. Ideally, there should never be any invalid strings in ARTIQ Python, of course, but in practice, I'd argue that the runtime crashing without any (even partial) indication of the source of the invalid data is much worse than just passing it through to the host (which is faster as well!), and dealing with it there (e.g. by inserting replacement characters).

@thomasfire
Copy link
Contributor

passing it through to the host (which is faster as well!), and dealing with it there

I don't think changing that behavior would be meaningful and possible, since logging relies on the third-party's log_buffer, which operates strings only.
Though, I agree there are some memory-related bugs here, that worth investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants