Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler accepts returning allocated types from kernel #1967

Closed
b-bondurant opened this issue Sep 16, 2022 · 10 comments · Fixed by #2470
Closed

Compiler accepts returning allocated types from kernel #1967

b-bondurant opened this issue Sep 16, 2022 · 10 comments · Fixed by #2470
Assignees

Comments

@b-bondurant
Copy link
Contributor

Bug Report

One-Line Summary

String RPC return values (and/or kernel return values?) seem to be getting corrupted, which can indirectly cause a kernel panic.

Issue Details

Steps to Reproduce

Run the following experiment:

from artiq.experiment import *


class BadString(EnvExperiment):
    def build(self):
        self.setattr_device('core')

    @rpc
    def get_string_rpc(self) -> TStr:
        return "a string"

    @kernel
    def get_string(self) -> TStr:
        self.core.reset()
        my_str = self.get_string_rpc()
        # seems to cause the kernel panic
        # print(my_str)
        return my_str

    def run(self):
        print(self.get_string())

Expected Behavior

Prints a string to the console.

Actual (undesired) Behavior

Prints ring to the console. And if you uncomment the print statement in the kernel it causes a kernel panic with the following message:

[ 64082.239740s]  INFO(kernel): panic at /nix/store/2w4h6z0k128aa52r1law0zysi68arp34-python3.9-artiq-7.8123.3038639/lib/python3.9/site-packages/artiq/firmware/libproto_artiq/rpc_proto.rs:209:69: called `Result::unwrap()` on an `Err` value: Utf8Error { valid_up_to: 0, error_len: Some(1) }
[ 64082.264133s] ERROR(runtime::session): session aborted: unexpected request RunAborted from kernel CPU

And on the host side:

a string
Traceback (most recent call last):
  File "/nix/store/2w4h6z0k128aa52r1law0zysi68arp34-python3.9-artiq-7.8123.3038639/bin/.artiq_run-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/frontend/artiq_run.py", line 224, in main
    return run(with_file=True)
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/frontend/artiq_run.py", line 210, in run
    raise exn
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/frontend/artiq_run.py", line 203, in run
    exp_inst.run()
  File "bad_string.py", line 21, in run
    print(self.get_string())
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/language/core.py", line 54, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/coredevice/core.py", line 140, in run
    self._run_compiled(kernel_library, embedding_map, symbolizer, demangler)
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/coredevice/core.py", line 130, in _run_compiled
    self.comm.serve(embedding_map, symbolizer, demangler)
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/coredevice/comm_kernel.py", line 706, in serve
    self._read_header()
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/coredevice/comm_kernel.py", line 249, in _read_header
    sync_byte = self._read(1)[0]
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/coredevice/comm_kernel.py", line 237, in _read
    raise ConnectionResetError("Core device connection closed unexpectedly")
ConnectionResetError: Core device connection closed unexpectedly

Strangely, it prints the correct string before panicking. It almost looks like the components work fine independently -- printing a string literal from the kernel works fine, as does returning a string literal from the kernel, and based on the above output it seems that the RPC return value is at least mostly correct when received in the kernel (possibly with some extra non-printable byte(s) that cause the panic?) -- but when used in conjunction this issue appears.

Your System (omit irrelevant parts)

  • Operating System: Ubuntu 22.04
  • ARTIQ version: 7.8123.3038639
  • Version of the gateware and runtime loaded in the core device: 7.8123.3038639
  • Hardware involved: Kasli v1.1 (full device db can be found here)
@b-bondurant
Copy link
Contributor Author

b-bondurant commented Sep 16, 2022

Also holds if I modify the types to bytes:

from artiq.experiment import *


class BadString(EnvExperiment):
    def build(self):
        self.setattr_device('core')

    @rpc
    def get_string_rpc(self) -> TBytes:
        return bytes('a string', encoding='utf-8')

    @kernel
    def get_string(self) -> TBytes:
        self.core.reset()
        my_str = self.get_string_rpc()
        # print(my_str)
        return my_str

    def run(self):
        print(self.get_string().decode('utf-8'))

Output: Bring

However if I uncomment the print statement I get something new:

bytearray(b'a string')
Traceback (most recent call last):
  File "/nix/store/2w4h6z0k128aa52r1law0zysi68arp34-python3.9-artiq-7.8123.3038639/bin/.artiq_run-wrapped", line 9, in <module>
    sys.exit(main())
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/frontend/artiq_run.py", line 224, in main
    return run(with_file=True)
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/frontend/artiq_run.py", line 210, in run
    raise exn
  File "/nix/store/7qs5shpgswzik9vsr7rkvz2l7znzmjz0-python3-3.9.13-env/lib/python3.9/site-packages/artiq/frontend/artiq_run.py", line 203, in run
    exp_inst.run()
  File "bad_string.py", line 21, in run
    print(self.get_string().decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa8 in position 0: invalid start byte

Then, removing the .decode():

bytearray(b'a string')
bytearray(b'\xa8\x05\x06E\x00\x00\x00\x00')

@b-bondurant
Copy link
Contributor Author

And it gets weirder! (I might be having too much fun with this.) If I iterate over the bytes in-kernel and print them individually it changes the return value!

from artiq.experiment import *


class BadString(EnvExperiment):
    def build(self):
        self.setattr_device('core')

    @rpc
    def get_string_rpc(self) -> TBytes:
        return bytes('a string', encoding='utf-8')

    @kernel
    def get_string(self) -> TBytes:
        self.core.reset()
        my_str = self.get_string_rpc()
        print(my_str)
        for b in my_str:
            print(int(b))
        return my_str

    def run(self):
        my_str = self.get_string()
        print(my_str)
        for b in my_str:
            print(int(b))

output:

bytearray(b'a string')
97
32
115
116
114
105
110
103
bytearray(b'\x04\x00\x00D\xfc\x0f\x00\x00')
4
0
0
68
252
15
0
0

@sbourdeauducq
Copy link
Member

Most likely an allocation size isn't computed correctly somewhere (firmware and/or compiler) and receiving the string corrupts memory.

@pmldrmota
Copy link
Contributor

Maybe related to #1934?

@thomasfire thomasfire self-assigned this Oct 10, 2022
@thomasfire
Copy link
Contributor

After some research, found out rpc_send_async receives corrupted data. In some cases some parts (second 4 bytes, more often) may be correct. Also, can confirm there is no corruption with copied/owned data. Moreover, slicing [:] also generates normal output.
Still, not sure what causes such behavior.

@dnadlinger
Copy link
Collaborator

Given how the compiler is implemented currently, returning a string from a kernel can't work, and neither for any other "allocated" type (arrays/lists), as the backing allocation for the elements is on the stack frame of get_string(), while the code marshalling the value back to the host is in an implicit main-type function generated by the compiler. I'm not sure why this isn't caught by the escape analysis; my previous comments in #1497 indicate that it used to, though it might be a related issue. This is an "accepts invalid" bug – the code shouldn't compile, but does. We could think about changing the codegen such that the top-level function is special-cased to make the particular case of returning an array from a kernel work.

If you can make print(my_str) (or any other type of RPC) receive corrupted data without that, that's a separate bug, though; passing the data to an RPC while in get_string() should totally work, and is in fact the "intended" way of passing array data back to the host.

@dnadlinger dnadlinger changed the title RPC string corruption Compiler accepts returning allocated types from kernel Nov 24, 2022
@dnadlinger
Copy link
Collaborator

I can't seem to reproduce the print() crash on current master, though – unless you can, we can probably close the remaining issue as a duplicate of #1497/#1677.

@thomasfire
Copy link
Contributor

Checked latest master, and the crash is not being reproduced anymore. Though bytes/str corruption still persists.

This is an "accepts invalid" bug – the code shouldn't compile, but does.

As for me, it still looks like a legitimate code, I'm not sure that the compiler should not accept it.

@sbourdeauducq
Copy link
Member

Right, this doesn't actually have to do with RPC but with returning a string from a kernel function.

I'm not sure that the compiler should not accept it.

Remember that in kernels we don't have a heap.

@dnadlinger
Copy link
Collaborator

Also see #1298.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants