-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xaa in position 196607: invalid start byte #17
Comments
Yeah, there's a 2^32 limit on how much can be sent in a message - I didn't think people would be shipping more than 4 gb (and if you are coming close to this limit, you may want to try to identify alternatives to doing that - ghidra_bridge is definitely not going to perform well under those conditions, in either memory or network speed). But! I don't think that's the problem here. The error message is having trouble decoding the received JSON, with the decode fail happening at 196607 - nowhere near a 2^32 limit (but still a way big message - at least 191Kb :o ). Additionally, packing the message on the server would have failed - Tracking down invalid unicode always sucks - it'd help to be able to see the code for your "my_function". |
Indeed, that is a lot of data that get's serialized to JSON and definitively not the most efficient way. What's actually done is passing edges from the control flow graph of the program. However, the error does only occur once in a while, but is is always What's your recommendation on passing larger data back to python? Maybe a pipe or so? |
Honestly, if you're not working across the network, I'd probably suggest
writing it straight out to a file and reading it back from disk in the
python process. Take ghidra_bridge and any networking complexity right out
of the picture. Also means that when something goes a little off, you can
maybe resume from the file instead of having to collect the data from
scratch.
|
Additionally, if you want, you could try patching the jfx_bridge/bridge.py on the receiving side to log the message when it hits the decode issue. This could be helpful in tracking down the source of the issue even if you can't reproduce it deterministically. This would look something like replacing the line
|
I have printed the file, as you suggested. It really contains non-Unicode characters. I have attached a shorted version for you. |
Hah! Just enough for me to see what's going on. Looks like halfway through a message being received, a second message is jumping in. The unicode error is being caused by the \x00\x00\x00\xaa bytes of the second message size, and there's 0xaa bytes of second message JSON before the initial message resumes. I'm going to have to go hunt through the network dispatching code to see why that can happen... |
Transferred this issue over to jfx-bridge (the underlying comms beneath ghidra_bridge), because I'm pretty sure that's where the problem is. Here's a braindump of what I think has happened - you can skip down to the bottom for how to upgrade and hopefully fix the issue if you want, this is mostly for historical record. The problem probably lies in the potential for messages being sent across the bridge to become interleaved - it wouldn't happen often, because most socket.send() calls will drop into native and dispatch the message in one hit, but for very large messages, there's the potential for it to only send part of the message before it returns back to python and loops around to send the rest. If there's another thread waiting with a message when that happens, and python decides to swap threads, the first message will be incomplete when the second message's (including its size header) gets put on the wire. Eventually, when control returns to the first thread, it'll finish sending its message, but the damage is already done. On the receiving end, it'll see the first message's size header and try to read that many bytes - which will include reading the second message's size header and data, and lose some of the end of the first message. When this gets fed into a unicode decode it'll probably fail with invalid bytes when it hits the binary size header - even if it didn't somehow, the JSON structure would almost certainly be broken, so the json.loads() would fail in the next step. I've addressed this by gating all the places where data gets written to the socket through a lock. However, I haven't been able to build a testcase that actually replicates the problem, so it's all a guess as to whether this actually fixes your issue. If you did end up with code that reliably replicated the problem, that'd be nice to have so I could try turning it into a testcase to avoid regressions. TL;DR - I've released version 0.9.1 of jfx-bridge with a fix that I think might sort the problem. Upgrade with Please let me know if you think it's solved the issue, or if it keeps occurring. |
Wow, that was fast. For now, the error has not occurred again, so I guess it is fixed. Thanks a lot! |
Sweet! I'll close this now, but if it does reoccur, feel free to reopen. |
Hello,
I want to remotify a function 'my_function' and return a very large object from my_function.
When I do this, I get the following output:
In the bridge.py file in function read_size_and_data_from_socket, I think the size of 'data' is stored in 32 bits. Maybe there is a problem if 'my_function' returns an object that is larger than 2**32 bytes (~4 Gigabytes)?
Let me know if you need more info.
The text was updated successfully, but these errors were encountered: