-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Segfault on node 0.12.4, to do with unicode and buffers #25583
Comments
I've managed to construct a simple JS test case:
While I was working this out, the crash was unreliable: the same code would crash, or not crash. There was at one point a loop to repeatedly call a function that copied part of the buffer into a new buffer. The above code appears to crash 100% of the time for me each execution, but if it doesn't for you, you might try looping it and/or the buffer copy thing. |
@myndzi So I attempted to reproduce this with the following code and could not have it crash. 'use strict'; for (var j =0; j < 1000; j++) { var pattern = new Buffer([0xF0, 0x9F, 0x92, 0x95]); var data = new Buffer(8000); for (var i = 0; i < 8000; i += 4) { pattern.copy(data, i, 0, 4); } console.log(data.slice(0, 7833).toString()) } But no luck. Is this what you recommended to reproduce? Also I simply ran this by using node file.js with node 10.31. What node version are you running? |
Er, as it says in the title of the issue, this is for node 0.12. It probably crashes in 0.11.x and up -- I was on 0.11 when I first observed it and updated to latest to see if it had been fixed. When I had it looping, it was only the slice / composition part, so something like this:
...but I don't imagine on 0.10.31, if the other didn't crash, this will. |
@myndzi Well it is probably a Mac specific issue but I get Bus error 10 when using Node 0.12.5 as well as 0.12.4 |
http://stackoverflow.com/a/212585/2156140 It's surely the same cause, just a different effect on a different platform |
Does it crash if you use |
No. The problem appears to specifically be in the utf8->utf16 conversion path. Binary doesn't have such a thing as a 'partially complete encoded character' at the end of the line |
The first test case from @myndzi caused a segfault on my Linux box. |
Does seem not to crash on 0.10.x for me as well, FWIW. (I think there's a possibility that the resultant string is malformed, however, if you were to combine two strings that split over such a character break) |
@myndzi Just a note that after looking at the most recent code sample you sent, the first slice always crashes for me with the node version built off master. The for loop is not even needed. I am having trouble finding the code path slice triggers though. |
You'll notice the reproduction I arrived at doesn't include the loop; I only mentioned it because there was some random behavior in the versions leading up to it, where looping helped. The first code sample should be preferred. Slice isn't necessary to the crash, it just creates the bad data. The crash is caused by the .toString() method; the string is converted to utf-16, but since the last character of the buffer is the first byte of a four-byte sequence, the converted utf-16 becomes the first half of a utf-16 surrogate pair. The WriteUtf16 loop is called from the inlined unicode helpers and copies the bulk of the data, and then the WriteUtf16Slow function is called to 'finish' the copy. But it gets called with only one byte left (len = 1), and that byte is the first half of a surrogate pair; in the source, this causes the code to assume that there are at least two bytes left -- though there's an assertion that causes the crash when this assumption is not met. Some previous work decoding utf-8 appears to be necessary; I tried it with a buffer that ended in just the one character, but it crashed very intermittently. When I filled the buffer with plenty of copies so that some decoding went on before it got to the slow-copy part, it began to crash reliably, and that's how I arrived at the above code. |
For what it's worth, even if it doesn't crash, your data will get messed up if it cuts in the middle of a UTF-8 sequence, if you're stringifying each chunk. This could happen because people don't know how to work with buffers and are just using setEncoding, or they're piping to some streaming module that does the same (htmlparser2 operates with strings). I wrote this: https://www.npmjs.com/package/utf8-align-stream which should ensure that a Node stream always outputs whole UTF-8 sequences when they're available by conditionally buffering the last <= 5 bytes. This will make the strings compose correctly as well as prevent this crash. |
@trevnorris ... I assume this is resolved now? |
Ah yes. Sorry. |
I have a core dump now from a binary compiled with symbols, but I'm not sure what to do with it. I've managed to dump the data from the buffer being operated on to a file, but I cannot reproduce the crash in javascript code. I'm uncertain how exactly things got passed through and along. The backtrace looks like this:
(I've omitted the actual data snippits for readability)
As best as I can tell, the buffer being worked with is split on the first byte of a four-byte utf-8 sequence representing a > 0xFFFF value (U+1F495); it's part of an HTTP response. The error is obvious:
https://github.com/joyent/node/blob/v0.12.4/deps/v8/src/unicode.cc#L317-L321
An assumption is made here, that if there's a UTF-16 surrogate pair, both parts exist, but that's not what's happening. It seems assumed that such a case won't reach this code. I was unable to follow the code here well enough to come up with a way to reproduce the problem in JS code:
https://github.com/joyent/node/blob/v0.12.4/deps/v8/src/factory.cc#L231-L265
Please let me know what if anything I can do to help nail this down further!
The text was updated successfully, but these errors were encountered: