Add option to allow reading invalid unicode #78

pavelxdd · 2022-04-20T08:51:42Z

I've been using this patch for months in my project, since I don't need json reader failing when there are invalid suquences in json strings.

Can we add a new option YYJSON_READ_ALLOW_INVALID_UNICODE (optional via YYJSON_DISABLE_NON_STANDARD) to allow parsing invalid unicode data?

codecov · 2022-04-20T08:53:14Z

Codecov Report

Merging #78 (545b8ab) into master (c2df748) will decrease coverage by 0.08%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master      #78      +/-   ##
==========================================
- Coverage   97.68%   97.59%   -0.09%     
==========================================
  Files           2        2              
  Lines        2158     2163       +5     
==========================================
+ Hits         2108     2111       +3     
- Misses         50       52       +2

Impacted Files	Coverage Δ
src/yyjson.h	`93.28% <ø> (ø)`
src/yyjson.c	`97.88% <83.33%> (-0.10%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2df748...545b8ab. Read the comment docs.

ibireme · 2022-04-20T14:14:36Z

Incorrect UTF8 strings are dangerous and can make your code vulnerable.

For example:

    size_t size = 3;
    char *buf = malloc(size);
    buf[0] = '"';
    buf[1] = 0xF0;
    buf[2] = '"';
    yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);
    
    char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_UNICODE, NULL);

While the writer sees 0xF0, it assumes that here is a 4-byte encoded unicode code point. It continues to read the next 4 bytes, causing a heap-buffer-overflow error.

pavelxdd · 2022-04-20T14:33:35Z

You're right, but an issue in your example is not with the reader, but with the writer.
The reader correctly parses the input string, and stores a string with length 1 in the root.

The write_string function has str_len argument which is 1 and shouldn't try to access the buffer outside of specified range.

IMO the writer should report unicode errors like the reader does now, and optionally (a new option maybe) allow writing these values. What do you think?

I don't want to check (with other libs or how?) all my input data whether it is a valid UTF-8 or not, instead yyjson should check the str_len inside write_string.

TkTech · 2022-04-20T14:58:24Z

@ibireme it is possible to do this safely, as @pavelxdd suggests, and simdjson does do this.

This can have benefits. Some language wrappers, like Python, will always do UTF8 validation when creating the high-level str, so I end up validating twice when loading data.

ibireme · 2022-04-20T17:56:44Z

Yes, the risk is not the with reader, but with the caller who uses these strings. We should add more comments to make users aware of this risk.

With the default option, the strings are already validated by the reader, so there is no need for the writer to do validation, additional validation will degrade the performance of serialization. Perhaps a new option like "validate unicode" could be added later.

@TkTech The pull request will still do validation, but continues on error, so performance will not change.

pavelxdd · 2022-04-20T19:20:52Z

@ibireme my point was that the current writer behaviour is strange.
You don't usually expect the function with size constrains in its arguments to try accessing the memory outside of specified region.
Imagine the standard functions memcmp or memcpy that have a size argument accessing the memory out of this size? It'll be a nightmare.

The issue with the writer doesn't have anything to do with the reader validation.
Yes, reader currently does validation, but what if I need to construct a JSON from some data, I would use yyjson_mut_strn (which has a len argument, but writer function doesn't care and can access the memory outside of this len, apparently).

From what I can see, write_string doesn't need to check the bounds on every iteration. You should only check the last 3 bytes of the string - that's all. No perfomance loss on it.

Otherwise, do you expect the user to manually validate the input data before calling yyjson_mut_strn every time? And there are no public APIs in yyjson to do that, too.

…code

pavelxdd · 2022-04-20T19:47:40Z

I've added a warning to YYJSON_READ_ALLOW_INVALID_UNICODE about security issues when dealing with invalid unicode inside JSON.

Do you mind if I open a separate pull request with a new option YYJSON_WRITE_ALLOW_INVALID_UNICODE which will add an optional unicode validation in writer when YYJSON_WRITE_ESCAPE_UNICODE is enabled?

ibireme · 2022-04-20T22:58:37Z

Checking the last 3 bytes extra for each string will certainly cause performance degradation.
And it's not enough to just make a length check.

Another example:

    size_t size = 6;
    char *buf = malloc(size);
    buf[0] = '"';
    buf[1] = 0xF0;
    buf[2] = 'a';
    buf[3] = 'b';
    buf[4] = 'c';
    buf[5] = '"';
    
    yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);
    
    size_t len;
    char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_SLASHES, &len);
    // "\uD846\uDCA3"

If you don't check each byte, the wrong byte will break the following utf-8 sequences.

When creating a string value, the input is controlled by the user, so I think it's enough to have this hint: https://github.com/ibireme/yyjson/blob/master/src/yyjson.h#L1655

pavelxdd · 2022-04-20T23:18:55Z

You are right about checking the last 3 bytes.

With the option YYJSON_WRITE_ALLOW_INVALID_UNICODE introduced in #79 the writer will correctly produce the output string \xF0abc.

Edit: you mean YYJSON_WRITE_ESCAPE_UNICODE, not YYJSON_WRITE_ESCAPE_SLASHES?
I agree, there needs to be a proper validation done on each iteration.

ibireme · 2022-04-20T23:57:17Z

I tried to add full unicode validation for writer and ran a benchmark.

M1 Pro, minify json:
github.json   (ASCII)        3.26GB/s -> 3.11GB/s (95%)
twitter.json  (partial CJK)  3.56GB/s -> 3.33GB/s (93%)
poem.json     (full CJK)     3.21GB/s -> 2.63GB/s (82%)

I think this performance drop is acceptable. I will add the unicode validation for writer later, so that reader and writer will have the same behavior.

pavelxdd · 2022-04-20T23:58:23Z

Thanks. Should I close #79 then?

ibireme · 2022-04-21T00:05:01Z

Thanks. Should I close #79 then?

Sure, I will re-implement YYJSON_WRITE_ALLOW_INVALID_UNICODE option later.

Add option to allow reading invalid unicode

f4f18ba

Add extra warning about security issues when dealing with invalid uni…

545b8ab

…code

pavelxdd mentioned this pull request Apr 20, 2022

Add option to allow writing invalid unicode #79

Closed

ibireme merged commit d0a9e48 into ibireme:master Apr 20, 2022

pavelxdd deleted the utf8 branch April 20, 2022 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to allow reading invalid unicode #78

Add option to allow reading invalid unicode #78

pavelxdd commented Apr 20, 2022

codecov bot commented Apr 20, 2022 •

edited

Loading

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022 •

edited

Loading

TkTech commented Apr 20, 2022

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022 •

edited

Loading

pavelxdd commented Apr 20, 2022 •

edited

Loading

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022 •

edited

Loading

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022

ibireme commented Apr 21, 2022

Add option to allow reading invalid unicode #78

Add option to allow reading invalid unicode #78

Conversation

pavelxdd commented Apr 20, 2022

codecov bot commented Apr 20, 2022 • edited Loading

Codecov Report

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022 • edited Loading

TkTech commented Apr 20, 2022

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022 • edited Loading

pavelxdd commented Apr 20, 2022 • edited Loading

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022 • edited Loading

ibireme commented Apr 20, 2022

pavelxdd commented Apr 20, 2022

ibireme commented Apr 21, 2022

codecov bot commented Apr 20, 2022 •

edited

Loading

pavelxdd commented Apr 20, 2022 •

edited

Loading

pavelxdd commented Apr 20, 2022 •

edited

Loading

pavelxdd commented Apr 20, 2022 •

edited

Loading

pavelxdd commented Apr 20, 2022 •

edited

Loading