Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow reading invalid unicode #78

Merged
merged 2 commits into from
Apr 20, 2022
Merged

Add option to allow reading invalid unicode #78

merged 2 commits into from
Apr 20, 2022

Conversation

pavelxdd
Copy link
Contributor

I've been using this patch for months in my project, since I don't need json reader failing when there are invalid suquences in json strings.

Can we add a new option YYJSON_READ_ALLOW_INVALID_UNICODE (optional via YYJSON_DISABLE_NON_STANDARD) to allow parsing invalid unicode data?

@codecov
Copy link

codecov bot commented Apr 20, 2022

Codecov Report

Merging #78 (545b8ab) into master (c2df748) will decrease coverage by 0.08%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master      #78      +/-   ##
==========================================
- Coverage   97.68%   97.59%   -0.09%     
==========================================
  Files           2        2              
  Lines        2158     2163       +5     
==========================================
+ Hits         2108     2111       +3     
- Misses         50       52       +2     
Impacted Files Coverage Δ
src/yyjson.h 93.28% <ø> (ø)
src/yyjson.c 97.88% <83.33%> (-0.10%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2df748...545b8ab. Read the comment docs.

@ibireme
Copy link
Owner

ibireme commented Apr 20, 2022

Incorrect UTF8 strings are dangerous and can make your code vulnerable.

For example:

    size_t size = 3;
    char *buf = malloc(size);
    buf[0] = '"';
    buf[1] = 0xF0;
    buf[2] = '"';
    yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);
    
    char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_UNICODE, NULL);

While the writer sees 0xF0, it assumes that here is a 4-byte encoded unicode code point. It continues to read the next 4 bytes, causing a heap-buffer-overflow error.

@pavelxdd
Copy link
Contributor Author

pavelxdd commented Apr 20, 2022

You're right, but an issue in your example is not with the reader, but with the writer.
The reader correctly parses the input string, and stores a string with length 1 in the root.

The write_string function has str_len argument which is 1 and shouldn't try to access the buffer outside of specified range.

IMO the writer should report unicode errors like the reader does now, and optionally (a new option maybe) allow writing these values. What do you think?

I don't want to check (with other libs or how?) all my input data whether it is a valid UTF-8 or not, instead yyjson should check the str_len inside write_string.

@TkTech
Copy link
Contributor

TkTech commented Apr 20, 2022

@ibireme it is possible to do this safely, as @pavelxdd suggests, and simdjson does do this.

This can have benefits. Some language wrappers, like Python, will always do UTF8 validation when creating the high-level str, so I end up validating twice when loading data.

@ibireme
Copy link
Owner

ibireme commented Apr 20, 2022

Yes, the risk is not the with reader, but with the caller who uses these strings. We should add more comments to make users aware of this risk.

With the default option, the strings are already validated by the reader, so there is no need for the writer to do validation, additional validation will degrade the performance of serialization. Perhaps a new option like "validate unicode" could be added later.

@TkTech The pull request will still do validation, but continues on error, so performance will not change.

@pavelxdd
Copy link
Contributor Author

pavelxdd commented Apr 20, 2022

@ibireme my point was that the current writer behaviour is strange.
You don't usually expect the function with size constrains in its arguments to try accessing the memory outside of specified region.
Imagine the standard functions memcmp or memcpy that have a size argument accessing the memory out of this size? It'll be a nightmare.

The issue with the writer doesn't have anything to do with the reader validation.
Yes, reader currently does validation, but what if I need to construct a JSON from some data, I would use yyjson_mut_strn (which has a len argument, but writer function doesn't care and can access the memory outside of this len, apparently).

From what I can see, write_string doesn't need to check the bounds on every iteration. You should only check the last 3 bytes of the string - that's all. No perfomance loss on it.

Otherwise, do you expect the user to manually validate the input data before calling yyjson_mut_strn every time? And there are no public APIs in yyjson to do that, too.

@pavelxdd
Copy link
Contributor Author

pavelxdd commented Apr 20, 2022

I've added a warning to YYJSON_READ_ALLOW_INVALID_UNICODE about security issues when dealing with invalid unicode inside JSON.

Do you mind if I open a separate pull request with a new option YYJSON_WRITE_ALLOW_INVALID_UNICODE which will add an optional unicode validation in writer when YYJSON_WRITE_ESCAPE_UNICODE is enabled?

@ibireme
Copy link
Owner

ibireme commented Apr 20, 2022

Checking the last 3 bytes extra for each string will certainly cause performance degradation.
And it's not enough to just make a length check.

Another example:

    size_t size = 6;
    char *buf = malloc(size);
    buf[0] = '"';
    buf[1] = 0xF0;
    buf[2] = 'a';
    buf[3] = 'b';
    buf[4] = 'c';
    buf[5] = '"';
    
    yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);
    
    size_t len;
    char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_SLASHES, &len);
    // "\uD846\uDCA3"

If you don't check each byte, the wrong byte will break the following utf-8 sequences.

When creating a string value, the input is controlled by the user, so I think it's enough to have this hint: https://github.com/ibireme/yyjson/blob/master/src/yyjson.h#L1655

@pavelxdd
Copy link
Contributor Author

pavelxdd commented Apr 20, 2022

You are right about checking the last 3 bytes.

With the option YYJSON_WRITE_ALLOW_INVALID_UNICODE introduced in #79 the writer will correctly produce the output string \xF0abc.

Edit: you mean YYJSON_WRITE_ESCAPE_UNICODE, not YYJSON_WRITE_ESCAPE_SLASHES?
I agree, there needs to be a proper validation done on each iteration.

@ibireme
Copy link
Owner

ibireme commented Apr 20, 2022

I tried to add full unicode validation for writer and ran a benchmark.

M1 Pro, minify json:
github.json   (ASCII)        3.26GB/s -> 3.11GB/s (95%)
twitter.json  (partial CJK)  3.56GB/s -> 3.33GB/s (93%)
poem.json     (full CJK)     3.21GB/s -> 2.63GB/s (82%)

I think this performance drop is acceptable. I will add the unicode validation for writer later, so that reader and writer will have the same behavior.

@ibireme ibireme merged commit d0a9e48 into ibireme:master Apr 20, 2022
@pavelxdd
Copy link
Contributor Author

Thanks. Should I close #79 then?

@pavelxdd pavelxdd deleted the utf8 branch April 20, 2022 23:58
@ibireme
Copy link
Owner

ibireme commented Apr 21, 2022

Thanks. Should I close #79 then?

Sure, I will re-implement YYJSON_WRITE_ALLOW_INVALID_UNICODE option later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants