-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to allow reading invalid unicode #78
Conversation
Codecov Report
@@ Coverage Diff @@
## master #78 +/- ##
==========================================
- Coverage 97.68% 97.59% -0.09%
==========================================
Files 2 2
Lines 2158 2163 +5
==========================================
+ Hits 2108 2111 +3
- Misses 50 52 +2
Continue to review full report at Codecov.
|
Incorrect UTF8 strings are dangerous and can make your code vulnerable. For example: size_t size = 3;
char *buf = malloc(size);
buf[0] = '"';
buf[1] = 0xF0;
buf[2] = '"';
yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);
char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_UNICODE, NULL); While the writer sees |
You're right, but an issue in your example is not with the reader, but with the writer. The IMO the writer should report unicode errors like the reader does now, and optionally (a new option maybe) allow writing these values. What do you think? I don't want to check (with other libs or how?) all my input data whether it is a valid UTF-8 or not, instead yyjson should check the |
Yes, the risk is not the with reader, but with the caller who uses these strings. We should add more comments to make users aware of this risk. With the default option, the strings are already validated by the reader, so there is no need for the writer to do validation, additional validation will degrade the performance of serialization. Perhaps a new option like "validate unicode" could be added later. @TkTech The pull request will still do validation, but continues on error, so performance will not change. |
@ibireme my point was that the current writer behaviour is strange. The issue with the writer doesn't have anything to do with the reader validation. From what I can see, Otherwise, do you expect the user to manually validate the input data before calling |
I've added a warning to Do you mind if I open a separate pull request with a new option |
Checking the last 3 bytes extra for Another example: size_t size = 6;
char *buf = malloc(size);
buf[0] = '"';
buf[1] = 0xF0;
buf[2] = 'a';
buf[3] = 'b';
buf[4] = 'c';
buf[5] = '"';
yyjson_doc *doc = yyjson_read(buf, size, YYJSON_READ_ALLOW_INVALID_UNICODE);
size_t len;
char *str = yyjson_write(doc, YYJSON_WRITE_ESCAPE_SLASHES, &len);
// "\uD846\uDCA3" If you don't check each byte, the wrong byte will break the following utf-8 sequences. When creating a string value, the input is controlled by the user, so I think it's enough to have this hint: https://github.com/ibireme/yyjson/blob/master/src/yyjson.h#L1655 |
You are right about checking the last 3 bytes. With the option Edit: you mean |
I tried to add full unicode validation for writer and ran a benchmark.
I think this performance drop is acceptable. I will add the unicode validation for |
Thanks. Should I close #79 then? |
Sure, I will re-implement |
I've been using this patch for months in my project, since I don't need json reader failing when there are invalid suquences in json strings.
Can we add a new option
YYJSON_READ_ALLOW_INVALID_UNICODE
(optional viaYYJSON_DISABLE_NON_STANDARD
) to allow parsing invalid unicode data?