-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-3048 Fixed issue with invalid UTF-8 string. #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find, however it might be a better idea to remove the encoding_helpers.c file entirely and rely on Python's builtin UTF-8 decoding behavior instead. It looks like we only use this logic when encoding regex fields.
bson/_cbsonmodule.c
Outdated
long int_flags; | ||
char flags[FLAGS_SIZE]; | ||
char check_utf8 = 0; | ||
const char* pattern_data; | ||
int pattern_length, flags_length; | ||
result_t status; | ||
//result_t status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed entirely
"regex patterns must not contain the NULL byte"); | ||
Py_DECREF(InvalidDocument); | ||
|
||
if (check_utf8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Bernie mentioned above we need to keep the NULL byte checking behavior. For BSON keys we perform the same check like this:
https://github.com/mongodb/mongo-python-driver/blob/be3008aa11/bson/_cbsonmodule.c#L1230-L1239
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to see if there's any performance cost of using PyUnicode_DecodeUTF8 instead of the encoding helpers. My guess is the answer is no except docs with many regex fields ({...'r1': Regex(...), ..., 'r1000': Regex(...)}
) or docs with large regex fields (eg a 1MB regex: {'r': Regex(b'1'*1024*1024)}
). Could you spend some time investigating using timeit and post your findings?
We found that using the Python standard library significantly speeds up the results. For 1 million bytes, the Unicode C implementation used about 4.19 ms ± 2.89 µs, and the Python Unicode implementation takes about 145 µs ± 947 ns to run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We found that using the Python standard library significantly speeds up the results. For 1 million bytes, the Unicode C implementation used about 4.19 ms ± 2.89 µs, and the Python Unicode implementation takes about 145 µs ± 947 ns to run.
NBD only a modest 30x speed up. LGTM!
I think there was a missing switch case in
isLegalUTF8
for characters beginning with0xED
, based on what I saw in this implementation. There are some other apparent differences, but I think for those other cases, they probably have the same behavior already.Results: