Skip to content

PYTHON-3048 Fixed issue with invalid UTF-8 string. #970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 13, 2022

Conversation

bcwarner
Copy link
Contributor

I think there was a missing switch case in isLegalUTF8 for characters beginning with 0xED, based on what I saw in this implementation. There are some other apparent differences, but I think for those other cases, they probably have the same behavior already.

Results:

In [1]: from bson import *

In [2]: encode({"a":Regex(b'\xed\xbc\xad','')})
---------------------------------------------------------------------------
InvalidStringData                         Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 encode({"a":Regex(b'\xed\xbc\xad','')})

File ~/opt/anaconda3/lib/python3.9/site-packages/pymongo-4.2.0.dev2-py3.9-macosx-10.9-x86_64.egg/bson/__init__.py:947, in encode(document, check_keys, codec_options)
    944 if not isinstance(codec_options, CodecOptions):
    945     raise _CODEC_OPTIONS_TYPE_ERROR
--> 947 return _dict_to_bson(document, check_keys, codec_options)

InvalidStringData: regex patterns must be valid UTF-8

@bcwarner bcwarner marked this pull request as ready for review June 13, 2022 17:12
Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find, however it might be a better idea to remove the encoding_helpers.c file entirely and rely on Python's builtin UTF-8 decoding behavior instead. It looks like we only use this logic when encoding regex fields.

long int_flags;
char flags[FLAGS_SIZE];
char check_utf8 = 0;
const char* pattern_data;
int pattern_length, flags_length;
result_t status;
//result_t status;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be removed entirely

"regex patterns must not contain the NULL byte");
Py_DECREF(InvalidDocument);

if (check_utf8) {
Copy link
Member

@ShaneHarvey ShaneHarvey Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Bernie mentioned above we need to keep the NULL byte checking behavior. For BSON keys we perform the same check like this:
https://github.com/mongodb/mongo-python-driver/blob/be3008aa11/bson/_cbsonmodule.c#L1230-L1239

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to see if there's any performance cost of using PyUnicode_DecodeUTF8 instead of the encoding helpers. My guess is the answer is no except docs with many regex fields ({...'r1': Regex(...), ..., 'r1000': Regex(...)}) or docs with large regex fields (eg a 1MB regex: {'r': Regex(b'1'*1024*1024)}). Could you spend some time investigating using timeit and post your findings?

@bcwarner
Copy link
Contributor Author

We found that using the Python standard library significantly speeds up the results. For 1 million bytes, the Unicode C implementation used about 4.19 ms ± 2.89 µs, and the Python Unicode implementation takes about 145 µs ± 947 ns to run.

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We found that using the Python standard library significantly speeds up the results. For 1 million bytes, the Unicode C implementation used about 4.19 ms ± 2.89 µs, and the Python Unicode implementation takes about 145 µs ± 947 ns to run.

NBD only a modest 30x speed up. LGTM!

@bcwarner bcwarner merged commit 3f7231a into mongodb:master Jun 13, 2022
juliusgeo pushed a commit to juliusgeo/mongo-python-driver that referenced this pull request Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants