PYTHON-3048 Fixed issue with invalid UTF-8 string. #970

bcwarner · 2022-06-11T00:11:00Z

I think there was a missing switch case in isLegalUTF8 for characters beginning with 0xED, based on what I saw in this implementation. There are some other apparent differences, but I think for those other cases, they probably have the same behavior already.

Results:

In [1]: from bson import *

In [2]: encode({"a":Regex(b'\xed\xbc\xad','')})
---------------------------------------------------------------------------
InvalidStringData                         Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 encode({"a":Regex(b'\xed\xbc\xad','')})

File ~/opt/anaconda3/lib/python3.9/site-packages/pymongo-4.2.0.dev2-py3.9-macosx-10.9-x86_64.egg/bson/__init__.py:947, in encode(document, check_keys, codec_options)
    944 if not isinstance(codec_options, CodecOptions):
    945     raise _CODEC_OPTIONS_TYPE_ERROR
--> 947 return _dict_to_bson(document, check_keys, codec_options)

InvalidStringData: regex patterns must be valid UTF-8

ShaneHarvey

Nice find, however it might be a better idea to remove the encoding_helpers.c file entirely and rely on Python's builtin UTF-8 decoding behavior instead. It looks like we only use this logic when encoding regex fields.

blink1073 · 2022-06-13T20:49:43Z

bson/_cbsonmodule.c

    long int_flags;
    char flags[FLAGS_SIZE];
    char check_utf8 = 0;
    const char* pattern_data;
    int pattern_length, flags_length;
-    result_t status;
+    //result_t status;


This line can be removed entirely

ShaneHarvey · 2022-06-13T21:32:20Z

bson/_cbsonmodule.c

-                            "regex patterns must not contain the NULL byte");
-            Py_DECREF(InvalidDocument);
+
+    if (check_utf8) {


As Bernie mentioned above we need to keep the NULL byte checking behavior. For BSON keys we perform the same check like this:
https://github.com/mongodb/mongo-python-driver/blob/be3008aa11/bson/_cbsonmodule.c#L1230-L1239

ShaneHarvey

LGTM!

ShaneHarvey

It would be interesting to see if there's any performance cost of using PyUnicode_DecodeUTF8 instead of the encoding helpers. My guess is the answer is no except docs with many regex fields ({...'r1': Regex(...), ..., 'r1000': Regex(...)}) or docs with large regex fields (eg a 1MB regex: {'r': Regex(b'1'*1024*1024)}). Could you spend some time investigating using timeit and post your findings?

bcwarner · 2022-06-13T22:50:57Z

We found that using the Python standard library significantly speeds up the results. For 1 million bytes, the Unicode C implementation used about 4.19 ms ± 2.89 µs, and the Python Unicode implementation takes about 145 µs ± 947 ns to run.

ShaneHarvey

We found that using the Python standard library significantly speeds up the results. For 1 million bytes, the Unicode C implementation used about 4.19 ms ± 2.89 µs, and the Python Unicode implementation takes about 145 µs ± 947 ns to run.

NBD only a modest 30x speed up. LGTM!

…ns (mongodb#970)

Added missing switch case

7ba4fb8

bcwarner marked this pull request as ready for review June 13, 2022 17:12

bcwarner requested review from blink1073, juliusgeo and ShaneHarvey as code owners June 13, 2022 17:12

ShaneHarvey requested changes Jun 13, 2022

View reviewed changes

Removed encoding_helpers.c

d155efd

blink1073 reviewed Jun 13, 2022

View reviewed changes

Merge conflicts

ca91fde

ShaneHarvey requested changes Jun 13, 2022

View reviewed changes

Ben Warner added 3 commits June 13, 2022 14:44

Mising line

d968f68

Merge branch 'master' into PYTHON-3048

c0f0984

Added null byte check

e81f71e

ShaneHarvey approved these changes Jun 13, 2022

View reviewed changes

ShaneHarvey reviewed Jun 13, 2022

View reviewed changes

Changelog updated

2aee049

Ben Warner added 2 commits June 13, 2022 15:55

More changelog edits

fab2bea

Forgot link

395146f

ShaneHarvey approved these changes Jun 13, 2022

View reviewed changes

bcwarner merged commit 3f7231a into mongodb:master Jun 13, 2022

juliusgeo pushed a commit to juliusgeo/mongo-python-driver that referenced this pull request Jun 29, 2022

PYTHON-3048 Fixed bug with incorrect validation of UTF-8 regex patter…

c694831

…ns (mongodb#970)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PYTHON-3048 Fixed issue with invalid UTF-8 string. #970

PYTHON-3048 Fixed issue with invalid UTF-8 string. #970

Uh oh!

bcwarner commented Jun 11, 2022

Uh oh!

ShaneHarvey left a comment

Uh oh!

blink1073 Jun 13, 2022

Uh oh!

ShaneHarvey Jun 13, 2022 •

edited

Loading

Uh oh!

ShaneHarvey left a comment

Uh oh!

ShaneHarvey left a comment •

edited

Loading

Uh oh!

bcwarner commented Jun 13, 2022

Uh oh!

ShaneHarvey left a comment

Uh oh!

Uh oh!

PYTHON-3048 Fixed issue with invalid UTF-8 string. #970

PYTHON-3048 Fixed issue with invalid UTF-8 string. #970

Uh oh!

Conversation

bcwarner commented Jun 11, 2022

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

blink1073 Jun 13, 2022

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bcwarner commented Jun 13, 2022

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShaneHarvey Jun 13, 2022 •

edited

Loading

ShaneHarvey left a comment •

edited

Loading