Bug fix for handling of variable length utf-8 chars #447

DerekParks · 2024-04-15T06:40:14Z

When trying to guess the doc type, landroid reads in the first 1024 bytes and tries to decode it as utf-8.

Utf-8 has variable length characters. It is quite easy for the 1024 byte boundary to fall in the middle of a character. This causes us to wrongly assume we aren't dealing with utf-8.

Here is an example that shows the problem:

my_str = "abc﷽🤦🏻‍♂️🤦🏻‍♂️🤦🏻‍♂️"
print(len(my_str)) # 19
b = my_str.encode('utf-8')
print(len(b)) # 57 bytes that represent 19 chars
content = b[:50] #choose to cut it off at 50 for this example

try:
    _ = content.decode("utf-8")
    print(True)
except UnicodeDecodeError:
    print(False) # prints False


def find_last_full_char(str_to_test):
    for i in range(len(str_to_test) -1, 0, -1):
        if (str_to_test[i] & 0xC0) != 0x80:
            return i

content = b[:find_last_full_char(b)]

try:
    _ = content.decode("utf-8")
    print(True) # prints True
except UnicodeDecodeError:
    print(False)

I also added a missing docstring and fixed a typo that pops up when trying to use non-openai embeddings. LMK, if I should put those in a different PR.

dparks1 added 2 commits April 14, 2024 23:29

Fix handling of variable length unicode chars

ff40563

docstring

491c82f

pchalasani merged commit dc9477a into langroid:main Apr 16, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix for handling of variable length utf-8 chars #447

Bug fix for handling of variable length utf-8 chars #447

DerekParks commented Apr 15, 2024

Bug fix for handling of variable length utf-8 chars #447

Bug fix for handling of variable length utf-8 chars #447

Conversation

DerekParks commented Apr 15, 2024