Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for handling of variable length utf-8 chars #447

Merged
merged 2 commits into from
Apr 16, 2024

Conversation

DerekParks
Copy link
Contributor

When trying to guess the doc type, landroid reads in the first 1024 bytes and tries to decode it as utf-8.

Utf-8 has variable length characters. It is quite easy for the 1024 byte boundary to fall in the middle of a character. This causes us to wrongly assume we aren't dealing with utf-8.

Here is an example that shows the problem:

my_str = "abc﷽🤦🏻‍♂️🤦🏻‍♂️🤦🏻‍♂️"
print(len(my_str)) # 19
b = my_str.encode('utf-8')
print(len(b)) # 57 bytes that represent 19 chars
content = b[:50] #choose to cut it off at 50 for this example

try:
    _ = content.decode("utf-8")
    print(True)
except UnicodeDecodeError:
    print(False) # prints False


def find_last_full_char(str_to_test):
    for i in range(len(str_to_test) -1, 0, -1):
        if (str_to_test[i] & 0xC0) != 0x80:
            return i

content = b[:find_last_full_char(b)]

try:
    _ = content.decode("utf-8")
    print(True) # prints True
except UnicodeDecodeError:
    print(False)

I also added a missing docstring and fixed a typo that pops up when trying to use non-openai embeddings. LMK, if I should put those in a different PR.

@pchalasani pchalasani merged commit dc9477a into langroid:main Apr 16, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants