Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First line of the text file reads wrong #117

Closed
moran-trullion opened this issue Apr 4, 2022 · 4 comments
Closed

First line of the text file reads wrong #117

moran-trullion opened this issue Apr 4, 2022 · 4 comments

Comments

@moran-trullion
Copy link

For the text file 'frequency_dictionary_en_82_765.txt'
the first line is "the 23135851162"
i.e. the word 'the' shows up 23135851162 times in the corpus.

Because of encoding issues the word 'the' is not uploaded to the dictionary in symspell._words
instead, the word "\ufeffthe" is in symspell._words.

That's happen only for the first line of the text file.
Hope I was clear.

Thanks!

@mammothb
Copy link
Owner

mammothb commented Apr 9, 2022

Do you have a sample code snippet which can show the error? And also, may I know how did you obtain "frequency_dictionary_en_82_765.txt" file, i.e., simply download or copy/paste into a new file?

I have trouble replicating this error you have described. This code snippet downloads the file from github:

from pathlib import Path

import requests

r = requests.get(
    "https://raw.githubusercontent.com/mammothb/symspellpy/master/symspellpy/"
    "frequency_dictionary_en_82_765.txt"
)

path = Path.cwd() / "frequency_dictionary_en_82_765.txt"
with open(path, "wb") as outfile:
    outfile.write(r.content)

with open(path, "r") as infile:
    print(infile.readlines()[0])

Outputs:

the 23135851162

I tried the sample code snippet from the documentation , it also does not show the error:

[('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)]

@crazoter
Copy link

crazoter commented May 7, 2022

I have similarly experienced this problem with symspellpy==6.7.6, but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt which I downloaded manually and added a few entries manually to the end of the file (it's possible that there may be duplicate entries, but i don't think this is the cause).

The files appear identical so I didn't really know what was the issue, but thought you might be interested.

Code snippet:

from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=6)    
# https://symspellpy.readthedocs.io/en/latest/api/symspellpy.html#symspellpy.symspellpy.SymSpell.load_dictionary
sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1)
suggestions = sym_spell.lookup("the", Verbosity.CLOSEST, max_edit_distance=6)

frequency_dictionary_en_82_765.txt

I resolved this quite easily by adding a new line as the first line.

@mammothb
Copy link
Owner

mammothb commented May 7, 2022

but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt

The dictionary file from the original SymSpell repository is saved with the UTF-8-BOM encoding. And load_dictionary() opens the file using UTF-8 encoding by default. This could have resulted in the extra characters in the first line.

The dictionary file provided by this repository is in UTF-8 encoding and should loaded properly.

Also, could you try using

sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1, encoding="utf_8_sig")

with the dictionary file from the original SymSpell repo (without any modifications) and see if it loads properly?

@crazoter
Copy link

crazoter commented May 9, 2022

Interesting, I wouldn't have thought that it was a problem related to the encoding until you mention it. Adding the encoding parameter indeed fixes the issue for the dictionary in the original SymSpell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants