-
-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First line of the text file reads wrong #117
Comments
Do you have a sample code snippet which can show the error? And also, may I know how did you obtain "frequency_dictionary_en_82_765.txt" file, i.e., simply download or copy/paste into a new file? I have trouble replicating this error you have described. This code snippet downloads the file from github: from pathlib import Path
import requests
r = requests.get(
"https://raw.githubusercontent.com/mammothb/symspellpy/master/symspellpy/"
"frequency_dictionary_en_82_765.txt"
)
path = Path.cwd() / "frequency_dictionary_en_82_765.txt"
with open(path, "wb") as outfile:
outfile.write(r.content)
with open(path, "r") as infile:
print(infile.readlines()[0]) Outputs:
I tried the sample code snippet from the documentation , it also does not show the error:
|
I have similarly experienced this problem with symspellpy==6.7.6, but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt which I downloaded manually and added a few entries manually to the end of the file (it's possible that there may be duplicate entries, but i don't think this is the cause). The files appear identical so I didn't really know what was the issue, but thought you might be interested. Code snippet: from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=6)
# https://symspellpy.readthedocs.io/en/latest/api/symspellpy.html#symspellpy.symspellpy.SymSpell.load_dictionary
sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1)
suggestions = sym_spell.lookup("the", Verbosity.CLOSEST, max_edit_distance=6) frequency_dictionary_en_82_765.txt I resolved this quite easily by adding a new line as the first line. |
The dictionary file from the original SymSpell repository is saved with the UTF-8-BOM encoding. And The dictionary file provided by this repository is in UTF-8 encoding and should loaded properly. Also, could you try using
with the dictionary file from the original SymSpell repo (without any modifications) and see if it loads properly? |
Interesting, I wouldn't have thought that it was a problem related to the encoding until you mention it. Adding the encoding parameter indeed fixes the issue for the dictionary in the original SymSpell. |
For the text file 'frequency_dictionary_en_82_765.txt'
the first line is "the 23135851162"
i.e. the word 'the' shows up 23135851162 times in the corpus.
Because of encoding issues the word 'the' is not uploaded to the dictionary in symspell._words
instead, the word "\ufeffthe" is in symspell._words.
That's happen only for the first line of the text file.
Hope I was clear.
Thanks!
The text was updated successfully, but these errors were encountered: