First line of the text file reads wrong #117

moran-trullion · 2022-04-04T11:51:48Z

For the text file 'frequency_dictionary_en_82_765.txt'
the first line is "the 23135851162"
i.e. the word 'the' shows up 23135851162 times in the corpus.

Because of encoding issues the word 'the' is not uploaded to the dictionary in symspell._words
instead, the word "\ufeffthe" is in symspell._words.

That's happen only for the first line of the text file.
Hope I was clear.

Thanks!

mammothb · 2022-04-09T06:43:27Z

Do you have a sample code snippet which can show the error? And also, may I know how did you obtain "frequency_dictionary_en_82_765.txt" file, i.e., simply download or copy/paste into a new file?

I have trouble replicating this error you have described. This code snippet downloads the file from github:

from pathlib import Path

import requests

r = requests.get(
    "https://raw.githubusercontent.com/mammothb/symspellpy/master/symspellpy/"
    "frequency_dictionary_en_82_765.txt"
)

path = Path.cwd() / "frequency_dictionary_en_82_765.txt"
with open(path, "wb") as outfile:
    outfile.write(r.content)

with open(path, "r") as infile:
    print(infile.readlines()[0])

Outputs:

the 23135851162

I tried the sample code snippet from the documentation , it also does not show the error:

[('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)]

crazoter · 2022-05-07T10:32:33Z

I have similarly experienced this problem with symspellpy==6.7.6, but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt which I downloaded manually and added a few entries manually to the end of the file (it's possible that there may be duplicate entries, but i don't think this is the cause).

The files appear identical so I didn't really know what was the issue, but thought you might be interested.

Code snippet:

from symspellpy import SymSpell, Verbosity
sym_spell = SymSpell(max_dictionary_edit_distance=6)    
# https://symspellpy.readthedocs.io/en/latest/api/symspellpy.html#symspellpy.symspellpy.SymSpell.load_dictionary
sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1)
suggestions = sym_spell.lookup("the", Verbosity.CLOSEST, max_edit_distance=6)

frequency_dictionary_en_82_765.txt

I resolved this quite easily by adding a new line as the first line.

mammothb · 2022-05-07T10:55:49Z

but I was using https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt

The dictionary file from the original SymSpell repository is saved with the UTF-8-BOM encoding. And load_dictionary() opens the file using UTF-8 encoding by default. This could have resulted in the extra characters in the first line.

The dictionary file provided by this repository is in UTF-8 encoding and should loaded properly.

Also, could you try using

sym_spell.load_dictionary("frequency_dictionary_en_82_765.txt", 0, 1, encoding="utf_8_sig")

with the dictionary file from the original SymSpell repo (without any modifications) and see if it loads properly?

crazoter · 2022-05-09T02:08:24Z

Interesting, I wouldn't have thought that it was a problem related to the encoding until you mention it. Adding the encoding parameter indeed fixes the issue for the dictionary in the original SymSpell.

mammothb closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First line of the text file reads wrong #117

First line of the text file reads wrong #117

moran-trullion commented Apr 4, 2022

mammothb commented Apr 9, 2022

crazoter commented May 7, 2022 •

edited

Loading

mammothb commented May 7, 2022

crazoter commented May 9, 2022 •

edited

Loading

First line of the text file reads wrong #117

First line of the text file reads wrong #117

Comments

moran-trullion commented Apr 4, 2022

mammothb commented Apr 9, 2022

crazoter commented May 7, 2022 • edited Loading

mammothb commented May 7, 2022

crazoter commented May 9, 2022 • edited Loading

crazoter commented May 7, 2022 •

edited

Loading

crazoter commented May 9, 2022 •

edited

Loading