Fix encoding to render Arabic text correctly. #29

mzeidhassan · 2019-02-13T21:20:48Z

Referencing this issue https://github.com/mammothb/symspellpy/issues/28, the Arabic rendering issue was fixed by changing these 2 lines in symspellpy.py file.
I hope this helps.

mammothb · 2019-02-13T22:56:04Z

It looks like you have just changed the default argument value to utf-8. Are you not able to achieve the same result by calling

create_dictionary(<path/to/corpus>, encoding="utf-8")

and

load_dictionary(<path/to/corpus>, <term_index>, <count_index>, encoding="utf-8")

instead? If so, could please send me the corpus used so I could debug the issue?

mzeidhassan · 2019-02-14T05:02:34Z

I tried a couple of options to enforce the encoding, but nothing worked. Only when I changed the encoding in symspellpy.py file, it worked.

Attached is a sample file to test. Thanks for your support. Please feel free to reject this pr if you wish.

One last question:
Why should we use 'encoding=None'? Is this for compatibility reasons with older versions of Python?

sample_text_ar.txt

mammothb · 2019-02-14T08:04:10Z

I think this could be a problem with how the terminal displays Arabic text instead. Due to the special features of the Arabic script mentioned here. The terminal doesn't display it properly while the text file displays it properly.

This script demonstrates that lookup still works properly.

import os

from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

def main():
    sym_spell = SymSpell(83000, 2, 7)
    # load dictionary
    corpus_path = os.path.join(os.path.dirname(__file__),
                               "sample_text_ar.txt")
    dict_path = os.path.join(os.path.dirname(__file__), "dict.txt")
    corrected_path = os.path.join(os.path.dirname(__file__), "corrected.txt")
    with open(corpus_path, encoding="utf-8-sig") as infile:
        for line in infile:
            print(line)

Last portion of the sample_text_ar.txt

while the output in the console is

If you were copy and paste this word into a text file, it is displayed as كيفية which suggest that the file is properly read but is displayed incorrectly.

    if not sym_spell.create_dictionary(corpus_path, encoding="utf-8-sig"):
        print("Corpus file not found")
        return

    with open(dict_path, "w", encoding="utf-8") as outfile:
        for key, count in sym_spell.words.items():
            print("{} {}".format(key, count))
            outfile.write("{} {}\n".format(key, count))

The console output is

while dict.txt shows

and we are able to search for الزعامة in the original text

demonstrating that the words stored in sym_spell have the correct spelling.

    results = sym_spell.lookup("كفية", Verbosity.TOP)
    with open(corrected_path, "w", encoding="utf-8") as outfile:
        for result in results:
            print(result)
            outfile.write(str(result))

Finally, we can correct a misspelled word from كفية to كيفية. Hopefully I didn't make a mistake here as I don't understand Arabic and am simply comparing the shape of the words.
Console output:
corrected.txt:

if __name__ == "__main__":
    main()

So I think you could look into an alternative terminal which supports displaying Arabic scripts, or print all your outputs to a text file for debugging.

mzeidhassan · 2019-02-14T14:46:07Z

@mammothb Thank you so much for taking the time. Glad that it looks OK.

One more thing: I was not using a terminal. I modified the create dictionary script to print out to a file. Before doing that, I was using Pycharm, and the Arabic text was garbled still in Pycharm output window.

The code I used was this:

`    with open("ar-freq-dict.txt", "w", encoding="utf-8") as f:
        for key, count in sym_spell.words.items():
            f.write(("{} {}".format(key, count)) + "\n")`

I will use your code above and test it.

Thanks a million for your support and for providing such an amazing port of Symspell. Thank you!

mzeidhassan · 2019-02-15T17:44:37Z

Thanks @mammothb ! I tested the code and it works fine at my end now with Python 3.5.2. I will close this issue and the other github issue. Thanks again for your support.

Update symspellpy.py

69ea26a

mzeidhassan mentioned this pull request Feb 13, 2019

Arabic output garbled from dictionary creation #28

Closed

mzeidhassan closed this Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix encoding to render Arabic text correctly. #29

Fix encoding to render Arabic text correctly. #29

mzeidhassan commented Feb 13, 2019

mammothb commented Feb 13, 2019

mzeidhassan commented Feb 14, 2019 •

edited

Loading

mammothb commented Feb 14, 2019

mzeidhassan commented Feb 14, 2019 •

edited

Loading

mzeidhassan commented Feb 15, 2019

Fix encoding to render Arabic text correctly. #29

Fix encoding to render Arabic text correctly. #29

Conversation

mzeidhassan commented Feb 13, 2019

mammothb commented Feb 13, 2019

mzeidhassan commented Feb 14, 2019 • edited Loading

mammothb commented Feb 14, 2019

mzeidhassan commented Feb 14, 2019 • edited Loading

mzeidhassan commented Feb 15, 2019

mzeidhassan commented Feb 14, 2019 •

edited

Loading

mzeidhassan commented Feb 14, 2019 •

edited

Loading