Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix encoding to render Arabic text correctly. #29

Closed
wants to merge 1 commit into from

Conversation

mzeidhassan
Copy link

Referencing this issue https://github.com/mammothb/symspellpy/issues/28, the Arabic rendering issue was fixed by changing these 2 lines in symspellpy.py file.
I hope this helps.

@mammothb
Copy link
Owner

It looks like you have just changed the default argument value to utf-8. Are you not able to achieve the same result by calling

create_dictionary(<path/to/corpus>, encoding="utf-8")

and

load_dictionary(<path/to/corpus>, <term_index>, <count_index>, encoding="utf-8")

instead? If so, could please send me the corpus used so I could debug the issue?

@mzeidhassan
Copy link
Author

mzeidhassan commented Feb 14, 2019

I tried a couple of options to enforce the encoding, but nothing worked. Only when I changed the encoding in symspellpy.py file, it worked.

Attached is a sample file to test. Thanks for your support. Please feel free to reject this pr if you wish.

One last question:
Why should we use 'encoding=None'? Is this for compatibility reasons with older versions of Python?

sample_text_ar.txt

@mammothb
Copy link
Owner

I think this could be a problem with how the terminal displays Arabic text instead. Due to the special features of the Arabic script mentioned here. The terminal doesn't display it properly while the text file displays it properly.

This script demonstrates that lookup still works properly.

import os

from symspellpy.symspellpy import SymSpell, Verbosity  # import the module

def main():
    sym_spell = SymSpell(83000, 2, 7)
    # load dictionary
    corpus_path = os.path.join(os.path.dirname(__file__),
                               "sample_text_ar.txt")
    dict_path = os.path.join(os.path.dirname(__file__), "dict.txt")
    corrected_path = os.path.join(os.path.dirname(__file__), "corrected.txt")
    with open(corpus_path, encoding="utf-8-sig") as infile:
        for line in infile:
            print(line)

Last portion of the sample_text_ar.txt
img
while the output in the console is
img
If you were copy and paste this word img into a text file, it is displayed as كيفية which suggest that the file is properly read but is displayed incorrectly.

    if not sym_spell.create_dictionary(corpus_path, encoding="utf-8-sig"):
        print("Corpus file not found")
        return

    with open(dict_path, "w", encoding="utf-8") as outfile:
        for key, count in sym_spell.words.items():
            print("{} {}".format(key, count))
            outfile.write("{} {}\n".format(key, count))

The console output is
img
while dict.txt shows
img
and we are able to search for الزعامة in the original text
img
demonstrating that the words stored in sym_spell have the correct spelling.

    results = sym_spell.lookup("كفية", Verbosity.TOP)
    with open(corrected_path, "w", encoding="utf-8") as outfile:
        for result in results:
            print(result)
            outfile.write(str(result))

Finally, we can correct a misspelled word from كفية to كيفية. Hopefully I didn't make a mistake here as I don't understand Arabic and am simply comparing the shape of the words.
Console output: img
corrected.txt: img

if __name__ == "__main__":
    main()

So I think you could look into an alternative terminal which supports displaying Arabic scripts, or print all your outputs to a text file for debugging.

@mzeidhassan
Copy link
Author

mzeidhassan commented Feb 14, 2019

@mammothb Thank you so much for taking the time. Glad that it looks OK.

One more thing: I was not using a terminal. I modified the create dictionary script to print out to a file. Before doing that, I was using Pycharm, and the Arabic text was garbled still in Pycharm output window.

The code I used was this:

`    with open("ar-freq-dict.txt", "w", encoding="utf-8") as f:
        for key, count in sym_spell.words.items():
            f.write(("{} {}".format(key, count)) + "\n")`

I will use your code above and test it.

Thanks a million for your support and for providing such an amazing port of Symspell. Thank you!

@mzeidhassan
Copy link
Author

Thanks @mammothb ! I tested the code and it works fine at my end now with Python 3.5.2. I will close this issue and the other github issue. Thanks again for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants