-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when trying to import Tokenizer #1
Comments
Interesting, Windows is reading it as cp1252 instead of utf8. Are you using Python 3 or 2? |
3.6 |
Very interesting now! Looks like an upsteam bug/feature in CPython vs Windows... https://stackoverflow.com/questions/42070668/python-3-default-encoding-cp1252 My suggestion is to set the proper locale before the Python interpreter. If you're on cygwin: https://stackoverflow.com/questions/24255407/permanently-set-python-path-for-anaconda-within-cygwin If natively and globally on windows, see https://www.java.com/en/download/help/locale.xml |
BTW, do you get the same when you |
I have the same issue. But not when importing nltk |
Are you using Windows too? |
Yes, Windows 10. Python 3.6.5 |
a quick workaround is explicitly stating the encoding as being utf-8 in corpus.py line 37 i.e. change |
@lukedorney Hmmm.. It's weird that in Python3 the default encoding is already utf8 but Windows is doing something strange in the locale such that it's not the default. |
@lukedorney @janwendt @sleighsoft I've added the patch and updated the package. Please tell me if you still face the same problems after
|
Going to close this issue. If there's any error in Windows from encoding problems again, please feel free to reopen this issue. |
After the import "from sacremoses import MosesTokenizer" the following error occurs:
The text was updated successfully, but these errors were encountered: