Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to import Tokenizer #1

Closed
janwendt opened this issue May 9, 2018 · 11 comments
Closed

Error when trying to import Tokenizer #1

janwendt opened this issue May 9, 2018 · 11 comments
Labels

Comments

@janwendt
Copy link

janwendt commented May 9, 2018

After the import "from sacremoses import MosesTokenizer" the following error occurs:

Traceback (most recent call last):
 File "moses.py", line 1, in <module>
   from sacremoses import MosesTokenizer
 File "C:\Anaconda3\lib\site-packages\sacremoses\__init__.py", line 3, in <module>
   from sacremoses.tokenize import *
 File "C:\Anaconda3\lib\site-packages\sacremoses\tokenize.py", line 16, in <module>
   class MosesTokenizer:
 File "C:\Anaconda3\lib\site-packages\sacremoses\tokenize.py", line 22, in MosesTokenizer
   IsN = text_type(''.join(perluniprops.chars('IsN')))
 File "C:\Anaconda3\lib\site-packages\sacremoses\corpus.py", line 38, in chars
   for ch in fin.read().strip():
 File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
   return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 65: character maps to <undefined>
@alvations
Copy link
Contributor

Interesting, Windows is reading it as cp1252 instead of utf8. Are you using Python 3 or 2?

@janwendt
Copy link
Author

3.6

@alvations
Copy link
Contributor

alvations commented May 10, 2018

Very interesting now!

Looks like an upsteam bug/feature in CPython vs Windows... https://stackoverflow.com/questions/42070668/python-3-default-encoding-cp1252

My suggestion is to set the proper locale before the Python interpreter.

If you're on cygwin: https://stackoverflow.com/questions/24255407/permanently-set-python-path-for-anaconda-within-cygwin

If natively and globally on windows, see https://www.java.com/en/download/help/locale.xml

@alvations
Copy link
Contributor

BTW, do you get the same when you import nltk?

@sleighsoft
Copy link

I have the same issue. But not when importing nltk

@alvations
Copy link
Contributor

Are you using Windows too?

@sleighsoft
Copy link

Yes, Windows 10. Python 3.6.5

@lukedorney
Copy link

a quick workaround is explicitly stating the encoding as being utf-8 in corpus.py line 37 i.e. change
with open(self.datadir+category+'.txt') as fin:
to
with open(self.datadir+category+'.txt', encoding='utf-8') as fin:
although this only works with python 3 (with py2 you'd need to use codecs)

@alvations
Copy link
Contributor

@lukedorney Hmmm.. It's weird that in Python3 the default encoding is already utf8 but Windows is doing something strange in the locale such that it's not the default.

@alvations
Copy link
Contributor

@lukedorney @janwendt @sleighsoft I've added the patch and updated the package.

Please tell me if you still face the same problems after

pip install -U sacremoses

@alvations
Copy link
Contributor

Going to close this issue. If there's any error in Windows from encoding problems again, please feel free to reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants