Error when trying to import Tokenizer #1

janwendt · 2018-05-09T13:15:28Z

After the import "from sacremoses import MosesTokenizer" the following error occurs:

Traceback (most recent call last):
 File "moses.py", line 1, in <module>
   from sacremoses import MosesTokenizer
 File "C:\Anaconda3\lib\site-packages\sacremoses\__init__.py", line 3, in <module>
   from sacremoses.tokenize import *
 File "C:\Anaconda3\lib\site-packages\sacremoses\tokenize.py", line 16, in <module>
   class MosesTokenizer:
 File "C:\Anaconda3\lib\site-packages\sacremoses\tokenize.py", line 22, in MosesTokenizer
   IsN = text_type(''.join(perluniprops.chars('IsN')))
 File "C:\Anaconda3\lib\site-packages\sacremoses\corpus.py", line 38, in chars
   for ch in fin.read().strip():
 File "C:\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
   return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 65: character maps to <undefined>

The text was updated successfully, but these errors were encountered:

alvations · 2018-05-10T00:11:32Z

Interesting, Windows is reading it as cp1252 instead of utf8. Are you using Python 3 or 2?

janwendt · 2018-05-10T10:25:42Z

3.6

alvations · 2018-05-10T11:31:48Z

Very interesting now!

Looks like an upsteam bug/feature in CPython vs Windows... https://stackoverflow.com/questions/42070668/python-3-default-encoding-cp1252

My suggestion is to set the proper locale before the Python interpreter.

If you're on cygwin: https://stackoverflow.com/questions/24255407/permanently-set-python-path-for-anaconda-within-cygwin

If natively and globally on windows, see https://www.java.com/en/download/help/locale.xml

alvations · 2018-05-10T11:33:14Z

BTW, do you get the same when you import nltk?

sleighsoft · 2018-06-06T08:56:59Z

I have the same issue. But not when importing nltk

alvations · 2018-06-06T10:18:39Z

Are you using Windows too?

sleighsoft · 2018-06-07T14:35:49Z

Yes, Windows 10. Python 3.6.5

lukedorney · 2018-06-16T16:04:00Z

a quick workaround is explicitly stating the encoding as being utf-8 in corpus.py line 37 i.e. change
with open(self.datadir+category+'.txt') as fin:
to
with open(self.datadir+category+'.txt', encoding='utf-8') as fin:
although this only works with python 3 (with py2 you'd need to use codecs)

alvations · 2018-06-19T00:56:16Z

@lukedorney Hmmm.. It's weird that in Python3 the default encoding is already utf8 but Windows is doing something strange in the locale such that it's not the default.

alvations · 2018-06-19T07:07:41Z

@lukedorney @janwendt @sleighsoft I've added the patch and updated the package.

Please tell me if you still face the same problems after

pip install -U sacremoses

alvations · 2018-12-17T16:48:02Z

Going to close this issue. If there's any error in Windows from encoding problems again, please feel free to reopen this issue.

alvations added the windows label May 10, 2018

alvations closed this as completed Dec 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when trying to import Tokenizer #1

Error when trying to import Tokenizer #1

janwendt commented May 9, 2018 •

edited

alvations commented May 10, 2018

janwendt commented May 10, 2018

alvations commented May 10, 2018 •

edited

alvations commented May 10, 2018

sleighsoft commented Jun 6, 2018

alvations commented Jun 6, 2018

sleighsoft commented Jun 7, 2018

lukedorney commented Jun 16, 2018

alvations commented Jun 19, 2018

alvations commented Jun 19, 2018

alvations commented Dec 17, 2018

Error when trying to import Tokenizer #1

Error when trying to import Tokenizer #1

Comments

janwendt commented May 9, 2018 • edited

alvations commented May 10, 2018

janwendt commented May 10, 2018

alvations commented May 10, 2018 • edited

alvations commented May 10, 2018

sleighsoft commented Jun 6, 2018

alvations commented Jun 6, 2018

sleighsoft commented Jun 7, 2018

lukedorney commented Jun 16, 2018

alvations commented Jun 19, 2018

alvations commented Jun 19, 2018

alvations commented Dec 17, 2018

janwendt commented May 9, 2018 •

edited

alvations commented May 10, 2018 •

edited