'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

jerryIsHere · 2020-07-07T08:14:23Z

I guess the error is related to python source encoding issue that my PC is trying to decode the source code with wrong encoding-decoding tools, perhaps :
https://www.python.org/dev/peps/pep-0263/

I guess the error was triggered by the code " module = importlib.import_module(module_path)" at line 57 in the source code: nlp/src/nlp/load.py / (https://github.com/huggingface/nlp/blob/911d5596f9b500e39af8642fe3d1b891758999c7/src/nlp/load.py#L51)

Any ideas?

p.s. tried the same code on colab, that runs perfectly

patpizio · 2020-07-07T21:38:36Z

This is probably a Windows issue, we need to specify the encoding when load_dataset() reads the original CSV file.
Try to find the open() statement called by load_dataset() and add an encoding='utf-8' parameter.
See issues #242 and #307

lhoestq · 2020-07-09T08:49:07Z

It should be in xtreme.py:L755:

        if self.config.name == "tydiqa" or self.config.name.startswith("MLQA") or self.config.name == "SQuAD":
            with open(filepath) as f:
                data = json.load(f)

Could you try to add the encoding parameter:

open(filepath, encoding='utf-8')

lhoestq · 2020-07-13T10:01:04Z

Hello @jerryIsHere :) Did it work ?
If so we may change the dataset script to force the utf-8 encoding

jerryIsHere · 2020-07-13T10:35:40Z

@lhoestq sorry for being that late, I found 4 copy of xtreme.py. I did the changes as what has been told to all of them.
The problem is not solved

lhoestq · 2020-07-16T13:25:17Z

Could you provide a better error message so that we can make sure it comes from the opening of the tydiqa's json files ?

jerryIsHere · 2020-07-17T04:31:52Z

@lhoestq
The error message is same as before:
Exception has occurred: UnicodeDecodeError
'cp950' codec can't decode byte 0xe2 in position 111: illegal multibyte sequence
File "D:\python\test\test.py", line 3, in
dataset = load_dataset('xtreme', 'tydiqa')

I said that I found 4 copy of xtreme.py and add the 「, encoding='utf-8'」 parameter to the open() function
these python script was found under this directory
C:\Users\USER\AppData\Local\Programs\Python\Python37\Lib\site-packages\nlp\datasets\xtreme

ghazi-f · 2020-07-21T13:10:42Z

Hi there !
I encountered the same issue with the IMDB dataset on windows. It threw an error about charmap not being able to decode a symbol during the first time I tried to download it. I checked on a remote linux machine I have, and it can't be reproduced.
I added encoding='UTF-8' to both lines that have open in imdb.py (108 and 114) and it worked for me.
Thank you !

lhoestq · 2020-07-21T13:12:51Z

Hi there !
I encountered the same issue with the IMDB dataset on windows. It threw an error about charmap not being able to decode a symbol during the first time I tried to download it. I checked on a remote linux machine I have, and it can't be reproduced.
I added encoding='UTF-8' to both lines that have open in imdb.py (108 and 114) and it worked for me.
Thank you !

Hello !
Glad you managed to fix this issue on your side.
Do you mind opening a PR for IMDB ?

jerryIsHere · 2020-08-23T07:57:17Z

This is probably a Windows issue, we need to specify the encoding when load_dataset() reads the original CSV file.
Try to find the open() statement called by load_dataset() and add an encoding='utf-8' parameter.
See issues #242 and #307

Sorry for not responding for about a month.
I have just found that it is necessary to change / add the environment variable as what was told in #242.
Everything works after I add the new environment variable and restart my PC.

I think the encoding issue for windows isn't limited to the open() function call specific to few dataset, but actually in the entire library, depends on the machine / os you use.

lhoestq · 2020-09-07T14:51:45Z

Since #481 we shouldn't have other issues with encodings as they need to be set to "utf-8" be default.

Closing this one, but feel free to re-open if you gave other questions

lhoestq added the dataset bug A bug in a dataset script provided in the library label Jul 9, 2020

ghazi-f mentioned this issue Jul 21, 2020

- Corrected encoding for IMDB. #422

Merged

lewtun mentioned this issue Aug 6, 2020

Apply utf-8 encoding to all datasets #481

Merged

lhoestq closed this as completed Sep 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

jerryIsHere commented Jul 7, 2020 •

edited

Loading

patpizio commented Jul 7, 2020

lhoestq commented Jul 9, 2020

lhoestq commented Jul 13, 2020

jerryIsHere commented Jul 13, 2020

lhoestq commented Jul 16, 2020

jerryIsHere commented Jul 17, 2020

ghazi-f commented Jul 21, 2020

lhoestq commented Jul 21, 2020 •

edited

Loading

jerryIsHere commented Aug 23, 2020

lhoestq commented Sep 7, 2020

'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

Comments

jerryIsHere commented Jul 7, 2020 • edited Loading

patpizio commented Jul 7, 2020

lhoestq commented Jul 9, 2020

lhoestq commented Jul 13, 2020

jerryIsHere commented Jul 13, 2020

lhoestq commented Jul 16, 2020

jerryIsHere commented Jul 17, 2020

ghazi-f commented Jul 21, 2020

lhoestq commented Jul 21, 2020 • edited Loading

jerryIsHere commented Aug 23, 2020

lhoestq commented Sep 7, 2020

jerryIsHere commented Jul 7, 2020 •

edited

Loading

lhoestq commented Jul 21, 2020 •

edited

Loading