Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

Closed
jerryIsHere opened this issue Jul 7, 2020 · 10 comments
Closed

'cp950' codec error from load_dataset('xtreme', 'tydiqa') #347

jerryIsHere opened this issue Jul 7, 2020 · 10 comments
Labels
dataset bug A bug in a dataset script provided in the library

Comments

@jerryIsHere
Copy link
Contributor

jerryIsHere commented Jul 7, 2020

image

I guess the error is related to python source encoding issue that my PC is trying to decode the source code with wrong encoding-decoding tools, perhaps :
https://www.python.org/dev/peps/pep-0263/

I guess the error was triggered by the code " module = importlib.import_module(module_path)" at line 57 in the source code: nlp/src/nlp/load.py / (https://github.com/huggingface/nlp/blob/911d5596f9b500e39af8642fe3d1b891758999c7/src/nlp/load.py#L51)

Any ideas?

p.s. tried the same code on colab, that runs perfectly

@patpizio
Copy link
Contributor

patpizio commented Jul 7, 2020

This is probably a Windows issue, we need to specify the encoding when load_dataset() reads the original CSV file.
Try to find the open() statement called by load_dataset() and add an encoding='utf-8' parameter.
See issues #242 and #307

@lhoestq
Copy link
Member

lhoestq commented Jul 9, 2020

It should be in xtreme.py:L755:

        if self.config.name == "tydiqa" or self.config.name.startswith("MLQA") or self.config.name == "SQuAD":
            with open(filepath) as f:
                data = json.load(f)

Could you try to add the encoding parameter:

open(filepath, encoding='utf-8')

@lhoestq lhoestq added the dataset bug A bug in a dataset script provided in the library label Jul 9, 2020
@lhoestq
Copy link
Member

lhoestq commented Jul 13, 2020

Hello @jerryIsHere :) Did it work ?
If so we may change the dataset script to force the utf-8 encoding

@jerryIsHere
Copy link
Contributor Author

@lhoestq sorry for being that late, I found 4 copy of xtreme.py. I did the changes as what has been told to all of them.
The problem is not solved

@lhoestq
Copy link
Member

lhoestq commented Jul 16, 2020

Could you provide a better error message so that we can make sure it comes from the opening of the tydiqa's json files ?

@jerryIsHere
Copy link
Contributor Author

@lhoestq
The error message is same as before:
Exception has occurred: UnicodeDecodeError
'cp950' codec can't decode byte 0xe2 in position 111: illegal multibyte sequence
File "D:\python\test\test.py", line 3, in
dataset = load_dataset('xtreme', 'tydiqa')

image

I said that I found 4 copy of xtreme.py and add the 「, encoding='utf-8'」 parameter to the open() function
these python script was found under this directory
C:\Users\USER\AppData\Local\Programs\Python\Python37\Lib\site-packages\nlp\datasets\xtreme

@ghazi-f
Copy link
Contributor

ghazi-f commented Jul 21, 2020

Hi there !
I encountered the same issue with the IMDB dataset on windows. It threw an error about charmap not being able to decode a symbol during the first time I tried to download it. I checked on a remote linux machine I have, and it can't be reproduced.
I added encoding='UTF-8' to both lines that have open in imdb.py (108 and 114) and it worked for me.
Thank you !

@lhoestq
Copy link
Member

lhoestq commented Jul 21, 2020

Hi there !
I encountered the same issue with the IMDB dataset on windows. It threw an error about charmap not being able to decode a symbol during the first time I tried to download it. I checked on a remote linux machine I have, and it can't be reproduced.
I added encoding='UTF-8' to both lines that have open in imdb.py (108 and 114) and it worked for me.
Thank you !

Hello !
Glad you managed to fix this issue on your side.
Do you mind opening a PR for IMDB ?

@jerryIsHere
Copy link
Contributor Author

This is probably a Windows issue, we need to specify the encoding when load_dataset() reads the original CSV file.
Try to find the open() statement called by load_dataset() and add an encoding='utf-8' parameter.
See issues #242 and #307

Sorry for not responding for about a month.
I have just found that it is necessary to change / add the environment variable as what was told in #242.
Everything works after I add the new environment variable and restart my PC.

I think the encoding issue for windows isn't limited to the open() function call specific to few dataset, but actually in the entire library, depends on the machine / os you use.

@lhoestq
Copy link
Member

lhoestq commented Sep 7, 2020

Since #481 we shouldn't have other issues with encodings as they need to be set to "utf-8" be default.

Closing this one, but feel free to re-open if you gave other questions

@lhoestq lhoestq closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset bug A bug in a dataset script provided in the library
Projects
None yet
Development

No branches or pull requests

4 participants