Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128) #23444

Open
mau21mau opened this issue Nov 1, 2018 · 10 comments
Open
Labels
Enhancement IO Excel read_excel, to_excel

Comments

@mau21mau
Copy link

mau21mau commented Nov 1, 2018

Code Sample, a copy-pastable example if possible

dataframe = pd.read_excel(
                StringIO(self.file_stream), sheet, na_values=['undefined', 'NaN'], header=None, keep_default_na=False
            )

Problem description

I'm trying to read a xls file with read_excel() method and it throws the error on the title. If I try to read the file with xlrd lib I can fix the error by providing the parameter encoding_override with the file encoding. I've seen some Stackoverflow answers and all of them recommend using an encoding parameter, which doesn't exist. Why don't the implement an encoding parameter for the read_excel() method, and just use it as encoding_override when reading the file with xlrd?

@gfyoung
Copy link
Member

gfyoung commented Nov 1, 2018

That seems reasonable to me. We have an encoding parameter in read_csv, so adding it to read_excel would be consistent.

@WillAyd
Copy link
Member

WillAyd commented Nov 5, 2018

Is this only applicable to files created with Excel 95 and earlier?

https://xlrd.readthedocs.io/en/latest/unicode.html

If so I am -1 here as I can't imagine we support anything else explicitly with that type of age to it

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@WillAyd : Consistency with read_csv is the main reason why I would support this parameter. Our data IO API is quite fragmented, so adding this parameter is a step in the right direction.

@WillAyd
Copy link
Member

WillAyd commented Nov 5, 2018

Am I misreading it that all Excel files created in the past 21 years contain an encoding of utf-16-le though? If so while consistency is good the keyword would either be unused or actually confusing / counter-productive to almost every Excel file still out there in the wild.

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

Am I misreading it that all Excel files created in the past 21 years contain an encoding of utf-16-le though?

Uncertain.

If so while consistency is good the keyword would either be unused or actually confusing / counter-productive to almost every Excel file still out there in the wild.

Confusing? Not if good documentation is written for it. Would be good then to clarify xlrd docs, and @mau21mau might need more clarification on the type of Excel file you were trying to read.

@WillAyd
Copy link
Member

WillAyd commented Nov 5, 2018

My big pushback is on referring to this as encoding because I don't think it covers the same concept as other IO functions. I haven't stepped through the source code of xlrd but I am interpreting it as an intentional disambiguation that they chose the parameter name of encoding_override and not just encoding. Their docs suggest that this is only used in case of missing or incorrect code pages, and therefore may not explicitly determine encoding.

What if we just either changed the intention here to add encoding_override as a parameter or alternately allowed kwargs to go through to read_excel? I'd be fine with either of those, but don't want to mangle concepts with other IO functions

@gfyoung
Copy link
Member

gfyoung commented Nov 5, 2018

@WillAyd : I'm not sure I fully understand your argument. The word "encoding" seems to mean the same thing for xlrd as it does for read_csv, even if the determination / assumption of encoding seems predicated on the existence of a record.

@mau21mau
Copy link
Author

mau21mau commented Nov 7, 2018

My big pushback is on referring to this as encoding because I don't think it covers the same concept as other IO functions. I haven't stepped through the source code of xlrd but I am interpreting it as an intentional disambiguation that they chose the parameter name of encoding_override and not just encoding. Their docs suggest that this is only used in case of missing or incorrect code pages, and therefore may not explicitly determine encoding.

What if we just either changed the intention here to add encoding_override as a parameter or alternately allowed kwargs to go through to read_excel? I'd be fine with either of those, but don't want to mangle concepts with other IO functions

I don't know exactly what kind of file it is, since it's from one of our users (I don't know how he generated the file). The thing is that xlrd does support it and, thus, I thought that read_excel, since it uses xlrd, should be prepare, being it with encoding parameter or kwargs, to account for that scenario.

@sindhuprakasam
Copy link

Am facing the same issue when i try to export my pandas dataframe to an excel file, so the issue is still open for that as well ?

@gfyoung
Copy link
Member

gfyoung commented Mar 22, 2019

@sindhusubha : Absolutely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants