Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas read_sas error: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128) #12809

Closed
randomgambit opened this issue Apr 5, 2016 · 22 comments

Comments

Projects
None yet
4 participants
@randomgambit
Copy link

commented Apr 5, 2016

Hello everybody,

I am using Pandas 0.18 to open a sas7bdat dataset

I simply use:

df=pd.read_sas('P:/myfile.sas7bdat')

and I get the following error

    buf[0:text_block_size].rstrip(b"\x00 ").decode())

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd8 in position 0: ordinal not in range(128)

If I use

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

I get

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd8 in position 0: invalid continuation byte

Other sas7bdat files in my folder are handled just fine by Pandas.

When I open the file in SAS I see that the column names are very long and span several lines, but otherwise the files look just fine.

There are not so many possible options in read_sas... what should I do? Is this a bug in read_sas?

Many thanks!

@TomAugspurger TomAugspurger added the IO SAS label Apr 6, 2016

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

Are you able to share that file, or a similar file with non-senstitive data that raises the same error?

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

well this is the problem.. I cant. but I can do my best to run tests on my side, or do stuff in sas, or whatever you need to sort out the problem

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

You said the lines were long and span several lines. Can you make a dummy file with long names (just random strings like AAAAA.... might work) and a bit of fake data? (I don't have a copy of SAS).

Actually, this might be a dupe of #12659 Can you try reading the file linked there and see if the same error is raised?

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

when I try to read that file, I get TypeError: read() takes at most 1 argument (2 given)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

yes, I dowloaded the file test17.sas7bdat. This is not the error expected?

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

I checked the details of the file in SAS. Apparently it is encoded in latin 1 western.
So I tried read_sas('myfile.sas7bdat', encoding='latin-1') but I get the same error

-- ascii codec cant decode byte etc..

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

Sorry, I was mistaken about the error message. Looks like this is a different issue.

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

something strange is that even if I specify some encoding, I still get some error relative to the ascii codec. Can that be a cause of the error?

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

the encoding of my sas file is more precisely latin1 western ISO. Created in linux. (but I use pandas on windows)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

Can you drop into the debugger after it raises the error? %debug if your in IPython. Then you can see what's going on.

The docstring says encoding is just for decoding string columns (the actual values), so perhaps it isn't being applied to decoding the column names.

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

aha! ok lemme try the debugger

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016


> c:\users\me\appdata\local\continuum\anaconda2\lib\site-packages\pandas\io\sas\sas7bdat.py(529)_process_columntext_subheader()
    527         buf = self._read_bytes(offset, text_block_size)
    528         self.column_names_strings.append(
--> 529             buf[0:text_block_size].rstrip(b"\x00 ").decode())
    530 
    531         if len(self.column_names_strings) == 1:

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

this is what I get. then the debugger seems to wait for instructions

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

Ahh, that looks promising though. Does buf[0:text_block_size].rstrip(b"\x00 ").decode('latin1') work there?

Although, that might not go well with the bit stripping there...

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

what do you mean? what should I do?
sorry I never user the debugger..

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

OK Tom, I found a fix.

Just check the encoding of your sas file (right click, properties, details) and set the encoding.

import sys
reload(sys)
sys.setdefaultencoding("latin-1")

the question I have is thus: why specifying the encoding in the read_sas function does nothing?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

I believe the encoding parameter is just used to decode text data in the actual DataFrame itself, and not the metadata like column headers. Does that sound correct @kshedden ?

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

According to the docs below, depending on the setting of the VALIDVARNAME
option, variable names may be either restricted to ASCII, or may be
arbitrary bytes to be decoded somehow:

http://support.sas.com/documentation/cdl/en/lrcon/68089/HTML/default/viewer.htm#p18cdcs4v5wd2dn1q0x296d3qek6.htm

I'm not sure if this VALIDVARNAME (which I have never heard of before) is
in the file somewhere, or is an option that you specify within the
session. In any case, it appears that the column names may need to be
decoded.

Also relevant:

http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002601944.htm

On Wed, Apr 6, 2016 at 8:27 AM, Tom Augspurger notifications@github.com
wrote:

I believe the encoding parameter is just used to decode text data in the
actual DataFrame itself, and not the metadata like column headers. Does
that sound correct @kshedden https://github.com/kshedden ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12809 (comment)

@randomgambit

This comment has been minimized.

Copy link
Author

commented Apr 6, 2016

yes, makes sense although I dont have any control over the creation of these sas files.

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2016

I'm working on a PR #12656 and will
try to work this into it.

I haven't had much time lately but will try to get to this next week.

Kerby

On Wed, Apr 6, 2016 at 9:02 AM, randomgambit notifications@github.com
wrote:

yes, makes sense although I dont have any control over the creation of
these sas files.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#12809 (comment)

@kshedden

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2016

@randomgambit, can you try this branch against your SAS file:

https://github.com/kshedden/pandas/tree/sas7bdat_perf

I hope it fixes your problem.

@jreback jreback added this to the 0.18.1 milestone Apr 17, 2016

@jreback jreback added the Unicode label Apr 17, 2016

@jreback jreback closed this in 33683cc Apr 22, 2016

nps added a commit to nps/pandas that referenced this issue May 17, 2016

Modest performance, address pandas-dev#12647
closes pandas-dev#12659
closes pandas-dev#12654
closes pandas-dev#12647
closes pandas-dev#12809

Majorperformance improvements through use of Cython
Bug fixes in read_sas

Author: Kerby Shedden <kshedden@umich.edu>

Closes pandas-dev#12656 from kshedden/sas7bdat_perf and squashes the following commits:

b3024ed [Kerby Shedden] Add missing test data files
af085f7 [Kerby Shedden] Add one more type
fe4731b [Kerby Shedden] Integrate jreback's cython improvements
b7de358 [Kerby Shedden] flake8 fixes
ea87a7f [Kerby Shedden] Fix encoding handling bug for py2
8b4b96d [Kerby Shedden] pep8 cleanup
1af73b3 [Kerby Shedden] Further encoding work
c26d22b [Kerby Shedden] added to whatsnew
873a877 [Kerby Shedden] Added option to not decode header text
11c2f31 [Kerby Shedden] Further cythonization
3ef626e [Kerby Shedden] Working on cython issues
7e156b7 [Kerby Shedden] Working on cython issues
dc330c5 [Kerby Shedden] Add two missing alignment constants
23bdf7a [Kerby Shedden] Decouple data decoding and decoding e.g. of column names
3bd1b35 [Kerby Shedden] Move more code to cython
bdc9a06 [Kerby Shedden] More cython for performance, refactored constants
ea2339f [Kerby Shedden] Use encoding when reading column headers
7d91d51 [Kerby Shedden] Add test data set from raderaj
a7df841 [Kerby Shedden] Modest performance, address pandas-dev#12647

fix up memoryview access on windows, installation issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.