Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read SAS (sas7bdat) #12654

Closed
ywhcuhk opened this issue Mar 17, 2016 · 11 comments
Closed

Read SAS (sas7bdat) #12654

ywhcuhk opened this issue Mar 17, 2016 · 11 comments
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Milestone

Comments

@ywhcuhk
Copy link

ywhcuhk commented Mar 17, 2016

Very excited to have this new feature in Pandas. I have a few comments to share:

  1. pd.read_sas() doesn't read SAS date variable correctly (this is noted in the doc). Dates are read as numpy.float64. Note in SAS, dates are recorded as numbers relative to 1960-1-1. It would be helpful to allow some sort of arguments to parse the date variable correctly.
  2. Moreover, SAS has some special missing variables such as .B or .R. I wonder how are these cases treated?
  3. Not nearly as fast as read_csv(). To read a 700MB SAS data. The time is
CPU times: user 1min 47s, sys: 955 ms, total: 1min 48s
Wall time: 1min 48s

The time for the same CSV file (I covered the same file to CSV using SAS) is

CPU times: user 3.93 s, sys: 343 ms, total: 4.28 s
Wall time: 4.29 s
@benjello
Copy link
Contributor

To improve performance, the path to follow might be the one indicated in #10517.

@jreback jreback added Performance Memory or execution speed performance IO SAS SAS: read_sas labels Mar 17, 2016
@jreback jreback added this to the Next Major Release milestone Mar 17, 2016
@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

cc @kshedden

this is mostly in python ATM. You always write for correctness first, profile, then if necessary use things like cython in critical sections. pull-requests are welcome for an improved implm.

@kshedden
Copy link
Contributor

Thanks for the feedback. A few comments:

  • Performance is a major issue. The pandas version is currently not usable for my own use case (files with billions of rows). FWIW I have a golang version that is about 20x faster (https://github.com/kshedden/datareader) which I use for big files. Most of the slowness in pandas.read_sas is in process_byte_array_with_data (already in cython but probably not optimally set up). I have a local version with improvements that is about 30% faster, but I was looking for a much bigger improvement.
  • The other open sourced SAS7BDAT readers that I know of , including the one from wizard, don't support compression properly or at all, e.g.: Support binary compression in sas7bdat WizardMac/ReadStat#21 I put a lot of time into getting the compression support to work.
  • Date support needs some work, but this should be relatively easy to fix. There are several different date formats in SAS. We detect and autoconvert some but not all of them.
  • I don't have documentation for missing value codes, if you can point me to it (i.e. which float values correspond to which codes) I should be able to add it.
  • I just noticed that wizard supports sas7bcat (categorical data format codes), I will look into porting this over at some point.

@marks
Copy link

marks commented Sep 4, 2016

@kshedden any update on sas7bcat support by any chance? Thanks!!

@xappppp
Copy link

xappppp commented Jun 16, 2018

Is there any updated on sas7bdat? I have a 5GB sas data to read into python. But the performance is still a hassle. Thanks!

@ofajardo
Copy link

ofajardo commented Aug 22, 2018

@marks @xappppp @kshedden I have released a wrapper around the ReadStat C library called pyreadstat. Because most of the code is C is faster than pandas. It can also read sas7bcat files. It handles value labels and column labels. Missing values tags will come soon (@kshedden it would be nice if you could provide a sample file with tagged missing values).
https://github.com/Roche/pyreadstat

@kshedden
Copy link
Contributor

kshedden commented Aug 22, 2018 via email

@ofajardo
Copy link

ofajardo commented Aug 22, 2018

@kshedden Thanks a lot for your useful comments.

  • I'll repair the documentation page, thanks for reporting that.
  • Since I am wrapping readstat, and currently there is no support for compressed SAS files or incremental reading of data on readstat, I cannot do those. I hope it's fixed in the future.
  • I'll check about pandas reading value labels and reading the header only. Probably I was not careful enough when reading the documentation. I will remove those statements if - as I guess - you are right.
  • True that readstat cannot guess encodings 100% of times, but on my hands it did it correctly always except for one file (reading your comment that the encoding is stored in the file itself probably readstat is not guessing but just using that information). However for pandas I had to type in the encoding every time. But maybe again I overlooked something? Again I will remove that statement if wrong.

What is true is that the main motivation was speed as we need to do some heavy lifting.

@ofajardo
Copy link

ofajardo commented Aug 22, 2018

@kshedden

  • API documentation fixed (quick fix, have to work on that a bit more later).
  • Known limitation section added to the documentation listing what it cannot do at the moment.
  • You are right regarding value labels and reading heading only for pandas. However, as far as I can see those features are not documented (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sas.html and https://github.com/pandas-dev/pandas/blob/master/pandas/io/sas/sas7bdat.py) or at least I was not able to find them. In my opinion it would be nice to advertise them more clearly. In any case I will remove those statements from my documentation.
  • Regarding the encodings, if you can read the encoding from the file, why not using it instead of returning bytes? I found it a bit annoying to have to specify it everytime. Of course it could fail from time to time, but then you can fall back to returning bytes. Well, just an opinion/suggestion.

@kshedden
Copy link
Contributor

kshedden commented Aug 23, 2018 via email

@seemasinghh
Copy link

Hi, I don't have any idea about sas to pandas this is the new term I want to know more about it because now I am a student of SAS in CETPA where I learn basic to advance about SAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SAS SAS: read_sas Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants