It would be really convenient to be able to at least import SAS tables into pandas dataframe. Is this planned ? Are they insurmountable issues ?
can you post a link to the format? and see if any converters have been writtenin python?
obviously the idea would be to read the native format file
so looks like simple binary read/write stuff....could be done....only other question is there any license issue with doing this?
not sure. @benjello want to contact SAS and ask them?
That technote is describing the XPORT format, which isn't the native binary format. I've never seen a published layout of the native format. Some people have partially reverse engineered it, but I've never seen a solution that could handle data sets with compression.
@mtkni thanks had no idea.
aside from using SAS to actually export (e.g. csv or whatever), is there aformat that one could save that provides some interoperbility (and is openish)?
a 10 minute search suggests no but maybe someone else knows more.
I am not a specialist much more a potential user in heavy need of such a tool.
For now I have used alternatively StatTransfer which is not a free software
or when importing to R I used one of the method exposed here but you need to have sas installed.
I am sorry for not being able to provide ypu for more information than above.
probably could do
I don't think 1) is a good idea
export data in stata to xport format or csv format
ok. just throwing it out there. i don't like calling out to other programs either, but this seems like it's going to be tough. i can implement the R code above...if u think that's a good idea...but it basically forces users to use that particular version of the format and if it ever changes we won't know until it breaks.
i wouldn't be able to test to_sas though since i don't have sas
I would be glad to test everything that would do the job. I have sas.
I've spent some time on this in the past. These are my thoughts:
As much as I wish there was a good solution to this, and as much as I'd be willing to help build it, I just don't think there is. I've built read_sas using CSV as an intermediate format. Obviously, this requires a SAS license. It takes very few lines to implement, but most of those lines are specific to our SAS environment and are not well-portable.
I will study the performance of XPORT vs CSV next week. If it's dramatically faster, then it may be worth the effort to implement. Even then, I'm not sure it's worth taking that on as part of the Pandas project. I would be interested in comments from other SAS users on that.
Just my two cents.
nice to hear from someone who tried to do this. FWIW i think it might be tough to beat CSV for speed, most of it is written C/Cython.
Agreed. The new, fast CSV was a game changer.
can sas export to HDF5?
No, it can't.
export in STATA format?
No, and if it did it would be an expensive add-on module. SAS is pretty good at reading from databases (http://www.sas.com/resources/factsheet/sas-access-factsheet.pdf), although each database platform is a separate license. I haven't found it good at all at writing to databases (it can, but it's slow). Other than that, interoperability doesn't seem to be part of their business model.
Oh wait, I may have spoken to soon. Apparently I can export to a stata file: http://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003102702.htm
Is STATA supported in Pandas? It would still requires a SAS license, but I can benchmark that versus CSV.
There is a read_stata that will be available in te coming version but already available on github
BTW, @mtkni I would be happy to look at the read_sas you implemented if you would share it ...
Just to close the loop on this, exporting to STATA requires an add-on for which I'm not licensed, so I can't benchmark it.
FYI, the XPT or transport format is a non-proprietary format that has no licensing issues and is the only format currently accepted by the Food and Drug Administration (FDA) for clinical trial data. Most pharmaceutical companies submit XPT format to the FDA. It would be nice to have a way to read these files in just like a csv file.
@dramage1 Is "XPT" the same as "XPORT" above?
Heyo - there's at least one Python package for reading XPT files - https://pypi.python.org/pypi/xport/0.1.0
Just what I needed, much appreciated.
@dramage1 if you use this enough to want to write up a pandas wrapper for it, that could be a useful addition to pandas (depending on the stability of xport)
@jtratner I worked on the xport library before. I can refactor xport to give a better API for use in a pandas read_xpt or borrow some code to include directly in pandas.
@dramage1 let me know if the xport library is confusing or broken. I'll try to improve the docs and/or code.
As I said over email - I'm glad that you're interested in working on this.
Feel free to ping me if you have any pandas-related issues.
@benjello @selik any action on this?
Not yet... check back in a couple weeks :-0
I am sorry but I won't be qualified enough but I am willing to test any code
@benjello Could you give me a few test cases? I don't have SAS available to me. I'd like to get just a few tiny test files to make some unit tests.
@selik you are going to use the xport soln?
@jseabold you have thoughts on this?
Not really anything to add beyond what's here. I'm sure it will be useful if XPT format is used places (yikes that's a terrible data policy re: FDA). I've been lucky enough to avoid SAS beyond coursework which required it.
@jreback It makes sense to refactor the xport library to make it friendly as a dependency for a pandas.read_xpt().
@selik yes....prob best to simply incorporate it directly (with the licensening references / included) - see what we did with msgpack. Then you can modify and not introduce a dep.
I am not a license expert...but I think that the MIT license is compat with pandas BDS 3 clause
(you basically just copy the LICENSE to the LICENSES dir) and are good 2 go
@jseabold Regarding the FDA data format. They are beginning to realize that it is time to move forwar dand have a XML format pilot project proposal http://goo.gl/1xNiv8. The SAS transport (XPT) format is not going away anytime soon - the FDA moves at a snails pace implementing changes, so I glad to see you are working on this. Unfortunately, I am a complete newbie at python and can't help much. I could provide some sample data in XPT format if someone can explain how to upload it to GIT.
@dramage1 You can email me the files if you'd like. I'm looking for a Rosetta stone for XPT and CSV. Or XPT and some other plain-text format. I think my email address is in my profile.
@selik Mike, I tried firstname.lastname@example.org and got an undeliverable message.
@dramage1 That's not good. I wonder who else is having trouble emailing me. Mind sharing your email in your profile?
I updated my profile
Even though it would be slower, would it be worthwhile to add pyodbc-based support for SAS?
@spearsem It wouldn't necessarily be slower if SAS has some secret awesome algorithm for reading XPT files. That's how R reads XPT. But if you already have SAS, the best thing to do is read the file in SAS and save as CSV, not to read it directly from Python.
BTW, I'm slowly moving along with xport. I think I'll have code ready for inclusion in pandas by end of April.
I'm thinking specifically when you don't already have SAS, just someone's old data files. It would be interesting to be able to connect to the data via some within-pandas wrapper on pyodbc for the SAS drivers and then perform some queries on it into pandas.
I don't follow. How would you have SAS drivers without SAS?
Well, you may be a person who knows absolutely nothing about SAS, but who can solve the problem very quickly in Pandas. This happened to me before with Stata. My company had plenty of Stata licenses, but no one who knew Stata had time to help explain to me what was going on with the script that generated some data. In the interim, I found a statsmodels function that would read .dta files (pandas didn't have that ability yet) and solved the problem with the data very quickly. It would have taken much longer for me to do it with Stata.
Ah, I see. Unfortunately, I can't help with that, as I don't have SAS.
Some progress on the front of reading SAS sas7bdat format in R: http://cran.r-project.org/web/packages/sas7bdat/index.html and it seems that a pythonista got inspired by it http://git.pyhacker.com/sas7bdat
Can anyone vouch for the R solution?
(I'm here by accident, and have no idea about anything.)
sas7bdat is on pypi https://pypi.python.org/pypi/sas7bdat MIT licensed
and sas2pd in
FWIW, there is apparently also support for reading compressed files in the Java-based "parso" open-source library, http://search.maven.org/remotecontent?filepath=com/ggasoftware/parso/1.2.1/parso-1.2.1-sources.jar
the sas7bdat Python library now supports compressed files too: https://pypi.python.org/pypi/sas7bdat
@jaredhobbs yeah! Thanks a lot for this! If only it supported Python3, it would be even more awesome :)
@gdementen After a few long nights, Python3 support has landed: https://pypi.python.org/pypi/sas7bdat/2.0.1
I also fixed some bugs I came across.
Many thanks, this expands my options for working with Clinical Trial data sets.
Date: Sun, 4 Jan 2015 01:11:26 -0800
Subject: Re: [pandas] ENH: read_sas, to_sas (#4052)
Reply to this email directly or view it on GitHub.
Hadley just release haven for reading SAS, Spss, and Stata files. It wraps a C library (ReadStat)[https://github.com/WizardMac/ReadStat]. It has an MIT license.
Did anyone ever solve the read_sas issue? Or is reading fixed width ascii files with accompanying dictionary still an issue?
see #9711 - going to be merged shortly -
But does that only deal with '.xpt' file types? Glancing at the code it doesn't seem to deal with the other two part SAS file format.
@tyler-abbot yes that is for xport type files. It would be straightforward to wrap the library mentioned above to extend this to the sas binary format. just need a volunteer - interested?
@jreback I'm actually working on transcribing the SAScii (http://cran.r-project.org/web/packages/SAScii/index.html) package from R to Python. I would be happy to share the results if it is ok with the author of that package. I haven't done much development, though, so don't know much about sop's. I'm also not sure how compatible it would be with the library you mentioned. Perhaps just an add on with the option of ...format="sas"... or something along those lines.
just saw this. https://pypi.python.org/pypi/sas7bdat/2.0.1. Even if this is pure-python (slower), that is ok to start. Better to have it able to read than not.
@jreback @kshedden I use extensively https://pypi.python.org/pypi/sas7bdat/2.0.1 It is slow but works well.
Since usually I transfer my data to HDF format once at the beginning of a study, it was worth it.
So, I don't think the sas7bdat package can read the type of sas files I'm talking about. I have finished writing a function to do it, but am going out of town for a few months. It is all contained in this package:
I have a few days during which I could work on incorporating the read_sas() function into pandas. I'm going to read through the documentation and do some more testing, but if anyone has suggestions that will help me move more quickly it would be greatly appreciated.
I am going to mark this for 0.17.1. The implementation for using sas7bdat is quite trivial. So should start with that.
I used sas7bdat the other day... here is what it took:
from sas7bdat import SAS7BDAT
with SAS7BDAT('/homes/abie/projects/2015/TICS/tics_07.sas7bdat') as f:
df = f.to_data_frame()
There is some useful information hidden in the sas file that does not make it into the dataframe, though, such as the column labels.
agreed. all that is really needed are:
ENH: Support for reading SAS7BDAT files
In case you're curious, I revised the xport module and I think the code is a bit easier to read now. The xport.to_dataframe I wrote is occasionally 2x faster than pandas.read_sas but sometimes much slower. I expect it is dependent on the number of floats in the dataset, as I didn't vectorize the conversion from IBM to IEEE.
@selik well certainly welcome a PR to fixup the pandas versions!
note that this could be MUCH faster if it were cythonized, similar to how #12656 is done.
@jreback I'd want to change the API. So long as pandas.read_sas behaves the same, is there room to change the behavior of things like XportReader?
@selik what do you need to change? the user API is very simple actually, just pd.read_sas(..., format='....') and chunking. As long as you don't mess with that prob ok.
@jreback Sounds good. Not sure what I might need to change internally, but the effort is more pleasant if there's more freedom. I'd say I'll get around to it soon, but looking back I suddenly realize it took me 3+ years from the first time I told myself I'd revise the XPORT reader.
sure feel free to take a look around
Just FYI, I added dump and dumps to the xport module, in case anyone wants to take a look for writing pandas.to_sas.
The conversion from Python floats to IBM-mainframe 64-bit floats seems to be working quite well, very rarely losing precision. At least when I round-trip from IEEE to IBM and back to IEEE.