Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reader for SPSS (.sav) files #26537

Merged
merged 53 commits into from
Jun 16, 2019
Merged

Add reader for SPSS (.sav) files #26537

merged 53 commits into from
Jun 16, 2019

Conversation

cbrnr
Copy link
Contributor

@cbrnr cbrnr commented May 27, 2019

  • closes Read and write spss data format #5768 (at least the reading part, this PR does not cover writing SPSS files)
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I haven't added a test yet because I wanted to ask which test .sav file I should use (and where to put it). Also, there's no whatsnew entry yet - in which section should this entry go (and I assume it will be out for 0.25.0, so I'll have to change 0.24.3 to 0.25.0 in the docstring).

This PR adds the capability to load SPSS .sav files with df = pd.io.read_spss("spss_file.sav"). Currently, there are two optional arguments: usecols (should be self-explanatory, let me know if you don't want me to handle a simple str) and categorical, which maps to the apply_value_formats parameter in read_sav. With categorical=True, a categorical columns is created with the labels from the .sav file. If False, numbers will be used.

A few open questions:

  • Which additional optional arguments should be made available? Pyreadstat has dates_as_pandas_datetime, encoding, and user_missing which I haven't mapped yet.
  • Should the function be called read_spss or read_sav? SPSS files have the extension sav, but the R haven package has a function read_spss (which is why I'd prefer read_spss).
  • Are there any additional meta information bits that could be used/integrated into the data frame? pyreadstat.read_sav returns a dataframe and meta-information separately, which I think we shouldn't do in pandas.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add this to the CI in several (but not all places); this should gracefully skip tests if not installed. you can commit the .sav files pandas/tests/io/data/ (see how we do this for .dta). also update the install.rst

@jreback jreback added IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata labels May 27, 2019
@codecov
Copy link

codecov bot commented May 28, 2019

Codecov Report

Merging #26537 into master will decrease coverage by 50.06%.
The diff coverage is 25%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26537       +/-   ##
===========================================
- Coverage   91.76%    41.7%   -50.07%     
===========================================
  Files         174      175        +1     
  Lines       50629    50637        +8     
===========================================
- Hits        46462    21119    -25343     
- Misses       4167    29518    +25351
Flag Coverage Δ
#multiple ?
#single 41.7% <25%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/api.py 100% <100%> (ø) ⬆️
pandas/io/spss.py 14.28% <14.28%> (ø)
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a516c1...0407cc5. Read the comment docs.

@codecov
Copy link

codecov bot commented May 28, 2019

Codecov Report

Merging #26537 into master will decrease coverage by 0.02%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26537      +/-   ##
==========================================
- Coverage   91.88%   91.86%   -0.03%     
==========================================
  Files         179      180       +1     
  Lines       50696    50710      +14     
==========================================
+ Hits        46581    46583       +2     
- Misses       4115     4127      +12
Flag Coverage Δ
#multiple 90.45% <50%> (-0.02%) ⬇️
#single 41.1% <50%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/api.py 100% <100%> (ø) ⬆️
pandas/io/spss.py 46.15% <46.15%> (ø)
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.84% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 430f0fd...e55b8c4. Read the comment docs.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 28, 2019

I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files?

One test file containing dates cannot be loaded because pyreadstat produces an error - this is reported in Roche/pyreadstat#26.

I've updated install.rst and I'm currently trying to see if a Travis job successfully completes when pyreadstat is installed.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 28, 2019

I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files?

Seems like haven has an MIT license, according to https://cran.r-project.org/web/packages/haven/index.html. Does that sound right @hadley?

If so, I think you can include include Haven's license file in our licenses folder, and we'll be good. It'd be good to note the source of these tests files in test_spss.py.

@hadley
Copy link

hadley commented May 28, 2019

Yeah, that's fine with me. (I generally follow the US and consider data to be un-copyrightable, although in this case I guess it's the specific form that it's important not so much the data)

@cbrnr
Copy link
Contributor Author

cbrnr commented May 29, 2019

All tests pass now (currently re-running because of a wrong imports order). Two open questions:

  1. I have currently added pyreadstat only to one Travis job (3.7). Where would you like this dependency to be added?
  2. The Haven license consists of two files (MIT and LICENSE). How and where should these files be added (e.g. how should we rename these files)?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put licenses in pandas/LICENSES

@@ -22,3 +22,4 @@ dependencies:
- pip
- pip:
- moto
- pyreadstat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only a wheel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to one of the windows builds & the macosx build; is this support on 3.5? any other requirements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what you mean by "only a wheel" though. pyreadstat has binary wheels for Windows, Linux, and macOS for Python 3.5, 3.6, and 3.7: https://pypi.org/project/pyreadstat/#files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning you are installing thru pip, rather use a conda package if its available

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
@@ -0,0 +1,27 @@
def read_spss(path, usecols=None, categorical=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add typing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is the first time I've used typing so please double-check.

pandas/tests/io/test_spss.py Outdated Show resolved Hide resolved
@pep8speaks
Copy link

pep8speaks commented May 29, 2019

Hello @cbrnr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-15 07:07:22 UTC

@cbrnr
Copy link
Contributor Author

cbrnr commented May 29, 2019

@jreback I've implemented all changes.

@@ -22,3 +22,4 @@ dependencies:
- pip
- pip:
- moto
- pyreadstat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning you are installing thru pip, rather use a conda package if its available

pandas/io/spss.py Outdated Show resolved Hide resolved

Parameters
----------
path : string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this accept a pathlike? Union[str, Path] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pathlib.Path should work everywhere a path string is expected, so IMO it is not really necessary to explicitly add this. But if you want I can of course add it (this requires an extra import pathlib though).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented this change.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 30, 2019

Yes, pyreadstat is only available via pip and not via conda.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 30, 2019

Fixed the formatting issues, hopefully this will come back green and then it should be ready to merge.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit on annotation but otherwise this lgtm - nice change!

@WillAyd WillAyd added this to the 0.25.0 milestone May 30, 2019
@cbrnr
Copy link
Contributor Author

cbrnr commented May 30, 2019

@WillAyd could you please elaborate?

@WillAyd
Copy link
Member

WillAyd commented May 30, 2019

Not sure what happened to my comment but was asking to change Union[str, Sequence[str], None] to Optional[Union[str, Sequence[str]]

@cbrnr
Copy link
Contributor Author

cbrnr commented May 30, 2019

@WillAyd done!

@@ -22,3 +22,4 @@ dependencies:
- pip
- pip:
- moto
- pyreadstat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock

pandas/io/spss.py Outdated Show resolved Hide resolved
pandas/tests/io/test_spss.py Outdated Show resolved Hide resolved
@ofajardo
Copy link

ofajardo commented May 31, 2019

@cbrnr Pyreadstat is available both with pip and conda. In the README that is explained in the How to install section

@cbrnr
Copy link
Contributor Author

cbrnr commented May 31, 2019

I didn't see that you have conda-forge added, sorry about that. I've addressed all comments.

@cbrnr
Copy link
Contributor Author

cbrnr commented May 31, 2019

No idea what's going on with Azure, could someone please restart?

@jreback
Copy link
Contributor

jreback commented Jun 2, 2019

can you merge master (you have a conflict)

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 3, 2019

OK, I've rebased.

@jreback
Copy link
Contributor

jreback commented Jun 3, 2019

looks good. @jorisvandenbossche @TomAugspurger @bashtage ok with this?

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 15, 2019

Finally all green!

@jreback jreback merged commit 21fe224 into pandas-dev:master Jun 16, 2019
@jreback
Copy link
Contributor

jreback commented Jun 16, 2019

thanks @cbrnr

@cbrnr cbrnr deleted the pyreadstat branch June 17, 2019 05:35
@ofajardo
Copy link

@cbrnr I am in the process of releasing a new version of pyreadstat with writing capabilities, in case that's of your interest. Should be there later today or tomorrow.

@cbrnr
Copy link
Contributor Author

cbrnr commented Jun 28, 2019

Nice! However, at the moment I don't think there's a need for exporting to a proprietary format directly from pandas. If someone really wants to do that they can use your package directly.

@isantolin
Copy link
Contributor

Still missing documentation http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 19, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read and write spss data format
9 participants