Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reader for SPSS (.sav) files #26537

Merged
merged 53 commits into from Jun 16, 2019

Conversation

@cbrnr
Copy link
Contributor

commented May 27, 2019

  • closes #5768 (at least the reading part, this PR does not cover writing SPSS files)
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I haven't added a test yet because I wanted to ask which test .sav file I should use (and where to put it). Also, there's no whatsnew entry yet - in which section should this entry go (and I assume it will be out for 0.25.0, so I'll have to change 0.24.3 to 0.25.0 in the docstring).

This PR adds the capability to load SPSS .sav files with df = pd.io.read_spss("spss_file.sav"). Currently, there are two optional arguments: usecols (should be self-explanatory, let me know if you don't want me to handle a simple str) and categorical, which maps to the apply_value_formats parameter in read_sav. With categorical=True, a categorical columns is created with the labels from the .sav file. If False, numbers will be used.

A few open questions:

  • Which additional optional arguments should be made available? Pyreadstat has dates_as_pandas_datetime, encoding, and user_missing which I haven't mapped yet.
  • Should the function be called read_spss or read_sav? SPSS files have the extension sav, but the R haven package has a function read_spss (which is why I'd prefer read_spss).
  • Are there any additional meta information bits that could be used/integrated into the data frame? pyreadstat.read_sav returns a dataframe and meta-information separately, which I think we shouldn't do in pandas.
@jreback
Copy link
Contributor

left a comment

pls add this to the CI in several (but not all places); this should gracefully skip tests if not installed. you can commit the .sav files pandas/tests/io/data/ (see how we do this for .dta). also update the install.rst

@codecov

This comment has been minimized.

Copy link

commented May 28, 2019

Codecov Report

Merging #26537 into master will decrease coverage by 50.06%.
The diff coverage is 25%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #26537       +/-   ##
===========================================
- Coverage   91.76%    41.7%   -50.07%     
===========================================
  Files         174      175        +1     
  Lines       50629    50637        +8     
===========================================
- Hits        46462    21119    -25343     
- Misses       4167    29518    +25351
Flag Coverage Δ
#multiple ?
#single 41.7% <25%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/io/api.py 100% <100%> (ø) ⬆️
pandas/io/spss.py 14.28% <14.28%> (ø)
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.37%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.16%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.1%) ⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a516c1...0407cc5. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented May 28, 2019

Codecov Report

Merging #26537 into master will decrease coverage by 0.02%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26537      +/-   ##
==========================================
- Coverage   91.88%   91.86%   -0.03%     
==========================================
  Files         179      180       +1     
  Lines       50696    50710      +14     
==========================================
+ Hits        46581    46583       +2     
- Misses       4115     4127      +12
Flag Coverage Δ
#multiple 90.45% <50%> (-0.02%) ⬇️
#single 41.1% <50%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/io/api.py 100% <100%> (ø) ⬆️
pandas/io/spss.py 46.15% <46.15%> (ø)
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/frame.py 96.88% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.84% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 430f0fd...e55b8c4. Read the comment docs.

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 28, 2019

I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files?

One test file containing dates cannot be loaded because pyreadstat produces an error - this is reported in Roche/pyreadstat#26.

I've updated install.rst and I'm currently trying to see if a Travis job successfully completes when pyreadstat is installed.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files?

Seems like haven has an MIT license, according to https://cran.r-project.org/web/packages/haven/index.html. Does that sound right @hadley?

If so, I think you can include include Haven's license file in our licenses folder, and we'll be good. It'd be good to note the source of these tests files in test_spss.py.

@hadley

This comment has been minimized.

Copy link

commented May 28, 2019

Yeah, that's fine with me. (I generally follow the US and consider data to be un-copyrightable, although in this case I guess it's the specific form that it's important not so much the data)

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

All tests pass now (currently re-running because of a wrong imports order). Two open questions:

  1. I have currently added pyreadstat only to one Travis job (3.7). Where would you like this dependency to be added?
  2. The Haven license consists of two files (MIT and LICENSE). How and where should these files be added (e.g. how should we rename these files)?
@jreback
Copy link
Contributor

left a comment

put licenses in pandas/LICENSES

@@ -22,3 +22,4 @@ dependencies:
- pip
- pip:
- moto
- pyreadstat

This comment has been minimized.

Copy link
@jreback

jreback May 29, 2019

Contributor

is this only a wheel?

This comment has been minimized.

Copy link
@jreback

jreback May 29, 2019

Contributor

add to one of the windows builds & the macosx build; is this support on 3.5? any other requirements?

This comment has been minimized.

Copy link
@cbrnr

cbrnr May 29, 2019

Author Contributor

Done.

This comment has been minimized.

Copy link
@cbrnr

cbrnr May 29, 2019

Author Contributor

I don't know what you mean by "only a wheel" though. pyreadstat has binary wheels for Windows, Linux, and macOS for Python 3.5, 3.6, and 3.7: https://pypi.org/project/pyreadstat/#files

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

meaning you are installing thru pip, rather use a conda package if its available

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger May 30, 2019

Contributor

Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock

Show resolved Hide resolved doc/source/whatsnew/v0.25.0.rst Outdated
@@ -0,0 +1,27 @@
def read_spss(path, usecols=None, categorical=True):

This comment has been minimized.

Copy link
@jreback

jreback May 29, 2019

Contributor

can you add typing

This comment has been minimized.

Copy link
@cbrnr

cbrnr May 29, 2019

Author Contributor

Sure. This is the first time I've used typing so please double-check.

Show resolved Hide resolved pandas/tests/io/test_spss.py Outdated
@pep8speaks

This comment has been minimized.

Copy link

commented May 29, 2019

Hello @cbrnr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-15 07:07:22 UTC
@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 29, 2019

@jreback I've implemented all changes.

@@ -22,3 +22,4 @@ dependencies:
- pip
- pip:
- moto
- pyreadstat

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

meaning you are installing thru pip, rather use a conda package if its available

Show resolved Hide resolved pandas/io/spss.py Outdated
Parameters
----------
path : string

This comment has been minimized.

Copy link
@jreback

jreback May 30, 2019

Contributor

does this accept a pathlike? Union[str, Path] ?

This comment has been minimized.

Copy link
@cbrnr

cbrnr May 30, 2019

Author Contributor

pathlib.Path should work everywhere a path string is expected, so IMO it is not really necessary to explicitly add this. But if you want I can of course add it (this requires an extra import pathlib though).

This comment has been minimized.

Copy link
@cbrnr

cbrnr May 30, 2019

Author Contributor

I've implemented this change.

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

Yes, pyreadstat is only available via pip and not via conda.

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

Fixed the formatting issues, hopefully this will come back green and then it should be ready to merge.

@WillAyd
Copy link
Member

left a comment

Minor nit on annotation but otherwise this lgtm - nice change!

@WillAyd WillAyd added this to the 0.25.0 milestone May 30, 2019

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

@WillAyd could you please elaborate?

@WillAyd

This comment has been minimized.

Copy link
Member

commented May 30, 2019

Not sure what happened to my comment but was asking to change Union[str, Sequence[str], None] to Optional[Union[str, Sequence[str]]

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

@WillAyd done!

@@ -22,3 +22,4 @@ dependencies:
- pip
- pip:
- moto
- pyreadstat

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger May 30, 2019

Contributor

Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock

Show resolved Hide resolved pandas/io/spss.py Outdated
Show resolved Hide resolved pandas/tests/io/test_spss.py Outdated
@ofajardo

This comment has been minimized.

Copy link

commented May 31, 2019

@cbrnr Pyreadstat is available both with pip and conda. In the README that is explained in the How to install section

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

I didn't see that you have conda-forge added, sorry about that. I've addressed all comments.

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

No idea what's going on with Azure, could someone please restart?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

can you merge master (you have a conflict)

@cbrnr cbrnr force-pushed the cbrnr:pyreadstat branch from e87db3b to 734c930 Jun 3, 2019

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 3, 2019

OK, I've rebased.

@jreback

jreback approved these changes Jun 3, 2019

@cbrnr cbrnr force-pushed the cbrnr:pyreadstat branch from 6e2349b to b232b61 Jun 14, 2019

cbrnr added some commits Jun 14, 2019

@cbrnr cbrnr force-pushed the cbrnr:pyreadstat branch from 5a1c3f6 to e55b8c4 Jun 15, 2019

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2019

Finally all green!

@jreback jreback merged commit 21fe224 into pandas-dev:master Jun 16, 2019

12 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190615.9 succeeded
Details
pandas-dev.pandas (Checks) Checks succeeded
Details
pandas-dev.pandas (Docs) Docs succeeded
Details
pandas-dev.pandas (Linux py35_compat) Linux py35_compat succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow_old_np) Linux py36_locale_slow_old_np succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np15) Windows py36_np15 succeeded
Details
pandas-dev.pandas (Windows py37_np141) Windows py37_np141 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 16, 2019

thanks @cbrnr

@cbrnr cbrnr deleted the cbrnr:pyreadstat branch Jun 17, 2019

@ofajardo

This comment has been minimized.

Copy link

commented Jun 27, 2019

@cbrnr I am in the process of releasing a new version of pyreadstat with writing capabilities, in case that's of your interest. Should be there later today or tomorrow.

@cbrnr

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

Nice! However, at the moment I don't think there's a need for exporting to a proprietary format directly from pandas. If someone really wants to do that they can use your package directly.

@isantolin

This comment has been minimized.

Copy link

commented Jul 19, 2019

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.