Add reader for SPSS (.sav) files #26537

cbrnr · 2019-05-27T11:49:42Z

closes Read and write spss data format #5768 (at least the reading part, this PR does not cover writing SPSS files)
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I haven't added a test yet because I wanted to ask which test .sav file I should use (and where to put it). Also, there's no whatsnew entry yet - in which section should this entry go (and I assume it will be out for 0.25.0, so I'll have to change 0.24.3 to 0.25.0 in the docstring).

This PR adds the capability to load SPSS .sav files with df = pd.io.read_spss("spss_file.sav"). Currently, there are two optional arguments: usecols (should be self-explanatory, let me know if you don't want me to handle a simple str) and categorical, which maps to the apply_value_formats parameter in read_sav. With categorical=True, a categorical columns is created with the labels from the .sav file. If False, numbers will be used.

A few open questions:

Which additional optional arguments should be made available? Pyreadstat has dates_as_pandas_datetime, encoding, and user_missing which I haven't mapped yet.
Should the function be called read_spss or read_sav? SPSS files have the extension sav, but the R haven package has a function read_spss (which is why I'd prefer read_spss).
Are there any additional meta information bits that could be used/integrated into the data frame? pyreadstat.read_sav returns a dataframe and meta-information separately, which I think we shouldn't do in pandas.

jreback

pls add this to the CI in several (but not all places); this should gracefully skip tests if not installed. you can commit the .sav files pandas/tests/io/data/ (see how we do this for .dta). also update the install.rst

codecov · 2019-05-28T08:47:03Z

Codecov Report

Merging #26537 into master will decrease coverage by 50.06%.
The diff coverage is 25%.

@@             Coverage Diff             @@
##           master   #26537       +/-   ##
===========================================
- Coverage   91.76%    41.7%   -50.07%     
===========================================
  Files         174      175        +1     
  Lines       50629    50637        +8     
===========================================
- Hits        46462    21119    -25343     
- Misses       4167    29518    +25351

Flag	Coverage Δ
#multiple	`?`
#single	`41.7% <25%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/api.py	`100% <100%> (ø)`	⬆️
pandas/io/spss.py	`14.28% <14.28%> (ø)`
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
... and 131 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a516c1...0407cc5. Read the comment docs.

codecov · 2019-05-28T08:47:04Z

Codecov Report

Merging #26537 into master will decrease coverage by 0.02%.
The diff coverage is 50%.

@@            Coverage Diff             @@
##           master   #26537      +/-   ##
==========================================
- Coverage   91.88%   91.86%   -0.03%     
==========================================
  Files         179      180       +1     
  Lines       50696    50710      +14     
==========================================
+ Hits        46581    46583       +2     
- Misses       4115     4127      +12

Flag	Coverage Δ
#multiple	`90.45% <50%> (-0.02%)`	⬇️
#single	`41.1% <50%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/api.py	`100% <100%> (ø)`	⬆️
pandas/io/spss.py	`46.15% <46.15%> (ø)`
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️
pandas/core/frame.py	`96.88% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.84% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 430f0fd...e55b8c4. Read the comment docs.

cbrnr · 2019-05-28T09:02:05Z

I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files?

One test file containing dates cannot be loaded because pyreadstat produces an error - this is reported in Roche/pyreadstat#26.

I've updated install.rst and I'm currently trying to see if a Travis job successfully completes when pyreadstat is installed.

TomAugspurger · 2019-05-28T20:15:05Z

I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files?

Seems like haven has an MIT license, according to https://cran.r-project.org/web/packages/haven/index.html. Does that sound right @hadley?

If so, I think you can include include Haven's license file in our licenses folder, and we'll be good. It'd be good to note the source of these tests files in test_spss.py.

hadley · 2019-05-28T20:34:54Z

Yeah, that's fine with me. (I generally follow the US and consider data to be un-copyrightable, although in this case I guess it's the specific form that it's important not so much the data)

cbrnr · 2019-05-29T09:21:12Z

All tests pass now ~~(currently re-running because of a wrong imports order)~~. Two open questions:

I have currently added pyreadstat only to one Travis job (3.7). Where would you like this dependency to be added?
The Haven license consists of two files (MIT and LICENSE). How and where should these files be added (e.g. how should we rename these files)?

jreback

put licenses in pandas/LICENSES

jreback · 2019-05-29T12:23:50Z

ci/deps/travis-37.yaml

@@ -22,3 +22,4 @@ dependencies:
  - pip
  - pip:
    - moto
+    - pyreadstat


is this only a wheel?

add to one of the windows builds & the macosx build; is this support on 3.5? any other requirements?

I don't know what you mean by "only a wheel" though. pyreadstat has binary wheels for Windows, Linux, and macOS for Python 3.5, 3.6, and 3.7: https://pypi.org/project/pyreadstat/#files

meaning you are installing thru pip, rather use a conda package if its available

Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock

doc/source/whatsnew/v0.25.0.rst

jreback · 2019-05-29T12:25:06Z

pandas/io/spss.py

@@ -0,0 +1,27 @@
+def read_spss(path, usecols=None, categorical=True):


can you add typing

Sure. This is the first time I've used typing so please double-check.

pandas/tests/io/test_spss.py

pep8speaks · 2019-05-29T15:52:00Z

Hello @cbrnr! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-15 07:07:22 UTC

cbrnr · 2019-05-29T16:01:22Z

@jreback I've implemented all changes.

jreback · 2019-05-30T01:23:28Z

ci/deps/travis-37.yaml

@@ -22,3 +22,4 @@ dependencies:
  - pip
  - pip:
    - moto
+    - pyreadstat


meaning you are installing thru pip, rather use a conda package if its available

pandas/io/spss.py

jreback · 2019-05-30T01:25:36Z

pandas/io/spss.py

+
+    Parameters
+    ----------
+    path : string


does this accept a pathlike? Union[str, Path] ?

pathlib.Path should work everywhere a path string is expected, so IMO it is not really necessary to explicitly add this. But if you want I can of course add it (this requires an extra import pathlib though).

I've implemented this change.

cbrnr · 2019-05-30T07:59:46Z

Yes, pyreadstat is only available via pip and not via conda.

cbrnr · 2019-05-30T08:08:33Z

Fixed the formatting issues, hopefully this will come back green and then it should be ready to merge.

WillAyd

Minor nit on annotation but otherwise this lgtm - nice change!

cbrnr · 2019-05-30T18:33:58Z

@WillAyd could you please elaborate?

WillAyd · 2019-05-30T18:37:14Z

Not sure what happened to my comment but was asking to change Union[str, Sequence[str], None] to Optional[Union[str, Sequence[str]]

cbrnr · 2019-05-30T18:42:54Z

@WillAyd done!

TomAugspurger · 2019-05-30T21:57:40Z

ci/deps/travis-37.yaml

@@ -22,3 +22,4 @@ dependencies:
  - pip
  - pip:
    - moto
+    - pyreadstat


Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock

pandas/io/spss.py

pandas/tests/io/test_spss.py

ofajardo · 2019-05-31T07:51:53Z

@cbrnr Pyreadstat is available both with pip and conda. In the README that is explained in the How to install section

cbrnr · 2019-05-31T07:54:58Z

I didn't see that you have conda-forge added, sorry about that. I've addressed all comments.

cbrnr · 2019-05-31T08:52:43Z

No idea what's going on with Azure, could someone please restart?

jreback · 2019-06-02T23:55:55Z

can you merge master (you have a conflict)

cbrnr · 2019-06-03T06:16:41Z

OK, I've rebased.

jreback · 2019-06-03T11:58:09Z

looks good. @jorisvandenbossche @TomAugspurger @bashtage ok with this?

cbrnr · 2019-06-15T10:43:49Z

Finally all green!

jreback · 2019-06-16T14:30:52Z

thanks @cbrnr

ofajardo · 2019-06-27T15:17:26Z

@cbrnr I am in the process of releasing a new version of pyreadstat with writing capabilities, in case that's of your interest. Should be there later today or tomorrow.

cbrnr · 2019-06-28T06:32:18Z

Nice! However, at the moment I don't think there's a need for exporting to a proprietary format directly from pandas. If someone really wants to do that they can use your package directly.

isantolin · 2019-07-19T13:16:33Z

Still missing documentation http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html

TomAugspurger · 2019-07-19T13:19:30Z

Can you open a PR fixing that? Or a new issue if you don't plan to?

…

On Fri, Jul 19, 2019 at 8:16 AM Ignacio Santolin ***@***.***> wrote: Still missing documentation http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26537?email_source=notifications&email_token=AAKAOIUY5EKTLKJB46MV5M3QAG5DRA5CNFSM4HP3GQF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LS6KY#issuecomment-513224491>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIU5ZPA6JGX7HGCVAITQAG5DRANCNFSM4HP3GQFQ> .

cbrnr mentioned this pull request May 27, 2019

macOS wheel cannot be installed without admin permissions Roche/pyreadstat#23

Closed

jreback requested changes May 27, 2019

View reviewed changes

jreback added IO Data IO issues that don't fit into a more specific label IO Stata read_stata, to_stata labels May 27, 2019

jreback requested changes May 29, 2019

View reviewed changes

jreback requested changes May 30, 2019

View reviewed changes

WillAyd requested changes May 30, 2019

View reviewed changes

WillAyd added this to the 0.25.0 milestone May 30, 2019

WillAyd approved these changes May 30, 2019

View reviewed changes

TomAugspurger reviewed May 30, 2019

View reviewed changes

jreback approved these changes Jun 3, 2019

View reviewed changes

cbrnr added 18 commits June 14, 2019 15:08

Use is_list_like

2d7a256

Fix tests

bd56eee

Fix df is assigned but never used

f68b516

Sort imports

ee14f29

Update requirements-dev.txt

15d7c71

Remove isort

a05de6c

Improve docstring

748fe61

Revert indent

ced4866

Indent should be 2 spaces

a18e0f5

Add minimum version for pyreadstat

913989d

Add minimum version

0abcde8

Use import_optional_dependency

040af2b

Remove minimum version for now

53f5692

Correct import order

ceef885

Remove duplicate

b232b61

Don't need conda-forge here

b8b7fff

Remove blank line

90702f3

Fix order

e55b8c4

jreback merged commit 21fe224 into pandas-dev:master Jun 16, 2019

cbrnr deleted the pyreadstat branch June 17, 2019 05:35

isantolin mentioned this pull request Jul 19, 2019

BUG: Missing documentation for read_spss() #27476

Closed

TomAugspurger mentioned this pull request Mar 16, 2020

copy license text from: tidyverse/haven #32756

Merged

ofajardo mentioned this pull request Dec 9, 2020

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Closed

allefeld mentioned this pull request May 17, 2022

Read and write spss data format #5768

Closed

		@@ -0,0 +1,27 @@
		def read_spss(path, usecols=None, categorical=True):

Add reader for SPSS (.sav) files #26537

Add reader for SPSS (.sav) files #26537

Conversation

cbrnr commented May 27, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

codecov bot commented May 28, 2019

Codecov Report

codecov bot commented May 28, 2019 • edited Loading

Codecov Report

cbrnr commented May 28, 2019

TomAugspurger commented May 28, 2019 • edited Loading

hadley commented May 28, 2019

cbrnr commented May 29, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented May 29, 2019 • edited Loading

Comment last updated at 2019-06-15 07:07:22 UTC

cbrnr commented May 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbrnr commented May 30, 2019

cbrnr commented May 30, 2019

WillAyd left a comment

Choose a reason for hiding this comment

cbrnr commented May 30, 2019

WillAyd commented May 30, 2019

cbrnr commented May 30, 2019

Choose a reason for hiding this comment

ofajardo commented May 31, 2019 • edited Loading

cbrnr commented May 31, 2019

cbrnr commented May 31, 2019

jreback commented Jun 2, 2019

cbrnr commented Jun 3, 2019

jreback commented Jun 3, 2019

cbrnr commented Jun 15, 2019

jreback commented Jun 16, 2019

ofajardo commented Jun 27, 2019

cbrnr commented Jun 28, 2019

isantolin commented Jul 19, 2019

TomAugspurger commented Jul 19, 2019 via email

cbrnr commented May 27, 2019 •

edited

Loading

codecov bot commented May 28, 2019 •

edited

Loading

TomAugspurger commented May 28, 2019 •

edited

Loading

cbrnr commented May 29, 2019 •

edited

Loading

pep8speaks commented May 29, 2019 •

edited

Loading

ofajardo commented May 31, 2019 •

edited

Loading