ENH Add check for inferred compression before `get_filepath_or_buffer` #11074

Merged
merged 1 commit into from Sep 15, 2015

Conversation

Projects
None yet
2 participants

When reading CSVs, if compression='infer', check the input before calling get_filepath_or_buffer in the _read function. This way we can catch compresion extensions on S3 files. Partially resolves issue #11070 .

Checking for the file extension in the _read function should make the checks inside the parsers redundant. When I tried to remove them, however, I discovered that there's tests which assume the parsers can take an "infer" compression, so I left their checks.

I also discovered that the URL-reading code has a test which reads a URL ending in "gz" but which appears not to be gzip encoded, so this PR attempts to preserve its verdict in that case.

Contributor

jreback commented Sep 12, 2015

pls change tests which are incorrect as well
iow this should cause them to fail

I made this PR so that it didn't break any tests. Are the parsers ever accessed outside of the _read function? Do they need to be able to infer the compression type on their own? If not, I can remove that code from the parsers and change the tests.

Contributor

jreback commented Sep 12, 2015

the infer param can be moved higher up in the stack (eg in the get_filepath_or_buffer) - makes the readers simpler in that respect

Found it. I actually didn't need to change any tests. Now the only check for file extensions happens in the _read function, instead of separately inside each of the two parsers. By the time the parsers get called, any compression inference has already taken place.

Contributor

jreback commented Sep 12, 2015

gr8

jreback added the Data IO label Sep 12, 2015

Added a test using the new files in s3://pandas-test/.

stephen-hoover referenced this pull request Sep 12, 2015

Closed

Improvements for read_csv from AWS S3 #11070

4 of 4 tasks complete

Should I do anything else for this PR?

jreback added this to the 0.17.0 milestone Sep 14, 2015

Contributor

jreback commented Sep 14, 2015

can you add a whatsnew note for this

Contributor

jreback commented Sep 14, 2015

pls rebase. ping when green.

@stephen-hoover stephen-hoover ENH Move check for inferred compression to before `get_filepath_or_bu…
…ffer`

When reading CSVs, if `compression='infer'`, check the input before calling `get_filepath_or_buffer` in the `_read` function. This way we can catch compresion extensions on S3 files.

We now attempt to infer compression from an input filename only in the `_read` function, instead of separately in each parser.
a49b2cd

@jreback , green! I found had to tweak get_filepath_or_buffer with an extra check for 'infer' compression.

@jreback jreback added a commit that referenced this pull request Sep 15, 2015

@jreback jreback Merge pull request #11074 from stephen-hoover/infer-s3-compression
ENH Add check for inferred compression before `get_filepath_or_buffer`
da6ad3f

@jreback jreback merged commit da6ad3f into pandas-dev:master Sep 15, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jreback commented Sep 15, 2015

thanks!

stephen-hoover deleted the stephen-hoover:infer-s3-compression branch Sep 15, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment