ENH Add check for inferred compression before `get_filepath_or_buffer` #11074

stephen-hoover · 2015-09-12T16:09:22Z

When reading CSVs, if compression='infer', check the input before calling get_filepath_or_buffer in the _read function. This way we can catch compresion extensions on S3 files. Partially resolves issue #11070 .

Checking for the file extension in the _read function should make the checks inside the parsers redundant. When I tried to remove them, however, I discovered that there's tests which assume the parsers can take an "infer" compression, so I left their checks.

I also discovered that the URL-reading code has a test which reads a URL ending in "gz" but which appears not to be gzip encoded, so this PR attempts to preserve its verdict in that case.

jreback · 2015-09-12T16:20:54Z

pls change tests which are incorrect as well
iow this should cause them to fail

stephen-hoover · 2015-09-12T16:26:58Z

I made this PR so that it didn't break any tests. Are the parsers ever accessed outside of the _read function? Do they need to be able to infer the compression type on their own? If not, I can remove that code from the parsers and change the tests.

jreback · 2015-09-12T16:48:24Z

the infer param can be moved higher up in the stack (eg in the get_filepath_or_buffer) - makes the readers simpler in that respect

stephen-hoover · 2015-09-12T17:28:44Z

Found it. I actually didn't need to change any tests. Now the only check for file extensions happens in the _read function, instead of separately inside each of the two parsers. By the time the parsers get called, any compression inference has already taken place.

jreback · 2015-09-12T17:29:59Z

gr8

stephen-hoover · 2015-09-12T20:54:46Z

Added a test using the new files in s3://pandas-test/.

stephen-hoover · 2015-09-14T13:03:25Z

Should I do anything else for this PR?

jreback · 2015-09-14T19:49:02Z

can you add a whatsnew note for this

jreback · 2015-09-14T22:26:16Z

pls rebase. ping when green.

…ffer` When reading CSVs, if `compression='infer'`, check the input before calling `get_filepath_or_buffer` in the `_read` function. This way we can catch compresion extensions on S3 files. We now attempt to infer compression from an input filename only in the `_read` function, instead of separately in each parser.

stephen-hoover · 2015-09-15T12:59:34Z

@jreback , green! I found had to tweak get_filepath_or_buffer with an extra check for 'infer' compression.

ENH Add check for inferred compression before `get_filepath_or_buffer`

jreback · 2015-09-15T13:23:45Z

thanks!

stephen-hoover force-pushed the infer-s3-compression branch from 528d4f8 to 4d8c0c6 Compare September 12, 2015 17:26

jreback added the IO Data IO issues that don't fit into a more specific label label Sep 12, 2015

stephen-hoover force-pushed the infer-s3-compression branch from f217181 to 5f97c7b Compare September 12, 2015 20:53

stephen-hoover mentioned this pull request Sep 12, 2015

Improvements for read_csv from AWS S3 #11070

Closed

4 tasks

jreback added this to the 0.17.0 milestone Sep 14, 2015

stephen-hoover force-pushed the infer-s3-compression branch from 5f97c7b to 9aeb3b2 Compare September 14, 2015 20:08

stephen-hoover force-pushed the infer-s3-compression branch 2 times, most recently from f488d81 to 67f134b Compare September 14, 2015 22:30

stephen-hoover force-pushed the infer-s3-compression branch from 67f134b to a49b2cd Compare September 15, 2015 11:40

jreback added a commit that referenced this pull request Sep 15, 2015

Merge pull request #11074 from stephen-hoover/infer-s3-compression

da6ad3f

ENH Add check for inferred compression before `get_filepath_or_buffer`

jreback merged commit da6ad3f into pandas-dev:master Sep 15, 2015

stephen-hoover deleted the infer-s3-compression branch September 15, 2015 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Add check for inferred compression before `get_filepath_or_buffer` #11074

ENH Add check for inferred compression before `get_filepath_or_buffer` #11074

stephen-hoover commented Sep 12, 2015

jreback commented Sep 12, 2015

stephen-hoover commented Sep 12, 2015

jreback commented Sep 12, 2015

stephen-hoover commented Sep 12, 2015

jreback commented Sep 12, 2015

stephen-hoover commented Sep 12, 2015

stephen-hoover commented Sep 14, 2015

jreback commented Sep 14, 2015

jreback commented Sep 14, 2015

stephen-hoover commented Sep 15, 2015

jreback commented Sep 15, 2015

ENH Add check for inferred compression before get_filepath_or_buffer #11074

ENH Add check for inferred compression before get_filepath_or_buffer #11074

Conversation

stephen-hoover commented Sep 12, 2015

jreback commented Sep 12, 2015

stephen-hoover commented Sep 12, 2015

jreback commented Sep 12, 2015

stephen-hoover commented Sep 12, 2015

jreback commented Sep 12, 2015

stephen-hoover commented Sep 12, 2015

stephen-hoover commented Sep 14, 2015

jreback commented Sep 14, 2015

jreback commented Sep 14, 2015

stephen-hoover commented Sep 15, 2015

jreback commented Sep 15, 2015

ENH Add check for inferred compression before `get_filepath_or_buffer` #11074

ENH Add check for inferred compression before `get_filepath_or_buffer` #11074