Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add check for inferred compression before get_filepath_or_buffer #11074

Merged
merged 1 commit into from
Sep 15, 2015

Conversation

stephen-hoover
Copy link
Contributor

When reading CSVs, if compression='infer', check the input before calling get_filepath_or_buffer in the _read function. This way we can catch compresion extensions on S3 files. Partially resolves issue #11070 .

Checking for the file extension in the _read function should make the checks inside the parsers redundant. When I tried to remove them, however, I discovered that there's tests which assume the parsers can take an "infer" compression, so I left their checks.

I also discovered that the URL-reading code has a test which reads a URL ending in "gz" but which appears not to be gzip encoded, so this PR attempts to preserve its verdict in that case.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

pls change tests which are incorrect as well
iow this should cause them to fail

@stephen-hoover
Copy link
Contributor Author

I made this PR so that it didn't break any tests. Are the parsers ever accessed outside of the _read function? Do they need to be able to infer the compression type on their own? If not, I can remove that code from the parsers and change the tests.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

the infer param can be moved higher up in the stack (eg in the get_filepath_or_buffer) - makes the readers simpler in that respect

@stephen-hoover
Copy link
Contributor Author

Found it. I actually didn't need to change any tests. Now the only check for file extensions happens in the _read function, instead of separately inside each of the two parsers. By the time the parsers get called, any compression inference has already taken place.

@jreback
Copy link
Contributor

jreback commented Sep 12, 2015

gr8

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Sep 12, 2015
@stephen-hoover
Copy link
Contributor Author

Added a test using the new files in s3://pandas-test/.

@stephen-hoover
Copy link
Contributor Author

Should I do anything else for this PR?

@jreback jreback added this to the 0.17.0 milestone Sep 14, 2015
@jreback
Copy link
Contributor

jreback commented Sep 14, 2015

can you add a whatsnew note for this

@jreback
Copy link
Contributor

jreback commented Sep 14, 2015

pls rebase. ping when green.

@stephen-hoover stephen-hoover force-pushed the infer-s3-compression branch 2 times, most recently from f488d81 to 67f134b Compare September 14, 2015 22:30
…ffer`

When reading CSVs, if `compression='infer'`, check the input before calling `get_filepath_or_buffer` in the `_read` function. This way we can catch compresion extensions on S3 files.

We now attempt to infer compression from an input filename only in the `_read` function, instead of separately in each parser.
@stephen-hoover
Copy link
Contributor Author

@jreback , green! I found had to tweak get_filepath_or_buffer with an extra check for 'infer' compression.

jreback added a commit that referenced this pull request Sep 15, 2015
ENH Add check for inferred compression before `get_filepath_or_buffer`
@jreback jreback merged commit da6ad3f into pandas-dev:master Sep 15, 2015
@jreback
Copy link
Contributor

jreback commented Sep 15, 2015

thanks!

@stephen-hoover stephen-hoover deleted the infer-s3-compression branch September 15, 2015 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants