Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor compression code to expand URL support #14576

Closed
wants to merge 2 commits into from

Conversation

dhimmel
Copy link
Contributor

@dhimmel dhimmel commented Nov 3, 2016

Work in progress. Picks up where #13340 left off.

closes #12688
closes #13340
closes #14570

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for the new features you added? (the compressed sources from urls)

@@ -276,7 +276,7 @@ def file_path_to_url(path):

# ZipFile is not a context manager for <= 2.6
# must be tuple index here since 2.6 doesn't use namedtuple for version_info
if sys.version_info[1] <= 6:
if compat.PY2 and sys.version_info[1] <= 6:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support python 2.6 anymore, so I think this check can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 removed in 28f19f1.

f = _get_handle(f, 'r', encoding=self.encoding,
compression=self.compression,
memory_map=self.memory_map)
self.handles.append(f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (adding each f to handles) will give problems, as only the handles we opened ourselves should be added, not handles passed by the user (previously those were only added inside the if statements)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the effect in the failure of the test_file_handles test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche I attempted a fix in 2f43c39, but it's not very elegant -- logic is repeated in _get_handle and PythonParser.__init__.

What do you think is the best solution? What situation do we not end up opening a handle ourselves? Is it a problem that in PY3 we pass path_or_buf through io.TextIOWrapper before adding it to handles?

@jorisvandenbossche jorisvandenbossche added the IO Data IO issues that don't fit into a more specific label label Nov 3, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Nov 3, 2016
@@ -285,53 +285,84 @@ def ZipFile(*args, **kwargs):
ZipFile = zipfile.ZipFile


def _get_handle(path, mode, encoding=None, compression=None, memory_map=False):
"""Gets file handle for given path and mode.
def _get_handle(source, mode, encoding=None, compression=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this path_or_buf in-line with spellings elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 673bde0

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 3, 2016
@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 3, 2016

Can you add tests for the new features you added? (the compressed sources from urls)

@jorisvandenbossche, do you want me to add xz/zip/bz2 compressed versions of salary.table.csv to pandas/io/tests/parser/data? Also should I add my tests to pandas/io/tests/parser/test_network.py#L17-L35?

@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 3, 2016

Are we also going to have to do something to common.py#L239-L244?

@jreback
Copy link
Contributor

jreback commented Nov 3, 2016

@dhimmel yes that should be more general

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 4, 2016
Create compressed versions of the salary dataset for testing pandas-dev#14576.

Rename `salary.table.csv` to `salaries.tsv` because the dataset is tab rather
than comma delimited. Remove the word table because it's implied by the
extension. Rename `salary.table.gz` to `salaries.tsv.gz`, since compressed files
should append to not strip the original extension.

Created new files by running the following commands:

```sh
cd pandas/io/tests/parser/data
bzip2 --keep salaries.tsv
xz --keep salaries.tsv
zip salaries.tsv.zip salaries.tsv
```
dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 10, 2016
url = self.base_url + '.zip'
url_table = read_table(url, compression="zip", engine="python")
tm.assert_frame_equal(url_table, self.local_table)
def test_compressed_urls(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche, to try and reduce the repetition of the testing code, I made this commit (28373f1) which uses nose test generators. However, the nose doc states:

Please note that method generators are not supported in unittest.TestCase subclasses.

As a result, the tests yielded by test_compressed_urls aren't actually running. I'm attempting to run the test using:

nosetests --verbosity=3 pandas/io/tests/parser/test_network.py:TestCompressedUrl.test_compressed_urls

Do you know the best way to proceed? I'm new to nose and pretty new to testing, so any advice will be appreciated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we run yielded test in computation and io/pickle

will be much easier when we switch to pytest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see those for some examples

Copy link
Contributor Author

@dhimmel dhimmel Nov 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I see the test generator examples in io/tests/test_pickle.py, computation/tests/test_compat.py, and computation/tests/test_eval.py. Those examples don't have a setUp method nor do they use the pandas.util.testing.network decorator. So I did my best in 272669b. It seems that one side effect is that the failure messages are less informative?

@codecov-io
Copy link

codecov-io commented Nov 10, 2016

Current coverage is 85.29% (diff: 88.40%)

Merging #14576 into master will increase coverage by 0.02%

@@             master     #14576   diff @@
==========================================
  Files           144        144          
  Lines         50980      50943    -37   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43470      43453    -17   
+ Misses         7510       7490    -20   
  Partials          0          0          

Powered by Codecov. Last update 14e4815...b33a500

@dhimmel dhimmel force-pushed the compression branch 2 times, most recently from 272669b to 4a5d3e7 Compare November 11, 2016 15:21
dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 11, 2016
from io import TextIOWrapper
return TextIOWrapper(path_or_buf, encoding=encoding)

elif compression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be infer here? (IIRC we have an inference function based on the file ext). Is this guaranteed to be a valid compression? (its fine, though I think easy to handle it here anyhow).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this function (_get_handle) cannot consume compression=infer. However, it would be an easy addition:

if compression == 'infer':
    assert is_path
    compression = infer_compression(path_or_buff)

One issue I'm having is that _get_handle was undocumented before this PR. So it's a bit hard for me to know what to implement. @jreback or @jorisvandenbossche: can you comment on the tentative docstring and fill in the missing details, so I know exactly what to implement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel What is still missing in your current docstring?

Adding compression='infer' here only makes sense if it would be used of course. Where happens the inference of the compression now? Because maybe it is already inferred before it is passed to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where happens the inference of the compression now?

@jorisvandenbossche, inferrence currently happens in pandas/io/parsers.py.

I don't think compression == 'infer' will ever make it this far with the current design. @jreback does this answer your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhimmel that's fine. I just want to validate somewhere that compression in [None, ......] or raise a ValueError

def test_url_gz_infer(self):
url = 'https://s3.amazonaws.com/pandas-test/salary.table.gz'
url_table = read_table(url, compression="infer", engine="python")
class TestCompressedUrl:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inherit from object

@jreback
Copy link
Contributor

jreback commented Nov 12, 2016

ideally if you can use some of this code in the parser routines would be great (may not be completely possible, but want to consolidate some of that duplication)

@dhimmel
Copy link
Contributor Author

dhimmel commented Nov 21, 2016

Regarding common.py#L239-L244, I think there's a problem. Currently, when compression="infer" and the path_or_buff is a URL, get_filepath_or_buffer is using the Content-Encoding header of the request to infer the compression type. However, I don't actually think compressed files on the network will provide a Content-Encoding header of their compression type. Instead I think we want to infer the compression of a URL from it's name, just as we do with paths.

For example, https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/io/tests/parser/data/salaries.csv.bz2 would infer bz2 compression because it ends with .bz2. @jreback and @jorisvandenbossche, agree?

@jreback
Copy link
Contributor

jreback commented Nov 21, 2016

i think inferring based on the file name is usually enough - but u can also use the Content if it's available

f = lzma.LZMAFile(path, mode)
f = lzma.LZMAFile(path_or_buf, mode)

# Unrecognized Compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok we are doing the validation, nvm. then. (assume we have some tests that hit this path)

Copy link
Contributor Author

@dhimmel dhimmel Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assume we have some tests that hit this path

Actually, I don't think there were any tests for invalid compression (regardless of URL or not), so I added one in f2ce8f8.

# in Python 3, convert BytesIO or fileobjects passed with an encoding
elif compat.PY3 and isinstance(f, compat.BytesIO):
from io import TextIOWrapper
add_handle = (
Copy link
Contributor

@jreback jreback Nov 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could be a utility function in io/common (unless its not used anywhere else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think add_handle is only used here -- I created it for this PR, so nowhere else depends on it.

I'm still a little unsure about the purpose of self.handles -- should it only contain file handles because right now we're adding handles that aren't files (for example in the case of compat.PY3 and isinstance(f, compat.BytesIO))?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the self.handles is to allow things to be closed if WE (meaning pandas opened) them, e.g. when you wrap with TextWrapper we NEED to close them. Otherwise we try to preserve open files handles (that were passed to us)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to do this would be to handle this inside _get_handle and return from that function handles that should be added. IMO that would be more maintainable, as now this boolean expression here depends on the content of the _get_handle function, so better to do it there.

'gzip': '.gz',
'bz2': '.bz2',
'zip': '.zip',
'xz': '.xz',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test for invalid compression specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want a specific test for invalid compression specified for a URL? As of the latest commit (bff0f3e), compression inference has been consolidated and invalid compression raises an error before compression is attempted.

content_encoding = req.headers.get('Content-Encoding', None)
if content_encoding == 'gzip':
# Override compression based on Content-Encoding header
compression = 'gzip'
Copy link
Contributor Author

@dhimmel dhimmel Dec 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think inferring based on the file name is usually enough - but u can also use the Content if it's available

@jreback, Content-Encoding is not a versatile way to infer compression -- gzip is the only compression encoding that pandas and the Content-Encoding specification both support. I've modified it so inference on a URL uses the extension, just as a path would. However, if Content-Encoding is gzip, then compression is overridden. I don't expect this situation to arise very often. Does this make sense or should we just totally ignore Content-Encoding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree with your assements. Since this was here, we can leave it. But I agree the filename inference is better / more standard.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 4, 2016

I'm getting failing builds in AppVeyor in Python 2 due to ImportError: No module named backports. This occurs in compat/__init__.py at:

    def import_lzma():
        """ import the backported lzma library
        or raise ImportError if not available """
        from backports import lzma
        return lzma

Any idea why backports is not available? How should I proceed?

Update: I think I resolved this in af3c2bc.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 13, 2016

Assuming the tests pass for 640dc1b, I don't think there are any more unaddressed comments.

can you rebase, problem with appeveyor which i just fixed

@jreback rebased with appveyor now passing.

Can you open issues for those to keep track of them.

@jorisvandenbossche: I created #14874 and #14875.

We can just squash your additional commits into one commit, and then commit those two commits of two authors to master.

@jorisvandenbossche do you want me to squash all of my commits? If so, what should go in the commit message? Should I add the issues we're closing to the commit message?

@dhimmel dhimmel changed the title WIP: Consolidate compression code Refactor compression code to expand URL support Dec 13, 2016
@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

@dhimmel you can squash or it will get squashed on the merge. in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great. We could in this case cherry-pick them (rather than completely squashing), to preserver authoriship (as I think when we squash ALL then authoriship is the last one). if we only have 2 commits a merge is easy.

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 13, 2016

in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great.

I'm going to squash all of my commits. Grouping them will be pretty burdensome and the intermediate commits were often not functional. I created a branch on dhimmel/pandas to preserve the individual commits for reference.

Just to list things here, git log master.. --oneline gives the following:

640dc1b Simplify TextIOWrapper usage
4774b75 Use list to store new handles
94aa0cb Track new handles in _get_handle #14576 (comment) #14576 (comment)
376dce3 Close bz2 file object in Python 2
b88d1b8 Remove compressed URL c-engine tests
b34c2f6 Move _infer_compression to io/common.py
d27e57d Fix "ImportError: No module named backports"
97a1c25 Test compressed URLs using c engine
ecee74f Revert to old error msg. Remove unused line
7c9fe8b Skip xz test if no lzma
492e615 Remove unused imports
bf9fe97 Add invalid compression test
7940ab7 Fix PEP8 errors
05332fd Update S3 get_filepath_or_buffer compression
24ea605 Fix get_filepath_or_buffer and _get_handle
2dc1f2f Clean up docstring
10652a0 Move compression inference to io/parsers
8d24bcf Delete unused function _wrap_compressed
83b2bc5 Infer compression from URL extension
c2e6e5b Improve error message, fix partial scope
9977da2 Fix nose test generator bug. Better error messages
7ff1247 Inheret from object for TestCompressedUrl
30e1caf Fix local salaries.csv path
6230dfb Update salaries.csv URL post #14587
6530f46 Enable nose execution of test generators
a441448 Broken: use method generators to DRY tests
25114c3 Tests for bz2, xz & zip URLs
89e45f4 Revert changes to zip error messages
29b7f74 Change error message in test_zip
3aba6cb Replicate original handles appending
101a344 Impove _get_handle docstring
58ffbce Rename source to path_or_buf and flake8
f0901fb Remove ZipFile support for Python 2.6
2c78178 Improve readability
592cf92 Fix ZipFile bug in Python 3.6+
74caf72 combine compression code into one

36 commits total in this PR. Wow!

@dhimmel
Copy link
Contributor Author

dhimmel commented Dec 13, 2016

if we only have 2 commits a merge is easy.

Great we're now at 2 commits. I think merge makes sense, because it is nice to see the diff of both commits combined. Alone the changesets will be incomplete.

Let me know if there's anything more for me to do.

@jorisvandenbossche
Copy link
Member

@jreback You can use the github interface as well to "rebase and merge" instead of squashing, which will just apply the commits in this PR to master. And so the two commits now in the PR are perfect!

@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

@jorisvandenbossche right, but this is the exception to the rule. I will have a look in a bit.

@dhimmel going to merge, and can followup on comments / issues in another PR.

jreback pushed a commit that referenced this pull request Dec 13, 2016
@jreback jreback closed this in 3761448 Dec 13, 2016
@jreback
Copy link
Contributor

jreback commented Dec 13, 2016

thanks @dhimmel very nice!

in your xref PR, pls add whatsnew that cover all of the issues that were closed (for 0.20.0)

@jorisvandenbossche
Copy link
Member

@dhimmel thanks a lot!

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 14, 2016
dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 14, 2016
dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 15, 2016
jreback pushed a commit that referenced this pull request Dec 17, 2016
…th C engine (GH14874)

Follow up on #14576, which
refactored compression code to expand URL support.    Fixes up some
small remaining issues and adds a what's new entry.    - [x] Closes
#14874

Author: Daniel Himmelstein <daniel.himmelstein@gmail.com>

Closes #14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments
8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2
09dcbff [Daniel Himmelstein] DOC: Improve what's new
c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes
f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine
0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring
210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code
cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine
85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine
a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016
…th C engine (GH14874)

Follow up on pandas-dev#14576, which
refactored compression code to expand URL support.    Fixes up some
small remaining issues and adds a what's new entry.    - [x] Closes
pandas-dev#14874

Author: Daniel Himmelstein <daniel.himmelstein@gmail.com>

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments
8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2
09dcbff [Daniel Himmelstein] DOC: Improve what's new
c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes
f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine
0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring
210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code
cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine
85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine
a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
…th C engine (GH14874)

Follow up on pandas-dev#14576, which
refactored compression code to expand URL support.    Fixes up some
small remaining issues and adds a what's new entry.    - [x] Closes
pandas-dev#14874

Author: Daniel Himmelstein <daniel.himmelstein@gmail.com>

Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits:

e1b5d42 [Daniel Himmelstein] Address what's new review comments
8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2
09dcbff [Daniel Himmelstein] DOC: Improve what's new
c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes
f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine
0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring
210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code
cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine
85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine
a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
4 participants