Refactor compression code to expand URL support #14576

dhimmel · 2016-11-03T15:38:34Z

Work in progress. Picks up where #13340 left off.

closes #12688
closes #13340
closes #14570

jorisvandenbossche

Can you add tests for the new features you added? (the compressed sources from urls)

jorisvandenbossche · 2016-11-03T17:02:14Z

pandas/io/common.py

@@ -276,7 +276,7 @@ def file_path_to_url(path):

 # ZipFile is not a context manager for <= 2.6
 # must be tuple index here since 2.6 doesn't use namedtuple for version_info
-if sys.version_info[1] <= 6:
+if compat.PY2 and sys.version_info[1] <= 6:


We don't support python 2.6 anymore, so I think this check can be removed

👍 removed in 28f19f1.

jorisvandenbossche · 2016-11-03T17:04:46Z

pandas/io/parsers.py

+        f = _get_handle(f, 'r', encoding=self.encoding,
+                        compression=self.compression,
+                        memory_map=self.memory_map)
+        self.handles.append(f)


This line (adding each f to handles) will give problems, as only the handles we opened ourselves should be added, not handles passed by the user (previously those were only added inside the if statements)

You can see the effect in the failure of the test_file_handles test

@jorisvandenbossche I attempted a fix in 2f43c39, but it's not very elegant -- logic is repeated in _get_handle and PythonParser.__init__.

What do you think is the best solution? What situation do we not end up opening a handle ourselves? Is it a problem that in PY3 we pass path_or_buf through io.TextIOWrapper before adding it to handles?

jreback · 2016-11-03T17:28:19Z

pandas/io/common.py

@@ -285,53 +285,84 @@ def ZipFile(*args, **kwargs):
    ZipFile = zipfile.ZipFile


-def _get_handle(path, mode, encoding=None, compression=None, memory_map=False):
-    """Gets file handle for given path and mode.
+def _get_handle(source, mode, encoding=None, compression=None,


call this path_or_buf in-line with spellings elsewhere

Addressed in 673bde0

See pandas-dev#14576 (comment)

dhimmel · 2016-11-03T18:44:40Z

Can you add tests for the new features you added? (the compressed sources from urls)

@jorisvandenbossche, do you want me to add xz/zip/bz2 compressed versions of salary.table.csv to pandas/io/tests/parser/data? Also should I add my tests to pandas/io/tests/parser/test_network.py#L17-L35?

dhimmel · 2016-11-03T19:44:34Z

Are we also going to have to do something to common.py#L239-L244?

jreback · 2016-11-03T22:45:25Z

@dhimmel yes that should be more general

Create compressed versions of the salary dataset for testing pandas-dev#14576. Rename `salary.table.csv` to `salaries.tsv` because the dataset is tab rather than comma delimited. Remove the word table because it's implied by the extension. Rename `salary.table.gz` to `salaries.tsv.gz`, since compressed files should append to not strip the original extension. Created new files by running the following commands: ```sh cd pandas/io/tests/parser/data bzip2 --keep salaries.tsv xz --keep salaries.tsv zip salaries.tsv.zip salaries.tsv ```

See pandas-dev#14576 (comment)

dhimmel · 2016-11-10T22:11:49Z

pandas/io/tests/parser/test_network.py

-        url = self.base_url + '.zip'
-        url_table = read_table(url, compression="zip", engine="python")
-        tm.assert_frame_equal(url_table, self.local_table)
+    def test_compressed_urls(self):


@jorisvandenbossche, to try and reduce the repetition of the testing code, I made this commit (28373f1) which uses nose test generators. However, the nose doc states:

Please note that method generators are not supported in unittest.TestCase subclasses.

As a result, the tests yielded by test_compressed_urls aren't actually running. I'm attempting to run the test using:

nosetests --verbosity=3 pandas/io/tests/parser/test_network.py:TestCompressedUrl.test_compressed_urls

Do you know the best way to proceed? I'm new to nose and pretty new to testing, so any advice will be appreciated.

we run yielded test in computation and io/pickle

will be much easier when we switch to pytest

see those for some examples

Cool, I see the test generator examples in io/tests/test_pickle.py, computation/tests/test_compat.py, and computation/tests/test_eval.py. Those examples don't have a setUp method nor do they use the pandas.util.testing.network decorator. So I did my best in 272669b. It seems that one side effect is that the failure messages are less informative?

codecov-io · 2016-11-10T23:35:16Z

Current coverage is 85.29% (diff: 88.40%)

Merging #14576 into master will increase coverage by 0.02%

@@             master     #14576   diff @@
==========================================
  Files           144        144          
  Lines         50980      50943    -37   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43470      43453    -17   
+ Misses         7510       7490    -20   
  Partials          0          0

Powered by Codecov. Last update 14e4815...b33a500

See pandas-dev#14576 (comment)

jreback · 2016-11-12T16:11:56Z

pandas/io/common.py

+        from io import TextIOWrapper
+        return TextIOWrapper(path_or_buf, encoding=encoding)
+
+    elif compression:


can this be infer here? (IIRC we have an inference function based on the file ext). Is this guaranteed to be a valid compression? (its fine, though I think easy to handle it here anyhow).

Currently, this function (_get_handle) cannot consume compression=infer. However, it would be an easy addition:

if compression == 'infer': assert is_path compression = infer_compression(path_or_buff)

One issue I'm having is that _get_handle was undocumented before this PR. So it's a bit hard for me to know what to implement. @jreback or @jorisvandenbossche: can you comment on the tentative docstring and fill in the missing details, so I know exactly what to implement?

@dhimmel What is still missing in your current docstring?

Adding compression='infer' here only makes sense if it would be used of course. Where happens the inference of the compression now? Because maybe it is already inferred before it is passed to this function?

Where happens the inference of the compression now?

@jorisvandenbossche, inferrence currently happens in pandas/io/parsers.py.

I don't think compression == 'infer' will ever make it this far with the current design. @jreback does this answer your question?

@dhimmel that's fine. I just want to validate somewhere that compression in [None, ......] or raise a ValueError

jreback · 2016-11-12T16:12:23Z

pandas/io/tests/parser/test_network.py

-    def test_url_gz_infer(self):
-        url = 'https://s3.amazonaws.com/pandas-test/salary.table.gz'
-        url_table = read_table(url, compression="infer", engine="python")
+class TestCompressedUrl:


inherit from object

jreback · 2016-11-12T16:14:26Z

ideally if you can use some of this code in the parser routines would be great (may not be completely possible, but want to consolidate some of that duplication)

dhimmel · 2016-11-21T02:02:18Z

Regarding common.py#L239-L244, I think there's a problem. Currently, when compression="infer" and the path_or_buff is a URL, get_filepath_or_buffer is using the Content-Encoding header of the request to infer the compression type. However, I don't actually think compressed files on the network will provide a Content-Encoding header of their compression type. Instead I think we want to infer the compression of a URL from it's name, just as we do with paths.

For example, https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/io/tests/parser/data/salaries.csv.bz2 would infer bz2 compression because it ends with .bz2. @jreback and @jorisvandenbossche, agree?

jreback · 2016-11-21T02:13:06Z

i think inferring based on the file name is usually enough - but u can also use the Content if it's available

jreback · 2016-11-21T11:59:41Z

pandas/io/common.py

-            f = lzma.LZMAFile(path, mode)
+            f = lzma.LZMAFile(path_or_buf, mode)
+
+        # Unrecognized Compression


ahh ok we are doing the validation, nvm. then. (assume we have some tests that hit this path)

assume we have some tests that hit this path

Actually, I don't think there were any tests for invalid compression (regardless of URL or not), so I added one in f2ce8f8.

jreback · 2016-11-21T12:00:36Z

pandas/io/parsers.py

-        # in Python 3, convert BytesIO or fileobjects passed with an encoding
-        elif compat.PY3 and isinstance(f, compat.BytesIO):
-            from io import TextIOWrapper
+        add_handle = (


maybe this could be a utility function in io/common (unless its not used anywhere else)

I think add_handle is only used here -- I created it for this PR, so nowhere else depends on it.

I'm still a little unsure about the purpose of self.handles -- should it only contain file handles because right now we're adding handles that aren't files (for example in the case of compat.PY3 and isinstance(f, compat.BytesIO))?

the self.handles is to allow things to be closed if WE (meaning pandas opened) them, e.g. when you wrap with TextWrapper we NEED to close them. Otherwise we try to preserve open files handles (that were passed to us)

Another way to do this would be to handle this inside _get_handle and return from that function handles that should be added. IMO that would be more maintainable, as now this boolean expression here depends on the content of the _get_handle function, so better to do it there.

jreback · 2016-11-21T12:01:11Z

pandas/io/tests/parser/test_network.py

+        'gzip': '.gz',
+        'bz2': '.bz2',
+        'zip': '.zip',
+        'xz': '.xz',


is there a test for invalid compression specified?

Did you want a specific test for invalid compression specified for a URL? As of the latest commit (bff0f3e), compression inference has been consolidated and invalid compression raises an error before compression is attempted.

dhimmel · 2016-12-03T23:59:37Z

pandas/io/common.py

+        content_encoding = req.headers.get('Content-Encoding', None)
+        if content_encoding == 'gzip':
+            # Override compression based on Content-Encoding header
+            compression = 'gzip'


i think inferring based on the file name is usually enough - but u can also use the Content if it's available

@jreback, Content-Encoding is not a versatile way to infer compression -- gzip is the only compression encoding that pandas and the Content-Encoding specification both support. I've modified it so inference on a URL uses the extension, just as a path would. However, if Content-Encoding is gzip, then compression is overridden. I don't expect this situation to arise very often. Does this make sense or should we just totally ignore Content-Encoding?

yeah I agree with your assements. Since this was here, we can leave it. But I agree the filename inference is better / more standard.

dhimmel · 2016-12-04T00:16:21Z

I'm getting failing builds in AppVeyor in Python 2 due to ImportError: No module named backports. This occurs in compat/__init__.py at:

    def import_lzma():
        """ import the backported lzma library
        or raise ImportError if not available """
        from backports import lzma
        return lzma

Any idea why backports is not available? How should I proceed?

Update: I think I resolved this in af3c2bc.

See pandas-dev#14576 (comment)

pandas-dev#14576 (comment) pandas-dev#14576 (comment)

dhimmel · 2016-12-13T16:16:27Z

Assuming the tests pass for 640dc1b, I don't think there are any more unaddressed comments.

can you rebase, problem with appeveyor which i just fixed

@jreback rebased with appveyor now passing.

Can you open issues for those to keep track of them.

@jorisvandenbossche: I created #14874 and #14875.

We can just squash your additional commits into one commit, and then commit those two commits of two authors to master.

@jorisvandenbossche do you want me to squash all of my commits? If so, what should go in the commit message? Should I add the issues we're closing to the commit message?

jreback · 2016-12-13T17:16:10Z

@dhimmel you can squash or it will get squashed on the merge. in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great. We could in this case cherry-pick them (rather than completely squashing), to preserver authoriship (as I think when we squash ALL then authoriship is the last one). if we only have 2 commits a merge is easy.

dhimmel · 2016-12-13T18:17:49Z

in this case if you'd leave the original commit, and squash all of our subsequent ones to a single (or possibly) multiple, logically oriented commits would be great.

I'm going to squash all of my commits. Grouping them will be pretty burdensome and the intermediate commits were often not functional. I created a branch on dhimmel/pandas to preserve the individual commits for reference.

Just to list things here, git log master.. --oneline gives the following:

640dc1b Simplify TextIOWrapper usage
4774b75 Use list to store new handles
94aa0cb Track new handles in _get_handle #14576 (comment) #14576 (comment)
376dce3 Close bz2 file object in Python 2
b88d1b8 Remove compressed URL c-engine tests
b34c2f6 Move _infer_compression to io/common.py
d27e57d Fix "ImportError: No module named backports"
97a1c25 Test compressed URLs using c engine
ecee74f Revert to old error msg. Remove unused line
7c9fe8b Skip xz test if no lzma
492e615 Remove unused imports
bf9fe97 Add invalid compression test
7940ab7 Fix PEP8 errors
05332fd Update S3 get_filepath_or_buffer compression
24ea605 Fix get_filepath_or_buffer and _get_handle
2dc1f2f Clean up docstring
10652a0 Move compression inference to io/parsers
8d24bcf Delete unused function _wrap_compressed
83b2bc5 Infer compression from URL extension
c2e6e5b Improve error message, fix partial scope
9977da2 Fix nose test generator bug. Better error messages
7ff1247 Inheret from object for TestCompressedUrl
30e1caf Fix local salaries.csv path
6230dfb Update salaries.csv URL post #14587
6530f46 Enable nose execution of test generators
a441448 Broken: use method generators to DRY tests
25114c3 Tests for bz2, xz & zip URLs
89e45f4 Revert changes to zip error messages
29b7f74 Change error message in test_zip
3aba6cb Replicate original handles appending
101a344 Impove _get_handle docstring
58ffbce Rename source to path_or_buf and flake8
f0901fb Remove ZipFile support for Python 2.6
2c78178 Improve readability
592cf92 Fix ZipFile bug in Python 3.6+
74caf72 combine compression code into one

36 commits total in this PR. Wow!

Part of pandas-dev#14576. Closes pandas-dev#12688. Closes pandas-dev#14570. Opens pandas-dev#14874.

Part of pandas-dev#14576. Closes pandas-dev#13340.

Part of pandas-dev#14576. Closes pandas-dev#12688. Closes pandas-dev#14570. Opens pandas-dev#14874.

dhimmel · 2016-12-13T18:40:24Z

if we only have 2 commits a merge is easy.

Great we're now at 2 commits. I think merge makes sense, because it is nice to see the diff of both commits combined. Alone the changesets will be incomplete.

Let me know if there's anything more for me to do.

jorisvandenbossche · 2016-12-13T20:28:22Z

@jreback You can use the github interface as well to "rebase and merge" instead of squashing, which will just apply the commits in this PR to master. And so the two commits now in the PR are perfect!

jreback · 2016-12-13T22:05:45Z

@jorisvandenbossche right, but this is the exception to the rule. I will have a look in a bit.

@dhimmel going to merge, and can followup on comments / issues in another PR.

xref #14576 closes #13340

jreback · 2016-12-13T23:11:59Z

thanks @dhimmel very nice!

in your xref PR, pls add whatsnew that cover all of the issues that were closed (for 0.20.0)

jorisvandenbossche · 2016-12-14T10:58:02Z

@dhimmel thanks a lot!

Add what's new corresponding to pandas-dev#14576.

…th C engine (GH14874) Follow up on #14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes #14874 Author: Daniel Himmelstein <daniel.himmelstein@gmail.com> Closes #14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

xref pandas-dev#14576 closes pandas-dev#13340

closes pandas-dev#14576 closes pandas-dev#12688 closes pandas-dev#14570 xref pandas-dev#14874

…th C engine (GH14874) Follow up on pandas-dev#14576, which refactored compression code to expand URL support. Fixes up some small remaining issues and adds a what's new entry. - [x] Closes pandas-dev#14874 Author: Daniel Himmelstein <daniel.himmelstein@gmail.com> Closes pandas-dev#14880 from dhimmel/whats-new and squashes the following commits: e1b5d42 [Daniel Himmelstein] Address what's new review comments 8568aed [Daniel Himmelstein] TST: Read bz2 files from S3 in PY2 09dcbff [Daniel Himmelstein] DOC: Improve what's new c4ea3d3 [Daniel Himmelstein] STY: PEP8 fixes f8a7900 [Daniel Himmelstein] TST: check bz2 compression in PY2 c engine 0e0fa0a [Daniel Himmelstein] DOC: Reword get_filepath_or_buffer docstring 210fb20 [Daniel Himmelstein] DOC: What's New for refactored compression code cb91007 [Daniel Himmelstein] TST: Read compressed URLs with c engine 85630ea [Daniel Himmelstein] ENH: Support bz2 compression in PY2 for c engine a7960f6 [Daniel Himmelstein] DOC: Improve _infer_compression docstring

jorisvandenbossche reviewed Nov 3, 2016

View reviewed changes

jorisvandenbossche added the IO Data IO issues that don't fit into a more specific label label Nov 3, 2016

jorisvandenbossche added this to the 0.20.0 milestone Nov 3, 2016

jreback reviewed Nov 3, 2016

View reviewed changes

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 3, 2016

Remove ZipFile support for Python 2.6

28f19f1

See pandas-dev#14576 (comment)

dhimmel mentioned this pull request Nov 4, 2016

TST: Create compressed salary testing data #14587

Merged

dhimmel force-pushed the compression branch from 09c407d to 9948bd8 Compare November 10, 2016 21:25

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 10, 2016

Remove ZipFile support for Python 2.6

0089fc1

See pandas-dev#14576 (comment)

dhimmel commented Nov 10, 2016

View reviewed changes

dhimmel force-pushed the compression branch 2 times, most recently from 272669b to 4a5d3e7 Compare November 11, 2016 15:21

dhimmel added a commit to dhimmel/pandas that referenced this pull request Nov 11, 2016

Remove ZipFile support for Python 2.6

3d9a554

See pandas-dev#14576 (comment)

jreback reviewed Nov 12, 2016

View reviewed changes

jreback reviewed Nov 21, 2016

View reviewed changes

dhimmel commented Dec 3, 2016

View reviewed changes

dhimmel force-pushed the compression branch from e133024 to c9e1c28 Compare December 4, 2016 15:39

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 4, 2016

Remove ZipFile support for Python 2.6

3c67c58

See pandas-dev#14576 (comment)

dhimmel force-pushed the compression branch from c9e1c28 to e6bdfb3 Compare December 4, 2016 16:46

dhimmel force-pushed the compression branch from bddad97 to 94aa0cb Compare December 13, 2016 15:18

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 13, 2016

Track new handles in _get_handle

94aa0cb

pandas-dev#14576 (comment) pandas-dev#14576 (comment)

This was referenced Dec 13, 2016

Reading bz2-compressed tables from URL fails with c-engine in Python 2 #14874

Closed

Create tests for reading compressed tables from S3 URLs #14875

Closed

dhimmel changed the title ~~WIP: Consolidate compression code~~ Refactor compression code to expand URL support Dec 13, 2016

dhimmel force-pushed the compression branch from 640dc1b to 8798932 Compare December 13, 2016 18:33

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 13, 2016

Refactor compression code to expand URL support

8798932

Part of pandas-dev#14576. Closes pandas-dev#12688. Closes pandas-dev#14570. Opens pandas-dev#14874.

Mahmoud Lababidi and others added 2 commits December 13, 2016 13:34

Move compression code to io.common._get_handle

d813fbb

Part of pandas-dev#14576. Closes pandas-dev#13340.

Refactor compression code to expand URL support

b33a500

Part of pandas-dev#14576. Closes pandas-dev#12688. Closes pandas-dev#14570. Opens pandas-dev#14874.

dhimmel force-pushed the compression branch from 8798932 to b33a500 Compare December 13, 2016 18:34

jreback pushed a commit that referenced this pull request Dec 13, 2016

Move compression code to io.common._get_handle

110ac2a

xref #14576 closes #13340

jreback closed this in 3761448 Dec 13, 2016

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 14, 2016

DOC: What's New for refactored compression code

b691c98

Add what's new corresponding to pandas-dev#14576.

dhimmel mentioned this pull request Dec 14, 2016

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

Closed

1 task

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 14, 2016

DOC: What's New for refactored compression code

98d5e0d

Add what's new corresponding to pandas-dev#14576.

dhimmel added a commit to dhimmel/pandas that referenced this pull request Dec 15, 2016

DOC: What's New for refactored compression code

210fb20

Add what's new corresponding to pandas-dev#14576.

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016

Move compression code to io.common._get_handle

7a4bce3

xref pandas-dev#14576 closes pandas-dev#13340

ischurov pushed a commit to ischurov/pandas that referenced this pull request Dec 19, 2016

CLN: Refactor compression code to expand URL support

a0e27fb

closes pandas-dev#14576 closes pandas-dev#12688 closes pandas-dev#14570 xref pandas-dev#14874

dhimmel mentioned this pull request Dec 29, 2016

Uniform file IO API and consolidated codebase #15008

Open

25 tasks

Refactor compression code to expand URL support #14576

Refactor compression code to expand URL support #14576

Conversation

dhimmel commented Nov 3, 2016 • edited by jreback Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Nov 3, 2016

dhimmel commented Nov 3, 2016

jreback commented Nov 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Nov 11, 2016 • edited Loading

Choose a reason for hiding this comment

codecov-io commented Nov 10, 2016 • edited Loading

Current coverage is 85.29% (diff: 88.40%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 12, 2016

dhimmel commented Nov 21, 2016

jreback commented Nov 21, 2016

Choose a reason for hiding this comment

dhimmel Dec 4, 2016 • edited Loading

Choose a reason for hiding this comment

jreback Nov 21, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel Dec 3, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Dec 4, 2016 • edited Loading

dhimmel commented Dec 13, 2016

jreback commented Dec 13, 2016

dhimmel commented Dec 13, 2016

dhimmel commented Dec 13, 2016

jorisvandenbossche commented Dec 13, 2016

jreback commented Dec 13, 2016

jreback commented Dec 13, 2016

jorisvandenbossche commented Dec 14, 2016

dhimmel commented Nov 3, 2016 •

edited by jreback

Loading

dhimmel Nov 11, 2016 •

edited

Loading

codecov-io commented Nov 10, 2016 •

edited

Loading

dhimmel Dec 4, 2016 •

edited

Loading

jreback Nov 21, 2016 •

edited

Loading

dhimmel Dec 3, 2016 •

edited

Loading

dhimmel commented Dec 4, 2016 •

edited

Loading