Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ZIP file decompression and TestCompression. #12175

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
5 participants
@lababidi
Copy link

commented Jan 29, 2016

Closes #11413

@jreback

View changes

pandas/io/parsers.py Outdated
klass = FixedWidthFieldParser
else: #default to engine == 'python':

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

why did you modify this?

This comment has been minimized.

Copy link
@lababidi

lababidi Jan 29, 2016

Author

I modified this to default to the Python engine. If this is not the wanted functionality. I can remove it.

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

just change code that is relevant.

This comment has been minimized.

Copy link
@lababidi

lababidi Jan 29, 2016

Author

will do. Does this need to a separate PR?

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

not sure why you are changing this in the first place.

This comment has been minimized.

Copy link
@lababidi

lababidi Jan 29, 2016

Author

For the use case that engine is neither 'python' nor 'python-fwf'.

Is it possible for this to happen?

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

no, that would be an exception and should raise a ValueError (do this in another issue/PR)

This comment has been minimized.

Copy link
@lababidi

lababidi Jan 29, 2016

Author

ok, thanks for the clarification.

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated

from pandas.compat import parse_date
import pandas.lib as lib
from pandas import compat

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

why did you change this?

This comment has been minimized.

Copy link
@lababidi

lababidi Jan 29, 2016

Author

Just reorganized the imports to be legible. Notice how now we can see a bit of simplicity that can be made after the organizations:

from pandas import DataFrame, Series, Index, MultiIndex, DatetimeIndex
from pandas import compat
from pandas.compat import(
    StringIO, BytesIO, PY3, range, long, lrange, lmap, u
)
from pandas.compat import parse_date
from pandas.io.common import DtypeWarning
from pandas.io.common import URLError

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

ok then. pls git diff master | flake8 --diff. linting is not enabled yet in travis to fail but will be shortly.

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated

def test_zip(self):
try:
import zipfile

This comment has been minimized.

Copy link
@jreback

jreback Jan 29, 2016

Contributor

instead of making this a direct Testing class, make it class Compression(object), then add this as a mixin to the TestCHighMemoryParser/LowMemory, TestPythonParser, and TestFixedWidth, so these routines are run for each type of parser (and don't defined read_csv/read_table. That way don't have to repeat tests and they test all engines

@lababidi

This comment has been minimized.

Copy link
Author

commented Jan 29, 2016

@jreback The idea of the Mixin makes sense. Took me a bit to wrap my head around it. I had to add self.engine to a couple of the tests because bzip decompression will not raise an exception and so the Compression Mixin needs to check what engine it is using to make sure the test runs correctly.

@lababidi

This comment has been minimized.

Copy link
Author

commented Jan 29, 2016

@jreback I think this is ready for your review.

@max-sixty

View changes

pandas/io/common.py Outdated
f = zip_file.open(file_name)
else:
raise ValueError('ZIP file contains multiple files {}',
zip_file.filename)

This comment has been minimized.

Copy link
@max-sixty

max-sixty Jan 29, 2016

Contributor

You need a .format here

@max-sixty

View changes

pandas/io/tests/test_parsers.py Outdated
from pandas.lib import Timestamp
from pandas.tseries.index import date_range
import pandas.tseries.tools as tools
class Compression(object):

This comment has been minimized.

Copy link
@max-sixty

max-sixty Jan 29, 2016

Contributor

Does this need to be called CompressionTest to get picked up? I know different test frameworks have different requirements.

This comment has been minimized.

Copy link
@lababidi

lababidi Jan 29, 2016

Author

It's not actually a Test. It's just a Mixin that gets pulled into other Tests. Those Tests will call these methods within Compression

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

move this after ParserTests, maybe call this CompressionTests to be more informative (and still avoide nose from actually running it as its a mixin)

This comment has been minimized.

Copy link
@tollycoast

tollycoast Jan 30, 2016

Thanks sir
Date: Sat, 30 Jan 2016 07:05:55 -0800
From: notifications@github.com
To: pandas@noreply.github.com
Subject: Re: [pandas] Add ZIP file decompression and TestCompression. (#12175)

In pandas/io/tests/test_parsers.py:

-from pandas.compat import parse_date
-import pandas.lib as lib
-from pandas import compat
-from pandas.lib import Timestamp
-from pandas.tseries.index import date_range
-import pandas.tseries.tools as tools
+class Compression(object):

move this after ParserTests, maybe call this CompressionTests to be more informative (and still avoide nose from actually running it as its a mixin)


Reply to this email directly or view it on GitHub.

@max-sixty

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2016

Jeff is going to ask you to squash your commits into one, as per the contributing docs.
Nice job overall!

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2016

@MaximilianR no need for squashing anymore thanks to https://github.com/pydata/pandas/blob/master/scripts/merge-py.py ;)

EDIT: which reminds me, CONTRIBUTING.md needs to be updated. Will do this weekend unless someone beats me to it.

@lababidi

This comment has been minimized.

Copy link
Author

commented Jan 29, 2016

@MaximilianR @TomAugspurger Thanks for the comments guys!

@max-sixty

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2016

pls squash

even though I can do it - it's much cleaner from a future reader perspective on a smaller change

@lababidi lababidi force-pushed the lababidi:feature_zip branch Jan 29, 2016

@lababidi

This comment has been minimized.

Copy link
Author

commented Jan 29, 2016

@jreback Squashed. Thanks.

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
result = self.read_csv(path, compression='zip')
tm.assert_frame_equal(result, expected)

result = self.read_csv(open(path, 'rb'), compression='zip')

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

do this in a with block to make sure the file is closed

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
with tm.ensure_clean() as path:
file_names = ['test_file', 'second_file']
tmp = zipfile.ZipFile(path, mode='w')
for file_name in file_names:

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

test on an empty zipfile as well

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
result = self.read_csv(path, compression='gzip')
tm.assert_frame_equal(result, expected)

result = self.read_csv(open(path, 'rb'), compression='gzip')

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

with block here

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
@@ -2623,7 +2732,9 @@ def test_eof_states(self):
StringIO(data), escapechar='\\')


class TestPythonParser(ParserTests, tm.TestCase):
class TestPythonParser(ParserTests, tm.TestCase, Compression):

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

make tm.TestCase the last class

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
@@ -3442,17 +3553,19 @@ def test_buffer_rd_bytes(self):
except Exception as e:
pass

class TestCParserHighMemory(CParserTests, tm.TestCase):

class TestCParserHighMemory(CParserTests, tm.TestCase, Compression):

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

same here

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
@@ -3753,18 +3866,20 @@ def test_single_char_leading_whitespace(self):
tm.assert_frame_equal(result, expected)


class TestCParserLowMemory(CParserTests, tm.TestCase):
class TestCParserLowMemory(CParserTests, tm.TestCase, Compression):

This comment has been minimized.

Copy link
@jreback

jreback Jan 30, 2016

Contributor

same here

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2016

looks pretty good. just some minor stylistic comments.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2016

======================================================================
ERROR: test_to_csv_compression_value_error (pandas.tests.frame.test_to_csv.TestDataFrameToCSV)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/pydata/pandas/pandas/tests/frame/test_to_csv.py", line 998, in test_to_csv_compression_value_error
    filename, compression="zip")
  File "/home/travis/build/pydata/pandas/pandas/util/testing.py", line 1952, in assertRaises
    _callable(*args, **kwargs)
  File "/home/travis/build/pydata/pandas/pandas/core/frame.py", line 1338, in to_csv
    formatter.save()
  File "/home/travis/build/pydata/pandas/pandas/core/format.py", line 1524, in save
    compression=self.compression)
  File "/home/travis/build/pydata/pandas/pandas/io/common.py", line 346, in _get_handle
    zip_file = zipfile.ZipFile(path)
  File "/home/travis/miniconda/envs/pandas/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()
  File "/home/travis/miniconda/envs/pandas/lib/python2.7/zipfile.py", line 811, in _RealGetContents
    raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file

put a test in for this error as well (e.g. try to open a non-zipfile); I don't think you need any code changes though, you can just let it raise.

@lababidi

This comment has been minimized.

Copy link
Author

commented Feb 1, 2016

@jreback Good ideas. I think I covered all your requests.

@lababidi lababidi force-pushed the lababidi:feature_zip branch Feb 1, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2016

@lababidi ok, this lgtm.

just need a whatsnew note (you can put kind of what you did in the addition to the doc-string). pls add, ping when green.

@jreback jreback added this to the 0.18.0 milestone Feb 1, 2016

@lababidi lababidi force-pushed the lababidi:feature_zip branch 2 times, most recently Feb 2, 2016

@lababidi lababidi force-pushed the lababidi:feature_zip branch Mar 18, 2016

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated

file_name = 'test_file.zip'
with tm.ensure_clean(file_name) as path:
tmp = zipfile.ZipFile(path, mode='w')

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

you didn't change anything here

This comment has been minimized.

Copy link
@lababidi

lababidi Mar 18, 2016

Author

@jreback what line are you referring to? zipfile.ZipFile?

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

no, your context managers are nested, but don't need to be

 with open(self.csv1, 'rb') as data_file:
        data = data_file.read()

data can be used here

This comment has been minimized.

Copy link
@lababidi

lababidi Mar 18, 2016

Author

@jreback thanks for clarifying. I was using the previous convention. I'll clean this up now.

@lababidi lababidi force-pushed the lababidi:feature_zip branch Mar 18, 2016

@jreback jreback referenced this pull request Mar 18, 2016

Closed

ENH: xz compression in to_csv() resolves #11852 #12668

4 of 4 tasks complete

@lababidi lababidi force-pushed the lababidi:feature_zip branch Mar 18, 2016

@jreback

View changes

pandas/io/parsers.py Outdated
compression : {'gzip', 'bz2', 'zip', 'infer', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2 or zip if filepath_or_buffer is a string ending in '.gz', '.bz2' or
'.zip', respectively, and no decompression otherwise. New in 0.18.0: ZIP

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

so add in a versionadded tag and put the 0.18.0 stuff there

@lababidi lababidi force-pushed the lababidi:feature_zip branch Mar 18, 2016

@lababidi

This comment has been minimized.

Copy link
Author

commented Mar 18, 2016

@jreback previous tests passed. The most recent push only changed the version in the docstring
https://travis-ci.org/pydata/pandas/builds/117001384

compression : {'gzip', 'bz2', 'zip', 'infer', None}, default 'infer'
For on-the-fly decompression of on-disk data. If 'infer', then use gzip,
bz2 or zip if filepath_or_buffer is a string ending in '.gz', '.bz2' or
'.zip', respectively, and no decompression otherwise. New in 0.18.1: ZIP

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

need a versionadded tag here

result = self.read_csv(path, compression='infer')
tm.assert_frame_equal(result, expected)

if self.engine is not 'python':

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

why is this check here?

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
expected = self.read_csv(self.csv1)

file_name = 'test_file.zip'
with tm.ensure_clean(file_name) as path:

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

you can put the file_name directly in (like you do below)

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
tmp.writestr(file_name, data)
tmp.close()

self.assertRaises(ValueError, self.read_csv,

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

can you use assertRaisesRegex here (to check that the Multiple Files is raised)

@jreback

View changes

pandas/io/tests/test_parsers.py Outdated
tmp = zipfile.ZipFile(path, mode='w')
tmp.close()

self.assertRaises(ValueError, self.read_csv,

This comment has been minimized.

Copy link
@jreback

jreback Mar 18, 2016

Contributor

here make sure the correct ValueError is raises (use assertRaisesRegex)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2016

@lababidi ok looks pretty good. only changes are to use assertRaisesRegexp when you assert for zip in order to make sure the correct messages are raised (as there are multiple possibilites).

pls make that and ping when green.

@lababidi lababidi force-pushed the lababidi:feature_zip branch 2 times, most recently Mar 21, 2016

@lababidi

This comment has been minimized.

Copy link
Author

commented Mar 21, 2016

@jreback could you help me? the test only failed on the following:

--------------------------------------------------------------------------------------------------------------
#176 nose.failure.Failure.runTest: direct creation of extension dtype datetime64[ns, UTC] is not supported ATM
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2016

not sure where you are seeing this. your 2.7 tests is failing because of a linting issue (line too long)

git diff master | flake8 --diff

@lababidi

This comment has been minimized.

Copy link
Author

commented Mar 21, 2016

@jreback it's in the Travis results

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2016

I restarted that job, though you need to repush anyhow (lint error). Never saw that one before; I think its a crash in something else, so let's see if it recurs.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2016

@lababidi ok this passed, except for the lint check. pls fix and repush, ping when green.

Mahmoud Lababidi
Add ZIP file decompression and TestCompression.
Fix PEP8 issues. Change Compression to be a Mixin. Add Compression Mixin correctly with current Tests. Add .format, Rename Compression, with-block, empty zip, bad-zip
@lababidi

This comment has been minimized.

Copy link
Author

commented Mar 22, 2016

Thank you @jreback for your help and patience with this. I'll help out on the other issues soon.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2016

@lababidi no thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.