BUG: set keyword argument so zipfile actually compresses #21144

minggli · 2018-05-20T16:16:35Z

closes DataFrame.to_pickle() fails for .zip format on MacOS and pandas 0.20.3 #17778
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

zipfile.ZipFile has default compression mode zipfile.ZIP_STORED. It creates an uncompressed archive member. Whilst it doesn't cause issue, it is a strange default to have given users would want to compress files.

In order for zip compression to actually reduce file size, keyword argument compression=zipfile.ZIP_DEFLATED is added.

codecov · 2018-05-20T17:13:38Z

Codecov Report

Merging #21144 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #21144   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49505    49505           
=======================================
  Hits        45466    45466           
  Misses       4039     4039

Flag	Coverage Δ
#multiple	`90.23% <100%> (ø)`	⬆️
#single	`41.88% <75%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/common.py	`70.04% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2844a...974b063. Read the comment docs.

gfyoung · 2018-05-21T06:50:13Z

pandas/io/common.py

        if mode in ['wb', 'rb']:
            mode = mode.replace('b', '')
-        super(BytesZipFile, self).__init__(file, mode, **kwargs)
+        super(BytesZipFile, self).__init__(file, mode, compression, **kwargs)


Can we add tests and a whatsnew for this?

Also, because you are modifying the default behavior, I'm not sure if we need a deprecation cycle for this (to be safe, we should I would imagine).

no this is a bug

Fair enough, though tests and whatsnew are still needed (just to be clear).

thanks. added whatsnew and tests.

jreback

pls add a whatsnew & a test (not sure what that should look like)

WillAyd · 2018-05-26T23:27:12Z

pandas/tests/frame/test_to_csv.py

@@ -943,6 +943,22 @@ def test_to_csv_compression(self, compression):
            with tm.decompress_file(filename, compression) as fh:
                assert_frame_equal(df, read_csv(fh, index_col=0))

+    def test_to_csv_compression_size(self, compression):


Since these are all the same tests I think it makes more sense to put in test_common and parametrize for the different writers, rather than splitting out across the various modules

make sense. done.

minggli · 2018-05-27T14:35:10Z

@WillAyd @jreback @gfyoung changes implemented, comments welcome. 👍

WillAyd · 2018-05-27T19:09:49Z

pandas/tests/test_common.py

+    s = df.iloc[:, 0]
+
+    with tm.ensure_clean() as filename:
+        for obj in [df, s]:


Can even parametrize the Series and Frame instead of a loop

WillAyd · 2018-05-27T19:10:13Z

pandas/tests/test_common.py

+            file_size = os.path.getsize(filename)
+            getattr(obj, method)(filename, compression=None)
+            uncompressed_file_size = os.path.getsize(filename)
+            if compression:


Shouldn't need this conditional

ok. though had to skip or xfail when compression==None. the fixture is shared across other tests so no need to change fixture.

WillAyd · 2018-05-27T19:52:08Z

pandas/tests/test_common.py

+
+
+@pytest.mark.parametrize('frame', [
+    pd.concat(100 * [DataFrame([[0.123456, 0.234567, 0.567567],


Instead of using concat just multiply your list of lists by 100 within the constructor - will be more performant and idiomatic

good point.

WillAyd · 2018-05-27T19:52:54Z

pandas/tests/test_common.py

@@ -222,3 +224,22 @@ def test_standardize_mapping():

    dd = collections.defaultdict(list)
    assert isinstance(com.standardize_mapping(dd), partial)
+
+
+@pytest.mark.parametrize('frame', [


Since this is either a frame or a series don't use frame as the variable name, as that's slightly confusing in case of the former being passed. obj should be fine

WillAyd

lgtm - will see what others say but thanks for the PR!

jreback · 2018-05-29T01:39:23Z

doc/source/whatsnew/v0.23.1.txt

@@ -83,6 +83,7 @@ Indexing
 I/O
 ^^^

+- Bug in :class:`pandas.io.common.BytesZipFile` where zip compression produces uncompressed zip archive (:issue:`17778`)


can you reference 21144 (as well is fine) as that other issue was closed for 0.23.0

Hi @jreback @WillAyd , tested 21144 on older version of pandas 0.20 on windows and same issue occurred. I think it's an existing issue unrelated to this PR. This PR only address zip compression and doesn't touch gzip at all.

jreback · 2018-05-29T01:39:46Z

pandas/tests/test_common.py

+        pytest.skip("only test compression case.")
+
+    with tm.ensure_clean() as filename:
+        getattr(obj, method)(filename, compression=compression)


does this fail under 0.23.0? (and works here clearly)

jreback · 2018-05-29T10:41:32Z

thanks @minggli !

minggli · 2018-05-29T10:43:51Z

happy to help!

…21144) (cherry picked from commit c85ab08)

(cherry picked from commit c85ab08)

…21144)

minggli changed the title ~~set keyword argument so zipfile actually compresses~~ BUG: set keyword argument so zipfile actually compresses May 20, 2018

set keyword argument so zipfile actually compresses

86d2f72

minggli force-pushed the bugfix/zip branch from 06be493 to 86d2f72 Compare May 20, 2018 16:27

gfyoung added Bug IO Data IO issues that don't fit into a more specific label labels May 21, 2018

gfyoung reviewed May 21, 2018

View reviewed changes

jreback requested changes May 21, 2018

View reviewed changes

minggli and others added 7 commits May 25, 2018 23:58

add compression size test case

498451d

add compression size test case

012383f

add compression size test case

a790148

add compression size test case

42f5c32

update whatsnew

74b8c34

Merge branch 'master' into bugfix/zip

c46fbe0

E302 expect 2 blank lines

f31fc3d

minggli closed this May 26, 2018

minggli reopened this May 26, 2018

minor refactor of tests

3a29ab3

WillAyd requested changes May 26, 2018

View reviewed changes

refactor tests

4775cac

minggli force-pushed the bugfix/zip branch from f12a4e3 to 4775cac Compare May 27, 2018 13:23

WillAyd requested changes May 27, 2018

View reviewed changes

parameterize objects

fa6c433

minggli force-pushed the bugfix/zip branch from 7181828 to fa6c433 Compare May 27, 2018 19:35

WillAyd requested changes May 27, 2018

View reviewed changes

simplify construction

adb6fd6

minggli force-pushed the bugfix/zip branch from d991d4c to adb6fd6 Compare May 27, 2018 19:58

WillAyd approved these changes May 27, 2018

View reviewed changes

jreback reviewed May 29, 2018

View reviewed changes

jreback added this to the 0.23.1 milestone May 29, 2018

WillAyd mentioned this pull request May 29, 2018

df.to_csv ignores compression when provided with a file handle #21227

Closed

update whatsnew

974b063

jreback approved these changes May 29, 2018

View reviewed changes

jreback added the Needs Backport label May 29, 2018

jreback merged commit c85ab08 into pandas-dev:master May 29, 2018

minggli deleted the bugfix/zip branch May 29, 2018 10:44

jorisvandenbossche removed the Needs Backport label Jun 8, 2018

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018

BUG: set keyword argument so zipfile actually compresses (pandas-dev#…

ea4e49d

…21144) (cherry picked from commit c85ab08)

jorisvandenbossche pushed a commit that referenced this pull request Jun 9, 2018

BUG: set keyword argument so zipfile actually compresses (#21144)

1d5e535

(cherry picked from commit c85ab08)

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

BUG: set keyword argument so zipfile actually compresses (pandas-dev#…

dec1e54

…21144)

TomAugspurger mentioned this pull request Jun 28, 2018

Permission Denied when writing to HDFS dask/knit#132

Open



		@pytest.mark.parametrize('frame', [
		pd.concat(100 * [DataFrame([[0.123456, 0.234567, 0.567567],

Uh oh!

BUG: set keyword argument so zipfile actually compresses #21144

BUG: set keyword argument so zipfile actually compresses #21144

Uh oh!

Conversation

minggli commented May 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung May 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung May 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minggli May 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minggli commented May 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minggli May 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented May 29, 2018

Uh oh!

minggli commented May 29, 2018

Uh oh!

Uh oh!

minggli commented May 20, 2018 •

edited

Loading

codecov bot commented May 20, 2018 •

edited

Loading

gfyoung May 21, 2018 •

edited

Loading

gfyoung May 21, 2018 •

edited

Loading

minggli May 27, 2018 •

edited

Loading

minggli May 27, 2018 •

edited

Loading