New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: set keyword argument so zipfile actually compresses #21144

Merged
merged 13 commits into from May 29, 2018

Conversation

Projects
None yet
5 participants
@minggli
Contributor

minggli commented May 20, 2018

  • closes #17778
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

zipfile.ZipFile has default compression mode zipfile.ZIP_STORED. It creates an uncompressed archive member. Whilst it doesn't cause issue, it is a strange default to have given users would want to compress files.

In order for zip compression to actually reduce file size, keyword argument compression=zipfile.ZIP_DEFLATED is added.

@minggli minggli changed the title from set keyword argument so zipfile actually compresses to BUG: set keyword argument so zipfile actually compresses May 20, 2018

@codecov

This comment has been minimized.

codecov bot commented May 20, 2018

Codecov Report

Merging #21144 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #21144   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49505    49505           
=======================================
  Hits        45466    45466           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.23% <100%> (ø) ⬆️
#single 41.88% <75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/common.py 70.04% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c2844a...974b063. Read the comment docs.

if mode in ['wb', 'rb']:
mode = mode.replace('b', '')
super(BytesZipFile, self).__init__(file, mode, **kwargs)
super(BytesZipFile, self).__init__(file, mode, compression, **kwargs)

This comment has been minimized.

@gfyoung

gfyoung May 21, 2018

Member

Can we add tests and a whatsnew for this?

This comment has been minimized.

@gfyoung

gfyoung May 21, 2018

Member

Also, because you are modifying the default behavior, I'm not sure if we need a deprecation cycle for this (to be safe, we should I would imagine).

This comment has been minimized.

@jreback

jreback May 21, 2018

Contributor

no this is a bug

This comment has been minimized.

@gfyoung

gfyoung May 21, 2018

Member

Fair enough, though tests and whatsnew are still needed (just to be clear).

This comment has been minimized.

@minggli

minggli May 25, 2018

Contributor

thanks. added whatsnew and tests.

@jreback

pls add a whatsnew & a test (not sure what that should look like)

minggli added some commits May 25, 2018

@minggli minggli closed this May 26, 2018

@minggli minggli reopened this May 26, 2018

@@ -943,6 +943,22 @@ def test_to_csv_compression(self, compression):
with tm.decompress_file(filename, compression) as fh:
assert_frame_equal(df, read_csv(fh, index_col=0))
def test_to_csv_compression_size(self, compression):

This comment has been minimized.

@WillAyd

WillAyd May 26, 2018

Member

Since these are all the same tests I think it makes more sense to put in test_common and parametrize for the different writers, rather than splitting out across the various modules

This comment has been minimized.

@minggli

minggli May 27, 2018

Contributor

make sense. done.

@minggli

This comment has been minimized.

Contributor

minggli commented May 27, 2018

@WillAyd @jreback @gfyoung changes implemented, comments welcome. 👍

s = df.iloc[:, 0]
with tm.ensure_clean() as filename:
for obj in [df, s]:

This comment has been minimized.

@WillAyd

WillAyd May 27, 2018

Member

Can even parametrize the Series and Frame instead of a loop

This comment has been minimized.

@minggli

minggli May 27, 2018

Contributor

done.

file_size = os.path.getsize(filename)
getattr(obj, method)(filename, compression=None)
uncompressed_file_size = os.path.getsize(filename)
if compression:

This comment has been minimized.

@WillAyd

WillAyd May 27, 2018

Member

Shouldn't need this conditional

This comment has been minimized.

@minggli

minggli May 27, 2018

Contributor

ok. though had to skip or xfail when compression==None. the fixture is shared across other tests so no need to change fixture.

@pytest.mark.parametrize('frame', [
pd.concat(100 * [DataFrame([[0.123456, 0.234567, 0.567567],

This comment has been minimized.

@WillAyd

WillAyd May 27, 2018

Member

Instead of using concat just multiply your list of lists by 100 within the constructor - will be more performant and idiomatic

This comment has been minimized.

@minggli

minggli May 27, 2018

Contributor

good point.

@@ -222,3 +224,22 @@ def test_standardize_mapping():
dd = collections.defaultdict(list)
assert isinstance(com.standardize_mapping(dd), partial)
@pytest.mark.parametrize('frame', [

This comment has been minimized.

@WillAyd

WillAyd May 27, 2018

Member

Since this is either a frame or a series don't use frame as the variable name, as that's slightly confusing in case of the former being passed. obj should be fine

@WillAyd

lgtm - will see what others say but thanks for the PR!

@@ -83,6 +83,7 @@ Indexing
I/O
^^^
- Bug in :class:`pandas.io.common.BytesZipFile` where zip compression produces uncompressed zip archive (:issue:`17778`)

This comment has been minimized.

@jreback

jreback May 29, 2018

Contributor

can you reference 21144 (as well is fine) as that other issue was closed for 0.23.0

This comment has been minimized.

@minggli

minggli May 29, 2018

Contributor

Hi @jreback @WillAyd , tested 21144 on older version of pandas 0.20 on windows and same issue occurred. I think it's an existing issue unrelated to this PR. This PR only address zip compression and doesn't touch gzip at all.

pytest.skip("only test compression case.")
with tm.ensure_clean() as filename:
getattr(obj, method)(filename, compression=compression)

This comment has been minimized.

@jreback

jreback May 29, 2018

Contributor

does this fail under 0.23.0? (and works here clearly)

@jreback jreback merged commit c85ab08 into pandas-dev:master May 29, 2018

0 of 3 checks passed

ci/circleci Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jreback

This comment has been minimized.

Contributor

jreback commented May 29, 2018

thanks @minggli !

@minggli

This comment has been minimized.

Contributor

minggli commented May 29, 2018

happy to help!

@minggli minggli deleted the minggli:bugfix/zip branch May 29, 2018

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jun 8, 2018

jorisvandenbossche added a commit that referenced this pull request Jun 9, 2018

david-liu-brattle-1 added a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment