New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.to_csv support for "compression='gzip'" #7615

Closed
francescomalandrino opened this Issue Jun 30, 2014 · 15 comments

Comments

Projects
None yet
@francescomalandrino

francescomalandrino commented Jun 30, 2014

the DataFrame.to_csv method seems to accept a "compression" named parameter:

import numpy as np,pandas as pd
data=np.arange(10).reshape(5,2)
df=pd.DataFrame(data,columns=['a','b'])
df.to_csv('test.csv.gz',compression='gzip')

However, the file it creates is not compressed at all:
francesco@i3 ~/Desktop $ cat test.csv.gz
,a,b
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9

How about either (i) actually implementing compression, or at least (ii) raise an error? The current behavior is confusing...

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 30, 2014

Contributor

to_csv allows **kwds so arbitrary additional arguments are 'accepted' (this is mainly for compatibility IIRC with some of the other to_* functions which allow this), but ignored. I suppose that could be removed (not sure why it was their in the first place). That said, only arguments in the doc-string are public.

Would accept a pull-request to limit this.

Contributor

jreback commented Jun 30, 2014

to_csv allows **kwds so arbitrary additional arguments are 'accepted' (this is mainly for compatibility IIRC with some of the other to_* functions which allow this), but ignored. I suppose that could be removed (not sure why it was their in the first place). That said, only arguments in the doc-string are public.

Would accept a pull-request to limit this.

@jreback jreback added CSV labels Jun 30, 2014

@jreback jreback added this to the Someday milestone Jun 30, 2014

@mmautner

This comment has been minimized.

Show comment
Hide comment
@mmautner

mmautner Jun 30, 2014

I think this out-of-scope for Pandas--just use this: https://docs.python.org/2/library/gzip.html

please close

mmautner commented Jun 30, 2014

I think this out-of-scope for Pandas--just use this: https://docs.python.org/2/library/gzip.html

please close

@francescomalandrino

This comment has been minimized.

Show comment
Hide comment
@francescomalandrino

francescomalandrino Jun 30, 2014

But compression='gzip' is accepted (and enacted) in pd.read_csv, which is why I was assuming to_csv behaves the same.

francescomalandrino commented Jun 30, 2014

But compression='gzip' is accepted (and enacted) in pd.read_csv, which is why I was assuming to_csv behaves the same.

@mmautner

This comment has been minimized.

Show comment
Hide comment
@mmautner

mmautner Jun 30, 2014

The way you initially phrased the issue suggested that you were just guessing at keyword arguments--'compression' isn't a documented argument so I don't think your confusion is shared by many. You're welcome to submit a pull-request, I don't feel religious about this at all

mmautner commented Jun 30, 2014

The way you initially phrased the issue suggested that you were just guessing at keyword arguments--'compression' isn't a documented argument so I don't think your confusion is shared by many. You're welcome to submit a pull-request, I don't feel religious about this at all

@francescomalandrino

This comment has been minimized.

Show comment
Hide comment
@francescomalandrino

francescomalandrino Jun 30, 2014

Sorry, what I meant is:

  1. compression is documented and working for read_csv:
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
  2. compression is not documented and not working for to_csv.

There is no error in the documentation and both (1) and (2) make sense to me.
It's (1) and (2) together, i.e., the fact that to_csv behaves differently from read_csv without telling the user, that seemed a bit inconsistent to me.

Closing on the grounds that I won't be fixing it myself, and probably it's not a proper bug.

francescomalandrino commented Jun 30, 2014

Sorry, what I meant is:

  1. compression is documented and working for read_csv:
    http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
  2. compression is not documented and not working for to_csv.

There is no error in the documentation and both (1) and (2) make sense to me.
It's (1) and (2) together, i.e., the fact that to_csv behaves differently from read_csv without telling the user, that seemed a bit inconsistent to me.

Closing on the grounds that I won't be fixing it myself, and probably it's not a proper bug.

@mmautner

This comment has been minimized.

Show comment
Hide comment
@mmautner

mmautner Jun 30, 2014

Thanks! I definitely didn't mean to antagonize you--agreed that it's an unfortunate inconsistency

mmautner commented Jun 30, 2014

Thanks! I definitely didn't mean to antagonize you--agreed that it's an unfortunate inconsistency

@jorisvandenbossche jorisvandenbossche changed the title from DataFrame.to_csv accepts but does not enact "compression='gzip'" to ENH: DataFrame.to_csv support for "compression='gzip'" Jul 1, 2014

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jul 1, 2014

Member

Would we want this feature, if someone would implement it? If so, we can leave it open marked as an enhancement proposal?

Member

jorisvandenbossche commented Jul 1, 2014

Would we want this feature, if someone would implement it? If so, we can leave it open marked as an enhancement proposal?

@msevrens

This comment has been minimized.

Show comment
Hide comment
@msevrens

msevrens Aug 22, 2014

I would also like to_csv to have the same functionality of from_csv.

msevrens commented Aug 22, 2014

I would also like to_csv to have the same functionality of from_csv.

@dhimmel

This comment has been minimized.

Show comment
Hide comment
@dhimmel

dhimmel May 26, 2015

Contributor

+1, a compression argument for DataFrame.to_csv would spare many user headaches.

In Python 3.4, I use the following workaround:

with gzip.open('path_to_file', 'wt') as write_file:
    data_frame.to_csv(write_file)
Contributor

dhimmel commented May 26, 2015

+1, a compression argument for DataFrame.to_csv would spare many user headaches.

In Python 3.4, I use the following workaround:

with gzip.open('path_to_file', 'wt') as write_file:
    data_frame.to_csv(write_file)
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer May 26, 2015

Member

@dhimmel If you're interested in putting in the work, I think we're still open to a PR to add this feature.

Member

shoyer commented May 26, 2015

@dhimmel If you're interested in putting in the work, I think we're still open to a PR to add this feature.

@dhimmel

This comment has been minimized.

Show comment
Hide comment
@dhimmel

dhimmel May 28, 2015

Contributor

@shoyer, okay I will keep this in mind. I have a bit to learn first.

Contributor

dhimmel commented May 28, 2015

@shoyer, okay I will keep this in mind. I have a bit to learn first.

@rbrito

This comment has been minimized.

Show comment
Hide comment
@rbrito

rbrito Oct 20, 2015

Thank you so much for implementing this! Besides the aesthetics POV and fixing the asymmetry between read/write, this is a huge improvement to some people like me.

rbrito commented Oct 20, 2015

Thank you so much for implementing this! Besides the aesthetics POV and fixing the asymmetry between read/write, this is a huge improvement to some people like me.

@jsmedmar

This comment has been minimized.

Show comment
Hide comment
@jsmedmar

jsmedmar Jun 30, 2016

This did not work for me, the output file isn't compressed. I'm using Pandas 0.18.1

jsmedmar commented Jun 30, 2016

This did not work for me, the output file isn't compressed. I'm using Pandas 0.18.1

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Jun 30, 2016

Contributor

@jsmedmar could you open a new issue with that demonstrating the problem? Thanks.

Contributor

TomAugspurger commented Jun 30, 2016

@jsmedmar could you open a new issue with that demonstrating the problem? Thanks.

@indera

This comment has been minimized.

Show comment
Hide comment
@indera

indera Oct 18, 2016

@jsmedmar I see the "compression" argument is properly documented and it is working
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.DataFrame.to_csv.html

One confusing thing is that if you run the following code

import numpy as np
import pandas as pd

data = np.arange(10).reshape(5, 2)
df = pd.DataFrame(data, columns=['a', 'b'])
print(df)
df.to_csv('test.csv.gz', compression='gzip')
"""
   a  b
   0  0  1
   1  2  3
   2  4  5
   3  6  7
   4  8  9
"""

you get a compressed file, but opening it in vim automatically decompresses it, so to verify that compression happened use the "head" command:

$ head test.csv.gz
D5X�test.csv�70

indera commented Oct 18, 2016

@jsmedmar I see the "compression" argument is properly documented and it is working
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.DataFrame.to_csv.html

One confusing thing is that if you run the following code

import numpy as np
import pandas as pd

data = np.arange(10).reshape(5, 2)
df = pd.DataFrame(data, columns=['a', 'b'])
print(df)
df.to_csv('test.csv.gz', compression='gzip')
"""
   a  b
   0  0  1
   1  2  3
   2  4  5
   3  6  7
   4  8  9
"""

you get a compressed file, but opening it in vim automatically decompresses it, so to verify that compression happened use the "head" command:

$ head test.csv.gz
D5X�test.csv�70

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment