Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defaulting to_csv to infer compression #22004

Closed
dhimmel opened this issue Jul 20, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@dhimmel
Copy link
Contributor

commented Jul 20, 2018

This issue follows up on #17900 by thanks @Dobatymo and @gfyoung with review from @jreback. #17900 added an 'infer' option to compression in _get_handle. The main user-facing benefit here is that df.to_csv will be able to infer compression just like pandas.read_csv. However, unlike read_csv the default value for compression is None rather than 'infer'

Unfortunately, much of the convenience of compression='infer' is lost if you have to explicitly specify it. In summary, I think there is a major convenience to the following command to work and automatically perform gzip compression:

df.to_csv('path.csv.gz')

Compatibility assessment

Defaulting to infer would only affect users who are currently using paths with compression extensions but not actually compressing. That's pretty bad practice IMO. Hence, I'm in favor of breaking backwards compatibility and changing the default for compression to infer. It looks like this would go into the major release 0.24?

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jul 20, 2018

I agree conceptually. Probably need to handle cases where this would potentially conflict with the compression argument. PRs welcome

@dhimmel

This comment has been minimized.

Copy link
Contributor Author

commented Jul 21, 2018

I am happy to open a PR. I think the solution will be as simple as changing the compression default to infer in:

pandas/pandas/core/frame.py

Lines 1714 to 1716 in 322dbf4

def to_csv(self, path_or_buf=None, sep=",", na_rep='', float_format=None,
columns=None, header=True, index=True, index_label=None,
mode='w', encoding=None, compression=None, quoting=None,

Looks like to_pickle already defaults to infer:

def to_pickle(obj, path, compression='infer', protocol=pkl.HIGHEST_PROTOCOL):

to_json should also probably be switched to default to infer:

def to_json(path_or_buf, obj, orient=None, date_format='epoch',
double_precision=10, force_ascii=True, date_unit='ms',
default_handler=None, lines=False, compression=None,
index=True):

I don't think the other to_* methods have a compression argument but I should double check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.