Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conform Series.to_csv to DataFrame.to_csv #19715

Closed
dahlbaek opened this issue Feb 15, 2018 · 8 comments

Comments

Projects
None yet
5 participants
@dahlbaek
Copy link
Contributor

commented Feb 15, 2018

Thank you for this awesome package!

Problem description

When indexing/selecting/slicing a DataFrame, pandas may return a Series. Both classes support the to_csv method, but the methods have subtle differences. These subtle differences may break your code, or demand a call to to_frame. Perhaps it would be beneficial to conform the Series.to_csv method to the DataFrame.to_csv method?

At least, the following two examples seem relevant (and are the reason I came here):

  • DataFrame.to_csv accepts the keyword parameter path_or_buf. This seems to be completely analogous to the differently named keyword parameter path of Series.to_csv.
  • DataFrame.to_csv accepts the keyword parameter line_terminator. There does not seem to be an analogous keyword parameter of Series.to_csv.

Addendum: Other differences

The following two keyword parameters are implemented for DataFrame.to_csv, but not for Series.to_csv: compression, chunksize. I suppose these are mostly relevant if you have very large dataframes, but I do not see any reason why they should not have analogous methods for Series.

Some of the keyword parameters of DataFrame.to_csv are specific to multi-column data and may therefore not make immediate sense for Series.to_csv. On the other hand, a user may wish to use quoting in a single-column file in order to maintain conformity with other parts of a bigger project. In this relation, the following keyword parameters are implemented for DataFrame.to_csv but do not for Series.to_csv: quoting, quotechar,doublequote, escapechar.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-32-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL:
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 15, 2018

@dahlbaek Thanks for opening this issue. I think this all sounds very sensible, and, should also not be too hard to solve.
As currently Series.to_csv just creates a dataframe and then does DataFrame.to_csv:

pandas/pandas/core/series.py

Lines 2924 to 2932 in 2fdf1e2

df = DataFrame(self)
# result is only a string if no path provided, otherwise None
result = df.to_csv(path, index=index, sep=sep, na_rep=na_rep,
float_format=float_format, header=header,
index_label=index_label, mode=mode,
encoding=encoding, compression=compression,
date_format=date_format, decimal=decimal)
if path is None:
return result

So I think we could simply pass through **kwargs to DataFrame.to_csv

@dahlbaek

This comment has been minimized.

Copy link
Contributor Author

commented Feb 15, 2018

I see! If you would like, I would be happy to try my hand at a pull request.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Feb 15, 2018

Sure!

@gfyoung

This comment has been minimized.

Copy link
Member

commented Feb 15, 2018

xref #18958 : I 100% agree that this should be done.

@gfyoung gfyoung added the Compat label Feb 15, 2018

@dahlbaek dahlbaek referenced this issue Feb 18, 2018

Closed

Conform Series.to_csv to DataFrame.to_csv #19745

3 of 4 tasks complete

@jreback jreback added this to the 0.23.0 milestone Feb 18, 2018

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.23.0, Next Major Release Mar 29, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Jul 13, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Jul 13, 2018

@toobaz toobaz referenced this issue Jul 13, 2018

Merged

DEPR: Deprecate Series.to_csv signature #21896

4 of 4 tasks complete
@gfyoung

This comment has been minimized.

Copy link
Member

commented Jul 13, 2018

@jreback @jorisvandenbossche : Three PR's for resolving this issue:

#19745 (@dahlbaek)
#21868 (@gfyoung)
#21896 (@toobaz)

All of them have different implementations. Thoughts?

toobaz added a commit to toobaz/pandas that referenced this issue Jul 13, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Jul 14, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Jul 14, 2018

@dahlbaek

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2018

As far as I understand, the solution by @gfyoung (#21868) takes care of everything except the reordering of positional arguments. As pointed out by @toobaz here (see #21896 for proof of concept), the ordering problem can be solved by noting that index should be True or False while sep must be a one-character string. That is, one can determine whether the user is using the old or new signature by checking the type of sep (which is the first positional argument following path_or_buf in the new signature).

However, as of right now, index also accepts one-character strings (evaluating to True), which means there may be code out there passing one-character strings to index. I wonder if it would be possible to pave the way for the solution to this problem by first having a release which forces users to pass in boolean valued arguments to index?

If one were to first implement something like

if not isinstance(index, bool):
    raise ValueError(
        "{0} is not a valid value for index. "
        "Use True, False or bool({0}) instead."
        .format(repr(index))
    )

in Series.to_csv, then the solution of @toobaz would be ensured to correctly classify which signature the user had in mind. But maybe this is overthinking it.

@toobaz

This comment has been minimized.

Copy link
Member

commented Jul 14, 2018

then the solution of @toobaz would be ensured to correctly classify which signature the user had in mind

Doesn't it already? (see last rebase, and my reply to your comment)

@toobaz

This comment has been minimized.

Copy link
Member

commented Jul 14, 2018

Doesn't it already? (see last rebase, and my reply to your comment)

No, it does not, sorry, the ambiguity is unavoidable.

If one were to first implement something like

No, I really don't think we want to do this - not worth the effort (the "y" behavior is undocumented).

toobaz added a commit to toobaz/pandas that referenced this issue Jul 25, 2018

@dhimmel dhimmel referenced this issue Jul 27, 2018

Merged

Default to_* methods to compression='infer' #22011

4 of 4 tasks complete

toobaz added a commit to toobaz/pandas that referenced this issue Aug 2, 2018

toobaz added a commit to toobaz/pandas that referenced this issue Aug 2, 2018

jorisvandenbossche added a commit that referenced this issue Aug 13, 2018

Sup3rGeo added a commit to Sup3rGeo/pandas that referenced this issue Oct 1, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.