Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comment char parameter to to_csv method #27637

Open
mthaak opened this issue Jul 29, 2019 · 6 comments
Open

Add comment char parameter to to_csv method #27637

mthaak opened this issue Jul 29, 2019 · 6 comments
Labels
Docs IO CSV read_csv, to_csv

Comments

@mthaak
Copy link

mthaak commented Jul 29, 2019

Code Sample

import pandas as pd
pd.DataFrame( ["a'", 'a"', "a,", "a#"], columns=["Column"]).to_csv("my.csv", index=False)

results in my.csv:

Column
a'
"a"""
"a,"
a#

(a# is not quoted)

Problem description

We would like to use the "#" character as comment indicator such that lines that start with the character are automatically ignored. However when fields contain the "#" character and the pd.read_csv("my.csv", comment="#") is used to read the CSV, then those fields are read as nan. When those fields are quoted, then they are read as literal strings (which is the behavior we want). So we want to automatically quote fields containing the comment character "#" in to_csv.

(the work-around we have now is set quoting=2 (non-numeric) so by default all strings are quoted)

Expected Output

import pandas as pd
pd.DataFrame( ["a'", 'a"', "a,", "a#"], columns=["Column"]).to_csv("my.csv", index=False, commentchar="#")

results in my.csv:

Column
a'
"a"""
"a,"
"a#"

A more universal solution could be to allow passing a list of characters to quote for the quoting parameter. E.g. df.to_csv("my.csv", quoting=['"', '#', ","].

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-25-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.11
numpy: 1.16.4
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.6.0
sphinx: 2.1.1
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: 0.9.3
psycopg2: 2.8.3 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger
Copy link
Contributor

So we want to automatically quote fields containing the comment character "#" in to_csv.

I'm not sure we would want to do this automatically. As you say, quoting is how you control this.

A more universal solution could be to allow passing a list of characters to quote for the quoting parameter. E.g. df.to_csv("my.csv", quoting=['"', '#', ","].

I'm not if we would want to deviate from the stdlib here. cc @gfyoung if you have thoughts.

@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Jul 29, 2019
@gfyoung
Copy link
Member

gfyoung commented Jul 29, 2019

However when fields contain the "#" character and the pd.read_csv("my.csv", comment="#") is used to read the CSV, then those fields are read as nan.

To clarify, when you say "nan", I think you mean the "#" doesn't show up, right?

Have you considered escaping those "#" ? Then you can pass in an escapechar parameter.

@mthaak
Copy link
Author

mthaak commented Jul 30, 2019

To clarify, when you say "nan", I think you mean the "#" doesn't show up, right?

As soon as a "#" character is encountered, the rest of the line is ignored because it's interpreted as comment. But if the # is in a quoted string, this doesn't happen.

Have you considered escaping those "#" ? Then you can pass in an escapechar parameter.

That's actually a good solution as well. Except that you have to pre-process the columns beforehand. Might go for this one though. 👍

I have to admit this can be a feature request for quite a specific use case. Can imagine it's perhaps not worth the work.

@TomAugspurger
Copy link
Contributor

IMO, this isn't generally useful enough to warrant a new keyword.

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2019

IMO, this isn't generally useful enough to warrant a new keyword.

@mthaak : I unfortunately would have to agree here with @TomAugspurger.

That being said: give the escaping attempt a shot. If it works for you, we could consider adding this to our cookbook, as it's not entirely unreasonable that people would want to do this.

@mthaak
Copy link
Author

mthaak commented Aug 2, 2019

In case you would like to add it to the cookbook: escaping the comment characters + escapechar works, but the more general solution is setting quoting=2 (non-nummeric) in the to_csv method such that all text strings are quoted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants