Add comment char parameter to to_csv method #27637

mthaak · 2019-07-29T11:17:50Z

Code Sample

import pandas as pd
pd.DataFrame( ["a'", 'a"', "a,", "a#"], columns=["Column"]).to_csv("my.csv", index=False)

results in my.csv:

Column
a'
"a"""
"a,"
a#

(a# is not quoted)

Problem description

We would like to use the "#" character as comment indicator such that lines that start with the character are automatically ignored. However when fields contain the "#" character and the pd.read_csv("my.csv", comment="#") is used to read the CSV, then those fields are read as nan. When those fields are quoted, then they are read as literal strings (which is the behavior we want). So we want to automatically quote fields containing the comment character "#" in to_csv.

(the work-around we have now is set quoting=2 (non-numeric) so by default all strings are quoted)

Expected Output

import pandas as pd
pd.DataFrame( ["a'", 'a"', "a,", "a#"], columns=["Column"]).to_csv("my.csv", index=False, commentchar="#")

results in my.csv:

Column
a'
"a"""
"a,"
"a#"

A more universal solution could be to allow passing a list of characters to quote for the quoting parameter. E.g. df.to_csv("my.csv", quoting=['"', '#', ","].

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-25-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.11
numpy: 1.16.4
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.6.0
sphinx: 2.1.1
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: 0.9.3
psycopg2: 2.8.3 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-07-29T15:02:10Z

So we want to automatically quote fields containing the comment character "#" in to_csv.

I'm not sure we would want to do this automatically. As you say, quoting is how you control this.

A more universal solution could be to allow passing a list of characters to quote for the quoting parameter. E.g. df.to_csv("my.csv", quoting=['"', '#', ","].

I'm not if we would want to deviate from the stdlib here. cc @gfyoung if you have thoughts.

gfyoung · 2019-07-29T17:08:39Z

However when fields contain the "#" character and the pd.read_csv("my.csv", comment="#") is used to read the CSV, then those fields are read as nan.

To clarify, when you say "nan", I think you mean the "#" doesn't show up, right?

Have you considered escaping those "#" ? Then you can pass in an escapechar parameter.

mthaak · 2019-07-30T15:04:11Z

To clarify, when you say "nan", I think you mean the "#" doesn't show up, right?

As soon as a "#" character is encountered, the rest of the line is ignored because it's interpreted as comment. But if the # is in a quoted string, this doesn't happen.

Have you considered escaping those "#" ? Then you can pass in an escapechar parameter.

That's actually a good solution as well. Except that you have to pre-process the columns beforehand. Might go for this one though. 👍

I have to admit this can be a feature request for quite a specific use case. Can imagine it's perhaps not worth the work.

TomAugspurger · 2019-08-01T21:40:38Z

IMO, this isn't generally useful enough to warrant a new keyword.

gfyoung · 2019-08-01T22:56:55Z

IMO, this isn't generally useful enough to warrant a new keyword.

@mthaak : I unfortunately would have to agree here with @TomAugspurger.

That being said: give the escaping attempt a shot. If it works for you, we could consider adding this to our cookbook, as it's not entirely unreasonable that people would want to do this.

mthaak · 2019-08-02T13:03:23Z

In case you would like to add it to the cookbook: escaping the comment characters + escapechar works, but the more general solution is setting quoting=2 (non-nummeric) in the to_csv method such that all text strings are quoted

jbrockmendel added the IO CSV read_csv, to_csv label Jul 29, 2019

gfyoung added the Enhancement label Aug 1, 2019

mroeschke added Docs and removed Enhancement labels Jul 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comment char parameter to to_csv method #27637

Add comment char parameter to to_csv method #27637

mthaak commented Jul 29, 2019 •

edited

TomAugspurger commented Jul 29, 2019

gfyoung commented Jul 29, 2019

mthaak commented Jul 30, 2019 •

edited

TomAugspurger commented Aug 1, 2019

gfyoung commented Aug 1, 2019

mthaak commented Aug 2, 2019 •

edited

Add comment char parameter to to_csv method #27637

Add comment char parameter to to_csv method #27637

Comments

mthaak commented Jul 29, 2019 • edited

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 29, 2019

gfyoung commented Jul 29, 2019

mthaak commented Jul 30, 2019 • edited

TomAugspurger commented Aug 1, 2019

gfyoung commented Aug 1, 2019

mthaak commented Aug 2, 2019 • edited

mthaak commented Jul 29, 2019 •

edited

Output of `pd.show_versions()`

mthaak commented Jul 30, 2019 •

edited

mthaak commented Aug 2, 2019 •

edited