Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving CSV with backslashed-escaping is not idempotent. #14122

Open
deads opened this issue Aug 31, 2016 · 19 comments
Open

Saving CSV with backslashed-escaping is not idempotent. #14122

deads opened this issue Aug 31, 2016 · 19 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@deads
Copy link

deads commented Aug 31, 2016

@pdbaines and I noticed this bug.

I want Pandas to write a CSV file so that all field data is backslash escaped if the character has a special interpretation (e.g. quotes or backslashes themselves). If a quote is backslashed, it is treated as field data, rather than a special character. This is not the behavior that I am seeing.

Consider the following data frame:

df = pd.DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\"""], "zoo": ["1"]})
df.to_csv("out.csv", index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

When written to a file, it looks something like this:

"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.\","1"

The quotes are properly escaped in Please "help" me, but oddly, the end-quote of the field is backslashed, but the start-quote of the field is not back-slashed.

If I read the data frame in again using exactly the same parameters,

df2 = pd.read_csv("out.csv", quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

I get a data frame with both fields concatenated into the first field and the second field is NaN.

$ print(df2)
                                                text  zoo
0  Hello! Please "help" me. I cannot quote a csv....  NaN

If I instead, do the following:

df3 = pd.DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\\""""], "zoo": ["1"]})
df3.to_csv("outB.csv", index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)
df4 = pd.read_csv("outB.csv", quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8", escapechar='\\', doublequote=False)

I instead get a file with an odd-number of unescaped quote characters:

"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.\\"","1"

and some unescaped quote characters are treated as data.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
@deads deads changed the title Saving CSV with backslashed-escaping is not idempotent. Some backslash-escaped characters treated like special characters, not data. Saving CSV with backslashed-escaping is not idempotent. Aug 31, 2016
@jreback jreback added the IO CSV read_csv, to_csv label Aug 31, 2016
@jreback
Copy link
Contributor

jreback commented Aug 31, 2016

I think this is a dupe of #12922 and likely fixed here: d814f43

@gfyoung

@gfyoung
Copy link
Member

gfyoung commented Sep 1, 2016

>>> import csv
>>> from pandas import DataFrame
>>> df = DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv.\\"""],
                    "zoo": ["1"]})
>>> df
                                              text zoo
0  Hello! Please "help" me. I cannot quote a csv.\   1

Why did you put an backslash character there at the end? Let's remove it:

>>> df = DataFrame({"text": ["""Hello! Please "help" me. I cannot quote a csv."""],
                    "zoo": ["1"]})
>>> df
                                              text zoo
0  Hello! Please "help" me. I cannot quote a csv.   1
>>> print(df.to_csv(index=False, quoting=csv.QUOTE_NONNUMERIC, encoding="utf-8",
                    escapechar='\\', doublequote=False)))
"text","zoo"
"Hello! Please \"help\" me. I cannot quote a csv.","1"

You get that backslash at the end because you put it there.

@jreback : This is not a bug and can be closed.

@deads
Copy link
Author

deads commented Sep 1, 2016

@gfyoung: I put a backslash there to show that backslash escaping does not work for all possible inputs. It seems very reasonable for an implementation to simply backslash the backslash as is the case with C, Java, Python shell, JSON, and string literals as well as the output of Python str.__repr__.

A lack of idempotency could be a security concern as it could affect the availability and integrity of an application.

@gfyoung
Copy link
Member

gfyoung commented Sep 2, 2016

If a quote is backslashed, it is treated as field data, rather than a special character.

The // is interpreted as field data. It is not escaping that last quotation mark. My point is that you are misinterpreting your output as bugged when in fact it is behaving as is.

@deads
Copy link
Author

deads commented Sep 2, 2016

@gfyoung
Please explain why it is desired behavior to not be able to save arbitrary data to a Pandas DataFrame cell and read it back in the same.

@jreback
Copy link
Contributor

jreback commented Sep 2, 2016

@deads I am not convinced that your example should be lossless. csv is a pretty lossy format, esp with all of the options you have selected. Can you do this example with the python csv reader losslessly?

@wesm
Copy link
Member

wesm commented Sep 5, 2016

This behavior is present in the csv module https://gist.github.com/wesm/7763d396ae25c9fd5b27588da27015e4 . From first principles seems like the offending backslash should be escaped. If I manually edit the file to be

"a"
"Hello! Please \"help\" me. I cannot quote a csv.\\"

then read_csv returns the original input

I fiddled with R and it doesn't seem to do much better

> df <- data.frame(a=c("Hello! Please \"help\" me. I cannot quote a csv.\\"))> write.table(df, sep=',', qmethod='e', row.names=F)
"a"
"Hello! Please \"help\" me. I cannot quote a csv.\"

Another example of CSV not being a high fidelity data interchange tool =|

Also, I do not think it is fair to say

You get that backslash at the end because you put it there.
@jreback : This is not a bug and can be closed.

to someone reporting behavior that looks like a bug. This would seem buggy to me if I ran across it in production (presuming this came out of some kind of real world use). "Just change your input" is easy to say until the data in question is machine-generated (and may contain backslashes).

@gfyoung
Copy link
Member

gfyoung commented Sep 6, 2016

@wesm : Your comment is a little presumptive because I did not realize that that was his point with the extraneous backslash. Before jumping to conclusions as you did about me "brushing off" this problem as a cop-out, I would suggest that you read the original issue.

With regards to your point about the bug, I am not surprised that this issue persists with csv because the entire to_csv method is essentially a giant wrapper around Python's csv module. I suspect then that the issue in pandas is really a manifestation of a bug in the csv module that should be pursued with the Python developers first.

@black-snow
Copy link

I seem to have problems with quoting and escaping, too. Has anything happened since 2k16?

@gfyoung
Copy link
Member

gfyoung commented Dec 11, 2017

@black-snow : I don't believe anything has changed with this issue unfortunately. Given that the escaping and writing is handled by Python csv at the very end, if we were to work around this idempotent issue, I think we would have to do some hacky data adjustment before writing.

Have a look at the examples and see if they still persist today. Then also post your example code, and we can have a look.

@black-snow
Copy link

Thanks for the quick reply @gfyoung ! I've already fixed my issue. Apparently pandas needs to be told that quotes inside a quoted field are escaped with a backslash.

@gfyoung
Copy link
Member

gfyoung commented Dec 11, 2017

Awesome! Mind sharing your code-sample for reference?

@black-snow
Copy link

Sure.

There's no real csv standard but I'm used to certain defaults i. e. delimiter is ,, quote char is " and escape char is \ (e. g. from PHP). Pandas doesn't seem to use the backslash as the escape character by default so I had to add it.

X = pd.read_csv(args.file, header=None, index_col=False, escapechar='\\').as_matrix()

Thought this issue would be related but apparently it's not.

@gfyoung
Copy link
Member

gfyoung commented Dec 12, 2017

@black-snow : You'll need to provide us the file (if possible). Can't run that code-sample as is. 😄

@black-snow
Copy link

I cannot, sadly, it's business internal stuff. But without the escapechar pandas should already fail on this (not tested):

"first column","second column"
"first columne with \"escaped\" quotes","second column"

@gfyoung
Copy link
Member

gfyoung commented Dec 12, 2017

But without the escapechar pandas should already fail on this (not tested)

Confirmation would be nice.

@dmitriyshashkin
Copy link

I'm also having this problem. Setting escapechar to backsplash doesn't fix it.

pd.DataFrame({'bar': ['test test \\', 'test'], 'foo': ['aa', 'bb']}).to_csv('~/test.csv', quoting=csv.QUOTE_NONNUMERIC, doublequote=False, escapechar="\\")

This results in the following code in the file:

0,"test test \","aa"
1,"test","bb"

Now, if I'm trying to read the same file using backslash as escape char I get erroneous result:
pd.read_csv('~/test.csv', escapechar='\\').iloc[0]

bar = test test ",aa"
foo = NaN

I believe setting "" as escape character should result in "" being escaped by ""

@dmitriyshashkin
Copy link

I guess it's related to this bug https://bugs.python.org/issue12178 Opened in 2011

@mroeschke mroeschke added the Bug label May 1, 2021
@Lauler
Copy link

Lauler commented Jul 3, 2021

I guess it's related to this bug https://bugs.python.org/issue12178 Opened in 2011

Looks like this is finally getting fixed in python 3.10?

Date: 2020-09-20 06:40 Author: Tal Einat: After a great deal of delay, a fix for this has finally been merged, and will be available in Python 3.10. Thanks to everyone involved!

R's read.csv seems to handle things properly already. This issue affects me a lot when writing large amounts of machine generated OCR output to csv and trying to read back the same file. R handles the reading/writing properly but unfortunately pd.read_csv doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

8 participants