Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_csv does not always handle line_terminator correctly #17365

Closed
kevinsa5 opened this issue Aug 28, 2017 · 12 comments
Closed

to_csv does not always handle line_terminator correctly #17365

kevinsa5 opened this issue Aug 28, 2017 · 12 comments
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string Windows Windows OS

Comments

@kevinsa5
Copy link

Code Sample, a copy-pastable example if possible

def hex_print(content):
    print ' '.join(['{0:02x}'.format(ord(i)) for i in content])
    print ' '.join(['{:>2}'.format(repr(i).replace("'", '')) for i in content])
    print ' '

import pandas as pd
import tempfile

filename = tempfile.NamedTemporaryFile(delete = False).name

df = pd.DataFrame({'x':[1]})
for sep in ['\n', '\r', '\r\n', 'F']:
    print 'with separator: {} ~~~~~~~~~~~~~~~~~~~~~~~~'.format(repr(sep))
    df.to_csv(filename, line_terminator = sep)        
    with open(filename, 'rb') as f:
        content = f.read()
    print 'file method:'
    hex_print(content)
    print 'string method:'
    hex_print(df.to_csv(line_terminator = sep))

Problem description

It seems that the to_csv does not always handle the line_terminator argument correctly. The above code prints out the hexified CSV data produced from several different calls to to_csv. In particular, passing \n in fact produces \r\n, and \r\n becomes \r\r\n. Note also that this only happens when writing to a file, not directly returning the CSV data as a string.

However, this seems to be OS-dependent as well -- I have reproduced it on several machines running Windows 7, Python 2.7, and various versions of pandas (including 0.20.1), but on a linux VM, it works as expected.

Output of above code:

with separator: '\n' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 0d 0a 30 2c 31 0d 0a
 ,  x \r \n  0  ,  1 \r \n

string method:
2c 78 0a 30 2c 31 0a
 ,  x \n  0  ,  1 \n

with separator: '\r' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 0d 30 2c 31 0d
 ,  x \r  0  ,  1 \r

string method:
2c 78 0d 30 2c 31 0d
 ,  x \r  0  ,  1 \r

with separator: '\r\n' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 0d 0d 0a 30 2c 31 0d 0d 0a
 ,  x \r \r \n  0  ,  1 \r \r \n

string method:
2c 78 0d 0a 30 2c 31 0d 0a
 ,  x \r \n  0  ,  1 \r \n

with separator: 'F' ~~~~~~~~~~~~~~~~~~~~~~~~
file method:
2c 78 46 30 2c 31 46
 ,  x  F  0  ,  1  F

string method:
2c 78 46 30 2c 31 46
 ,  x  F  0  ,  1  F

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 35.0.1
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.5
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.1.7
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string labels Aug 28, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2017

@kevinsa5 : Thanks for reporting this! The fact that you only have this issue on Windows I think is telling about the issue and could indicate that it is not a pandas issue. Here is a link that might potentially clear this up for you:

https://stackoverflow.com/questions/7013034/does-windows-carriage-return-r-n-consist-of-two-characters-or-one-character

If you can find a version of pandas where this doesn't occur, then that's a regression, but as it stands, I don't believe it is a problem on our end.

@kevinsa5
Copy link
Author

@gfyoung : Thanks for the quick reply! I am familiar with how Windows uses \r\n as its newline character combo, while other OSs use only \n. However, I do not believe that is the issue here.

If you take a close look at the script output I posted above, for some values of line_terminator, to_csv will produce different output depending on if you're sending to a file or not. From this alone, I believe it to be a bug in pandas, regardless of OS linesep convention -- I would not expect the destination of the CSV data to affect the data itself.

I should mention: I am immensely grateful for all the work the pandas devs have put into this library. Thank you very much for your continued effort.

@kevinsa5
Copy link
Author

@gfyoung After further reading, I think you may be correct. Is it true that on Windows, if you write the character '\n' to a file, the OS may actually insert '\r\n' into the file? Does this depend on how the file is opened?

Maybe I've been spoilt by Linux, but it seems unacceptable to me for an OS to silently change bytes that you send to disk.

@gfyoung
Copy link
Member

gfyoung commented Aug 28, 2017

Does this depend on how the file is opened?

Not entirely sure to be honest. However, as someone who has worked on a Windows computer from a Linux-based repository, I can say for certain that I have seen these carriage returns sneak into diffs just merely from cloning the repository.

Maybe I've been spoilt by Linux, but it seems unacceptable to me for an OS to silently change bytes that you send to disk.

As somebody who has worked in both the Linux and Windows world, you are perfectly entitled to bytes not being "corrupted" like this. There's a reason why Linux is generally preferred for developers 😄

@chris-b1
Copy link
Contributor

https://stackoverflow.com/questions/3191528/csv-in-python-adding-an-extra-carriage-return

On python2 looks like we should be opening the file in mode 'b' when passing to csv, PR to fix welcome

@chris-b1 chris-b1 added this to the Next Major Release milestone Aug 29, 2017
@kevinsa5
Copy link
Author

Indeed changing the above code to df.to_csv(filename, line_terminator = sep, mode='wb') does solve the issue on my end. Whether a PR should be as simple as changing the default value or not, I'm not qualified to say:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L1448

Thank you both. Pandas is a fantastic piece of software.

@jreback
Copy link
Contributor

jreback commented Aug 29, 2017

actually if u look the other highly starred answer in the SO post might work better

e.g. passing line_terminator to the csv.writer itself

@kevinsa5
Copy link
Author

Note that the argument to to_csv does get passed through to the csv.writer:

https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1586

@jreback jreback changed the title to_csv does not always handle line_terminator correctly to_csv does not always handle line_terminator correctly Aug 29, 2017
@jreback jreback added the Windows Windows OS label Aug 29, 2017
@jreback
Copy link
Contributor

jreback commented Aug 29, 2017

@kevinsa5 if you want to submit a PR with the change (for opening in binary mode on windows in binary when line_terminator is specified), and see if it passes tests.

@ingmars
Copy link

ingmars commented Feb 15, 2018

I everyone, I'm experiencing the same problem on Pandas 0.20.3 on Windows 7. However, mode='wb' might be a dangerous fix, as then it crashes with an encoding setting such as encoding='utf-8' saying: "ValueError: binary mode doesn't take an encoding argument".

It would be nice if there was a workaround making both line_terminator and encoding work at the same time.

@redx177
Copy link

redx177 commented Apr 18, 2018

I am experiencing the same issue.

With line_terminator='\r\n', mode='w', I am getting the line endings \r\r\n

When I am using line_terminator='\r\n', mode='wb', I am getting the error:

File "C:[...]\site-packages\pandas\io\common.py", line 332, in _get_handle
f = open(path, mode, errors='replace')
ValueError: binary mode doesn't take an errors argument

And the same as @ingmars when I try to set an encoding with line_terminator='\r\n', mode='wb', encoding='utf-8'.

I settled at the end with not specifying any line_terminator. Not happy, but I will fix this with other tools after the file has been written.

Everything with Win10, Python 3.5, Pandas 0.19.2

@gfyoung
Copy link
Member

gfyoung commented Jun 3, 2018

Per #20353 (comment) from @jreback , let's move the discussion to #20353.

@gfyoung gfyoung closed this as completed Jun 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string Windows Windows OS
Projects
None yet
Development

No branches or pull requests

6 participants