Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.to_csv not using correct line terminator value #20353

Closed
bbibentyo opened this issue Mar 14, 2018 · 15 comments · Fixed by #21406
Closed

DataFrame.to_csv not using correct line terminator value #20353

bbibentyo opened this issue Mar 14, 2018 · 15 comments · Fixed by #21406
Labels
IO CSV read_csv, to_csv Windows Windows OS
Milestone

Comments

@bbibentyo
Copy link

bbibentyo commented Mar 14, 2018

Method 1:

        data_frame.to_csv(file_path, sep=self.delimiter, float_format='%.2f',
                                  index=False, line_terminator='\n')

Method 2:

        with open(file_path, mode='w', newline='\n') as f:
            data_frame.to_csv(f, sep=self.delimiter, float_format='%.2f',
                              index=False)

Problem description

I noticed a strange behavior when using pandas.DataFrame.to_csv method on Windows (pandas version 0.20.3). When calling the method using method 1 with a file path, it's creating a new file using the \r line terminator, I had to use method two to make it work.

@gfyoung gfyoung added IO CSV read_csv, to_csv Windows Windows OS labels Mar 18, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2018

Can you provide as a reproducible example? I suspect that this is a Windows problem of unnecessarily inserting carriage-returns, but can't confirm yet.

@lautjy
Copy link

lautjy commented Apr 12, 2018

Same problem for me on Windows 10,
Python 3.6.3 :: Anaconda custom (64-bit),
Pandas: 0.20.3

  • Method 1 does NOT respect line_terminator='\n' . Resulting CSV file has Windows line-endings \r\n.
  • Method 2 works. Resulting CSV file has Unix line-endings. (thanks @bibenb1 )

@deflatSOCO
Copy link
Contributor

deflatSOCO commented May 27, 2018

I found that this problem is still in v0.23.
Here are some imformations under my Windows environment.

Code Sample, a copy-pastable example if possible

initialization
  • code
import pandas as pd
data = pd.DataFrame({
    "integer":[1,2,3],
    "string_with_lf":["abc","d\nef","g\nh\n\ni"],
    "char":["X","Y","Z"]
})
print(data)
  • output
   integer string_with_lf char
0        1            abc    X
1        2          d\nef    Y
2        3      g\nh\n\ni    Z
Method 1
  • code
data.to_csv("test.csv", sep=",", float_format='%.2f',index=False, line_terminator='\n',encoding='utf-8')
print(pd.read_csv("test.csv"))
print("-------")
with open("test.csv", mode='rb') as f:
    print(f.read())
print(data)
  • output
   integer   string_with_lf char
0        1              abc    X
1        2          d\r\nef    Y
2        3  g\r\nh\r\n\r\ni    Z
-------
b'integer,string_with_lf,char\r\n1,abc,X\r\n2,"d\r\nef",Y\r\n3,"g\r\nh\r\n\r\ni",Z\r\n'
Method 2
  • code
with open("test2.csv", mode='w', newline='\n') as f:
    data.to_csv(f, sep=",", float_format='%.2f',index=False, line_terminator='\n',encoding='utf-8')
 
print(pd.read_csv("test2.csv"))
print("-------")
with open("test2.csv", mode='rb') as f:
    print(f.read())
  • output
   integer string_with_lf char
0        1            abc    X
1        2          d\nef    Y
2        3      g\nh\n\ni    Z
-------
b'integer,string_with_lf,char\n1,abc,X\n2,"d\nef",Y\n3,"g\nh\n\ni",Z\n'

Problem description

As seen in "Method 1" sample, when using to_csv() directly, all \ns (both inside each elements and line terminators) are converted to \r\n, even though line_terminator='\n' is set.

Expected Output

Expect "Method 1" to output the csv described in "Method 2" sample.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.3.2
pip: 10.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Related issues

@gfyoung
Copy link
Member

gfyoung commented May 27, 2018

So this is definitely a Windows thing. I ran your samples on a Linux machine, and they produce the same output as your Method 2 in both cases. Thus, it would the case that Windows is quietly inserting carriage returns in the file.

@jreback
Copy link
Contributor

jreback commented May 27, 2018

this is a duplicate of #17365 - though these examples are slightly better so can close the other issue

further i suspect we have this issue recorders elsewhere - can u search

@deflatSOCO
Copy link
Contributor

Sorry for late reply.
I'm new to github and this commnity, so I don't know what "issue recorder" means.

@gfyoung
Copy link
Member

gfyoung commented Jun 9, 2018

I'm new to github and this commnity, so I don't know what "issue recorder" means.

@deflatSOCO : Don't worry about that. That was more addressed to me. Actually, he mostly likely meant "recorded" instead of "recorders"

@jreback : There are similar-sounding issues, but they are all either closed or insufficiently overlapping.

@deflatSOCO
Copy link
Contributor

@gfyoung Thanks for reply.
Can I submit PR for this one?

@gfyoung
Copy link
Member

gfyoung commented Jun 10, 2018

@deflatSOCO : Go for it!

deflatSOCO added a commit to deflatSOCO/pandas that referenced this issue Jun 10, 2018
deflatSOCO added a commit to deflatSOCO/pandas that referenced this issue Jun 10, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 11, 2018

Late to comment but why is this considered an issue? From a Python perspective isn't \n always expected to map to the platform's required line ending unless universal newline mode is specifically disabled?

Method 1 is using universal line mode so \n should become \r\n on Windows. Method 2 disables universal line mode hence why it's only \n. Both seem correct?

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

@WillAyd : The issue is that if you specify line_terminator (as we do by default to be \n), the line terminator should arguably be \n, independent of the OS.

@WillAyd
Copy link
Member

WillAyd commented Jun 11, 2018

Hmm OK. So is the expectation that line_terminator also disables universal newline support? Just want to be clear on expectations as if you take pandas out of the picture the below would be equivalent to the original issue (correct me if I am wrong):

# Matches method 1, i.e. would still write \r\n
with open('somefile.txt', 'w') as fn:
    fn.write("foo" + "\n")

# Matches method 2
with open('somefile.txt', 'w', newline="\n") as fn:
    fn.write("foo" + "\n")

Just wondering if it's not preferable to add newline as a keyword and pass through to match the Python API rather than having line_terminator take that on implicitly

@gfyoung
Copy link
Member

gfyoung commented Jun 11, 2018

line_terminator refers to the newline character, so the answer would be no in light of those semantics. In addition, I'm pretty wary of adding more keyword arguments to this already large signature unless absolutely necessary.

@jreback jreback added this to the 0.24.0 milestone Jun 13, 2018
@deflatSOCO
Copy link
Contributor

deflatSOCO commented Jun 19, 2018

The current flow of to_csv() seems to be

  1. Dump data to string buffer by csv.writer with optionlineterminator=self.line_terminator
  2. Open output file with universal newline support
  3. Save the contents of string buffer to the output file

I observed that the string buffer in step1 uses '\n's as expected, but they changed to '\r\n' after step2&3,
since "Universal newline support" always converts '\n' to '\r\n' in Windows.
From that I suppose that to ensure line_terminator='\n' works in Windows with string file name in path_or_buf (as in method 1), there's no way to disable universal newline support of open method, unless we change the process of to_csv() itself.

Again, I'm new to this community, so please correct me if I have the wrong recognitions.

If we need more discussions, I suppose I should close my PR until how to resolve this issue is decided. Is that OK?

@gfyoung
Copy link
Member

gfyoung commented Jun 19, 2018

@deflatSOCO : Don't need to close. We're just busy people, so sometimes, issues / PR's go dark for a little. Thanks for pinging us again!

deflatSOCO added a commit to deflatSOCO/pandas that referenced this issue Jun 24, 2018
* re-defined testcases that suits conversations in PR pandas-dev#21406
* changed default value of line_terminator to os.linesep
* changed API document of DataFrame.to_csv
* changed "newline" value of "open()" from '\n' to ''
* Updated whatsnew document

related pages:
* Issue pandas-dev#20353
* PR pandas-dev#21406
deflatSOCO added a commit to deflatSOCO/pandas that referenced this issue Jun 26, 2018
* Updates:
  * Updated expected values for some tests about 'to_csv()' method, to 
deal with new default value of 'line_terminator' arg.

* Related Issue:
  * Issue pandas-dev#20353
  * PR pandas-dev#21406
deflatSOCO added a commit to deflatSOCO/pandas that referenced this issue Jul 4, 2018
deflatSOCO added a commit to deflatSOCO/pandas that referenced this issue Aug 11, 2018
* Added test for new test util `convert_rows_list_to_csv_str`
* Edited what's new

Related Issue: pandas-dev#20353
gfyoung added a commit to deflatSOCO/pandas that referenced this issue Oct 18, 2018
* Use OS line terminator if none is provided
* Enforce line terminator selection if one is

Originally authored by @deflatSOCO, but reapplied
by @gfyoung due to enormous merge conflicts.

Closes pandas-devgh-20353.
gfyoung added a commit to deflatSOCO/pandas that referenced this issue Oct 19, 2018
* Use OS line terminator if none is provided
* Enforce line terminator selection if one is

Originally authored by @deflatSOCO, but reapplied
by @gfyoung due to enormous merge conflicts.

Closes pandas-devgh-20353.
jreback pushed a commit that referenced this issue Oct 19, 2018
* Use OS line terminator if none is provided
* Enforce line terminator selection if one is

Originally authored by @deflatSOCO, but reapplied
by @gfyoung due to enormous merge conflicts.

Closes gh-20353.
tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018
* Use OS line terminator if none is provided
* Enforce line terminator selection if one is

Originally authored by @deflatSOCO, but reapplied
by @gfyoung due to enormous merge conflicts.

Closes pandas-devgh-20353.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Windows Windows OS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants