Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
DataFrame to_csv line_terminator inconsistency when using compression #25311
Code Sample, a copy-pastable example if possible
df.to_csv('uncompressed.csv') df.to_csv('compressed-wrong-line-terminator.csv.gz') df.to_csv('compressed-good-line-terminator.csv.gz', line_terminator='\n')
Current line_terminator defaults when using compression and when not using compression are different (Windows OS, pandas 0.24.1).
When uncompressing the gzip file created using the default line_terminator, we can clearly see that the files are different (compressed-wrong-line-terminator.csv vs uncompressed.csv); only when using the explicit line_termintor='\n' the uncompressed file is identical to the not compressed file (compressed-good-line-terminator.csv.gz vs. uncompressed.csv).
It is emphasized that if we use the explicit line_terminator='\n' for non-compressed files, the output file is different than the ones created without explicit assignment of the line_terminator - forcing the user the need to explicitly specify the line_terminator only for compressed files.
This behavior is problematic, especially using the latest pandas version, where compression is inferred from the file extension, and one would expect that also the line_separator will undergo the same inference.
As stated above, it is expected that the command in python line 2 (after uncompressing it) will produce the same file as produced by the command in python line 1.
I've the same problem.
import pandas as pd import numpy as np d = pd.DataFrame(np.random.randint(1,10,size=(10,10)), columns=list('qwertyuiop')) d.to_csv('foo.csv.gz', index=False)
The saved files has two line terminators in each line
The saved file in the example:
Saving without compression does works as expected.
Setting the line_terminator resolves it
It seems to be limited to Windows, I've tried on Linux and it does not have the same problem, using the same pandas version
INSTALLED VERSIONS ------------------ commit: None python: 3.6.8.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
After carefully reading the details of #25048, it seems that sed task refers to a scenario where a file handler is passed to pandas.to_csv().
I believe that we create a file handler internally when we detect a compression scheme, and so go down the same path as the other issue. Is that right?…
On Wed, Mar 6, 2019 at 9:52 AM jointfull ***@***.***> wrote: After carefully reading the details of *#25048 <#25048>*, it seems that sed task refers to a scenario where a *file handler* is passed to pandas.to_csv(). However, *in my case*, the call to pandas.to_csv() is with a filename (and *not a file handler*), and behave different when giving a filename that ends with .gz (inferring a request for a compressed file). I believe we are talking about *two different problems* here (unless proven that they originate from the same *bug* and that fixing one fixes the other too). — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#25311 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIqumnI1b7cUyMC7kgthzGp60Lylrks5vT-RagaJpZM4a6gZw> .