New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_csv regression in 0.23.1 #21471

Closed
francois-a opened this Issue Jun 14, 2018 · 12 comments

Comments

Projects
None yet
10 participants
@francois-a

francois-a commented Jun 14, 2018

Writing to gzip no longer works with 0.23.1:

with gzip.open('test.txt.gz', 'wt') as f:
    pd.DataFrame([0,1],index=['a','b'], columns=['c']).to_csv(f, sep='\t')

produces corrupted output. This works fine in 0.23.0.

Presumably this is related to #21241 and #21118.

@WillAyd WillAyd added the Needs Info label Jun 14, 2018

@WillAyd

This comment has been minimized.

Member

WillAyd commented Jun 14, 2018

@Liam3851

This comment has been minimized.

Contributor

Liam3851 commented Jun 14, 2018

@WillAyd Francois's example is reproducible for me on Windows 7 using master. The output file test.txt.gz is empty instead of containing data.

If I let pandas do the compression it appears to work fine:

df = pd.DataFrame([0,1],index=['a','b'], columns=['c'])
df.to_csv('C:/temp/test.txt.gz', sep='\t', compression='gzip')
@saidie

This comment has been minimized.

saidie commented Jun 14, 2018

Hi,
I also encountered a to_csv problem on 0.23.1 although my case is different to others:

import sys
import pandas as pd
df = pd.DataFrame([0,1])
df.to_csv(sys.stdout)

This code writes the dataframe to a file named <stdout> while it is expected to be printed out to the stdout.

@wildraid

This comment has been minimized.

wildraid commented Jun 14, 2018

I also have a problem with "to_csv" specifically on 0.23.1.

Looks like function "_get_handle()" returns "f" as FD number (int) instead of buf.

            # GH 17778 handles zip compression for byte strings separately.
            buf = f.getvalue()
            if path_or_buf:
                f, handles = _get_handle(path_or_buf, self.mode,
                                         encoding=encoding,
                                         compression=self.compression)
                f.write(buf)
                f.close()

Error text:

  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 168, in save
    f.write(buf)
AttributeError: 'int' object has no attribute 'write'
@WillAyd

This comment has been minimized.

Member

WillAyd commented Jun 14, 2018

@Liam3851 thanks - I misread the original post so I see the point now.

@saidie and @wildraid please do not add distinct issues to this. If you feel you have a different issue please open it separately

@WillAyd WillAyd added Regression IO CSV and removed Needs Info labels Jun 14, 2018

@wildraid

This comment has been minimized.

wildraid commented Jun 14, 2018

@WillAyd , I did a quick research.

It seems that all "file-like" objects which cannot be converted to string file paths are affected. Gzip wrapper, stdout, FD's - all these problems have the same origin.

Example with FD:

import pandas
import os

with os.fdopen(3, 'w') as f:
    print(f)
    pandas.DataFrame([0, 1]).to_csv(f)

Output:

<_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
Traceback (most recent call last):
  File "gg.py", line 6, in <module>
    pandas.DataFrame([0, 1]).to_csv(f)
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
    f.write(buf)
AttributeError: 'int' object has no attribute 'write'

I guess, integer comes from "name" attribute of TextIOWrapper. For STDOUT it will be <stdout>, etc.

@minggli

This comment has been minimized.

Contributor

minggli commented Jun 14, 2018

I think the issue is caused by #21249 in response to #21227

The correct use when passing a file-handle and expect compression should be like francois's case: i.e. pass a gzip file handle or other compression archive file handle.

@dweawyn

This comment has been minimized.

dweawyn commented Jun 14, 2018

Writing to TemporaryFile fails as well. The file remains empty:

import tempfile
import pandas as pd

df = pd.DataFrame([0, 1], index=['a', 'b'], columns=['c'])
with tempfile.TemporaryFile() as f:
    df.to_csv(f)
    f.seek(0)
    print f.read()

@TomAugspurger TomAugspurger added this to the 0.23.2 milestone Jun 14, 2018

@ernoc

This comment has been minimized.

ernoc commented Jun 14, 2018

Hi, here are some additional examples of the changes in the behaviour of to_csv.

A common use case is to write a file header once and then write many dataframes' data to that file. Our implementation looks like this:

df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': [1.0, 2.0, 3.0],
})
df2 = ... 

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    df.to_csv(f, index=False, header=False)
    ...
    df2.to_csv(f, ...
    df3.to_csv(f, ...

This works in 0.23.0 but in 0.23.1 it produces a file that looks like this:

col1,col2
0
3,3.0

What happened here is that pandas has opened a second handle to the same file path in write mode, and our f.write line was flushed last, overwriting some of what pandas wrote.

Flushing alone would not help because now pandas will overwrite our data:

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    f.flush()
    df.to_csv(f, index=False, header=False)

produces:

1,1.0
2,2.0
3,3.0

One workaround is both flushing manually AND giving pandas a write mode:

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    f.flush()
    df.to_csv(f, index=False, header=False, mode='a')

IMO this is not expected behaviour: if we give pandas an open file handle, we don't expect pandas to find out what the original path was, and open it again on a second file handle.

This is the bit of code where re-opening is decided: https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/csvs.py#L139 . This gives the "" behaviour pointed out by @saidie . Data is written to a StringIO first, finally the file is opened again by path and the data in the StringIO is written to it.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Jun 14, 2018

Thanks all for the reports!
There is a PR now that tries to fix this: #21478. Trying out or review of that is certainly welcome.

@minggli

This comment has been minimized.

Contributor

minggli commented Jun 14, 2018

hello, raised a PR to remedy this issue. welcome testing and review. for reports from @francois-a and @saidie, and other reproducible, this patch should fix it.

for now a workaround would be to use file path or StringIO.

@WillAyd

This comment has been minimized.

Member

WillAyd commented Jun 19, 2018

Closed via #21478

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment