Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File mode in to_csv is ignored, when passing a file object instead of a path #19827

Closed
colobas opened this issue Feb 21, 2018 · 21 comments · Fixed by #35129
Closed

File mode in to_csv is ignored, when passing a file object instead of a path #19827

colobas opened this issue Feb 21, 2018 · 21 comments · Fixed by #35129

Comments

@colobas
Copy link

@colobas colobas commented Feb 21, 2018

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> df = pd.read_csv("example.csv")
>>> df.head()
   just  a  file
0     1  2     3
1     4  5     6
2     7  8     9
>>> with open("someother.csv", "wb") as f:
...     df.to_csv(f, mode="wb")
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python3.6/site-packages/pandas/core/frame.py", line 1524, in to_csv
    formatter.save()
  File "/usr/lib/python3.6/site-packages/pandas/io/formats/format.py", line 1652, in save
    self._save()
  File "/usr/lib/python3.6/site-packages/pandas/io/formats/format.py", line 1740, in _save
    self._save_header()
  File "/usr/lib/python3.6/site-packages/pandas/io/formats/format.py", line 1708, in _save_header
    writer.writerow(encoded_labels)
TypeError: a bytes-like object is required, not 'str'

Problem description

When passing a file opened in binary mode to df.to_csv and also passing mode='wb', this mode is ignored. I think it's because of these lines: https://github.com/pandas-dev/pandas/blob/master/pandas/io/common.py#L407-L411 and these ones: https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/format.py#L1660-L1662

It seems that is_text isn't passed, and so it assumes the default value of True

Expected Output

A file should just be written.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.20-1-lts
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.0
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.2.3
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: 0.1.3
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Feb 21, 2018

Isn't csv sort of fundamentally a text format? The python csv writer, which we call out to, is ultimately what's choking. Certainly could raise a better error.

In [36]: import csv

In [37]: writer = csv.writer(open('test.csv', mode='wb'))

In [38]: writer.writerow(['a', 'b', 'c'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-bb2ef92b0247> in <module>()
----> 1 writer.writerow(['a', 'b', 'c'])

TypeError: a bytes-like object is required, not 'str'

@colobas
Copy link
Author

@colobas colobas commented Feb 21, 2018

It is, but sometimes you have a filesystem-sort-of-interface (like Azure's Data Lake Store one - here ) that requires files to be written in binary mode, regardless of the format.

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Feb 22, 2018

Does passing a ADL's file-like object to to_csv, without the mode argument, work? I know you can pass an s3fs.S3File opened in binary mode to to_csv and everything works fine.

@colobas
Copy link
Author

@colobas colobas commented Feb 22, 2018

Hey @TomAugspurger , just tried it. Same error:

with adlfs_client.open("/dummy.csv", "wb") as f:
    dummy.to_csv(f)

yields

TypeError                                 Traceback (most recent call last)
<ipython-input-186-bbe6f623ffa5> in <module>()
      1 with adlfs_client.open("/dummy.csv", "wb") as f:
----> 2     dummy.to_csv(f)

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1522                                      doublequote=doublequote,
   1523                                      escapechar=escapechar, decimal=decimal)
-> 1524         formatter.save()
   1525 
   1526         if path_or_buf is None:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/format.py in save(self)
   1650                 self.writer = UnicodeWriter(f, **writer_kwargs)
   1651 
-> 1652             self._save()
   1653 
   1654         finally:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/format.py in _save(self)
   1738     def _save(self):
   1739 
-> 1740         self._save_header()
   1741 
   1742         nrows = len(self.data_index)

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/format.py in _save_header(self)
   1706         if not has_mi_columns or has_aliases:
   1707             encoded_labels += list(write_cols)
-> 1708             writer.writerow(encoded_labels)
   1709         else:
   1710             # write out the mi

/opt/conda/lib/python3.6/site-packages/azure/datalake/store/core.py in write(self, data)
    849             raise ValueError('I/O operation on closed file.')
    850 
--> 851         out = self.buffer.write(ensure_writable(data))
    852         self.loc += out
    853         self.flush(syncFlag='DATA')

TypeError: a bytes-like object is required, not 'str'

As of now the workaround is to create a temporary file and upload it explicitly, so it' not like I'm stuck, but it's just ugly.

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Feb 22, 2018

I think it works for s3fs because of this line:

need_text_wrapping = (BytesIO, S3File)

Perhaps you can wrap your f in an io.TextIOWrapper?

I'm not sure what the best way to solve this is generically.

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Feb 22, 2018

Perhaps just checking the buffer for a mode and if it's binary then we know we need to wrap in in a TextIOWrapper?

@colobas
Copy link
Author

@colobas colobas commented Feb 22, 2018

I can confirm that this does work:

from io import TextIOWrapper

with adlfs_client.open("/dummy.csv", "wb") as f:
    buf = TextIOWrapper(f)
    dummy.to_csv(buf)

Thanks for the suggestion

@jreback
Copy link
Contributor

@jreback jreback commented Feb 23, 2018

I suppose we could add a mini-example like this to the docs (io.rst). It is a useful case I think.
@colobas would you do a PR?

@jreback jreback added this to the Next Major Release milestone Feb 23, 2018
@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Feb 23, 2018

@colobas
Copy link
Author

@colobas colobas commented Feb 23, 2018

@TomAugspurger doesn't the mode parameter in to_csv stop having a purpose in that case?

EDIT: I had typed read_csv instead of to_csv

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Feb 23, 2018

@colobas
Copy link
Author

@colobas colobas commented Feb 23, 2018

No I think there would be no harm. I can make a PR during the weekend

@cbrnr
Copy link
Contributor

@cbrnr cbrnr commented Jun 20, 2018

Appending to a file doesn't work for text mode either:

import pandas as pd


with open("test.csv", "w") as f:
    f.write("A,B,C\n")


df = pd.DataFrame({"A": [1, 2, 3, 4],
                   "B": [5, 6, 7, 8],
                   "C": [9, 10, 11, 12]})

with open("test.csv", "a") as f:
    df.to_csv(f, header=False, index=False)

Even though f is open in append mode, the file gets overwritten by to_csv.

However, appending does work when a file name is passed instead of a file handle:

df.to_csv("test.csv", header=False, index=False, mode="a")

I think this issue came up only recently, because I think this worked with a previous pandas version (although I didn't check).

@jmerkin
Copy link

@jmerkin jmerkin commented Jun 27, 2018

I updated my conda install after probably 6-12 months and this issue crept up for me, so I can confirm that it arose recently.

Roughly, I was doing the following (essentially as cbrnr describes):

with open(filename, 'w') as fout:
    for input in inputs:
        result = some_function(input)
        result.to_csv(fout)

Looking through the installed versions in anaconda2/pkgs/ I see pandas-0.20.1 and pandas-0.23.1. Previously, the code would produce a single file with all the results produced. Now, the resulting file only contains the final results from the last iteration. The above code would work fine with 0.20, but produces the error as cbrnr describes now.

@mariusvniekerk
Copy link

@mariusvniekerk mariusvniekerk commented Jul 2, 2018

Setting mode='a' does clear it in the second example. This seems to have changed between 0.22.0 and 0.23.1

@Huite
Copy link

@Huite Huite commented Jul 3, 2018

I've downgraded to 0.23.0 to check, and 0.23.0 works as expected. It appears to be specific to 0.23.1.

@yrhooke
Copy link
Contributor

@yrhooke yrhooke commented Jul 6, 2018

Testing on current dev version (pandas: 0.24.0.dev0+232.g04caa569e) on my machine:

append works fine.

Wrapping the file with TextIOWrapper inside the with open() block as suggested above doesn't raise an error, but doesn't write to file either.

binary write mode appears to be an issue with python3 csv module (see @chris-b1 's comment).

Is there a particular fix that needs to happen? I'd be happy to work on it.

@harshit-ag
Copy link

@harshit-ag harshit-ag commented Apr 6, 2019

can i work on this issue ?

@harshit-ag
Copy link

@harshit-ag harshit-ag commented Apr 7, 2019

`import io
import pandas as pd
towrite = io.BytesIO()
df = pd.read_csv("example.csv")

df.to_excel(towrite)
towrite.seek(0)

with open("someother.csv", "wb") as f:

f.write(towrite.getvalue())
f.close()

`
This can be a probable solution by checking the mode and converting it to a byte-like object. I have tried to make changes but could not figure out the flow of code. @TomAugspurger can you help me to do that?

@Enteee
Copy link

@Enteee Enteee commented May 13, 2020

Hi folks, I wrote an article on my blog on how to Support Binary File Objects with pandas.DataFrame.to_csv. At the end of the article I added a monkey patch I think can also be used as a work around for this problem. Hope this helps until this is resolved in pandas.

@remram44
Copy link

@remram44 remram44 commented Jun 8, 2020

I'm very confused about this, because passing a io.BytesIO to pandas will not work (so whatever detection you do isn't very good) but passing a simple file-like object with a write() method will make to_csv() write bytes to it for some reason. Why to_csv() puts bytes into anything is beyond me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
IO Method Robustness
Awaiting triage
Development

Successfully merging a pull request may close this issue.