read_csv incompatible with newstr and future #14477

Closed
larssono opened this Issue Oct 23, 2016 · 7 comments

Comments

Projects
None yet
4 participants

larssono commented Oct 23, 2016 edited by jorisvandenbossche

When upgrading the pandas-0.19 I have several tests failing on a package I maintain. These packages are using several imports from future to work with both py2 and py3. It seems there is an issue with using from __future__ import unicode_literals

A small, complete example of the issue

import pandas as pd
pd.read_csv('simple.txt', quotechar='"')
from __future__ import unicode_literals
pd.read_csv('simple.txt', quotechar='"')

The first reading works the second does not and throws the stack trace attached. ("TypeError: "quotechar" must be string, not unicode")
The example file
simple.txt

Expected Output

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 26.0.0
Cython: None
numpy: 1.11.2
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

TypeError                                 Traceback (most recent call last)
<ipython-input-2-6e275a5a7598> in <module>()
      1 from __future__ import unicode_literals
----> 2 pd.read_csv('/Users/lom/simple.csv', quotechar='"')

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    643                     skip_blank_lines=skip_blank_lines)
    644 
--> 645         return _read(filepath_or_buffer, kwds)
    646 
    647     parser_f.__name__ = name

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    386 
    387     # Create the parser.
--> 388     parser = TextFileReader(filepath_or_buffer, **kwds)
    389 
    390     if (nrows is not None) and (chunksize is not None):

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    727             self.options['has_index_names'] = kwds['has_index_names']
    728 
--> 729         self._make_engine(self.engine)
    730 
    731     def close(self):

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    920     def _make_engine(self, engine='c'):
    921         if engine == 'c':
--> 922             self._engine = CParserWrapper(self.f, **self.options)
    923         else:
    924             if engine == 'python':

/usr/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1387         kwds['allow_leading_cols'] = self.index_col is not False
   1388 
-> 1389         self._reader = _parser.TextReader(src, **kwds)
   1390 
   1391         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4411)()

pandas/parser.pyx in pandas.parser.TextReader._set_quoting (pandas/parser.c:6535)()

TypeError: "quotechar" must be string, not unicode

jorisvandenbossche added this to the 0.19.1 milestone Oct 24, 2016

@larssono Thanks for the report!

cc @gfyoung

Member

gfyoung commented Oct 24, 2016 edited

@jorisvandenbossche : Might it be best to just add a unicode class to pandas.compat? I think that should patch this issue IINM i.e.

try:
    unicode
except NameError:
    unicode = str
Member

gfyoung commented Oct 24, 2016 edited

FYI, for future reference, here's a slightly easier way to reproduce (Note: Python 2.x required):

>>> from pandas import read_csv
>>> from pandas.compat import StringIO, u
>>>
>>> data = 'a\n1'
>>> read_csv(StringIO(data), quotechar=u('"'))
...
TypeError: "quotechar" must be string, not unicode
Contributor

jreback commented Oct 24, 2016

@gfyoung unicode needs to be very explicit

Member

gfyoung commented Oct 24, 2016

@jreback : Right...but what do you think of the patch I proposed above, and we can then add the class to the allowed string types in parser.pyx?

Contributor

jreback commented Oct 24, 2016

well it's not explicit
so -1

Member

gfyoung commented Oct 24, 2016 edited

In pandas.compat:

try:
    unicode
except NameError:
    unicode = str
...

In parser.pyx:

if not isinstance(quote_char, (str, bytes, compat.unicode)) and quote_char is not None:
...

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 25, 2016

@gfyoung gfyoung BUG: Accept unicode quotechars again in pd.read_csv
Closes gh-14477.
9a31321

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 25, 2016

@gfyoung gfyoung BUG: Accept unicode quotechars again in pd.read_csv
Closes gh-14477.
814746b

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 25, 2016

@gfyoung gfyoung BUG: Accept unicode quotechars again in pd.read_csv
Closes gh-14477.
1d3a3d7

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 25, 2016

@gfyoung gfyoung BUG: Accept unicode quotechars again in pd.read_csv
Closes gh-14477.
6a47510

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 26, 2016

@gfyoung gfyoung BUG: Accept unicode quotechars again in pd.read_csv
Closes gh-14477.
523412b

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 26, 2016

@gfyoung gfyoung BUG: Accept unicode quotechars again in pd.read_csv
Closes gh-14477.
ec9f59a

jreback closed this in 6130e77 Oct 26, 2016

@jorisvandenbossche jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

@gfyoung @jorisvandenbossche gfyoung + jorisvandenbossche [Backport #14492] BUG: Accept unicode quotechars again in pd.read_csv
Title is self-explanatory.  Affects Python 2.x only.  Closes #14477.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14492 from gfyoung/quotechar-unicode-2.x and squashes the following commits:

ec9f59a [gfyoung] BUG: Accept unicode quotechars again in pd.read_csv

(cherry picked from commit 6130e77)
6440067

@amolkahat amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

@gfyoung @amolkahat gfyoung + amolkahat BUG: Accept unicode quotechars again in pd.read_csv
Title is self-explanatory.  Affects Python 2.x only.  Closes #14477.

Author: gfyoung <gfyoung17@gmail.com>

Closes #14492 from gfyoung/quotechar-unicode-2.x and squashes the following commits:

ec9f59a [gfyoung] BUG: Accept unicode quotechars again in pd.read_csv
01e2818
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment