TypeError when using 'comment=...' in read_csv from a file #31396

cddf · 2020-01-28T17:18:33Z

Code Sample

Given a data file data.csv with a line that is commented out:

+1.280000e+002,-4.078996e+001
+2.560000e+002,-5.155923e+001
# +3.840000e+002,-7.221378e+001
+5.120000e+002,-7.918677e+001
+6.400000e+002,-7.919656e+001

import pandas as pd
pd.read_csv('data.csv', sep=None, index_col=0, header=None, engine="python", comment='#')

Problem description

It raises a TypeError when using the comment parameter:

TypeError

TypeError                                 Traceback (most recent call last)
<ipython-input-17-c89b3c3e691f> in <module>
----> 1 pd.read_csv('data.csv', sep=None, comment='#')

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names,
 index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfoote
r, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirs
t, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encodin
g, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1145                     ' "python-fwf")'.format(engine=engine)
   1146                 )
-> 1147             self._engine = klass(self.f, **self.options)
   1148 
   1149     def _failover_to_python(self):

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, **kwds)
   2297         # Set self.data to something that can read lines.
   2298         if hasattr(f, "readline"):
-> 2299             self._make_reader(f)
   2300         else:
   2301             self.data = f

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in _make_reader(self, f)
   2427                 self.pos += 1
   2428                 self.line_pos += 1
-> 2429                 sniffed = csv.Sniffer().sniff(line)
   2430                 dia.delimiter = sniffed.delimiter
   2431                 if self.encoding is not None:

/usr/lib64/python3.8/csv.py in sniff(self, sample, delimiters)
    179 
    180         quotechar, doublequote, delimiter, skipinitialspace = \
--> 181                    self._guess_quote_and_delimiter(sample, delimiters)
    182         if not delimiter:
    183             delimiter, skipinitialspace = self._guess_delimiter(sample,

/usr/lib64/python3.8/csv.py in _guess_quote_and_delimiter(self, data, delimiters)
    220                       r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
    221             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
--> 222             matches = regexp.findall(data)
    223             if matches:
    224                 break

TypeError: expected string or bytes-like object

Without the comment in the data file and without the parameter comment='#' everything works as expected.

It seems that sep=None ist the problem here.
When using sep=',' it works. But in our case, the import is part of a general importer that should accept a variety of different files. Thus, we must use sep=None.

Expected Output

I would expect the following output:

Out[18]: 
                 1
0                 
128.0    -40.78996
256.0    -51.55923
512.0    -79.18677
640.0    -79.19656

[4 rows x 1 columns]

Output of `pd.show_versions()`

Details

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.13-201.fc31.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

s-scherrer · 2020-01-28T20:10:38Z

I could reproduce your issue with the same versions of python and pandas. The problem also occurs without the comment in the file, but with comment='#' and sep=None.

It seems like line in line 2396 of pandas/io/parsers.py (current master), which is passed into csv.Sniffer().sniff is a list of characters instead of a string.

The original issue is probably that in line 2392:

line = self._check_comments([line])[0]

the argument to self._check_comments seems to be a list of a single string, e.g. ['+1.280000e+002,-4.078996e+001\n'] instead of a list of lists of string like [['+1.280000e+002,-4.078996e+001\n']].

I assume changing the line to

line = self._check_comments([[line]])[0]

should solve the issue.

It would probably be good to add a test for this parameter combination. I can do it, but I'm unsure how to add the test. Should I just add it as parameter to test_sniff_delimiter in tests/io/parser/test_python_parser_only.py? Or would a separate test be better?

cddf · 2020-01-29T10:59:51Z

Nice! Thank you for your fast investigation 👍

I think, in principle it's good practice to make an extra test. But I'm not active in pandas development.

Added a test case to io/parser/test_python_parser_only.py in order to reproduce pandas-dev#31396.

This makes it possible to use read_csv with sep=None and comment set to a non-None value. Fixes pandas-dev#31396.

Added a test case to reproduce issue pandas-dev#31396.

This makes read_csv work when sep=None and comment is set to a value. Fixes pandas-dev#31396.

Added a note in whatsnew/v1.0.0.rst and moved test for pandas-dev#31396 to the end of tests/io/parser/test_python_parser_only.py.

s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020

TST: added test for sep=None and comment='#'

2690c09

Added a test case to io/parser/test_python_parser_only.py in order to reproduce pandas-dev#31396.

s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020

BUG: read_csv: sep=None and comment!=None

a1cd25d

This makes it possible to use read_csv with sep=None and comment set to a non-None value. Fixes pandas-dev#31396.

s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020

TST: Added test for sep=None and comment='#'

0dd9b0d

Added a test case to reproduce issue pandas-dev#31396.

s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020

BUG: read_csv: sep=None and comment!=None now possible

cbc1c5d

This makes read_csv work when sep=None and comment is set to a value. Fixes pandas-dev#31396.

s-scherrer mentioned this issue Feb 4, 2020

BUG: fixes bug when using sep=None and comment keyword for read_csv #31667

Merged

5 tasks

jreback added Bug IO CSV read_csv, to_csv labels Feb 5, 2020

s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 5, 2020

DOC: Added whatsnew note and moved test

07a8033

Added a note in whatsnew/v1.0.0.rst and moved test for pandas-dev#31396 to the end of tests/io/parser/test_python_parser_only.py.

jreback added this to the 1.1 milestone Mar 3, 2020

jreback closed this as completed in #31667 Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError when using 'comment=...' in read_csv from a file #31396

TypeError when using 'comment=...' in read_csv from a file #31396

cddf commented Jan 28, 2020

INSTALLED VERSIONS

s-scherrer commented Jan 28, 2020 •

edited

Loading

cddf commented Jan 29, 2020

TypeError when using 'comment=...' in read_csv from a file #31396

TypeError when using 'comment=...' in read_csv from a file #31396

Comments

cddf commented Jan 28, 2020

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

s-scherrer commented Jan 28, 2020 • edited Loading

cddf commented Jan 29, 2020

Output of `pd.show_versions()`

s-scherrer commented Jan 28, 2020 •

edited

Loading