Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when using 'comment=...' in read_csv from a file #31396

Closed
cddf opened this issue Jan 28, 2020 · 2 comments · Fixed by #31667
Closed

TypeError when using 'comment=...' in read_csv from a file #31396

cddf opened this issue Jan 28, 2020 · 2 comments · Fixed by #31667
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@cddf
Copy link

cddf commented Jan 28, 2020

Code Sample

Given a data file data.csv with a line that is commented out:

+1.280000e+002,-4.078996e+001
+2.560000e+002,-5.155923e+001
# +3.840000e+002,-7.221378e+001
+5.120000e+002,-7.918677e+001
+6.400000e+002,-7.919656e+001
import pandas as pd
pd.read_csv('data.csv', sep=None, index_col=0, header=None, engine="python", comment='#')

Problem description

It raises a TypeError when using the comment parameter:

TypeError
TypeError                                 Traceback (most recent call last)
<ipython-input-17-c89b3c3e691f> in <module>
----> 1 pd.read_csv('data.csv', sep=None, comment='#')

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names,
 index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfoote
r, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirs
t, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encodin
g, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1145                     ' "python-fwf")'.format(engine=engine)
   1146                 )
-> 1147             self._engine = klass(self.f, **self.options)
   1148 
   1149     def _failover_to_python(self):

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in __init__(self, f, **kwds)
   2297         # Set self.data to something that can read lines.
   2298         if hasattr(f, "readline"):
-> 2299             self._make_reader(f)
   2300         else:
   2301             self.data = f

~/.local/share/virtualenvs/openqlab/lib/python3.8/site-packages/pandas/io/parsers.py in _make_reader(self, f)
   2427                 self.pos += 1
   2428                 self.line_pos += 1
-> 2429                 sniffed = csv.Sniffer().sniff(line)
   2430                 dia.delimiter = sniffed.delimiter
   2431                 if self.encoding is not None:

/usr/lib64/python3.8/csv.py in sniff(self, sample, delimiters)
    179 
    180         quotechar, doublequote, delimiter, skipinitialspace = \
--> 181                    self._guess_quote_and_delimiter(sample, delimiters)
    182         if not delimiter:
    183             delimiter, skipinitialspace = self._guess_delimiter(sample,

/usr/lib64/python3.8/csv.py in _guess_quote_and_delimiter(self, data, delimiters)
    220                       r'(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
    221             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
--> 222             matches = regexp.findall(data)
    223             if matches:
    224                 break

TypeError: expected string or bytes-like object

Without the comment in the data file and without the parameter comment='#' everything works as expected.

It seems that sep=None ist the problem here.
When using sep=',' it works. But in our case, the import is part of a general importer that should accept a variety of different files. Thus, we must use sep=None.

Expected Output

I would expect the following output:

Out[18]: 
                 1
0                 
128.0    -40.78996
256.0    -51.55923
512.0    -79.18677
640.0    -79.19656

[4 rows x 1 columns]

Output of pd.show_versions()

Details

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.13-201.fc31.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : de_DE.UTF-8

pandas : 0.25.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@s-scherrer
Copy link
Contributor

s-scherrer commented Jan 28, 2020

I could reproduce your issue with the same versions of python and pandas. The problem also occurs without the comment in the file, but with comment='#' and sep=None.

It seems like line in line 2396 of pandas/io/parsers.py (current master), which is passed into csv.Sniffer().sniff is a list of characters instead of a string.

The original issue is probably that in line 2392:

line = self._check_comments([line])[0]

the argument to self._check_comments seems to be a list of a single string, e.g. ['+1.280000e+002,-4.078996e+001\n'] instead of a list of lists of string like [['+1.280000e+002,-4.078996e+001\n']].

I assume changing the line to

line = self._check_comments([[line]])[0]

should solve the issue.

It would probably be good to add a test for this parameter combination. I can do it, but I'm unsure how to add the test. Should I just add it as parameter to test_sniff_delimiter in tests/io/parser/test_python_parser_only.py? Or would a separate test be better?

@cddf
Copy link
Author

cddf commented Jan 29, 2020

Nice! Thank you for your fast investigation 👍

I think, in principle it's good practice to make an extra test. But I'm not active in pandas development.

s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020
Added a test case to io/parser/test_python_parser_only.py in order to
reproduce pandas-dev#31396.
s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020
This makes it possible to use read_csv with sep=None and comment set to
a non-None value.

Fixes pandas-dev#31396.
s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020
s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 4, 2020
This makes read_csv work when sep=None and comment is set to a value.
Fixes pandas-dev#31396.
@jreback jreback added Bug IO CSV read_csv, to_csv labels Feb 5, 2020
s-scherrer added a commit to s-scherrer/pandas that referenced this issue Feb 5, 2020
Added a note in whatsnew/v1.0.0.rst and moved test for pandas-dev#31396 to the end
of tests/io/parser/test_python_parser_only.py.
@jreback jreback added this to the 1.1 milestone Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants