read_csv return wrong dataframe when setting skiprows. #12775

strnam · 2016-04-02T07:46:34Z

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> from StringIO import StringIO
>>> data = """id,text,num_lines
1,"line 11
line 12",2
2,"line 21
line 22",2
3,"line 31",1"""

>>> pd.read_csv(StringIO(data))
Out[2]: 
   id              text  num_lines
0   1  'line 11\nline 12'          2
1   2  'line 21\nline 22'          2
2   3           'line 31'          1

>>> pd.read_csv(StringIO(data), skiprows=[1])
Out[3]: 
         id              text  num_lines
0  'line 12"'                 2        NaN
1         2  'line 21\nline 22'        2.0
2         3           'line 31'        1.0
...

Expected Output

>>> pd.read_csv(StringIO(data), skiprows=[1])
Out[3]: 
   id              text  num_lines
0   2  'line 21\nline 22'          2
1   3           'line 31'          1
...

It should skip '1,"line 11\nline 12",2' instead skip '1,"line 11'

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.3-300.fc23.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 18.0.1
Cython: None
numpy: 1.11.0
scipy: 0.14.1
statsmodels: 0.6.1
xarray: None
IPython: 3.2.1
sphinx: 1.2.3
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 0.6.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-02T15:21:58Z

very tricky as you have embedded newlines within quoted fields. I guess the skip_lines is not accounting for the quoted fields (and ignoring them).

jreback · 2016-04-02T15:23:13Z

this is related to #10911

jreback · 2016-04-02T15:25:40Z

cc @mdmueller
cc @selasley
cc @evanpw

strnam · 2016-04-03T11:20:24Z

Thank you. That problem appear in my real project when other people give me a large csv file contain text column (text from news website). Because the file is big so I just want to read a part of file using skiprows parameter, but it don't work as I expect.

selasley · 2016-04-03T19:47:48Z

Using the python engine instead of the faster c engine works for the data given above

In [4]: pd.read_csv(StringIO(data), skiprows=[1], engine='python')
Out[4]: 
   id              text  num_lines
0   2  line 21\nline 22          2
1   3           line 31          1

Patches bug in C engine CSV parser in which quotation marks were not being respected in skipped rows. Closes pandas-devgh-10911. Closes pandas-devgh-12775.

jreback added Bug IO CSV read_csv, to_csv Difficulty Intermediate labels Apr 2, 2016

jreback added this to the Next Major Release milestone Apr 2, 2016

gfyoung mentioned this issue Apr 14, 2016

Allow parsing in skipped row for C engine #12900

Closed

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 22, 2016

Patch handling of quotes in skipped rows

858c673

Patches bug in C engine CSV parser in which quotation marks were not being respected in skipped rows. Closes pandas-devgh-10911. Closes pandas-devgh-12775.

jreback modified the milestones: 0.18.1, Next Major Release Apr 22, 2016

jreback closed this as completed in 5688d27 Apr 22, 2016

gfyoung mentioned this issue Jun 5, 2016

BUG: Python parser breaks with quotes and multi-char sep #13374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv return wrong dataframe when setting skiprows. #12775

read_csv return wrong dataframe when setting skiprows. #12775

strnam commented Apr 2, 2016

jreback commented Apr 2, 2016

jreback commented Apr 2, 2016

jreback commented Apr 2, 2016

strnam commented Apr 3, 2016

selasley commented Apr 3, 2016

read_csv return wrong dataframe when setting skiprows. #12775

read_csv return wrong dataframe when setting skiprows. #12775

Comments

strnam commented Apr 2, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 2, 2016

jreback commented Apr 2, 2016

jreback commented Apr 2, 2016

strnam commented Apr 3, 2016

selasley commented Apr 3, 2016

output of `pd.show_versions()`