Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv return wrong dataframe when setting skiprows. #12775

Closed
strnam opened this issue Apr 2, 2016 · 5 comments
Closed

read_csv return wrong dataframe when setting skiprows. #12775

strnam opened this issue Apr 2, 2016 · 5 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@strnam
Copy link

strnam commented Apr 2, 2016

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> from StringIO import StringIO
>>> data = """id,text,num_lines
1,"line 11
line 12",2
2,"line 21
line 22",2
3,"line 31",1"""

>>> pd.read_csv(StringIO(data))
Out[2]: 
   id              text  num_lines
0   1  'line 11\nline 12'          2
1   2  'line 21\nline 22'          2
2   3           'line 31'          1

>>> pd.read_csv(StringIO(data), skiprows=[1])
Out[3]: 
         id              text  num_lines
0  'line 12"'                 2        NaN
1         2  'line 21\nline 22'        2.0
2         3           'line 31'        1.0
...

Expected Output

>>> pd.read_csv(StringIO(data), skiprows=[1])
Out[3]: 
   id              text  num_lines
0   2  'line 21\nline 22'          2
1   3           'line 31'          1
...

It should skip '1,"line 11\nline 12",2' instead skip '1,"line 11'

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.3-300.fc23.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 18.0.1
Cython: None
numpy: 1.11.0
scipy: 0.14.1
statsmodels: 0.6.1
xarray: None
IPython: 3.2.1
sphinx: 1.2.3
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 0.6.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None

@jreback jreback added this to the Next Major Release milestone Apr 2, 2016
@jreback
Copy link
Contributor

jreback commented Apr 2, 2016

very tricky as you have embedded newlines within quoted fields. I guess the skip_lines is not accounting for the quoted fields (and ignoring them).

@jreback
Copy link
Contributor

jreback commented Apr 2, 2016

this is related to #10911

@jreback
Copy link
Contributor

jreback commented Apr 2, 2016

cc @mdmueller
cc @selasley
cc @evanpw

@strnam
Copy link
Author

strnam commented Apr 3, 2016

Thank you. That problem appear in my real project when other people give me a large csv file contain text column (text from news website). Because the file is big so I just want to read a part of file using skiprows parameter, but it don't work as I expect.

@selasley
Copy link
Contributor

selasley commented Apr 3, 2016

Using the python engine instead of the faster c engine works for the data given above

In [4]: pd.read_csv(StringIO(data), skiprows=[1], engine='python')
Out[4]: 
   id              text  num_lines
0   2  line 21\nline 22          2
1   3           line 31          1

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 22, 2016
Patches bug in C engine CSV parser in
which quotation marks were not being
respected in skipped rows.

Closes pandas-devgh-10911.
Closes pandas-devgh-12775.
@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants