Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459

Closed
rahulporuri opened this Issue Oct 20, 2016 · 14 comments

Comments

Projects
None yet
4 participants

rahulporuri commented Oct 20, 2016 edited

A small, complete example of the issue

while trying to open a data file similar to

a
b"
c
d
e"
a,b,c
1,2,3
4,5,6
7,8,9

i expect the following code

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=5, sep=','); print df"

Expected Output

   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Observed Output

Empty DataFrame
Columns: [7,8,9]
Index: []

Further Insight

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=2, sep=','); print df"

surprisingly works. also,

python -c "import pandas; df = pandas.read_table('quoted_data.csv', skiprows=5, quotechar="?", sep=','); print df"

works

The behavior changed between pandas 0.18.0 and 0.18.1. we suspect changes made in #12900 to be causing this.

Note that the difference in skiprows values that works (2) and that doesn't (5) is the same as the number of lines in the file between quote chars.

Apologies for the noise if this has already been reported or is being addressed.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Contributor

pankajp commented Oct 20, 2016 edited

Some additional notes:

  • The python engine works as expected:
$ python -c "import pandas; df = pandas.read_table('quoted_data.csv', engine='python', skiprows=5, sep=','); print df"
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9
  • The commit 5688d27 is the one which changed this behavior for the c engine, as found out by git bisect

The difference in behaviour between python and c engine is not good. But, the question is a bit which of both you want.

cc @gfyoung

I suppose normally newlines in quotes should only be regarded as part of the string if the quotes are 'valid'. I mean, b" is not a valid quoted string, and if that occurs in a normal row this is just interpreted as the string 'b"', not as the start of a multiline string

Member

gfyoung commented Oct 21, 2016 edited

@rahulporuri : Thanks for bringing up this issue! This is not bugged behaviour but rather expected. The reason why you get an empty DataFrame is because that multi-line quote is considered to be a single field value. Thus, the five rows you are skipping are 'a', 'b\nc\nd\ne', a,b,c, 1,2,3, and 4,5,6. Once you are inside the quotation marks, anything goes.

Your "surprising" results behave as expected too. The first two rows are 'a' and 'b\nc\nd\ne', and those are skipped leaving the DataFrame you wanted. The reason why your next example "behaves" is because you specified the quote character to be '?'. Now 'b\nc\nd\ne' is no longer interpreted as a single field but rather as multiple lines because " is no longer a quotation mark character.

The Python behaviour is out of our control because the csv library in Python is the foundation of that parser, and when you read the data you provided, it parses incorrectly IMO by "ignoring" quotation marks (i.e. it treats the \n inside the quotation marks as line terminators). While it is good that you bring up this inconsistency, I wonder in the long-term whether this is worthwhile to "rectify," if there is any good way to do so. I think we should be deprecating the Python parser eventually because I think the C parser is objectively superior both in performance and general correctness.

@jorisvandenbossche : While I don't believe there is a real issue to fix, not entirely sure what would be best to do given my explanation above.

Contributor

pankajp commented Oct 25, 2016

I think @jorisvandenbossche 's suggestion is reasonable and expected, quoted field should have quoting in the beginning and end of the field. The example here is artificial but has use in real world data with ' and " being used in quite a few places to indicate feet and inches.
The python engine's behavior seems appropriate.

@gfyoung to illustrate the difference in how quotes are interpreted in skipped rows vs the data rows:

newline in quoted strings ("a\nb") -> regarded as one row

s1 = """a,b,c
1,"a
b",3
2,"c",4"""

In [2]: pd.read_csv(StringIO(s1))
Out[2]: 
   a     b  c
0  1  a\nb  3
1  2     c  4

if you have a similar case in to skip header rows ("further\nskip"") -> also interpreted as one row -> skiprows=2 instead of 3

s2 = """toskip
"further
skip"
a,b,c
1,"a
b",3
2,"c",4"""

In [4]: pd.read_csv(StringIO(s2), skiprows=2)
Out[4]: 
   a     b  c
0  1  a\nb  3
1  2     c  4

but if the quote is not 'valid' (a"\nb"), it is not interpreted as one row

s3 = """a,b,c
1,a"
b",3
2,"c",4"""

In [7]: pd.read_csv(StringIO(s3))
Out[7]: 
    a   b    c
0   1  a"  NaN
1  b"   3  NaN
2   2   c  4.0

if you have a similar construct in the to skip header rows (further"\nskip"), it is not regarded as separate rows as I still have to use skiprows=2 instead of skiprows=3:

s4 = """toskip
further"
skip"
a,b,c
1,a"
b",3
2,c,4"""

In [9]: pd.read_csv(StringIO(s4), skiprows=2)
Out[9]: 
    a   b    c
0   1  a"  NaN
1  b"   3  NaN
2   2   c  4.0

I am not sure if we have some kind of definition of what a 'valid' quote is, but in any case there is some inconsistency here, and which caused possibly unintended change in the skiprows behaviour.

Member

gfyoung commented Oct 25, 2016 edited

@jorisvandenbossche : Hmmm...so I think that s3 is a bug in your examples. The reason is that we do not go back to parsing as a quoted field once we are inside the field (i.e. your 'a' character means that the quote becomes in-field). It is not a thing of valid and invalid quotation marks. That can be patched.

@rahulporuri : Imagine your field value is a multi-line quote. Would you want Python to butcher it?

@gfyoung If s3 would be a bug, what would you expect from this:

s = """a,b
1,a"d
2,d"d"d"""

In [14]: pd.read_csv(StringIO(s))
Out[14]: 
   a      b
0  1    a"d
1  2  d"d"d

as this are both example of where quotation marks are not interpreted as starting quotes

Owner

jorisvandenbossche commented Oct 25, 2016 edited

The reason is that we do not go back to parsing as a quoted field once we are inside the field (i.e. your 'a' character means that the quote becomes in-field).

Yes (that's what I meant with 'invalid' quote, but maybe not a good name), so indeed because the field is already started, the quotation mark is not regarded as the start of a quote. But I don't understand why you say this would be a bug, as you also explain that we deliberately do not go as a quoted field one we are inside the field.

So why not follow the same reasoning for the header lines? If the line does not start with a quotation mark, you already are 'in-field'

Member

gfyoung commented Oct 25, 2016

@jorisvandenbossche : Fair point. Now that I think about it, we could go either way on this:

  1. We consider inline quotation marks as quotation marks to parse.
  2. We consider inline quotation marks as data.

Which one do you think has more use-cases?

Given that it is the current behaviour of both the python and c engine, I would go with option 2.
And apart from that, option 2 also makes the most sense to me I think.

But if we think option 2 is the right way, that means that the s3 example from above is not a bug (and that the current behaviour for data rows is the desired one).
But that leaves the question of how to interpret quotes on skipped lines. I would say that we should treat then the same as other rows.

Member

gfyoung commented Oct 25, 2016 edited

@jorisvandenbossche : Okay, but I suspect we're going to take a major performance hit if we have to differentiate between "quoted fields" and "in-field quotes". For example, what happens if your skipped row has multiple quoted fields in a single row?

I tackled this issue before with the C parser when I implemented quotation mark parsing in skipped rows. Right now, whenever we see a quotation mark, we just let anything and everything pass through.

But didn't we get the performance hit already when we started parsing the skipped rows?
Or is there still a large difference between parsing the skipped rows just for quotes and parsing the skipped rows for both delimiter and quotes ?

Member

gfyoung commented Oct 26, 2016

@jorisvandenbossche : Yes, we did. Now that I think about it, as long as we just check for delimiters (and no other parsing), we could be okay. We might need a couple of other states I think in tokenize_bytes.

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 27, 2016

@gfyoung gfyoung BUG: Don't parse inline quotes in skipped lines
Closes gh-14459.
5a0556e

jorisvandenbossche added this to the 0.19.1 milestone Oct 27, 2016

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 27, 2016

@gfyoung gfyoung BUG: Don't parse inline quotes in skipped lines
Closes gh-14459.
4fb55f6

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 28, 2016

@gfyoung gfyoung BUG: Don't parse inline quotes in skipped lines
Closes gh-14459.
1662b8f

@gfyoung gfyoung added a commit to gfyoung/pandas that referenced this issue Oct 30, 2016

@gfyoung gfyoung BUG: Don't parse inline quotes in skipped lines
Closes gh-14459.
2e41dab

@jorisvandenbossche jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Nov 2, 2016

@gfyoung @jorisvandenbossche gfyoung + jorisvandenbossche [Backport #14514] BUG: Don't parse inline quotes in skipped lines (#1…
…4514)


Closes gh-14459.
(cherry picked from commit b088112)
e8a71a3

@amolkahat amolkahat added a commit to amolkahat/pandas that referenced this issue Nov 26, 2016

@gfyoung @amolkahat gfyoung + amolkahat BUG: Don't parse inline quotes in skipped lines (#14514)
Closes gh-14459.
2035b72
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment