Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Conflict b/w skiprows and default quotechar kwargs to pandas.read_table #14459
Comments
|
Some additional notes:
|
jorisvandenbossche
added the
IO CSV
label
Oct 20, 2016
|
The difference in behaviour between python and c engine is not good. But, the question is a bit which of both you want. cc @gfyoung I suppose normally newlines in quotes should only be regarded as part of the string if the quotes are 'valid'. I mean, |
jorisvandenbossche
added the
Bug
label
Oct 20, 2016
|
@rahulporuri : Thanks for bringing up this issue! This is not bugged behaviour but rather expected. The reason why you get an empty DataFrame is because that multi-line quote is considered to be a single field value. Thus, the five rows you are skipping are Your "surprising" results behave as expected too. The first two rows are The Python behaviour is out of our control because the @jorisvandenbossche : While I don't believe there is a real issue to fix, not entirely sure what would be best to do given my explanation above. |
|
I think @jorisvandenbossche 's suggestion is reasonable and expected, quoted field should have quoting in the beginning and end of the field. The example here is artificial but has use in real world data with |
|
@gfyoung to illustrate the difference in how quotes are interpreted in skipped rows vs the data rows: newline in quoted strings (
if you have a similar case in to skip header rows (
but if the quote is not 'valid' (
if you have a similar construct in the to skip header rows (
I am not sure if we have some kind of definition of what a 'valid' quote is, but in any case there is some inconsistency here, and which caused possibly unintended change in the |
|
@jorisvandenbossche : Hmmm...so I think that @rahulporuri : Imagine your field value is a multi-line quote. Would you want Python to butcher it? |
|
@gfyoung If
as this are both example of where quotation marks are not interpreted as starting quotes |
Yes (that's what I meant with 'invalid' quote, but maybe not a good name), so indeed because the field is already started, the quotation mark is not regarded as the start of a quote. But I don't understand why you say this would be a bug, as you also explain that we deliberately do not go as a quoted field one we are inside the field. So why not follow the same reasoning for the header lines? If the line does not start with a quotation mark, you already are 'in-field' |
|
@jorisvandenbossche : Fair point. Now that I think about it, we could go either way on this:
Which one do you think has more use-cases? |
|
Given that it is the current behaviour of both the python and c engine, I would go with option 2. |
|
But if we think option 2 is the right way, that means that the |
|
@jorisvandenbossche : Okay, but I suspect we're going to take a major performance hit if we have to differentiate between "quoted fields" and "in-field quotes". For example, what happens if your skipped row has multiple quoted fields in a single row? I tackled this issue before with the C parser when I implemented quotation mark parsing in skipped rows. Right now, whenever we see a quotation mark, we just let anything and everything pass through. |
|
But didn't we get the performance hit already when we started parsing the skipped rows? |
|
@jorisvandenbossche : Yes, we did. Now that I think about it, as long as we just check for delimiters (and no other parsing), we could be okay. We might need a couple of other states I think in |
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
Oct 27, 2016
|
|
gfyoung |
5a0556e
|
gfyoung
referenced
this issue
Oct 27, 2016
Merged
BUG: Don't parse inline quotes in skipped lines #14514
jorisvandenbossche
added this to the
0.19.1
milestone
Oct 27, 2016
jorisvandenbossche
added the
Regression
label
Oct 27, 2016
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
Oct 27, 2016
|
|
gfyoung |
4fb55f6
|
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
Oct 28, 2016
|
|
gfyoung |
1662b8f
|
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
Oct 30, 2016
|
|
gfyoung |
2e41dab
|
jorisvandenbossche
closed this
in #14514
Oct 31, 2016
jorisvandenbossche
added a commit
that referenced
this issue
Oct 31, 2016
|
|
gfyoung + jorisvandenbossche |
b088112
|
jorisvandenbossche
added a commit
to jorisvandenbossche/pandas
that referenced
this issue
Nov 2, 2016
|
|
gfyoung + jorisvandenbossche |
e8a71a3
|
amolkahat
added a commit
to amolkahat/pandas
that referenced
this issue
Nov 26, 2016
|
|
gfyoung + amolkahat |
2035b72
|
rahulporuri commentedOct 20, 2016
•
edited
A small, complete example of the issue
while trying to open a data file similar to
i expect the following code
Expected Output
Observed Output
Further Insight
surprisingly works. also,
works
The behavior changed between
pandas0.18.0and0.18.1. we suspect changes made in #12900 to be causing this.Note that the difference in
skiprowsvalues that works (2) and that doesn't (5) is the same as the number of lines in the file between quote chars.Apologies for the noise if this has already been reported or is being addressed.
Output of
pd.show_versions()commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None