Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582
Conversation
codecov-io
commented
Nov 4, 2016
•
Current coverage is 85.22% (diff: 100%)@@ master #14582 diff @@
==========================================
Files 143 143
Lines 50804 50807 +3
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43292 43298 +6
+ Misses 7512 7509 -3
Partials 0 0
|
sinhrks
added Bug IO CSV
labels
Nov 7, 2016
|
Getting some weird segfaults that seem unrelated to this PR (also in my other one)...disabling testing for the time being. |
|
@jorisvandenbossche , @jreback : No more segfaulting, and everything looks green. Ready to merge if there are no concerns. |
| @@ -836,7 +836,8 @@ def test_integer_overflow_bug(self): | ||
| result = self.read_csv(StringIO(data), header=None, sep=' ') | ||
| self.assertTrue(result[0].dtype == np.float64) | ||
| - result = self.read_csv(StringIO(data), header=None, sep='\s+') | ||
| + result = self.read_csv(StringIO(data), header=None, |
jreback
Nov 15, 2016
Contributor
why do you need to change these tests? this is translated to the c-engine with delim whitespace=True so it shouldn't hit your change
gfyoung
Nov 15, 2016
•
Member
But this has nothing to do with the C engine. This has to do with the Python engine. You still do a buggy regex split even with delimiter="\s+" as you can see here. That's why all of these tests have to change.
gfyoung
Nov 15, 2016
•
Member
My change causes an error to be raised whenever the sep parameter is regex and quoting != csv.QUOTE_NONE for the Python engine. This test will break as a result for the Python engine without specifying that quoting=csv.QUOTE_NONE.
| @@ -801,6 +803,15 @@ def _clean_options(self, options, engine): | ||
| " different from '\s+' are"\ | ||
| " interpreted as regex)" | ||
| engine = 'python' | ||
| + elif quoting != csv.QUOTE_NONE: |
jreback
Nov 15, 2016
Contributor
\s+ IS delim_whitespace, you shouldn't have to change all of these tests
jreback
Nov 15, 2016
Contributor
my point is that \s+ is NOT a multi-char separator to the c-engine, it is instead passed as the delim_whitespace kw. So quoting is properly handled. Not sure this should be different in the python parser.
gfyoung
Nov 15, 2016
•
Member
Unfortunately, it is very different. Python's native csv library does not accept regex, so we have to do that buggy regex split. delim_whitespace=True just translates to delimiter="\s+" for the Python parser.
|
I am not sure I am convinced this change is needed. @gfyoung Can you provide a bit more rationale in the top post and examples of how you to change your code or how the result changes? Reason that I am not fully convinced is that you are disallowing it for all cases where there are quotes in the csv file, while there is only a problem when the delimiter is present in the quoted fields. |
|
@jorisvandenbossche , @jreback : Maybe the delimiter is not be present in the quoted fields, but how are we to check this without compromising performance in Python? It is easier to insist upon passing in a parameter without raising an error than doing such a check. This change is necessary then to patch the bug I showed in the related issue. There is no real behaviour change, except for an error being raised depending on what separator is being passed in with the Python engine. UPDATE: Not feasible because of what I provided above. |
|
@gfyoung It's quite possible your proposed change is the sensible thing to do, but I think you need to try to explain your changes and its consequences a bit better (in this case "Title is self-explanatory" is not really true .. :-)) As I understand it now, the consequence of your proposed change is that when somebody now does (on a file that does not necessarily contain quotes):
this will raise an error now (so breaking code). To resolve this error, the user needs to add In this case, for me this is a trade-off between having to provide an extra (unneeded) keyword in the simple case (no quotes in file) vs having a wrong error message in case of quotes + sep present in quotes (current behaviour that you are trying to fix). |
|
@jorisvandenbossche : Fair enough. Updated my description. |
|
@gfyoung Would it be possible to only raise the error if there are actually quotes in the data? (not sure how performant such a check would be) |
|
@jorisvandenbossche : That's going to be tricky because we wrap all IO objects as a stream. Unless you want to check every line when reading, I don't know of any other way to do it at the moment. Thoughts? |
|
@gfyoung why can't you just happily parse, then if the error happens, check the quoting paramater and raise the correct error? |
jorisvandenbossche
added this to the
0.20.0
milestone
Nov 23, 2016
|
@jreback : Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT. |
so maybe just say that in the error message. (IOW say you can pass I think forcing people to pass |
|
@jreback : Well, parsing regex separators is painful on our end. You don't have to pass in |
|
Hmm, we should try to make a decision here. Trying to summarize the options:
@gfyoung @jreback Thoughts? |
gfyoung
changed the title from
BUG: Disable multichar/regex sep for Python engine in read_csv to BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine
Nov 25, 2016
|
@jorisvandenbossche : You do have a point that the error message clarifies a lot what the issue is without necessarily inconveniencing the user. Changed PR to follow the first suggestion. |
| @@ -2515,6 +2515,11 @@ def _rows_to_cols(self, content): | ||
| msg = ('Expected %d fields in line %d, saw %d' % | ||
| (col_len, row_num + 1, zip_len)) | ||
| + if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE: |
jorisvandenbossche
Nov 25, 2016
Owner
Is the second part of the check needed (self.quoting != csv.QUOTE_NONE). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.
gfyoung
Nov 25, 2016
•
Member
@jorisvandenbossche : If quoting=csv.QUOTE_NONE, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.
jorisvandenbossche
Nov 25, 2016
Owner
Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK
|
lgtm. |
jorisvandenbossche
merged commit d8e427b
into pandas-dev:master
Nov 25, 2016
|
@gfyoung thanks! |
gfyoung
deleted the
gfyoung:multichar-regex-sep-python branch
Nov 25, 2016
amolkahat
added a commit
to amolkahat/pandas
that referenced
this pull request
Nov 26, 2016
|
|
gfyoung + amolkahat |
b6fc56a
|
jorisvandenbossche
modified the milestone: 0.19.2, 0.20.0
Dec 14, 2016
jorisvandenbossche
added a commit
that referenced
this pull request
Dec 15, 2016
|
|
gfyoung + jorisvandenbossche |
4a4bbac
|
gfyoung commentedNov 4, 2016
•
edited
If there is a field counts mismatch, check whether a multi-char separator was used in conjunction with quotes. Currently, that setup is not respected for the Python engine and can result in improper line breaks. This PR appends that to the error message if that is a possible explanation.
Closes #13374.