New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine #14582

Merged
merged 1 commit into from Nov 25, 2016

Conversation

Projects
None yet
5 participants
@gfyoung
Member

gfyoung commented Nov 4, 2016

If there is a field counts mismatch, check whether a multi-char separator was used in conjunction with quotes. Currently, that setup is not respected for the Python engine and can result in improper line breaks. This PR appends that to the error message if that is a possible explanation.

Closes #13374.

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Nov 4, 2016

Current coverage is 85.22% (diff: 100%)

Merging #14582 into master will increase coverage by <.01%

@@             master     #14582   diff @@
==========================================
  Files           143        143          
  Lines         50804      50807     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43292      43298     +6   
+ Misses         7512       7509     -3   
  Partials          0          0          

Powered by Codecov. Last update 75b606a...dda588b

codecov-io commented Nov 4, 2016

Current coverage is 85.22% (diff: 100%)

Merging #14582 into master will increase coverage by <.01%

@@             master     #14582   diff @@
==========================================
  Files           143        143          
  Lines         50804      50807     +3   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43292      43298     +6   
+ Misses         7512       7509     -3   
  Partials          0          0          

Powered by Codecov. Last update 75b606a...dda588b

@sinhrks sinhrks added Bug IO CSV labels Nov 7, 2016

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 10, 2016

Member

Getting some weird segfaults that seem unrelated to this PR (also in my other one)...disabling testing for the time being.

Member

gfyoung commented Nov 10, 2016

Getting some weird segfaults that seem unrelated to this PR (also in my other one)...disabling testing for the time being.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 10, 2016

Member

@gfyoung yes, the segfaults are not your fault: #14621

Member

jorisvandenbossche commented Nov 10, 2016

@gfyoung yes, the segfaults are not your fault: #14621

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 13, 2016

Member

@jorisvandenbossche , @jreback : No more segfaulting, and everything looks green. Ready to merge if there are no concerns.

Member

gfyoung commented Nov 13, 2016

@jorisvandenbossche , @jreback : No more segfaulting, and everything looks green. Ready to merge if there are no concerns.

Show outdated Hide outdated pandas/io/tests/parser/common.py
@@ -836,7 +836,8 @@ def test_integer_overflow_bug(self):
result = self.read_csv(StringIO(data), header=None, sep=' ')
self.assertTrue(result[0].dtype == np.float64)
result = self.read_csv(StringIO(data), header=None, sep='\s+')
result = self.read_csv(StringIO(data), header=None,

This comment has been minimized.

@jreback

jreback Nov 15, 2016

Contributor

why do you need to change these tests? this is translated to the c-engine with delim whitespace=True so it shouldn't hit your change

@jreback

jreback Nov 15, 2016

Contributor

why do you need to change these tests? this is translated to the c-engine with delim whitespace=True so it shouldn't hit your change

This comment has been minimized.

@gfyoung

gfyoung Nov 15, 2016

Member

But this has nothing to do with the C engine. This has to do with the Python engine. You still do a buggy regex split even with delimiter="\s+" as you can see here. That's why all of these tests have to change.

@gfyoung

gfyoung Nov 15, 2016

Member

But this has nothing to do with the C engine. This has to do with the Python engine. You still do a buggy regex split even with delimiter="\s+" as you can see here. That's why all of these tests have to change.

This comment has been minimized.

@jreback

jreback Nov 15, 2016

Contributor

and why the difference? that is a major api change.

@jreback

jreback Nov 15, 2016

Contributor

and why the difference? that is a major api change.

This comment has been minimized.

@gfyoung

gfyoung Nov 15, 2016

Member

My change causes an error to be raised whenever the sep parameter is regex and quoting != csv.QUOTE_NONE for the Python engine. This test will break as a result for the Python engine without specifying that quoting=csv.QUOTE_NONE.

@gfyoung

gfyoung Nov 15, 2016

Member

My change causes an error to be raised whenever the sep parameter is regex and quoting != csv.QUOTE_NONE for the Python engine. This test will break as a result for the Python engine without specifying that quoting=csv.QUOTE_NONE.

Show outdated Hide outdated pandas/io/parsers.py
@@ -801,6 +803,15 @@ def _clean_options(self, options, engine):
" different from '\s+' are"\
" interpreted as regex)"
engine = 'python'
elif quoting != csv.QUOTE_NONE:

This comment has been minimized.

@jreback

jreback Nov 15, 2016

Contributor

\s+ IS delim_whitespace, you shouldn't have to change all of these tests

@jreback

jreback Nov 15, 2016

Contributor

\s+ IS delim_whitespace, you shouldn't have to change all of these tests

This comment has been minimized.

@gfyoung

gfyoung Nov 15, 2016

Member

Not sure why you commented this.

@gfyoung

gfyoung Nov 15, 2016

Member

Not sure why you commented this.

This comment has been minimized.

@jreback

jreback Nov 15, 2016

Contributor

my point is that \s+ is NOT a multi-char separator to the c-engine, it is instead passed as the delim_whitespace kw. So quoting is properly handled. Not sure this should be different in the python parser.

@jreback

jreback Nov 15, 2016

Contributor

my point is that \s+ is NOT a multi-char separator to the c-engine, it is instead passed as the delim_whitespace kw. So quoting is properly handled. Not sure this should be different in the python parser.

This comment has been minimized.

@gfyoung

gfyoung Nov 15, 2016

Member

Unfortunately, it is very different. Python's native csv library does not accept regex, so we have to do that buggy regex split. delim_whitespace=True just translates to delimiter="\s+" for the Python parser.

@gfyoung

gfyoung Nov 15, 2016

Member

Unfortunately, it is very different. Python's native csv library does not accept regex, so we have to do that buggy regex split. delim_whitespace=True just translates to delimiter="\s+" for the Python parser.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 15, 2016

Member

I am not sure I am convinced this change is needed. @gfyoung Can you provide a bit more rationale in the top post and examples of how you to change your code or how the result changes?

Reason that I am not fully convinced is that you are disallowing it for all cases where there are quotes in the csv file, while there is only a problem when the delimiter is present in the quoted fields.

Member

jorisvandenbossche commented Nov 15, 2016

I am not sure I am convinced this change is needed. @gfyoung Can you provide a bit more rationale in the top post and examples of how you to change your code or how the result changes?

Reason that I am not fully convinced is that you are disallowing it for all cases where there are quotes in the csv file, while there is only a problem when the delimiter is present in the quoted fields.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 15, 2016

Member

@jorisvandenbossche , @jreback : Maybe the delimiter is not be present in the quoted fields, but how are we to check this without compromising performance in Python? It is easier to insist upon passing in a parameter without raising an error than doing such a check. This change is necessary then to patch the bug I showed in the related issue.

There is no real behaviour change, except for an error being raised depending on what separator is being passed in with the Python engine. All tests that I modified perhaps could be left unchanged if I change some of the logic inside the parser but will need to check.

UPDATE: Not feasible because of what I provided above.

Member

gfyoung commented Nov 15, 2016

@jorisvandenbossche , @jreback : Maybe the delimiter is not be present in the quoted fields, but how are we to check this without compromising performance in Python? It is easier to insist upon passing in a parameter without raising an error than doing such a check. This change is necessary then to patch the bug I showed in the related issue.

There is no real behaviour change, except for an error being raised depending on what separator is being passed in with the Python engine. All tests that I modified perhaps could be left unchanged if I change some of the logic inside the parser but will need to check.

UPDATE: Not feasible because of what I provided above.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 15, 2016

Member

@gfyoung It's quite possible your proposed change is the sensible thing to do, but I think you need to try to explain your changes and its consequences a bit better (in this case "Title is self-explanatory" is not really true .. :-))

As I understand it now, the consequence of your proposed change is that when somebody now does (on a file that does not necessarily contain quotes):

pd.read_csv(file, sep='\s+', engine='python')

this will raise an error now (so breaking code). To resolve this error, the user needs to add quoting=csv.QUOTE_NONE (as this is not the default).
Is this correct?

In this case, for me this is a trade-off between having to provide an extra (unneeded) keyword in the simple case (no quotes in file) vs having a wrong error message in case of quotes + sep present in quotes (current behaviour that you are trying to fix).

Member

jorisvandenbossche commented Nov 15, 2016

@gfyoung It's quite possible your proposed change is the sensible thing to do, but I think you need to try to explain your changes and its consequences a bit better (in this case "Title is self-explanatory" is not really true .. :-))

As I understand it now, the consequence of your proposed change is that when somebody now does (on a file that does not necessarily contain quotes):

pd.read_csv(file, sep='\s+', engine='python')

this will raise an error now (so breaking code). To resolve this error, the user needs to add quoting=csv.QUOTE_NONE (as this is not the default).
Is this correct?

In this case, for me this is a trade-off between having to provide an extra (unneeded) keyword in the simple case (no quotes in file) vs having a wrong error message in case of quotes + sep present in quotes (current behaviour that you are trying to fix).

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 18, 2016

Member

@jorisvandenbossche : Fair enough. Updated my description.

Member

gfyoung commented Nov 18, 2016

@jorisvandenbossche : Fair enough. Updated my description.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 18, 2016

Member

@gfyoung Would it be possible to only raise the error if there are actually quotes in the data? (not sure how performant such a check would be)
That will already solve the trade-off for a lot of usecases I think

Member

jorisvandenbossche commented Nov 18, 2016

@gfyoung Would it be possible to only raise the error if there are actually quotes in the data? (not sure how performant such a check would be)
That will already solve the trade-off for a lot of usecases I think

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 18, 2016

Member

@jorisvandenbossche : That's going to be tricky because we wrap all IO objects as a stream. Unless you want to check every line when reading, I don't know of any other way to do it at the moment. Thoughts?

Member

gfyoung commented Nov 18, 2016

@jorisvandenbossche : That's going to be tricky because we wrap all IO objects as a stream. Unless you want to check every line when reading, I don't know of any other way to do it at the moment. Thoughts?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

@gfyoung why can't you just happily parse, then if the error happens, check the quoting paramater and raise the correct error?

Contributor

jreback commented Nov 23, 2016

@gfyoung why can't you just happily parse, then if the error happens, check the quoting paramater and raise the correct error?

@jorisvandenbossche jorisvandenbossche added this to the 0.20.0 milestone Nov 23, 2016

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

Member

gfyoung commented Nov 23, 2016

@jreback : Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 23, 2016

Contributor

@gfyoung

Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

so maybe just say that in the error message. (IOW say you can pass quote.NONE to verify if no error.

I think forcing people to pass quote.NONE is pretty painful just to have a more general error message.

Contributor

jreback commented Nov 23, 2016

@gfyoung

Because how can you tell if it's a legitimate error in the CSV file or incorrect parsing of quotation marks? There isn't really a way to differentiate AFAICT.

so maybe just say that in the error message. (IOW say you can pass quote.NONE to verify if no error.

I think forcing people to pass quote.NONE is pretty painful just to have a more general error message.

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 23, 2016

Member

@jreback : Well, parsing regex separators is painful on our end. You don't have to pass in csv.None if you're separator is only one character, so it shouldn't inconvenience people in most cases.

Member

gfyoung commented Nov 23, 2016

@jreback : Well, parsing regex separators is painful on our end. You don't have to pass in csv.None if you're separator is only one character, so it shouldn't inconvenience people in most cases.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Nov 24, 2016

Member

Hmm, we should try to make a decision here. Trying to summarize the options:

  • Current situation: you get an error when having sep inside quoted field. From the error message this is not directly clear
    • Could we possibly do a check when such an error message occurs if there are quotes, and then add to the message: "possibly this is due to quotes that are ignored when having a multi-char sep"? (which is what @jreback proposed?) Or just adding it to the message without checking.
  • PR: force people to use quoting=csv.QUOTE_NONE with the idea of making them aware of the possible problem with quotes.
    • The reason I am not really a fan: a) this also forces the user to use this when there is no potential problem (eg no quotes, or no sep chars inside the quotes), ànd b) this will still give you the same error message when you have sep chars inside quotes but do pass csv.QUOTE_NONE as the error message says to do
  • Checking if there are quotes in the line, and only raising the error in that case
    • Disadvantages: possibly expensive performance wise + still forces users to use quoting when they have quotes without sep chars inside + makes implementation more complex for something that is probably not worth it.
  • ...

@gfyoung @jreback Thoughts?
Personally I would go for the first: leaving the current situation but improving the error message.

Member

jorisvandenbossche commented Nov 24, 2016

Hmm, we should try to make a decision here. Trying to summarize the options:

  • Current situation: you get an error when having sep inside quoted field. From the error message this is not directly clear
    • Could we possibly do a check when such an error message occurs if there are quotes, and then add to the message: "possibly this is due to quotes that are ignored when having a multi-char sep"? (which is what @jreback proposed?) Or just adding it to the message without checking.
  • PR: force people to use quoting=csv.QUOTE_NONE with the idea of making them aware of the possible problem with quotes.
    • The reason I am not really a fan: a) this also forces the user to use this when there is no potential problem (eg no quotes, or no sep chars inside the quotes), ànd b) this will still give you the same error message when you have sep chars inside quotes but do pass csv.QUOTE_NONE as the error message says to do
  • Checking if there are quotes in the line, and only raising the error in that case
    • Disadvantages: possibly expensive performance wise + still forces users to use quoting when they have quotes without sep chars inside + makes implementation more complex for something that is probably not worth it.
  • ...

@gfyoung @jreback Thoughts?
Personally I would go for the first: leaving the current situation but improving the error message.

@gfyoung gfyoung changed the title from BUG: Disable multichar/regex sep for Python engine in read_csv to BUG: Improve Error Message for Multi-Char Sep + Quotes in Python Engine Nov 25, 2016

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 25, 2016

Member

@jorisvandenbossche : You do have a point that the error message clarifies a lot what the issue is without necessarily inconveniencing the user. Changed PR to follow the first suggestion.

Member

gfyoung commented Nov 25, 2016

@jorisvandenbossche : You do have a point that the error message clarifies a lot what the issue is without necessarily inconveniencing the user. Changed PR to follow the first suggestion.

BUG: Improve error message for multi-char sep and quotes in Python en…
…gine

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes gh-13374.
@@ -2515,6 +2515,11 @@ def _rows_to_cols(self, content):
msg = ('Expected %d fields in line %d, saw %d' %
(col_len, row_num + 1, zip_len))
if len(self.delimiter) > 1 and self.quoting != csv.QUOTE_NONE:

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Nov 25, 2016

Member

Is the second part of the check needed (self.quoting != csv.QUOTE_NONE). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.

@jorisvandenbossche

jorisvandenbossche Nov 25, 2016

Member

Is the second part of the check needed (self.quoting != csv.QUOTE_NONE). Because AFAIU also when you do pass this, quotes are still be ignored by the regex expression to split the line, and you can still have this problem.

This comment has been minimized.

@gfyoung

gfyoung Nov 25, 2016

Member

@jorisvandenbossche : If quoting=csv.QUOTE_NONE, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.

@gfyoung

gfyoung Nov 25, 2016

Member

@jorisvandenbossche : If quoting=csv.QUOTE_NONE, all quotation marks are treated as data, so that's the user's fault, not ours. That's why the check is necessary.

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Nov 25, 2016

Member

Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK

@jorisvandenbossche

jorisvandenbossche Nov 25, 2016

Member

Although it's a user error, you can still run into this problem so the message can still be useful I think. But I do see your point, so OK

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 25, 2016

Contributor

lgtm.

Contributor

jreback commented Nov 25, 2016

lgtm.

@jorisvandenbossche jorisvandenbossche merged commit d8e427b into pandas-dev:master Nov 25, 2016

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche
Member

jorisvandenbossche commented Nov 25, 2016

@gfyoung thanks!

@gfyoung gfyoung deleted the gfyoung:multichar-regex-sep-python branch Nov 25, 2016

amolkahat added a commit to amolkahat/pandas that referenced this pull request Nov 26, 2016

BUG: Improve error message for multi-char sep and quotes in Python en…
…gine (#14582)

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes gh-13374.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.2, 0.20.0 Dec 14, 2016

jorisvandenbossche added a commit that referenced this pull request Dec 15, 2016

BUG: Improve error message for multi-char sep and quotes in Python en…
…gine (#14582)

If there is a field counts mismatch, check whether
a multi-char sep was used in conjunction with quotes.
Currently, that setup is not respected and can result
in improper line breaks.

Closes gh-13374.
(cherry picked from commit d8e427b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment