raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

GGordonGordon · 2016-11-17T01:44:37Z

jreback · 2016-11-17T12:34:03Z

needs tests
add whatsnew note in 0.19.2

- Updated tests - Updated whatsnew 0.19.2 note - Added new parameter file_header for CParserWrapper to contain the original header read from the file for comparison

codecov-io · 2016-11-18T19:03:56Z

Current coverage is 85.29% (diff: 100%)

Merging #14674 into master will increase coverage by 0.51%

@@             master     #14674   diff @@
==========================================
  Files           145        140     -5   
  Lines         51090      50723   -367   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43315      43263    -52   
+ Misses         7775       7460   -315   
  Partials          0          0

Powered by Codecov. Last update 0d3ecfa...0b15cd1

jreback · 2016-12-16T23:50:54Z

can you rebase

jreback · 2016-12-26T21:51:45Z

can you rebase

jorisvandenbossche

As I noted in the issue, this PR is certainly still welcome!
Added some comments.

jorisvandenbossche · 2017-01-02T09:42:52Z

doc/source/whatsnew/v0.19.2.txt

@@ -32,6 +32,7 @@ Bug Fixes



+- Bug in pd.read_csv - catch missing columns if usecols and header lengths match (:issue:`14671`)


Can you move this to 0.20.0.txt?

pd.read_csv() instead of pd.read_csv

Let's generalize this a little. This PR is not actually about handling when header and usecols length match. It's about properly handling situations when usecols provides non-existent columns.

jorisvandenbossche · 2017-01-02T11:27:37Z

pandas/io/parsers.py

+                usecol_len = len(set(self.usecols) - set(h))
+                usecoli_len = len(set(self.usecols) - set(range(0, len(h))))
+                if usecol_len > 0 and usecoli_len > 0:
+                    raise ValueError("Usecols do not match names.")


Those code block seems a bit convoluted. Can this be simplified?
Is there a reason you need _reader.file_header. Aren't those already in names?

Possibly this would be easier if this check is performed inside _filter_usecols.
Then we could check at the same time for out of bounds integer usecols (eg usecols=[0,2] if there are only 2 columns)

From what I can tell the way the C parser works is that it will populate the names from the names value passed into the TextReader class. So if I wanted to read columns 'A' and 'C' from a CSV but rename them as 'Category' and 'Total' (names parameter) usecols compares against the names parameter

Another example (which happens with the pandas.io.tests.parser.test_parsers.TestCParserLowMemory test):

s = """0,1,20140101,0900,4 0,1,20140102,1000,4""" parse_dates = [[1, 2]] names = list('acd') df = self.read_csv(StringIO(s), names=names, usecols=[0, 2, 3], parse_dates=parse_dates)

Fails because index 3 doesn't exist in the names list passed in (it's only 0-2)

cc @gfyoung

To clarify my concern here is how should the usecols function if the names parameter is used? The documentation states:
All elements in this array must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s)

Is that to mean that if the names parameter is passed in then usecols filters on those names or does it mean that I should be able to pass in values from either the headers or the names parameter? If it's the former then perhaps I can update the documentation to state "name (if supplied)" under the usecols argument. If it's the latter then how are to know which list to filter based on an integer index? The header or the names?

The other question is should the usecols be applied / filtered first before the names are applied to the columns?

jorisvandenbossche · 2017-01-02T11:31:58Z

pandas/io/tests/parser/usecols.py

@@ -54,6 +54,10 @@ def test_usecols(self):
        expected.columns = ['foo', 'bar']
        tm.assert_frame_equal(result, expected)

+        # same length but usecols column doesn't exist - see gh-14671


Can you put this in a separate test (eg call it test_usecols_non_existing)? And also add an explicit test case for the 'normal' case where the length does not match the number of columns (which already raises), and tests for both cases but using integers.

jorisvandenbossche · 2017-01-02T11:32:23Z

pandas/parser.pyx

+
+            #if self.usecols is not None:
+            #    if len(set(self.usecols) - set(header[0])) > 0 and len(set(self.usecols) - set(range(0,field_count))) > 0:
+            #        raise ValueError("Usecols do not match names.")


Can you clean this up?

gfyoung · 2017-01-04T19:09:15Z

@GGordonGordon :

Rebase onto master - your patch will need to be changed for this be merge-able.
I agree with @jorisvandenbossche that you do not need self._reader.header. You can use self.names AFAICT, so apply your current logic with self.names. The reason is that self.names should IINM contain the column names of the final output. Thus, you should be able to use that regardless of whether you read the header OR provide names.
Note that the Python parser does not raise if you provide out of bounds indices:

>>> data = 'a,b\n1,2'
>>> read_csv(StringIO(data), usecols=[0, 2], engine='python')
  a
0 1

Given the scope of your patch (which applies to both out-of-bounds indices and non-existent names), you will have to also make sure the Python engine raises similarly in this situation for consistency.

jreback · 2017-03-20T13:41:25Z

pls comment if youd like to update / rebase and continue

GGordonGordon mentioned this pull request Nov 17, 2016

ERR: usecols fails to raise error if column doesn't exist but is the same length as headers #14671

Closed

jreback added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Nov 17, 2016

GH14671 - ERR: Raise ValueError if usecol doesn't exist with same len

a985129

- Updated tests - Updated whatsnew 0.19.2 note - Added new parameter file_header for CParserWrapper to contain the original header read from the file for comparison

GGordonGordon force-pushed the fix/14671 branch from 0b15cd1 to a985129 Compare December 30, 2016 01:52

jorisvandenbossche reviewed Jan 2, 2017

View reviewed changes

jreback closed this Mar 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

GGordonGordon commented Nov 17, 2016

jreback commented Nov 17, 2016

codecov-io commented Nov 18, 2016 •

edited

Loading

jreback commented Dec 16, 2016

jreback commented Dec 26, 2016

jorisvandenbossche left a comment

jorisvandenbossche Jan 2, 2017

gfyoung Jan 4, 2017 •

edited

Loading

jorisvandenbossche Jan 2, 2017

GGordonGordon Jan 2, 2017 •

edited

Loading

jorisvandenbossche Jan 4, 2017

GGordonGordon Jan 4, 2017

jorisvandenbossche Jan 2, 2017

jorisvandenbossche Jan 2, 2017

gfyoung commented Jan 4, 2017 •

edited

Loading

jreback commented Mar 20, 2017

		@@ -32,6 +32,7 @@ Bug Fixes



		- Bug in pd.read_csv - catch missing columns if usecols and header lengths match (:issue:`14671`)

raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

Conversation

GGordonGordon commented Nov 17, 2016

jreback commented Nov 17, 2016

codecov-io commented Nov 18, 2016 • edited Loading

Current coverage is 85.29% (diff: 100%)

jreback commented Dec 16, 2016

jreback commented Dec 26, 2016

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Jan 2, 2017

Choose a reason for hiding this comment

gfyoung Jan 4, 2017 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jan 2, 2017

Choose a reason for hiding this comment

GGordonGordon Jan 2, 2017 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jan 4, 2017

Choose a reason for hiding this comment

GGordonGordon Jan 4, 2017

Choose a reason for hiding this comment

jorisvandenbossche Jan 2, 2017

Choose a reason for hiding this comment

jorisvandenbossche Jan 2, 2017

Choose a reason for hiding this comment

gfyoung commented Jan 4, 2017 • edited Loading

jreback commented Mar 20, 2017

codecov-io commented Nov 18, 2016 •

edited

Loading

gfyoung Jan 4, 2017 •

edited

Loading

GGordonGordon Jan 2, 2017 •

edited

Loading

gfyoung commented Jan 4, 2017 •

edited

Loading