Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674

Closed
wants to merge 1 commit into from

Conversation

GGordonGordon
Copy link
Contributor

fixes #14671

@jreback
Copy link
Contributor

jreback commented Nov 17, 2016

  • needs tests
  • add whatsnew note in 0.19.2

- Updated tests
- Updated whatsnew 0.19.2 note
- Added new parameter file_header for CParserWrapper to contain
the original header read from the file for comparison
@codecov-io
Copy link

codecov-io commented Nov 18, 2016

Current coverage is 85.29% (diff: 100%)

Merging #14674 into master will increase coverage by 0.51%

@@             master     #14674   diff @@
==========================================
  Files           145        140     -5   
  Lines         51090      50723   -367   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43315      43263    -52   
+ Misses         7775       7460   -315   
  Partials          0          0          

Powered by Codecov. Last update 0d3ecfa...0b15cd1

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

can you rebase

1 similar comment
@jreback
Copy link
Contributor

jreback commented Dec 26, 2016

can you rebase

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted in the issue, this PR is certainly still welcome!
Added some comments.

@@ -32,6 +32,7 @@ Bug Fixes



- Bug in pd.read_csv - catch missing columns if usecols and header lengths match (:issue:`14671`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this to 0.20.0.txt?

Copy link
Member

@gfyoung gfyoung Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. pd.read_csv() instead of pd.read_csv
  2. Let's generalize this a little. This PR is not actually about handling when header and usecols length match. It's about properly handling situations when usecols provides non-existent columns.

usecol_len = len(set(self.usecols) - set(h))
usecoli_len = len(set(self.usecols) - set(range(0, len(h))))
if usecol_len > 0 and usecoli_len > 0:
raise ValueError("Usecols do not match names.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those code block seems a bit convoluted. Can this be simplified?
Is there a reason you need _reader.file_header. Aren't those already in names?

Possibly this would be easier if this check is performed inside _filter_usecols.
Then we could check at the same time for out of bounds integer usecols (eg usecols=[0,2] if there are only 2 columns)

Copy link
Contributor Author

@GGordonGordon GGordonGordon Jan 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell the way the C parser works is that it will populate the names from the names value passed into the TextReader class. So if I wanted to read columns 'A' and 'C' from a CSV but rename them as 'Category' and 'Total' (names parameter) usecols compares against the names parameter

Another example (which happens with the pandas.io.tests.parser.test_parsers.TestCParserLowMemory test):

        s = """0,1,20140101,0900,4
        0,1,20140102,1000,4"""
        parse_dates = [[1, 2]]
        names = list('acd')

        df = self.read_csv(StringIO(s), names=names,
                           usecols=[0, 2, 3],
                           parse_dates=parse_dates)

Fails because index 3 doesn't exist in the names list passed in (it's only 0-2)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify my concern here is how should the usecols function if the names parameter is used? The documentation states:
All elements in this array must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s)

Is that to mean that if the names parameter is passed in then usecols filters on those names or does it mean that I should be able to pass in values from either the headers or the names parameter? If it's the former then perhaps I can update the documentation to state "name (if supplied)" under the usecols argument. If it's the latter then how are to know which list to filter based on an integer index? The header or the names?

The other question is should the usecols be applied / filtered first before the names are applied to the columns?

@@ -54,6 +54,10 @@ def test_usecols(self):
expected.columns = ['foo', 'bar']
tm.assert_frame_equal(result, expected)

# same length but usecols column doesn't exist - see gh-14671
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in a separate test (eg call it test_usecols_non_existing)? And also add an explicit test case for the 'normal' case where the length does not match the number of columns (which already raises), and tests for both cases but using integers.


#if self.usecols is not None:
# if len(set(self.usecols) - set(header[0])) > 0 and len(set(self.usecols) - set(range(0,field_count))) > 0:
# raise ValueError("Usecols do not match names.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clean this up?

@gfyoung
Copy link
Member

gfyoung commented Jan 4, 2017

@GGordonGordon :

  1. Rebase onto master - your patch will need to be changed for this be merge-able.

  2. I agree with @jorisvandenbossche that you do not need self._reader.header. You can use self.names AFAICT, so apply your current logic with self.names. The reason is that self.names should IINM contain the column names of the final output. Thus, you should be able to use that regardless of whether you read the header OR provide names.

  3. Note that the Python parser does not raise if you provide out of bounds indices:

>>> data = 'a,b\n1,2'
>>> read_csv(StringIO(data), usecols=[0, 2], engine='python')
  a
0 1

Given the scope of your patch (which applies to both out-of-bounds indices and non-existent names), you will have to also make sure the Python engine raises similarly in this situation for consistency.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

pls comment if youd like to update / rebase and continue

@jreback jreback closed this Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ERR: usecols fails to raise error if column doesn't exist but is the same length as headers
5 participants