ERR: usecols fails to raise error if column doesn't exist but is the same length as headers #14671

Closed
GGordonGordon opened this Issue Nov 16, 2016 · 17 comments

Comments

Projects
None yet
7 participants
@GGordonGordon
Contributor

GGordonGordon commented Nov 16, 2016

A small, complete example of the issue

read_csv doesn't raise a value error if the use cols contains the same number of values as the headers

simple test_csv.txt file contains:
A,B
1,2
3,4

It currently will set the other column as NaN values:
A B
0 1 NaN
1 3 NaN

import pandas as pd
file = pd.read_csv('data/test.csv',usecols=['A','C']) # Column C doesn't exist
file

Expected Output

ValueError("Usecols do not match names.") exception

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.1.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

@jreback jreback added this to the Next Major Release milestone Nov 16, 2016

@jreback jreback changed the title from usecols fails to raise error if column doesn't exist but is the same length as headers to ERR: usecols fails to raise error if column doesn't exist but is the same length as headers Nov 16, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 16, 2016

Contributor

yeah this seems reasonable.

Contributor

jreback commented Nov 16, 2016

yeah this seems reasonable.

@aileronajay

This comment has been minimized.

Show comment
Hide comment
@aileronajay

aileronajay Nov 16, 2016

Contributor

@GGordonGordon will you be creating a PR for this?

Contributor

aileronajay commented Nov 16, 2016

@GGordonGordon will you be creating a PR for this?

@GGordonGordon

This comment has been minimized.

Show comment
Hide comment
@GGordonGordon

GGordonGordon Nov 17, 2016

Contributor

@aileronajay yeah fairly soon

Contributor

GGordonGordon commented Nov 17, 2016

@aileronajay yeah fairly soon

@GGordonGordon

This comment has been minimized.

Show comment
Hide comment
@GGordonGordon

GGordonGordon Nov 17, 2016

Contributor

Pull request: #14674

Contributor

GGordonGordon commented Nov 17, 2016

Pull request: #14674

@GGordonGordon

This comment has been minimized.

Show comment
Hide comment
@GGordonGordon

GGordonGordon Nov 17, 2016

Contributor

For clarification should if usecols is a list of integers should it refer to the columns within the original data table (appears this is what the current documentation states) or should they refer as an index to the names list (appears to be how the code currently operates)?

Contributor

GGordonGordon commented Nov 17, 2016

For clarification should if usecols is a list of integers should it refer to the columns within the original data table (appears this is what the current documentation states) or should they refer as an index to the names list (appears to be how the code currently operates)?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 17, 2016

Contributor

@GGordonGordon the documentation is correctly. you would have to show an example where this is not true.

Contributor

jreback commented Nov 17, 2016

@GGordonGordon the documentation is correctly. you would have to show an example where this is not true.

@GGordonGordon

This comment has been minimized.

Show comment
Hide comment
@GGordonGordon

GGordonGordon Jan 2, 2017

Contributor

This appears to be resolved by #14234 and this is no longer needed.

Contributor

GGordonGordon commented Jan 2, 2017

This appears to be resolved by #14234 and this is no longer needed.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Jan 2, 2017

Member

Using master (so with #14234 merged), I still see the buggy behaviour:

In [12]: s = """A,B
    ...: 1,2
    ...: 3,4"""

In [13]: pd.read_csv(StringIO(s))
Out[13]: 
   A  B
0  1  2
1  3  4

In [15]: pd.read_csv(StringIO(s), usecols=["A", "C"])
Out[15]: 
   A    B
0  1  NaN
1  3  NaN

So it would be welcome if you could update the PR.

Member

jorisvandenbossche commented Jan 2, 2017

Using master (so with #14234 merged), I still see the buggy behaviour:

In [12]: s = """A,B
    ...: 1,2
    ...: 3,4"""

In [13]: pd.read_csv(StringIO(s))
Out[13]: 
   A  B
0  1  2
1  3  4

In [15]: pd.read_csv(StringIO(s), usecols=["A", "C"])
Out[15]: 
   A    B
0  1  NaN
1  3  NaN

So it would be welcome if you could update the PR.

@bpraggastis

This comment has been minimized.

Show comment
Hide comment
@bpraggastis

bpraggastis May 22, 2017

Contributor

Is anyone working on this now. Just picked it up at #pandas sprint at #PyCon2017 ?

Contributor

bpraggastis commented May 22, 2017

Is anyone working on this now. Just picked it up at #pandas sprint at #PyCon2017 ?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche May 22, 2017

Member

@bpraggastis Note there is a closed PR #14674 that started this work, but never was finished. You can have a look at that, and see whether it would be useful to start from that work (use the changes in that PR)

Member

jorisvandenbossche commented May 22, 2017

@bpraggastis Note there is a closed PR #14674 that started this work, but never was finished. You can have a look at that, and see whether it would be useful to start from that work (use the changes in that PR)

bpraggastis pushed a commit to bpraggastis/pandas that referenced this issue May 23, 2017

bpraggastis pushed a commit to bpraggastis/pandas that referenced this issue May 23, 2017

tests added for gh-14671, expected behavior of simultaneous use of us…
…ecols and names unclear so these tests are commented out

bpraggastis pushed a commit to bpraggastis/pandas that referenced this issue May 23, 2017

BUG code and tests added for gh-14671, expected behavior of simultane…
…ous use of usecols and names unclear so some tests are commented out
@bpraggastis

This comment has been minimized.

Show comment
Hide comment
@bpraggastis

bpraggastis May 23, 2017

Contributor

The approach originally taken (#14674) caused the test suite to fail. I used a different approach focusing only on the case when usecols_dtype == 'string'. In this case it was sufficient to check that usecols is a subset of names. I commited to branch gh-14671.

There is another case that is not covered that I would like clarification on. If header=0 and names are provided to replace the column headers in the data, then usecols should be able to reference individual columns by index or by the name assigned in names. I verified this works in practice:
image

When I wrote tests for this the tests failed due to a conditional in parsers.py: _infer_columns(self): on code lines:

if ((self.usecols is not None and
                     len(names) != len(self.usecols)) or
                    (self.usecols is None and
                     len(names) != len(columns[0]))):
                    raise ValueError('Number of passed names did not match '
                                     'number of header fields in the file')

The implication from this code is that if usecols is provided then its length must match len(names).
I am not sure if my BUG fix is adequate as the additional tests I wrote won't pass.

Contributor

bpraggastis commented May 23, 2017

The approach originally taken (#14674) caused the test suite to fail. I used a different approach focusing only on the case when usecols_dtype == 'string'. In this case it was sufficient to check that usecols is a subset of names. I commited to branch gh-14671.

There is another case that is not covered that I would like clarification on. If header=0 and names are provided to replace the column headers in the data, then usecols should be able to reference individual columns by index or by the name assigned in names. I verified this works in practice:
image

When I wrote tests for this the tests failed due to a conditional in parsers.py: _infer_columns(self): on code lines:

if ((self.usecols is not None and
                     len(names) != len(self.usecols)) or
                    (self.usecols is None and
                     len(names) != len(columns[0]))):
                    raise ValueError('Number of passed names did not match '
                                     'number of header fields in the file')

The implication from this code is that if usecols is provided then its length must match len(names).
I am not sure if my BUG fix is adequate as the additional tests I wrote won't pass.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung May 23, 2017

Member

@jorisvandenbossche : I too didn't realize this issue wasn't handled in full. Thanks for the CC!

@bpraggastis : Mind opening up a PR for this? Would make it a little easier to view changes and see whether they are merge-able.

Also, FYI, whenever you write "gh-xxxx" in the comment box, GitHub creates a link to the corresponding issue or PR in the repository. If you meant to point to your branch, then you will need to do:

<a href="URL-to-your-branch">gh-xxxx</a>
Member

gfyoung commented May 23, 2017

@jorisvandenbossche : I too didn't realize this issue wasn't handled in full. Thanks for the CC!

@bpraggastis : Mind opening up a PR for this? Would make it a little easier to view changes and see whether they are merge-able.

Also, FYI, whenever you write "gh-xxxx" in the comment box, GitHub creates a link to the corresponding issue or PR in the repository. If you meant to point to your branch, then you will need to do:

<a href="URL-to-your-branch">gh-xxxx</a>
@bpraggastis

This comment has been minimized.

Show comment
Hide comment
@bpraggastis

bpraggastis May 23, 2017

Contributor

@jorisvandenbossche @gfyoung : I created a PR and left the tests referencing the problem I mentioned commented out for future use (not sure if that is the correct protocol). Leaving PyCon now but would like to continue working on this if you open another issue addressing the new bug.

Contributor

bpraggastis commented May 23, 2017

@jorisvandenbossche @gfyoung : I created a PR and left the tests referencing the problem I mentioned commented out for future use (not sure if that is the correct protocol). Leaving PyCon now but would like to continue working on this if you open another issue addressing the new bug.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger May 23, 2017

Contributor

I mentioned commented out for future use (not sure if that is the correct protocol).

we typically mark it with an xfail for expected failure, so that it gets reported

@pytest.mark.xfail('this will fail')
def test_fail():
    assert False
Contributor

TomAugspurger commented May 23, 2017

I mentioned commented out for future use (not sure if that is the correct protocol).

we typically mark it with an xfail for expected failure, so that it gets reported

@pytest.mark.xfail('this will fail')
def test_fail():
    assert False
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung May 23, 2017

Member

we typically mark it with an xfail for expected failure, so that it gets reported

Right, but let's see if we can patch them first before marking. What you just found @bpraggastis is a bug with the Python engine (what you were demonstrating was with the C engine). So let's patch that first before addressing your PR.

Member

gfyoung commented May 23, 2017

we typically mark it with an xfail for expected failure, so that it gets reported

Right, but let's see if we can patch them first before marking. What you just found @bpraggastis is a bug with the Python engine (what you were demonstrating was with the C engine). So let's patch that first before addressing your PR.

gfyoung added a commit to gfyoung/pandas that referenced this issue May 23, 2017

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung May 24, 2017

Member

Actually, this problem is bigger than I thought. Filing issue for this.

Member

gfyoung commented May 24, 2017

Actually, this problem is bigger than I thought. Filing issue for this.

gfyoung added a commit to gfyoung/pandas that referenced this issue May 24, 2017

@jreback jreback modified the milestones: 0.20.2, Next Major Release May 24, 2017

gfyoung added a commit to gfyoung/pandas that referenced this issue May 24, 2017

TomAugspurger added a commit to bpraggastis/pandas that referenced this issue Jun 4, 2017

TomAugspurger added a commit to bpraggastis/pandas that referenced this issue Jun 4, 2017

tests added for gh-14671, expected behavior of simultaneous use of us…
…ecols and names unclear so these tests are commented out

TomAugspurger added a commit that referenced this issue Jun 4, 2017

ERRR: Raise error in usecols when column doesn't exist but length mat…
…ches (#16460)

* gh-14671 Check if usecols with type string contains a subset of names, if not throws an error

* tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out

* Review comments

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jun 4, 2017

ERRR: Raise error in usecols when column doesn't exist but length mat…
…ches (#16460)

* gh-14671 Check if usecols with type string contains a subset of names, if not throws an error

* tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out

* Review comments

(cherry picked from commit 50a62c1)

TomAugspurger added a commit that referenced this issue Jun 4, 2017

ERRR: Raise error in usecols when column doesn't exist but length mat…
…ches (#16460)

* gh-14671 Check if usecols with type string contains a subset of names, if not throws an error

* tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out

* Review comments

(cherry picked from commit 50a62c1)

Kiv added a commit to Kiv/pandas that referenced this issue Jun 11, 2017

ERRR: Raise error in usecols when column doesn't exist but length mat…
…ches (#16460)

* gh-14671 Check if usecols with type string contains a subset of names, if not throws an error

* tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out

* Review comments

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

ERRR: Raise error in usecols when column doesn't exist but length mat…
…ches (#16460)

* gh-14671 Check if usecols with type string contains a subset of names, if not throws an error

* tests added for gh-14671, expected behavior of simultaneous use of usecols and names unclear so these tests are commented out

* Review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment