Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ERR: usecols fails to raise error if column doesn't exist but is the same length as headers #14671
Comments
jreback
added IO CSV Error Reporting Difficulty Novice Effort Low
labels
Nov 16, 2016
jreback
added this to the
Next Major Release
milestone
Nov 16, 2016
jreback
changed the title from
usecols fails to raise error if column doesn't exist but is the same length as headers to ERR: usecols fails to raise error if column doesn't exist but is the same length as headers
Nov 16, 2016
|
yeah this seems reasonable. |
|
@GGordonGordon will you be creating a PR for this? |
GGordonGordon
commented
Nov 17, 2016
|
@aileronajay yeah fairly soon |
GGordonGordon
added a commit
to GGordonGordon/pandas
that referenced
this issue
Nov 17, 2016
|
|
GGordonGordon |
a5d344b
|
GGordonGordon
referenced
this issue
Nov 17, 2016
Closed
raise ValueError if usecols has same size as but doesn't exist in headers (#14671) #14674
GGordonGordon
commented
Nov 17, 2016
|
Pull request: #14674 |
GGordonGordon
commented
Nov 17, 2016
|
For clarification should if usecols is a list of integers should it refer to the columns within the original data table (appears this is what the current documentation states) or should they refer as an index to the names list (appears to be how the code currently operates)? |
|
@GGordonGordon the documentation is correctly. you would have to show an example where this is not true. |
GGordonGordon
commented
Jan 2, 2017
|
This appears to be resolved by #14234 and this is no longer needed. |
|
Using master (so with #14234 merged), I still see the buggy behaviour:
So it would be welcome if you could update the PR. |
|
Is anyone working on this now. Just picked it up at #pandas sprint at #PyCon2017 ? |
|
@bpraggastis Note there is a closed PR #14674 that started this work, but never was finished. You can have a look at that, and see whether it would be useful to start from that work (use the changes in that PR) |
bpraggastis
added a commit
to bpraggastis/pandas
that referenced
this issue
May 23, 2017
|
|
bpraggastis |
5e4966c
|
bpraggastis
added a commit
to bpraggastis/pandas
that referenced
this issue
May 23, 2017
|
|
bpraggastis |
ce0497d
|
bpraggastis
added a commit
to bpraggastis/pandas
that referenced
this issue
May 23, 2017
|
|
bpraggastis |
68dcc29
|
bpraggastis
added a commit
to bpraggastis/pandas
that referenced
this issue
May 23, 2017
|
|
bpraggastis |
a526421
|
|
The approach originally taken (#14674) caused the test suite to fail. I used a different approach focusing only on the case when usecols_dtype == 'string'. In this case it was sufficient to check that usecols is a subset of names. I commited to branch gh-14671. There is another case that is not covered that I would like clarification on. If header=0 and names are provided to replace the column headers in the data, then usecols should be able to reference individual columns by index or by the name assigned in names. I verified this works in practice: When I wrote tests for this the tests failed due to a conditional in parsers.py: _infer_columns(self): on code lines: if ((self.usecols is not None and
len(names) != len(self.usecols)) or
(self.usecols is None and
len(names) != len(columns[0]))):
raise ValueError('Number of passed names did not match '
'number of header fields in the file')The implication from this code is that if usecols is provided then its length must match len(names). |
|
cc @gfyoung |
|
@jorisvandenbossche : I too didn't realize this issue wasn't handled in full. Thanks for the CC! @bpraggastis : Mind opening up a PR for this? Would make it a little easier to view changes and see whether they are merge-able. Also, FYI, whenever you write "gh-xxxx" in the comment box, GitHub creates a link to the corresponding issue or PR in the repository. If you meant to point to your branch, then you will need to do:
|
bpraggastis
referenced
this issue
May 23, 2017
Merged
Raise error in usecols when column doesn't exist but length matches #16460
|
@jorisvandenbossche @gfyoung : I created a PR and left the tests referencing the problem I mentioned commented out for future use (not sure if that is the correct protocol). Leaving PyCon now but would like to continue working on this if you open another issue addressing the new bug. |
we typically mark it with an @pytest.mark.xfail('this will fail')
def test_fail():
assert False |
Right, but let's see if we can patch them first before marking. What you just found @bpraggastis is a bug with the Python engine (what you were demonstrating was with the C engine). So let's patch that first before addressing your PR. |
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
May 23, 2017
|
|
bpraggastis + gfyoung |
f101f3c
|
|
Actually, this problem is bigger than I thought. Filing issue for this. |
gfyoung
referenced
this issue
May 24, 2017
Open
API/DOC: Specification for `names` parameter in read_csv #16469
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
May 24, 2017
|
|
bpraggastis + gfyoung |
9d02d51
|
jreback
modified the milestone: 0.20.2, Next Major Release
May 24, 2017
gfyoung
added a commit
to gfyoung/pandas
that referenced
this issue
May 24, 2017
|
|
bpraggastis + gfyoung |
4af9e45
|
bpraggastis
added a commit
to bpraggastis/pandas
that referenced
this issue
Jun 3, 2017
|
|
bpraggastis |
2246e96
|
bpraggastis
added a commit
to bpraggastis/pandas
that referenced
this issue
Jun 3, 2017
|
|
bpraggastis |
972d72b
|
TomAugspurger
added a commit
to bpraggastis/pandas
that referenced
this issue
Jun 4, 2017
|
|
bpraggastis + TomAugspurger |
812f928
|
TomAugspurger
added a commit
to bpraggastis/pandas
that referenced
this issue
Jun 4, 2017
|
|
bpraggastis + TomAugspurger |
1968a70
|
TomAugspurger
closed this
in #16460
Jun 4, 2017
TomAugspurger
added a commit
that referenced
this issue
Jun 4, 2017
|
|
bpraggastis + TomAugspurger |
50a62c1
|
TomAugspurger
added a commit
to TomAugspurger/pandas
that referenced
this issue
Jun 4, 2017
|
|
bpraggastis + TomAugspurger |
28b1c8b
|
TomAugspurger
added a commit
that referenced
this issue
Jun 4, 2017
|
|
bpraggastis + TomAugspurger |
2c43917
|
Kiv
added a commit
to Kiv/pandas
that referenced
this issue
Jun 11, 2017
|
|
bpraggastis + Kiv |
9771514
|
stangirala
added a commit
to stangirala/pandas
that referenced
this issue
Jun 11, 2017
|
|
bpraggastis + stangirala |
2c2f77f
|
guillemborrell
added a commit
to guillemborrell/pandas
that referenced
this issue
Jul 7, 2017
|
|
bpraggastis + guillemborrell |
a4fb0be
|

GGordonGordon commentedNov 16, 2016
A small, complete example of the issue
read_csv doesn't raise a value error if the use cols contains the same number of values as the headers
simple test_csv.txt file contains:
A,B
1,2
3,4
It currently will set the other column as NaN values:
A B
0 1 NaN
1 3 NaN
Expected Output
ValueError("Usecols do not match names.") exception
Output of
pd.show_versions()pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1