Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: dont' allow ambiguous usecols #12678

Closed
jreback opened this issue Mar 20, 2016 · 14 comments
Closed

ERR: dont' allow ambiguous usecols #12678

jreback opened this issue Mar 20, 2016 · 14 comments
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Mar 20, 2016

#12551 and comment: #12512 (comment)

usecols=['a',1] should raise

@jreback jreback added Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Mar 20, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 20, 2016
@gfyoung
Copy link
Member

gfyoung commented Mar 20, 2016

@jreback : I'll tackle this one as a follow-up in terms of fixing documentation and enforcement. In the meantime, I'll remove the tests I added that had the ambiguous columns.

@jorisvandenbossche
Copy link
Member

Shouldn't we also just raise on duplicates in usecols ? eg ['a', 'a']

@gfyoung
Copy link
Member

gfyoung commented Mar 20, 2016

@jorisvandenbossche : usecols is a set, so you won't ever have such a problem.

@jreback
Copy link
Contributor Author

jreback commented Mar 20, 2016

#11822 and #11823 converts it to a list.

though if we allow only non-duplicated and make it a set operation against names...

though current doc-string say its array-like

@jorisvandenbossche
Copy link
Member

@gfyoung aha, I didn't know that, but then adding tests for the behaviour for duplicates values in usecols as we are doing in #11882 makes no sense?

@jreback
Copy link
Contributor Author

jreback commented Mar 20, 2016

what we could do is:

  • don't allow mixed-integers (e.g. 'a', 1) as these are ambiguous
  • still use a set for usecols, so duplicates are gone, but handle this as a set-selection operation against .names/columns (IOW if names is passed or a header is read in). If there are duplicates there its ok, this usecols just sub-selects. Duplicate handling will be solely there.

@gfyoung
Copy link
Member

gfyoung commented Mar 20, 2016

@jreback : What do you mean by "duplicate" handling? Is that what #11823 will be doing?

@jreback
Copy link
Contributor Author

jreback commented Mar 20, 2016

This is on master.

In [12]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
   ....:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[12]: 
   a  a
0  1  1

In [13]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out[13]: 
   a
0  1

but I think that these should BOTH output

   a   a
0  1   3

IOW, the usecols is just a filter as a set (so its 'a' in this case).

Then names takes over and you get the 0th and 2nd columns (that are named 'a')

@jreback
Copy link
Contributor Author

jreback commented Mar 20, 2016

cc @sxwang

@gfyoung
Copy link
Member

gfyoung commented Mar 20, 2016

@jreback : AFAICT, such behaviour will be fixed in #11882 right?

@jorisvandenbossche
Copy link
Member

@jreback I agree

@gfyoung Well the PR for that issue is #11882, and currently there it is not yet this behaviour (but still in reviewing phase)

@gfyoung
Copy link
Member

gfyoung commented Mar 20, 2016

Okay, but in terms of allocation, that issue should be tackled there, and I could just handle the enforcing non-ambiguous usecols, right?

@jorisvandenbossche
Copy link
Member

@gfyoung yes, that's right!

@jreback
Copy link
Contributor Author

jreback commented Mar 20, 2016

yep that sounds right. let's restrict this issue to ambiguous errors

gfyoung added a commit to forking-repos/pandas that referenced this issue Apr 6, 2016
Enforces the fact that 'usecols' must either
be all integers (indexing) or strings (column
names), as mixtures of the two are ambiguous.

Closes pandas-devgh-12678.
@jreback jreback closed this as completed in c6c201e Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants