New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: dont' allow ambiguous usecols #12678

Closed
jreback opened this Issue Mar 20, 2016 · 14 comments

Comments

Projects
None yet
3 participants
@jreback
Contributor

jreback commented Mar 20, 2016

#12551 and comment: #12512 (comment)

usecols=['a',1] should raise

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Mar 20, 2016

Member

@jreback : I'll tackle this one as a follow-up in terms of fixing documentation and enforcement. In the meantime, I'll remove the tests I added that had the ambiguous columns.

Member

gfyoung commented Mar 20, 2016

@jreback : I'll tackle this one as a follow-up in terms of fixing documentation and enforcement. In the meantime, I'll remove the tests I added that had the ambiguous columns.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 20, 2016

Member

Shouldn't we also just raise on duplicates in usecols ? eg ['a', 'a']

Member

jorisvandenbossche commented Mar 20, 2016

Shouldn't we also just raise on duplicates in usecols ? eg ['a', 'a']

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Mar 20, 2016

Member

@jorisvandenbossche : usecols is a set, so you won't ever have such a problem.

Member

gfyoung commented Mar 20, 2016

@jorisvandenbossche : usecols is a set, so you won't ever have such a problem.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 20, 2016

Contributor

#11822 and #11823 converts it to a list.

though if we allow only non-duplicated and make it a set operation against names...

though current doc-string say its array-like

Contributor

jreback commented Mar 20, 2016

#11822 and #11823 converts it to a list.

though if we allow only non-duplicated and make it a set operation against names...

though current doc-string say its array-like

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 20, 2016

Member

@gfyoung aha, I didn't know that, but then adding tests for the behaviour for duplicates values in usecols as we are doing in #11882 makes no sense?

Member

jorisvandenbossche commented Mar 20, 2016

@gfyoung aha, I didn't know that, but then adding tests for the behaviour for duplicates values in usecols as we are doing in #11882 makes no sense?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 20, 2016

Contributor

what we could do is:

  • don't allow mixed-integers (e.g. 'a', 1) as these are ambiguous
  • still use a set for usecols, so duplicates are gone, but handle this as a set-selection operation against .names/columns (IOW if names is passed or a header is read in). If there are duplicates there its ok, this usecols just sub-selects. Duplicate handling will be solely there.
Contributor

jreback commented Mar 20, 2016

what we could do is:

  • don't allow mixed-integers (e.g. 'a', 1) as these are ambiguous
  • still use a set for usecols, so duplicates are gone, but handle this as a set-selection operation against .names/columns (IOW if names is passed or a header is read in). If there are duplicates there its ok, this usecols just sub-selects. Duplicate handling will be solely there.
@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Mar 20, 2016

Member

@jreback : What do you mean by "duplicate" handling? Is that what #11823 will be doing?

Member

gfyoung commented Mar 20, 2016

@jreback : What do you mean by "duplicate" handling? Is that what #11823 will be doing?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 20, 2016

Contributor

This is on master.

In [12]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
   ....:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[12]: 
   a  a
0  1  1

In [13]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out[13]: 
   a
0  1

but I think that these should BOTH output

   a   a
0  1   3

IOW, the usecols is just a filter as a set (so its 'a' in this case).

Then names takes over and you get the 0th and 2nd columns (that are named 'a')

Contributor

jreback commented Mar 20, 2016

This is on master.

In [12]: pd.read_csv(StringIO("""1,2,3"""), engine='c', header=None, 
   ....:             names=['a', 'b', 'a'], usecols=['a','a'])
Out[12]: 
   a  a
0  1  1

In [13]: pd.read_csv(StringIO("""1,2,3"""), engine='python', header=None, 
            names=['a', 'b', 'a'], usecols=['a','a'])
Out[13]: 
   a
0  1

but I think that these should BOTH output

   a   a
0  1   3

IOW, the usecols is just a filter as a set (so its 'a' in this case).

Then names takes over and you get the 0th and 2nd columns (that are named 'a')

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 20, 2016

Contributor

cc @sxwang

Contributor

jreback commented Mar 20, 2016

cc @sxwang

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Mar 20, 2016

Member

@jreback : AFAICT, such behaviour will be fixed in #11882 right?

Member

gfyoung commented Mar 20, 2016

@jreback : AFAICT, such behaviour will be fixed in #11882 right?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 20, 2016

Member

@jreback I agree

@gfyoung Well the PR for that issue is #11882, and currently there it is not yet this behaviour (but still in reviewing phase)

Member

jorisvandenbossche commented Mar 20, 2016

@jreback I agree

@gfyoung Well the PR for that issue is #11882, and currently there it is not yet this behaviour (but still in reviewing phase)

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Mar 20, 2016

Member

Okay, but in terms of allocation, that issue should be tackled there, and I could just handle the enforcing non-ambiguous usecols, right?

Member

gfyoung commented Mar 20, 2016

Okay, but in terms of allocation, that issue should be tackled there, and I could just handle the enforcing non-ambiguous usecols, right?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Mar 20, 2016

Member

@gfyoung yes, that's right!

Member

jorisvandenbossche commented Mar 20, 2016

@gfyoung yes, that's right!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 20, 2016

Contributor

yep that sounds right. let's restrict this issue to ambiguous errors

Contributor

jreback commented Mar 20, 2016

yep that sounds right. let's restrict this issue to ambiguous errors

gfyoung added a commit to gfyoung/pandas that referenced this issue Apr 6, 2016

BUG: Prevent mixed-typed usecols
Enforces the fact that 'usecols' must either
be all integers (indexing) or strings (column
names), as mixtures of the two are ambiguous.

Closes pandas-devgh-12678.

@jreback jreback closed this in c6c201e Apr 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment