Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add as_index to value_counts and pivot_table #49069

Open
1 of 3 tasks
MarcoGorelli opened this issue Oct 13, 2022 · 11 comments
Open
1 of 3 tasks

ENH: add as_index to value_counts and pivot_table #49069

MarcoGorelli opened this issue Oct 13, 2022 · 11 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Oct 13, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Some methods (e.g. groupby) have the option to not end up with a column as the index.

This could be added to value_counts and pivot_table

Feature Description

as_index argument such that

df.value_counts().reset_index()

and

df.value_counts(as_index=False)

return the same value

The default as_index would still be True, but it could be set as False under an option

Additional Context

No response

@MarcoGorelli MarcoGorelli added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 13, 2022
@phofl
Copy link
Member

phofl commented Oct 13, 2022

Could you try to collect all methods where this would be necessary?

@MarcoGorelli MarcoGorelli added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 19, 2022
@MarcoGorelli
Copy link
Member Author

wait sorry @lucaslarios , still not 100% sure we want this in pandas, I should've added "needs discussion" label - I've added it now

@lucaslarios
Copy link

ok instead of that I will take another issue.

@wany-oh
Copy link
Contributor

wany-oh commented Oct 21, 2022

I think it is a big change the return value of value_counts() will be changed from Series to DataFrame.

@MarcoGorelli
Copy link
Member Author

true, but the same thing already happens with e.g. df.groupby('col1', as_index=False)['col2'].sum()

@rhshadrach
Copy link
Member

rhshadrach commented Oct 31, 2022

What is the benefit of adding as_index to these methods? It seems to me this increases the surface area of the pandas API (and doubles the amount we need to test with these methods) without adding much value.

Comparing with groupby, the as_index argument is part of the groupby which produces a DataFrameGroupBy/SeriesGroupBy that can be repeated. E.g.

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
gb = df.groupby('a', as_index=False)
print(gb.sum())
print(gb.mean())
print(gb.prod())

With groupby, you would have to add .reset_index() to each of these calls. This reuse case does not exist for value_counts nor pivot_table. And even with this, some also think we should remove as_index from groupby: #35860

@MarcoGorelli
Copy link
Member Author

The benefit would be within the larger scope of having an option to make Index opt-in, as it would help fulfill the point of "nobody would get an Index unless they ask for one".

This wouldn't be the default pandas behaviour, it'd be behind an option, but it'd mean that someone could opt in to a "I don't want to think about indices" mode

@rhshadrach
Copy link
Member

I see - thanks, I missed the idea of making as_index=False the default in the OP. Echoing #48880 (comment):

But in general I think that pandas users are better off finding natural row labels for their data than thinking about how to get rid of it.

I'll even go stronger in saying that, in my opinion, natural row labels is what makes pandas great. As such I feel moving toward "nobody would get an Index unless they ask for one" is moving in the wrong direction.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Oct 31, 2022

Thanks, I've reworded it a bit now - rather than eventually making as_index=False the default (to which there's too much pushback), this would just enable creation of an option where that would be the default

I also love row labels (I especially like having time series where the index is time, and each column represents a numeric quantity), but I do think there's a case to be made of a "no index by default" mode - I've tried putting some points together here https://hackmd.io/JPWJqwc1SZKz_Zaxe9MZRQ#Why (you've seen the doc before, this is just an updated version, expanding on the "why?" part)

@rhshadrach
Copy link
Member

rhshadrach commented Oct 31, 2022

but I do think there's a case to be made of a "no index by default" mode

My understanding is the reason for this issue would be to move toward that, but there is no consensus yet as to whether we should move toward that (e.g. #48880). Is this accurate?

@MarcoGorelli
Copy link
Member Author

yup, totally accurate. I'll put it on the agenda for the next call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants