Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .groupby on Series' values without reindexing #15340

Closed
wants to merge 1 commit into from

Conversation

kernc
Copy link
Contributor

@kernc kernc commented Feb 8, 2017

@jreback
Copy link
Contributor

jreback commented Feb 8, 2017

see my comments on the issue.

@codecov-io
Copy link

codecov-io commented Feb 8, 2017

Codecov Report

Merging #15340 into master will not impact coverage.

@@           Coverage Diff           @@
##           master   #15340   +/-   ##
=======================================
  Coverage   86.32%   86.32%           
=======================================
  Files         141      141           
  Lines       51165    51165           
=======================================
  Hits        44169    44169           
  Misses       6996     6996
Impacted Files Coverage Δ
pandas/core/groupby.py 95.15% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 542c916...87d6170. Read the comment docs.

@jreback jreback added the Groupby label Feb 8, 2017
@jreback
Copy link
Contributor

jreback commented Feb 8, 2017

so this changes the actual behavior. why is this warranted?

@kernc
Copy link
Contributor Author

kernc commented Feb 9, 2017

I don't know if it is. The docs say in case of series, the values will be used as grouping keys. I guess in this instance I don't see series as much different from any other series of values, like lists or numpy arrays. I don't know why aligning is necessary, particularly if the series of keys comes from another source (if not, why would it be passed as a series explicitly instead of by a column reference), in which case the index can be arbitrary and the result of alignment just as so.

In other words, if the grouping series is from the same source (i.e. the same frame) as the grouped object, then it already shares the same index, hence aligning is not necessary.
If the series is from a different source, then automatically aligning might be dangerous. I highly appreciate the fact that an indexed series makes a great random-access key-value store, but I assume most users don't treat it as such in preference to somewhat simpler ordered array of values which happens to have an additional index.

Ideally, I'd have .groupby() fail if the passed grouping is a Sequence and its length doesn't match. Otherwise, it should behave as it does when passed a list.

Of course, Series is not a Sequence?? 😳

@kernc
Copy link
Contributor Author

kernc commented Feb 9, 2017

so this changes the actual behavior. why is this warranted?

It's wholly compatible with the docs, and it doesn't break any previous tests. 😆

@jreback
Copy link
Contributor

jreback commented Feb 9, 2017

@kernc this certainly changes behavior. The entire point is to align. This is actually very confusing what you changed. Virtually every operation in pandas aligns. Why would this not?

@kernc
Copy link
Contributor Author

kernc commented Feb 10, 2017

Thanks for treating me like a pandas expert I'm not. ❤️

In that case, I agree a remark in the docs (docstring) would be welcome.

@kernc kernc closed this Feb 10, 2017
@jreback
Copy link
Contributor

jreback commented Feb 10, 2017

@kernc hahh, np.

if you'd like to do a PR for the doc-string would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.groupby by should indicate it aligns the passed in Series
3 participants