-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Closed
Labels
Milestone
Description
The unique
and nunique
attributes are very useful in conjunction with series groupby operations. I used these extensively in previous versions of Pandas whenever I needed to get a list of unique values for each subgroup (or the number of unique values). This can be used, for example, to count the number of subjects in each treatment group (or get a list of the subject IDs for reporting):
data = pandas.DataFrame({
'subject_id': ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'),
'treatment': (0, 0, 0, 0, 0, 1, 1, 1, 0, 0),
})
print data.groupby('treatment').subject_id.apply(lambda x: x.nunique())
print data.groupby('treatment').subject_id.apply(lambda x: x.unique())
We'd get the following output:
treatment
0 7
1 3
dtype: int64
treatment
0 [A, B, C, D, E, I, J]
1 [F, G, H]
dtype: object
This is super-useful for generating summary statistics (e.g. N's) and debugging (e.g. tracking down which subjects are in which groups. In previous versions of Pandas, we could simply do:
print data.groupby('treatment').subject_id.nunique()
print data.groupby('treatment').subject_id.unique()
It would be nice to continue this. Is there a reason why nunique
and unique
can't be added to the whitelist?