# 10 - Data Aggregation and Group Operations



In [1]:
import pandas as pd
import numpy as np

## 10.1 - How to think about group operations

.

### Introduction

groupby operations are usually described as dividing
by a key, applying a function, and then combining the 
result of such function. We'll get started with this 
tabular dataset:

In [2]:
df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
                    "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"), 
                    "data1" : np.random.standard_normal(7),  
                    "data2" : np.random.standard_normal(7)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.199253,0.623657
1,a,2.0,1.000606,0.145123
2,,1.0,0.5989,0.027154
3,b,2.0,1.837285,1.084548
4,b,1.0,-0.124856,-1.417453
5,a,,-0.735754,0.485362
6,,1.0,2.259049,0.780678


Suppose we want to compute some statistics over data1
column for key1 groups. One way to do so is to call
`groupby()` method over that column:

In [3]:
grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.api.typing.SeriesGroupBy object at 0x7027be6a59d0>

This `SeriesGroupBy` object can now be used to 
calculate some statistics, such as the mean:

In [4]:
grouped.mean()

key1
a    0.488035
b    0.856214
Name: data1, dtype: float64

If we instead pass multiple keys to `groupby()`, we'll
end up with a hierarchical index:

In [5]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     1       1.199253
      2       1.000606
b     1      -0.124856
      2       1.837285
Name: data1, dtype: float64

In [6]:
means.unstack()

key2,1,2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.199253,1.000606
b,-0.124856,1.837285


The key arrays here are Series, but they need not be so. We
could use numpy arrays, lists, etc, as long as they have the
right length.

If the grouping array is in the same DataFrame we're working on, 
we can use it's column names as groupby keys:

In [11]:
df.groupby('key1').mean()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,0.488035,0.418047
b,1.5,0.856214,-0.166453


In [15]:
df.groupby('key2').mean(numeric_only=True)

Unnamed: 0_level_0,data1,data2
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.983087,0.003509
2,1.418946,0.614835


In [16]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1.199253,0.623657
a,2,1.000606,0.145123
b,1,-0.124856,-1.417453
b,2,1.837285,1.084548


One useful groupby method is `size()`, which returns the
size of each group:

In [17]:
df.groupby(['key1', 'key2']).size()

key1  key2
a     1       1
      2       1
b     1       1
      2       1
dtype: int64

Note that NA values were dropped by default, but can be included:

In [18]:
df.groupby(['key1', 'key2'], dropna=False).size()

key1  key2
a     1       1
      2       1
      <NA>    1
b     1       1
      2       1
NaN   1       2
dtype: int64

A similar function is `count()`, but it only
counts non-null values:

In [19]:
df.groupby('key1', dropna=False).count()

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,3,3
b,2,2,2
,2,2,2


### Iterating over groups

We can iterate over the object return by groupby.
Each iteration contains a 2-tuple with the group
name and the chunk of data:

In [24]:
for name, group in df.groupby('key1'):
  print(name)
  print(group)
  print('\n')

a
  key1  key2     data1     data2
0    a     1  1.199253  0.623657
1    a     2  1.000606  0.145123
5    a  <NA> -0.735754  0.485362


b
  key1  key2     data1     data2
3    b     2  1.837285  1.084548
4    b     1 -0.124856 -1.417453




In the case of multiple keys, the first element
in each tuple is itself a tuple of key values:

In [25]:
for keys, group in df.groupby(['key1', 'key2']):
  print(keys)
  print(group)
  print('\n')

('a', 1)
  key1  key2     data1     data2
0    a     1  1.199253  0.623657


('a', 2)
  key1  key2     data1     data2
1    a     2  1.000606  0.145123


('b', 1)
  key1  key2     data1     data2
4    b     1 -0.124856 -1.417453


('b', 2)
  key1  key2     data1     data2
3    b     2  1.837285  1.084548




One useful recipe here is to compute a dictionary of
labels, data with a one-liner:

In [26]:
data_dict = {name: data for name, data in df.groupby('key1')}
data_dict['b']

Unnamed: 0,key1,key2,data1,data2
3,b,2,1.837285,1.084548
4,b,1,-0.124856,-1.417453


To groupby the columns axis, it is standard practice to transpose
the array:
group with a dictionary based on columns that start with 'key'
and columns that start with 'data':

In [32]:
grouped = df.T.groupby({'key1':'key', 'key2':'key',
                      'data1':'data', 'data2': 'data'})
                     
for key, value in grouped:
  print(key)
  print(value.T)

data
      data1     data2
0  1.199253  0.623657
1  1.000606  0.145123
2    0.5989  0.027154
3  1.837285  1.084548
4 -0.124856 -1.417453
5 -0.735754  0.485362
6  2.259049  0.780678
key
  key1  key2
0    a     1
1    a     2
2  NaN     1
3    b     2
4    b     1
5    a  <NA>
6  NaN     1


### Selecting a Column or a Subset of Columns

We can select only a subset of columns to
aggregate and do groupby operations. It is 
common to do so by indexing the groupby object:

In [41]:
df.groupby('key1')[['data1', 'data2']].mean()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.488035,0.418047
b,0.856214,-0.166453


The object returned is a Series if only a column
is selected, or a DataFrame if a list of columns
(that could be len == 1) is passed.

### Grouping with Dictionaries and Series

Consider the following DataFrame:

In [46]:
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])
people.iloc[2, [1,2]] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,0.132154,0.263457,-0.240705,-1.357358,0.643108
Steve,-0.18953,-1.344791,1.398498,0.49188,0.362259
Wanda,1.190825,,,-0.086028,-0.904074
Jill,0.760951,-1.060206,0.369945,0.278894,0.898692
Trey,0.305667,0.665185,-0.149211,1.317127,-0.563092


Suppose we have a correspondence for each column, e.g.
they are related to the colors red and blue, and we want
to compute statistics for red and blue groups. We could use
a dictionary for correspondence and use it in the groupby:

In [47]:
mapping = {'a':'red', 'b':'red', 'c':'blue', 'd':'blue',
           'e':'red', 'f':'orange'} 

by_column = people.T.groupby(mapping)

by_column.sum().T

Unnamed: 0,blue,red
Joe,-1.598064,1.038719
Steve,1.890379,-1.172063
Wanda,-0.086028,0.286752
Jill,0.648839,0.599437
Trey,1.167916,0.40776


We can also do that with a Series, which could
be seen as a fixed-sized mapping:

In [48]:
map_series = pd.Series(mapping)
map_series

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: str

In [49]:
people.T.groupby(map_series).count().T 

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


### Grouping with Functions

We can also pass a function to groupby, which will
apply itself to each key value and use its return
as grouby key. Suppose we want to compute the minimum
value for each first name initial:

In [50]:
people.groupby(lambda x: x[0]).min()

Unnamed: 0,a,b,c,d,e
J,0.132154,-1.060206,-0.240705,-1.357358,0.643108
S,-0.18953,-1.344791,1.398498,0.49188,0.362259
T,0.305667,0.665185,-0.149211,1.317127,-0.563092
W,1.190825,,,-0.086028,-0.904074


As everything is converted to arrays internally, it's
okay to pass a function and an array or series or dict
in the array of keys:

In [51]:
group_number = ['one', 'one', 'two', 'two', 'one']
people.groupby([lambda x: x[0], group_number]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
J,one,0.132154,0.263457,-0.240705,-1.357358,0.643108
J,two,0.760951,-1.060206,0.369945,0.278894,0.898692
S,one,-0.18953,-1.344791,1.398498,0.49188,0.362259
T,one,0.305667,0.665185,-0.149211,1.317127,-0.563092
W,two,1.190825,,,-0.086028,-0.904074


### Grouping by Index Levels

In hierarchically indexed objects, we can group by a 
certain index level. To do so, we pass the `level=`
argument with the level name. Consider the array:

In [52]:
columns = pd.MultiIndex.from_arrays([["US", "US", "US", "JP", "JP"], 
                                     [1, 3, 5, 1, 3]],  
                                     names=["cty", "tenor"])
hier_df = pd.DataFrame(np.random.standard_normal((4, 5)),
                       columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.295465,-0.974724,0.863463,0.365192,0.120075
1,0.549945,-1.668698,0.206943,-1.147478,-0.367363
2,0.455041,0.68534,1.592419,-1.22744,0.133776
3,2.074045,0.031908,0.794314,-0.451135,1.14956


In [53]:
hier_df.T.groupby(level='cty').count().T

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


### Partial Summary:

- `df.groupby([keys])` returns a groupby object which can compute statistics about such groups
    - To compute the mean, remember to use `numeric_only=True`
    - NA values are dropped by default
- We can iterate over groups. Each iteration contains the group name and data in a tuple.
- Selecting a column or subset is done by indexing the `groupby` object.
- We can groupby passing `dicts` or `series` with correspondence to be grouped by.
- Functions will also work, as long as they return the key label.
- We can also group by index level with `groupby(level=levelname)`

## 10.2 Data Aggregation
. 
### Introduction 

Data Aggregation refers simply to the process of producing
a scalar value from an array of values, such as when we used
`mean()`, `sum()`, and many others.

Those used before and some other aggregations are optimized
for groupby operations, but we can use some others supported
by our object of interest, even though they will by unoptimized.
For example, `nsmallest(n)` method for series can be called on
each groupby piece:

In [54]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.199253,0.623657
1,a,2.0,1.000606,0.145123
2,,1.0,0.5989,0.027154
3,b,2.0,1.837285,1.084548
4,b,1.0,-0.124856,-1.417453
5,a,,-0.735754,0.485362
6,,1.0,2.259049,0.780678


In [55]:
grouped = df.groupby('key1')
grouped['data1'].nsmallest(2)

key1   
a     5   -0.735754
      1    1.000606
b     4   -0.124856
      3    1.837285
Name: data1, dtype: float64

We can pass any custom function that returns a value from an
array as an groupby aggregation in the `.agg(function)` method:

In [59]:
def square_sum(array):
  sum = 0
  for item in array:
    sum += item**2
  return sum
grouped['data2'].agg(square_sum)

key1
a    0.645586
b    3.185417
Name: data2, dtype: float64

Some methods work even though they aren't technically aggregations,
like `describe()`:

In [60]:
grouped.describe()

Unnamed: 0_level_0,key2,key2,key2,key2,key2,key2,key2,key2,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
a,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,3.0,0.488035,...,1.09993,1.199253,3.0,0.418047,0.246267,0.145123,0.315242,0.485362,0.55451,0.623657
b,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,2.0,0.856214,...,1.346749,1.837285,2.0,-0.166453,1.769182,-1.417453,-0.791953,-0.166453,0.459048,1.084548


### Column-Wise and Multiple Function Application

Let's bring back the tipping dataset and add that 
`tip_pct` column from earlier:

In [61]:
tips = pd.read_csv('../pydata-book/examples/tips.csv')
tips.head(3)

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3


In [62]:
tips['tip_pct'] = tips['tip']/tips['total_bill'] 

The lesson here will be that we can use different
aggregations for each columns or multiple aggregation
functions at once. This will be illustrated in the 
following examples:

First, we'll aggregate by 'day' and 'smoker':

In [65]:
grouped = tips.groupby(['day', 'smoker'])

Then, we'll select the tip_pct column:

In [66]:
grouped_pct = grouped['tip_pct']

We can call a standard aggregation with its method
or its name as a string within `.agg()`:

In [69]:
grouped_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

If we pass a list of functions or function names, we'll
instead get a DataFrame with column names taken from the
functions:

In [72]:
grouped_pct.agg(['mean', 'std', 'max', square_sum])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,max,square_sum
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fri,No,0.15165,0.028123,0.187735,0.094364
Fri,Yes,0.174783,0.051293,0.26348,0.49507
Sat,No,0.158048,0.039767,0.29199,1.193641
Sat,Yes,0.147906,0.061375,0.325733,1.073243
Sun,No,0.160113,0.042347,0.252672,1.561685
Sun,Yes,0.18725,0.154134,0.710345,1.093824
Thur,No,0.160298,0.038774,0.266312,1.222448
Thur,Yes,0.163863,0.039389,0.241255,0.481294


If instead we pass tuples of `(name, function)`, the
name will be used for the column names:

In [73]:
grouped_pct.agg([('average', 'mean'), ('sdeviation', 'std')])

Unnamed: 0_level_0,Unnamed: 1_level_0,average,sdeviation
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


When working with a DataFrame we have more options. First,
we'll explore passing the same functions for different
column:

In [77]:
functions = ['mean', 'count', 'max']

result = grouped[['total_bill', 'tip_pct']].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill,total_bill,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count,max,mean,count,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,18.42,4,22.75,0.15165,4,0.187735
Fri,Yes,16.813333,15,40.17,0.174783,15,0.26348
Sat,No,19.661778,45,48.33,0.158048,45,0.29199
Sat,Yes,21.276667,42,50.81,0.147906,42,0.325733
Sun,No,20.506667,57,48.17,0.160113,57,0.252672
Sun,Yes,24.12,19,45.35,0.18725,19,0.710345
Thur,No,17.113111,45,41.19,0.160298,45,0.266312
Thur,Yes,19.190588,17,43.11,0.163863,17,0.241255


To apply different functions to each column, we pass a dict
`column_name: function`:

In [78]:
grouped.agg({'total_bill': 'mean', 'tip': 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,18.42,11.25
Fri,Yes,16.813333,40.71
Sat,No,19.661778,139.63
Sat,Yes,21.276667,120.77
Sun,No,20.506667,180.57
Sun,Yes,24.12,66.82
Thur,No,17.113111,120.32
Thur,Yes,19.190588,51.51


In [79]:
grouped.agg({'tip_pct': ['min', 'max', 'mean', 'std'], 'size': 'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,mean
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,2.25
Fri,Yes,0.103555,0.26348,0.174783,0.051293,2.066667
Sat,No,0.056797,0.29199,0.158048,0.039767,2.555556
Sat,Yes,0.035638,0.325733,0.147906,0.061375,2.47619
Sun,No,0.059447,0.252672,0.160113,0.042347,2.929825
Sun,Yes,0.06566,0.710345,0.18725,0.154134,2.578947
Thur,No,0.072961,0.266312,0.160298,0.038774,2.488889
Thur,Yes,0.090014,0.241255,0.163863,0.039389,2.352941


Hierarchical columns will be created only if at least one group
has more than one aggregation function.

### Returning Aggregated Data Without Row Indexes

If we want the row indexes as columns instead, and want
to avoid unnecessary computations of calling `reset_index()`,
we can just use the parameter `as_index=False`:

In [82]:
tips.groupby(['day', 'smoker'], as_index=False).mean(numeric_only=True)

Unnamed: 0,day,smoker,total_bill,tip,size,tip_pct
0,Fri,No,18.42,2.8125,2.25,0.15165
1,Fri,Yes,16.813333,2.714,2.066667,0.174783
2,Sat,No,19.661778,3.102889,2.555556,0.158048
3,Sat,Yes,21.276667,2.875476,2.47619,0.147906
4,Sun,No,20.506667,3.167895,2.929825,0.160113
5,Sun,Yes,24.12,3.516842,2.578947,0.18725
6,Thur,No,17.113111,2.673778,2.488889,0.160298
7,Thur,Yes,19.190588,3.03,2.352941,0.163863


### Partial Summary

- Aggregation of any function can be made on `groupby` objects with `.agg(function)`
    - We can aggregate with a list of functions or a dict `{'column_name': [functions]}`
    - We can use custom column names by passing `('name', function)` tuple
    - To avoid row indexing and lessening computing, we can pass `as_index=False` to `groupby()`

## 10.3 Apply: General split-apply-combine

This section concerns itself with the `apply()` method,
considered the most general one of the `groupby` object.
It splits the object being handled into pieces, invokes
the passed function on each piece, and then attempts to
concatenate the pieces.

As an example, we'll return to the tipping dataset from
before, and try to select the top five `tip_pct` of each group.

### Apply Presentation

First, we'll write a function that selects the largest values
in a given column:

In [84]:
def top(df, n=5, column='tip_pct'):
  return df.sort_values(column, ascending=False)[:n]

top(tips)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535


Now if we group by smoker and call apply with this function:

In [85]:
tips.groupby('smoker').apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
No,232,11.61,3.39,Sat,Dinner,2,0.29199
No,149,7.51,2.0,Thur,Lunch,2,0.266312
No,51,10.29,2.6,Sun,Dinner,2,0.252672
No,185,20.69,5.0,Sun,Dinner,5,0.241663
No,88,24.71,5.85,Thur,Lunch,2,0.236746
Yes,172,7.25,5.15,Sun,Dinner,2,0.710345
Yes,178,9.6,4.0,Sun,Dinner,2,0.416667
Yes,67,3.07,1.0,Sat,Dinner,1,0.325733
Yes,183,23.17,6.5,Sun,Dinner,4,0.280535
Yes,109,14.31,4.0,Sat,Dinner,2,0.279525


The apply function applies and combines.
The inner indexes are from the original index values.

If our function has more parameters, we can pass then 
comma-separated after the function name:

In [86]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
No,Fri,94,22.75,3.25,Dinner,2,0.142857
No,Sat,212,48.33,9.0,Dinner,4,0.18622
No,Sun,156,48.17,5.0,Dinner,6,0.103799
No,Thur,142,41.19,5.0,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Lunch,4,0.115982


What we can do with `apply` is pretty diverse and limited by
our creativity. The function must only return a scalar or 
pandas object.

The rest of the chapter will consist mainly of examples of
problems solved by groupby:

### Suppressing the Group Keys

On a quick note, we can `reset_index()` of the group
with `groupby(group_keys=False)` 

### Quantile and Bucket Analysis

Using pandas `cut` and `qcut` with groupby makes it convenient
to do bucket analysis on different categories of a dataset.
Consider a simple random dataset and an equal-length bucket 
categorization using pandas.cut:

In [93]:
frame = pd.DataFrame({'data1': np.random.standard_normal(1000),
                      'data2': np.random.standard_normal(1000)})

frame.head()

Unnamed: 0,data1,data2
0,-0.547242,0.766776
1,0.307368,0.545344
2,-0.692081,-0.302757
3,-0.542846,-2.34541
4,1.143889,-0.532546


In [97]:
quartiles = pd.cut(frame['data1'], 4)
quartiles.head(10)

0    (-1.789, -0.29]
1     (-0.29, 1.209]
2    (-1.789, -0.29]
3    (-1.789, -0.29]
4     (-0.29, 1.209]
5    (-1.789, -0.29]
6    (-1.789, -0.29]
7     (1.209, 2.708]
8     (-0.29, 1.209]
9     (-0.29, 1.209]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.294, -1.789] < (-1.789, -0.29] < (-0.29, 1.209] < (1.209, 2.708]]

This Categorical object created can be passed to groupby.
Once we've grouped by quartile, we can apply functions to it:

In [98]:
grouped = frame.groupby(quartiles)

def get_stats(group):
  return pd.DataFrame(
    {'min': group.min(), 'max': group.max(),
     'count': group.count(), 'mean': group.mean()}
  )

grouped.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-3.294, -1.789]",data1,-3.287902,-1.794877,39,-2.170917
"(-3.294, -1.789]",data2,-2.380135,1.772233,39,-0.008726
"(-1.789, -0.29]",data1,-1.761778,-0.292186,338,-0.866776
"(-1.789, -0.29]",data2,-2.796491,3.419199,338,-0.033755
"(-0.29, 1.209]",data1,-0.288733,1.208798,515,0.385388
"(-0.29, 1.209]",data2,-3.115294,2.660721,515,0.069993
"(1.209, 2.708]",data1,1.213015,2.708454,108,1.670963
"(1.209, 2.708]",data2,-3.067722,3.194849,108,-0.052062


Dealing with equal-sized buckets with `qcut` may be
cleaner if we pass `(labels=false)` to qcut:

In [99]:
quatiles_samp = pd.qcut(frame['data1'], 4, labels=False)

frame.groupby(quatiles_samp).apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,data1,-3.287902,-0.662581,250,-1.269778
0,data2,-2.796491,3.390453,250,-0.040651
1,data1,-0.661263,0.011186,250,-0.307645
1,data2,-2.454271,3.419199,250,0.024464
2,data1,0.012088,0.6796,250,0.333558
2,data2,-3.115294,2.652923,250,0.10108
3,data1,0.6865,2.708454,250,1.249074
3,data2,-3.067722,3.194849,250,-0.010197


### Example: Filling Missing Values with Group-Specific Values:

When cleaning missing data, if we wish to fill NA
values, `fillna()` is the method to use. But we may
want to use a different value or function with `fillna`
depending on the group we're filling for. Consider the
DataFrame:

In [102]:
states = ["Ohio", "New York", "Vermont", "Florida", 
          "Oregon", "Nevada", "California", "Idaho"] 

group_key = ["East", "East", "East", "East", 
            "West", "West", "West", "West"] 

data = pd.Series(np.random.standard_normal(8), index=states)

data

Ohio          2.568425
New York     -0.769078
Vermont      -2.145546
Florida      -0.035402
Oregon       -0.710139
Nevada       -0.877352
California    0.201258
Idaho         0.399675
dtype: float64

In [103]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Ohio          2.568425
New York     -0.769078
Vermont            NaN
Florida      -0.035402
Oregon       -0.710139
Nevada             NaN
California    0.201258
Idaho              NaN
dtype: float64

In [104]:
data.groupby(group_key).size()

East    4
West    4
dtype: int64

Suppose we want to fill NA values for the East
with the 0 and to the West with the 1.
We could have a dict indicating that:

In [109]:
fill_funcs = {'East': 0, 'West': 1}

def fill_group(group, fill_funcs):
  return group.fillna(fill_funcs[group.name])
  
data.groupby(group_key).apply(fill_group, fill_funcs)

East  Ohio          2.568425
      New York     -0.769078
      Vermont       0.000000
      Florida      -0.035402
West  Oregon       -0.710139
      Nevada        1.000000
      California    0.201258
      Idaho         1.000000
dtype: float64

### Example: Random Sampling and Permutation

Suppose we want to draw a random sample from a large
dataset for any purpose. There are many ways to do 'draws'.
The `sample` method for a Series will be used for the example.

First, we'll construct a deck of playing cards:

In [110]:
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'Q', 'K']
cards = []
for suit in suits:
  cards.extend(str(num) + suit for num in base_names)
deck = pd.Series(card_val, index=cards)

This results in a Series with all combinations of cards:

In [111]:
deck.head(13)

AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
QH     10
KH     10
dtype: int64

Drawing a hand of 5 cards is as simple as deck.sample(5)

In [112]:
deck.sample(5)

JH    10
8S     8
4S     4
4D     4
6H     6
dtype: int64

But we'll create a function to apply with n cards
and draw 2 cards per suit. To do that, we'll extract
the suit by the last letter:

In [114]:
def draw(deck, n=5):
  return deck.sample(n)

def get_suit(card):
  return card[-1]

deck.groupby(get_suit, group_keys=False).apply(draw, n=2)

7C     7
QC    10
3D     3
4D     4
JH    10
AH     1
9S     9
7S     7
dtype: int64

### Example: Group Weighted Average and Correlation

Under the split-apply-combine paradigm of groupby,
operations between two different columns or arrays are
possible. Consider the weighted average problem, where
we have an array of weights and data:

In [115]:
df = pd.DataFrame({"category": ["a", "a", "a", "a",
                                "b", "b", "b", "b"],
                   "data": np.random.standard_normal(8),
                   "weights": np.random.uniform(size=8)})
df

Unnamed: 0,category,data,weights
0,a,-0.244945,0.577941
1,a,1.650463,0.790817
2,a,2.114479,0.256223
3,a,-1.343586,0.679175
4,b,0.238621,0.967136
5,b,1.338831,0.22594
6,b,-1.688148,0.179038
7,b,-0.876609,0.141842


In [117]:
def get_wavg(frame):
  return np.average(frame['data'], weights=frame['weights'])

df.groupby('category').apply(get_wavg)

category
a    0.344116
b    0.070472
dtype: float64

As another example, the author uses a financial dataset
obtained from Yahoo! Finance:

In [125]:
close_px = pd.read_csv('../pydata-book/examples/stock_px.csv', parse_dates=True,
                       index_col=0)
close_px.tail(4)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2011-10-11,400.29,27.0,76.27,1195.54
2011-10-12,402.19,26.96,77.16,1207.25
2011-10-13,408.43,27.18,76.37,1203.66
2011-10-14,422.0,27.27,78.11,1224.58


First it defines a function to compute the correlation
of the stock with the correlation to SPX:

In [126]:
def spx_corr(group):
  return group.corrwith(group['SPX'])

Then it computes percent change:

In [127]:
rets = close_px.pct_change().dropna()

Then it defines a function to extract the year
from the datetime label, so it can group per year:

In [134]:
def get_year(x):
  return x.year

by_year = rets.groupby(get_year)

by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


It also computes intercolumn correlations:

In [135]:
def corr_apple_msft(group):
  return group['AAPL'].corr(group['MSFT'])

by_year.apply(corr_apple_msft)

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

In [133]:
close_px.info()

<class 'pandas.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB


### Example: Group-Wise Linear Regresion

Here the author uses the `statsmodels.api` to define
a function that returns the beta and intercept 
values for each group:

In [136]:
import statsmodels.api as sm 
def regress(data, yvar=None, xvars=None):
  Y = data[yvar]  
  X = data[xvars] 
  X["intercept"] = 1.
  result = sm.OLS(Y, X).fit()  
  return result.params

In [137]:
by_year.apply(regress, yvar='AAPL', xvars=['SPX'])

Unnamed: 0,SPX,intercept
2003,1.195406,0.00071
2004,1.363463,0.004201
2005,1.766415,0.003246
2006,1.645496,8e-05
2007,1.198761,0.003438
2008,0.968016,-0.00111
2009,0.879103,0.002954
2010,1.052608,0.001261
2011,0.806605,0.001514


### Summary

- `apply` is a generalization of `agg` and can apply and then concat any function possible we create or use on each group. It can:
    - return n values for each group
    - pass function parameters after a comma
- we can groupby `Categoricals`, such as those returned by `pd.cut` and `pd.qcut`
- We can fill with different values for different groups by defining a function that uses a dict mapping
- We can sample by group and return the values sampled, as `apply` will concatenate them;
- We can compute statistics across different columns, all we need is to write the function