# Week 8. Data Aggregation and Group Operations

In [1]:
import numpy as np
import pandas as pd
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.random.seed(12345)
np.set_printoptions(precision=4, suppress=True)

One reason for the popularity of relational databases and SQL is the ease with which data can be joined, filtered, transformed, and aggregated. 

However, query languages like SQL are somewhat constrained in the group operations that can be performed. 

As you will see, within Python and pandas, we can perform quite complex **group operations**. 

In this section, you will learn how to:
- (1) Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names);
- (2) Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function;
- (3) Apply within-group transformations or other manipulations, like normalization, linear regression, etc.

---

## 8.3 GroupBy Mechanics

Understanding **split-apply-combine**

* In the first stage, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. 
  * The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (```axis=0```) or its columns (```axis=1```). 
* Once this is done, a function is applied to each group, producing a new value. 
* Finally, the results of all those function applications are combined into a result object. 

In [2]:
data = pd.DataFrame({'key':['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'], 
                     'data':[0, 5, 10, 5, 10, 15, 10, 15, 20]})
data

Unnamed: 0,key,data
0,A,0
1,B,5
2,C,10
3,A,5
4,B,10
5,C,15
6,A,10
7,B,15
8,C,20


In [3]:
data.groupby('key').sum()

Unnamed: 0_level_0,data
key,Unnamed: 1_level_1
A,15
B,30
C,45


![Screenshot%202023-08-18%20at%202.45.07%20PM.png](attachment:Screenshot%202023-08-18%20at%202.45.07%20PM.png)

#### Each grouping key can take many forms
* A list or array of values that is the same length as the axis being grouped
* A value indicating a column name in a DataFrame
* A ```dict``` or ```Series``` giving a correspondence between the values on the axis being grouped and the group names
* A function to be invoked on the axis index or the individual labels in the index

In [4]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.204708,1.393406
1,a,two,0.478943,0.092908
2,b,one,-0.519439,0.281746
3,b,two,-0.55573,0.769023
4,a,one,1.965781,1.246435


In [5]:
grouped = df['data1'].groupby(df['key1'])
grouped   

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002B27F472290>

* The above ```grouped``` variable is now a *GroupBy* object.   <br>
<br>
* It has not actually computed anything yet except for some intermediate data about the group key ```df['key1']```.  <br>
<br>
* The idea is that this object has all of the information needed to then apply some operation to each of the groups. 
  * For example, to compute group means we can call the GroupBy’s ```mean``` method:

In [6]:
grouped.mean()

key1
a    0.746672
b   -0.537585
Name: data1, dtype: float64

We grouped the data using two keys, and the resulting Series now has a **hierarchical index** consisting of the unique pairs of keys observed:

In [7]:
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.880536
      two     0.478943
b     one    -0.519439
      two    -0.555730
Name: data1, dtype: float64

In [8]:
means.unstack()

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.880536,0.478943
b,-0.519439,-0.55573


The grouping key can be a list or array of values that is the same length as the axis being grouped

In [9]:
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()

California  2005    0.478943
            2006   -0.519439
Ohio        2005   -0.380219
            2006    1.965781
Name: data1, dtype: float64

In [10]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.204708,1.393406
1,a,two,0.478943,0.092908
2,b,one,-0.519439,0.281746
3,b,two,-0.55573,0.769023
4,a,one,1.965781,1.246435


In [11]:
df.groupby(['key1', 'key2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,one,0.880536,1.31992
a,two,0.478943,0.092908
b,one,-0.519439,0.281746
b,two,-0.55573,0.769023


In [12]:
# Another generally useful GroupBy method is size, which returns a Series containing group sizes:
df.groupby(['key1', 'key2']).size()

key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64

### 8.3.1 Iterating Over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing:
* the group name
* the chunk of data

In [13]:
for name, group in df.groupby('key1'):
    print('-------------------------------')
    print(name)
    print(group)

-------------------------------
a
  key1 key2     data1     data2
0    a  one -0.204708  1.393406
1    a  two  0.478943  0.092908
4    a  one  1.965781  1.246435
-------------------------------
b
  key1 key2     data1     data2
2    b  one -0.519439  0.281746
3    b  two -0.555730  0.769023


In the case of multiple keys, the first element in the tuple will be a **tuple of key values**:

In [14]:
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print('-------------------------------')
    print((k1, k2))
    print(group)

-------------------------------
('a', 'one')
  key1 key2     data1     data2
0    a  one -0.204708  1.393406
4    a  one  1.965781  1.246435
-------------------------------
('a', 'two')
  key1 key2     data1     data2
1    a  two  0.478943  0.092908
-------------------------------
('b', 'one')
  key1 key2     data1     data2
2    b  one -0.519439  0.281746
-------------------------------
('b', 'two')
  key1 key2    data1     data2
3    b  two -0.55573  0.769023


By default ```groupby``` groups on ```axis=0```, but you can group on any of the other axes. 
  * For example, we could group the columns of our example ```df``` here by ```dtype``` like so:

In [15]:
df.dtypes

key1      object
key2      object
data1    float64
data2    float64
dtype: object

In [16]:
grouped = df.groupby(df.dtypes, axis=1)

  grouped = df.groupby(df.dtypes, axis=1)


In [17]:
for dtype, group in grouped:
    print(dtype)
    print(group)

float64
      data1     data2
0 -0.204708  1.393406
1  0.478943  0.092908
2 -0.519439  0.281746
3 -0.555730  0.769023
4  1.965781  1.246435
object
  key1 key2
0    a  one
1    a  two
2    b  one
3    b  two
4    a  one


### 8.3.2 Selecting a Column or Subset of Columns

* Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation.  <br>
<br>
* ```df[['data2']]``` is different from ```df['data2']``` in the sense that the former one is a data frame but the latter one is series. 

In [18]:
df.groupby(['key1', 'key2'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,1.31992
a,two,0.092908
b,one,0.281746
b,two,0.769023


In [19]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002B203200F50>

In [20]:
s_grouped.mean()    # of type pd.Series

key1  key2
a     one     1.319920
      two     0.092908
b     one     0.281746
      two     0.769023
Name: data2, dtype: float64

### 8.3.3 Grouping with Dicts and Series

In [21]:
people = pd.DataFrame(np.random.randn(5, 5),
                      columns=['a', 'b', 'c', 'd', 'e'],
                      index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan   # Add a few NA values
people

Unnamed: 0,a,b,c,d,e
Joe,1.007189,-1.296221,0.274992,0.228913,1.352917
Steve,0.886429,-2.001637,-0.371843,1.669025,-0.43857
Wes,-0.539741,,,-1.021228,-0.577087
Jim,0.124121,0.302614,0.523772,0.00094,1.34381
Travis,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


Now, you could construct an array from this dict to pass to groupby, but instead we can just pass the dict 
 * I included the key ```'f'``` to highlight that unused grouping keys are OK. 

In [22]:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}  # Note that we don't have 'f' in 'people'
by_column = people.groupby(mapping, axis=1)   # 'axis=1' means sort by columns
by_column.sum()

  by_column = people.groupby(mapping, axis=1)   # 'axis=1' means sort by columns


Unnamed: 0,blue,red
Joe,0.503905,1.063885
Steve,1.297183,-1.553778
Wes,-1.021228,-1.116829
Jim,0.524712,1.770545
Travis,-4.230992,-2.405455


### 8.3.4 Grouping with Functions

* Using Python functions is a more generic way of defining a group mapping compared with a dict or Series. <br>
<br>
* Any function passed as a group key will be called once **per index value**, with the return values being used as the group names.  <br>
<br>
* Consider the example DataFrame from the previous section, which has people’s first names as index values. **Suppose you wanted to group by the length of the names**.  

In [23]:
people

Unnamed: 0,a,b,c,d,e
Joe,1.007189,-1.296221,0.274992,0.228913,1.352917
Steve,0.886429,-2.001637,-0.371843,1.669025,-0.43857
Wes,-0.539741,,,-1.021228,-0.577087
Jim,0.124121,0.302614,0.523772,0.00094,1.34381
Travis,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


In [24]:
people.groupby(len).sum()

Unnamed: 0,a,b,c,d,e
3,0.591569,-0.993608,0.798764,-0.791374,2.119639
5,0.886429,-2.001637,-0.371843,1.669025,-0.43857
6,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


You can even mix functions with arrays, dicts, or Series: 

In [25]:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

Unnamed: 0,Unnamed: 1,a,b,c,d,e
3,one,-0.539741,-1.296221,0.274992,-1.021228,-0.577087
3,two,0.124121,0.302614,0.523772,0.00094,1.34381
5,one,0.886429,-2.001637,-0.371843,1.669025,-0.43857
6,two,-0.713544,-0.831154,-2.370232,-1.860761,-0.860757


### 8.3.5 Grouping by Index Levels

A final convenience for hierarchically indexed datasets is the ability to **aggregate using one of the levels of an axis index**.

In [26]:
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df

cty,US,US,US,JP,JP
tenor,1,3,5,1,3
0,0.560145,-1.265934,0.119827,-1.063512,0.332883
1,-2.359419,-0.199543,-1.541996,-0.970736,-1.30703
2,0.28635,0.377984,-0.753887,0.331286,1.349742
3,0.069877,0.246674,-0.011862,1.004812,1.327195


In [27]:
hier_df.groupby(level='cty', axis=1).count()

  hier_df.groupby(level='cty', axis=1).count()


cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


In [28]:
hier_df.groupby(level=0, axis=1).count()

  hier_df.groupby(level=0, axis=1).count()


cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


---

## 8.4 Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including ```mean```, ```count```, ```min```, and ```sum```.

In [29]:
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.204708,1.393406
1,a,two,0.478943,0.092908
2,b,one,-0.519439,0.281746
3,b,two,-0.55573,0.769023
4,a,one,1.965781,1.246435


In [30]:
grouped = df[['key1','data1','data2']].groupby('key1')
grouped['data1'].quantile(0.9)

key1
a    1.668413
b   -0.523068
Name: data1, dtype: float64

To use **your own** aggregation functions, pass any function that aggregates an array to the ```aggregate``` or ```agg``` method:

In [31]:
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.170488,1.300498
b,0.036292,0.487276


In [32]:
grouped.describe()

Unnamed: 0_level_0,data1,data1,data1,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
a,3.0,0.746672,1.109736,-0.204708,0.137118,0.478943,1.222362,1.965781,3.0,0.910916,0.712217,0.092908,0.669671,1.246435,1.31992,1.393406
b,2.0,-0.537585,0.025662,-0.55573,-0.546657,-0.537585,-0.528512,-0.519439,2.0,0.525384,0.344556,0.281746,0.403565,0.525384,0.647203,0.769023


### 8.4.1 Column-Wise and Multiple Function Application

In [34]:
tips = pd.read_csv('../data/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808
5,25.29,4.71,No,Sun,Dinner,4,0.18624


Aggregating a Series or all of the columns of a DataFrame is a matter of using ```aggregate``` with the desired function or calling a method like ```mean```.

In [35]:
grouped = tips.groupby(['day', 'smoker'])

In [36]:
grouped_pct = grouped['tip_pct']
grouped_pct.agg('mean')

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

If you pass **a list of functions or function names**, you get back a DataFrame with column names taken from the functions:

In [37]:
grouped_pct.agg(['mean', 'std', peak_to_peak])  # we passed a list of aggregation functions to agg to 
                                                # evaluate indepedently on the data groups.

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fri,No,0.15165,0.028123,0.067349
Fri,Yes,0.174783,0.051293,0.159925
Sat,No,0.158048,0.039767,0.235193
Sat,Yes,0.147906,0.061375,0.290095
Sun,No,0.160113,0.042347,0.193226
Sun,Yes,0.18725,0.154134,0.644685
Thur,No,0.160298,0.038774,0.19335
Thur,Yes,0.163863,0.039389,0.15124


If you pass **a list of (name, function) tuples**, the first element of each tuple will be used as the DataFrame column names. 

In [38]:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])

  grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])


Unnamed: 0_level_0,Unnamed: 1_level_0,foo,bar
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,0.15165,0.028123
Fri,Yes,0.174783,0.051293
Sat,No,0.158048,0.039767
Sat,Yes,0.147906,0.061375
Sun,No,0.160113,0.042347
Sun,Yes,0.18725,0.154134
Thur,No,0.160298,0.038774
Thur,Yes,0.163863,0.039389


* With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column. <br>
<br>
* To start, suppose we wanted to compute the same three statistics for the ```tip_pct``` and ```total_bill``` columns.

In [39]:
functions = ['count', 'mean', 'max']
result = grouped[['tip_pct', 'total_bill']].agg(functions)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


As before, a list of tuples with **custom names** can be passed:

In [40]:
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped[['tip_pct', 'total_bill']].agg(ftuples)

  grouped[['tip_pct', 'total_bill']].agg(ftuples)


Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,Durchschnitt,Abweichung,Durchschnitt,Abweichung
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Fri,No,0.15165,0.000791,18.42,25.596333
Fri,Yes,0.174783,0.002631,16.813333,82.562438
Sat,No,0.158048,0.001581,19.661778,79.908965
Sat,Yes,0.147906,0.003767,21.276667,101.387535
Sun,No,0.160113,0.001793,20.506667,66.09998
Sun,Yes,0.18725,0.023757,24.12,109.046044
Thur,No,0.160298,0.001503,17.113111,59.625081
Thur,Yes,0.163863,0.001551,19.190588,69.808518


Now, suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dict to ```agg``` that contains a mapping of column names to any of the function specifications listed so far.

In [41]:
grouped.agg({'tip' : np.max, 'size' : 'sum'})

  grouped.agg({'tip' : np.max, 'size' : 'sum'})


Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Fri,No,3.5,9
Fri,Yes,4.73,31
Sat,No,9.0,115
Sat,Yes,10.0,104
Sun,No,6.0,167
Sun,Yes,6.5,49
Thur,No,6.7,112
Thur,Yes,5.0,40


In [42]:
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.15165,0.028123,9
Fri,Yes,0.103555,0.26348,0.174783,0.051293,31
Sat,No,0.056797,0.29199,0.158048,0.039767,115
Sat,Yes,0.035638,0.325733,0.147906,0.061375,104
Sun,No,0.059447,0.252672,0.160113,0.042347,167
Sun,Yes,0.06566,0.710345,0.18725,0.154134,49
Thur,No,0.072961,0.266312,0.160298,0.038774,112
Thur,Yes,0.090014,0.241255,0.163863,0.039389,40


---

## 8.5 Apply: General split-apply-combine

The most general-purpose GroupBy method is ```apply```. 

Returning to the tipping dataset from before, suppose you wanted to select **the top five ```tip_pct``` values** by group. First, write a function that selects the rows with the largest values in a particular column:

In [43]:
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]

top(tips, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
232,11.61,3.39,No,Sat,Dinner,2,0.29199
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


Now, if we group by smoker, say, and call ```apply``` with this function

In [44]:
tips.groupby('smoker').apply(top)

  tips.groupby('smoker').apply(top)


Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,88,24.71,5.85,No,Thur,Lunch,2,0.236746
No,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,232,11.61,3.39,No,Sat,Dinner,2,0.29199
Yes,109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
Yes,183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
Yes,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


What has happened here? 

The top function is called on each row group from the DataFrame, and then the results are glued together using ```pandas.concat```, labeling the pieces with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original DataFrame.

If you pass a function to ```apply``` that takes other arguments or keywords, you can pass these after the function:

In [45]:
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')

  tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,Fri,94,22.75,3.25,No,Fri,Dinner,2,0.142857
No,Sat,212,48.33,9.0,No,Sat,Dinner,4,0.18622
No,Sun,156,48.17,5.0,No,Sun,Dinner,6,0.103799
No,Thur,142,41.19,5.0,No,Thur,Lunch,5,0.121389
Yes,Fri,95,40.17,4.73,Yes,Fri,Dinner,4,0.11775
Yes,Sat,170,50.81,10.0,Yes,Sat,Dinner,3,0.196812
Yes,Sun,182,45.35,3.5,Yes,Sun,Dinner,3,0.077178
Yes,Thur,197,43.11,5.0,Yes,Thur,Lunch,4,0.115982


In [46]:
result = tips.groupby('smoker')['tip_pct'].describe()
result

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
No,151.0,0.159328,0.03991,0.056797,0.136906,0.155625,0.185014,0.29199
Yes,93.0,0.163196,0.085119,0.035638,0.106771,0.153846,0.195059,0.710345


In [47]:
result.unstack('smoker')

       smoker
count  No        151.000000
       Yes        93.000000
mean   No          0.159328
       Yes         0.163196
std    No          0.039910
       Yes         0.085119
min    No          0.056797
       Yes         0.035638
25%    No          0.136906
       Yes         0.106771
50%    No          0.155625
       Yes         0.153846
75%    No          0.185014
       Yes         0.195059
max    No          0.291990
       Yes         0.710345
dtype: float64

### 8.5.1 Suppressing the Group Keys

In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys along with the indexes of each piece of the original object. 

You can disable this by passing ```group_keys=False``` to groupby:

In [48]:
tips.groupby('smoker', group_keys=False).apply(top)

  tips.groupby('smoker', group_keys=False).apply(top)


Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
88,24.71,5.85,No,Thur,Lunch,2,0.236746
185,20.69,5.0,No,Sun,Dinner,5,0.241663
51,10.29,2.6,No,Sun,Dinner,2,0.252672
149,7.51,2.0,No,Thur,Lunch,2,0.266312
232,11.61,3.39,No,Sat,Dinner,2,0.29199
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345


### 8.5.2 Quantile and Bucket Analysis (```cut``` and ```qcut```)

In [49]:
frame = pd.DataFrame({'data1': np.random.randn(1000),
                      'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]     # The Categorical object returned by cut can be passed directly to groupby

0     (-1.23, 0.489]
1    (-2.956, -1.23]
2     (-1.23, 0.489]
3     (0.489, 2.208]
4     (-1.23, 0.489]
5     (0.489, 2.208]
6     (-1.23, 0.489]
7     (-1.23, 0.489]
8     (0.489, 2.208]
9     (0.489, 2.208]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]

The ```Categorical``` object returned by ```cut``` can be passed directly to ```groupby```.

In [50]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}


In [51]:
grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats)

  grouped = frame.data2.groupby(quartiles)


data1                 
(-2.956, -1.23]  min       -3.399312
                 max        1.670835
                 count     95.000000
                 mean      -0.039521
(-1.23, 0.489]   min       -2.989741
                 max        3.260383
                 count    598.000000
                 mean      -0.002051
(0.489, 2.208]   min       -3.745356
                 max        2.954439
                 count    297.000000
                 mean       0.081822
(2.208, 3.928]   min       -1.929776
                 max        1.765640
                 count     10.000000
                 mean       0.024750
Name: data2, dtype: float64

In [52]:
grouped.apply(get_stats).unstack()

Unnamed: 0_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-2.956, -1.23]",-3.399312,1.670835,95.0,-0.039521
"(-1.23, 0.489]",-2.989741,3.260383,598.0,-0.002051
"(0.489, 2.208]",-3.745356,2.954439,297.0,0.081822
"(2.208, 3.928]",-1.929776,1.76564,10.0,0.02475


In [53]:
# Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=None)   # set labels=False
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()

  grouped = frame.data2.groupby(grouping)


Unnamed: 0_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"(-2.9499999999999997, -1.191]",-3.399312,1.670835,100.0,-0.049902
"(-1.191, -0.881]",-1.950098,2.628441,100.0,0.030989
"(-0.881, -0.553]",-2.925113,2.527939,100.0,-0.067179
"(-0.553, -0.303]",-2.315555,3.260383,100.0,0.065713
"(-0.303, -0.029]",-2.047939,2.074345,100.0,-0.111653
"(-0.029, 0.213]",-2.989741,2.18481,100.0,0.05213
"(0.213, 0.503]",-2.223506,2.458842,100.0,-0.021489
"(0.503, 0.802]",-3.05699,2.954439,100.0,-0.026459
"(0.802, 1.286]",-3.745356,2.735527,100.0,0.103406
"(1.286, 3.928]",-2.064111,2.37702,100.0,0.220122


### Example: Filling Missing Values with Group-Specific       Values

In [54]:
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s

0         NaN
1   -0.125921
2         NaN
3   -0.884475
4         NaN
5    0.227290
dtype: float64

In [55]:
s.fillna(s.mean())

0   -0.261035
1   -0.125921
2   -0.261035
3   -0.884475
4   -0.261035
5    0.227290
dtype: float64

In [56]:
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data

Ohio          0.922264
New York     -2.153545
Vermont      -0.365757
Florida      -0.375842
Oregon        0.329939
Nevada        0.981994
California    1.105913
Idaho        -1.613716
dtype: float64

In [57]:
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data

Ohio          0.922264
New York     -2.153545
Vermont            NaN
Florida      -0.375842
Oregon        0.329939
Nevada             NaN
California    1.105913
Idaho              NaN
dtype: float64

We can fill the NA values using the group means like so:

In [58]:
data.groupby(group_key).mean()

East   -0.535707
West    0.717926
dtype: float64

In [59]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)

East  Ohio          0.922264
      New York     -2.153545
      Vermont      -0.535707
      Florida      -0.375842
West  Oregon        0.329939
      Nevada        0.717926
      California    1.105913
      Idaho         0.717926
dtype: float64

In another case, you might have predefined fill values in your code that vary by group.

In [60]:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)

East  Ohio          0.922264
      New York     -2.153545
      Vermont       0.500000
      Florida      -0.375842
West  Oregon        0.329939
      Nevada       -1.000000
      California    1.105913
      Idaho        -1.000000
dtype: float64

### Example: Group Weighted Average and Correlation

In [61]:
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a',
                                'b', 'b', 'b', 'b'],
                   'data': np.random.randn(8),
                   'weights': np.random.rand(8)})
df

Unnamed: 0,category,data,weights
0,a,1.561587,0.383597
1,a,0.40651,0.061095
2,a,0.359244,0.824554
3,a,-0.614436,0.381261
4,b,-1.691656,0.132297
5,b,0.758701,0.876379
6,b,-0.682273,0.510716
7,b,-1.038534,0.324901


In [62]:
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)

  grouped.apply(get_wavg)


category
a    0.415516
b   -0.132713
dtype: float64

In [64]:
close_px = pd.read_csv('../data/stock_px_2.csv', parse_dates=True, index_col=0)
close_px.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   AAPL    2214 non-null   float64
 1   MSFT    2214 non-null   float64
 2   XOM     2214 non-null   float64
 3   SPX     2214 non-null   float64
dtypes: float64(4)
memory usage: 86.5 KB


In [65]:
close_px[-4:]

Unnamed: 0,AAPL,MSFT,XOM,SPX
2011-10-11,400.29,27.0,76.27,1195.54
2011-10-12,402.19,26.96,77.16,1207.25
2011-10-13,408.43,27.18,76.37,1203.66
2011-10-14,422.0,27.27,78.11,1224.58


In [66]:
spx_corr = lambda x: x.corrwith(x['SPX'])

In [67]:
rets = close_px.pct_change().dropna()   # compute percent change on close_px using pct_change

Lastly, we group these percent changes by year, which can be extracted from each row label with a one-line function that returns the ```year``` attribute of each ```datetime``` label:

In [68]:
get_year = lambda x: x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)

Unnamed: 0,AAPL,MSFT,XOM,SPX
2003,0.541124,0.745174,0.661265,1.0
2004,0.374283,0.588531,0.557742,1.0
2005,0.46754,0.562374,0.63101,1.0
2006,0.428267,0.406126,0.518514,1.0
2007,0.508118,0.65877,0.786264,1.0
2008,0.681434,0.804626,0.828303,1.0
2009,0.707103,0.654902,0.797921,1.0
2010,0.710105,0.730118,0.839057,1.0
2011,0.691931,0.800996,0.859975,1.0


You could also compute **inter-column correlations**. Here we compute the annual correlation between Apple and Microsoft:

In [69]:
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))

2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64

### Example: Group-Wise Linear Regression

I can define the following regress function (using the ```statsmodels``` econometrics library), which executes an ordinary least squares (OLS) regression on each chunk of data:

In [72]:
import statsmodels.api as sm

In [None]:
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    print(result.params)
    return result.params

Now, to run a yearly linear regression of AAPL on SPX returns, execute:

In [81]:
by_year.apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,SPX,intercept
2003,1.195406,0.00071
2004,1.363463,0.004201
2005,1.766415,0.003246
2006,1.645496,8e-05
2007,1.198761,0.003438
2008,0.968016,-0.00111
2009,0.879103,0.002954
2010,1.052608,0.001261
2011,0.806605,0.001514


In [75]:
get_ym = lambda x: x.year*100+x.month
by_ym = rets.groupby(get_ym)

In [76]:
for ym, chunk in by_ym:
    print(ym)
    print(chunk)

200301
                AAPL      MSFT       XOM       SPX
2003-01-03  0.006757  0.001421  0.000684 -0.000484
2003-01-06  0.000000  0.017975  0.024624  0.022474
2003-01-07 -0.002685  0.019052 -0.033712 -0.006545
2003-01-08 -0.020188 -0.028272 -0.004145 -0.014086
2003-01-09  0.008242  0.029094  0.021159  0.019386
2003-01-10  0.002725  0.001824 -0.013927  0.000000
2003-01-13 -0.005435  0.008648 -0.004134 -0.001412
2003-01-14 -0.002732  0.010379  0.008993  0.005830
2003-01-15 -0.010959 -0.012506 -0.013713 -0.014426
2003-01-16  0.012465 -0.016282  0.004519 -0.003942
2003-01-17 -0.035568 -0.070345 -0.010381 -0.014017
2003-01-21 -0.005674 -0.002473 -0.023077 -0.015702
2003-01-22 -0.009986 -0.006445 -0.012885 -0.010432
2003-01-23  0.021614  0.024950 -0.002175  0.010224
2003-01-24 -0.026798 -0.046251 -0.021439 -0.029233
2003-01-27  0.024638 -0.013783 -0.026736 -0.016160
2003-01-28  0.031117 -0.007246  0.026326  0.013050
2003-01-29  0.024691  0.022419  0.036431  0.006779
2003-01-30 -0.041499 -0.

In [77]:
by_ym.apply(regress, 'AAPL', ['SPX'])

Unnamed: 0,SPX,intercept
200301,0.857212,0.001180
200302,1.272210,0.003604
200303,0.788352,-0.003169
200304,1.261566,-0.004209
200305,0.782133,0.009824
...,...,...
201106,0.819463,-0.000872
201107,0.783452,0.008492
201108,0.862569,0.001552
201109,0.429379,0.001118


---

# END