# Split-Apply-Combine Method

Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, is often critical component of a data analysis workflow. After load, merging and preparing a dataset, you may need to compute group statistics possibly pivot tables for reporting or visualization purposes. Pandas provides a flexible groupby interface enabling you to slice, dice and summarize datasets in a natural way.

Probably one of the reson for SQL and relational database popularity is the ease with which data can be joined, filtered, transformed and aggregated. However, query languages like SQL are somewhat constrained in the kind of group operations that can be performed. This is where python expressiveness stands out. We can perform quite complex group operations by utilizing any function that accepts a pandas object or numpy array.

Hadley Wickham, an author of many popular R packages, coined the term split-apply-combine for describing group operations.In the first stage of the process, data contained in a pandas object whether a Series, DataFrame or otherwise, is split into groups based on one or more criteria (or key) that you provide.

The splitting can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a funcntion is applied to each group, producing a new value. Finally, the results of those function applications are combined into a result object. The form of the resulting object will usually depend on what's being done to the data.

<img src="assets/images/general-split-apply-combine.svg" width="700" />

Each grouping key can take many forms, and the keys do not have to be all of the same type:

- a list or array of values that is the same length as the axis being grouped
- a value indicating a column name in a DataFrame
- a dict or series giving correspondence between values on the axis being grouped and group names
- a function to be invoked on the axis index or the individual labels in the index

In [1]:
import pandas as pd
import numpy as np

## Group By

In [2]:
product_sheet = pd.DataFrame({
    'product': np.random.choice(['shirt','shoes', 'pants'], 50),
    'color': np.random.choice(['black', 'red', 'green', 'yellow'], 50),
    'manufacturer': np.random.choice(['company 1', 'company 2', 'company 3'], 50),
    'price': np.random.randint(30, 100, 50),
    'discount': np.random.randint(10, 25, 50)
})

In [3]:
product_sheet.head()

Unnamed: 0,product,color,manufacturer,price,discount
0,pants,yellow,company 1,91,12
1,shoes,yellow,company 2,78,22
2,pants,green,company 1,58,22
3,shirt,black,company 2,74,11
4,shoes,yellow,company 1,41,20


Now, suppose you want to compute the average price of products per manufacturer. One way to do it is to call groupby on `product` and `manufacturer` which will return a GroupBy object. It has not actually computed anything yet except for some intermediate data. The idea is that this object has all information needed to then apply some operation to each groups.

So, for example we can apply the `mean()` function to the group to get the average per product per manufacturer

In [4]:
grouped = product_sheet.groupby(['manufacturer', 'product'])
grouped.mean()['price']

manufacturer  product
company 1     pants      65.142857
              shirt      49.400000
              shoes      64.833333
company 2     pants      62.428571
              shirt      74.666667
              shoes      62.000000
company 3     pants      52.500000
              shirt      55.285714
              shoes      62.375000
Name: price, dtype: float64

We can also display this computed value into a more intuitive form by calling `unstack()`. 

The take away is, the data (a Series) has been aggregated according to the group key, producing a new Series that is now index by unique values used to group the original dataset, in this case, (manufacturer, product).

In [5]:
grouped.mean()['price'].unstack()

product,pants,shirt,shoes
manufacturer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
company 1,65.142857,49.4,64.833333
company 2,62.428571,74.666667,62.0
company 3,52.5,55.285714,62.375


### Iterating over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

For example, our data was grouped by manufacturer and product so we have a tuple as a key with the following values

In [6]:
for key, group in grouped:
  print(key)

('company 1', 'pants')
('company 1', 'shirt')
('company 1', 'shoes')
('company 2', 'pants')
('company 2', 'shirt')
('company 2', 'shoes')
('company 3', 'pants')
('company 3', 'shirt')
('company 3', 'shoes')


### Grouping by Dict and Series
Grouping information may exist in a form other than an array. consider the following DataFrame

In [7]:
people = pd.DataFrame(
  np.random.randn(5,5), 
  columns=['a', 'b', 'c', 'd', 'e'], 
  index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']
)
people.iloc[2:3, [1,2]] = np.nan
people

Unnamed: 0,a,b,c,d,e
Joe,-0.122306,-0.4751,-0.213892,0.397373,0.181296
Steve,0.958647,-0.171377,-1.818325,0.20167,1.932927
Wes,-0.659468,,,-0.27231,-0.565033
Jim,0.632094,0.936541,-1.206588,1.141811,0.609661
Travis,-1.084104,0.04226,-0.20318,0.102754,-0.355167


Now, suppose I have a correspondence for the columns and want to sum together the columns by group:


In [8]:
mapping = {
  'a': 'red', 
  'b': 'red', 
  'c': 'blue', 
  'd': 'blue', 
  'e': 'red', 
  'f': 'orange' 
}

We could construct an array from this dict to pass to groupby, but instead we can pass the dict like so:

In [9]:
by_column = people.groupby(mapping, axis=1)
by_column.sum()

Unnamed: 0,blue,red
Joe,0.183481,-0.41611
Steve,-1.616654,2.720198
Wes,-0.27231,-1.2245
Jim,-0.064777,2.178295
Travis,-0.100426,-1.397011


Unused grouping keys will be ignored, in this case `f`.

The same functionality holds for Series, which can be viiewed as a fixed size mapping:

In [10]:
pd.Series(mapping)

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [11]:
people.groupby(mapping, axis=1).sum()

Unnamed: 0,blue,red
Joe,0.183481,-0.41611
Steve,-1.616654,2.720198
Wes,-0.27231,-1.2245
Jim,-0.064777,2.178295
Travis,-0.100426,-1.397011


### Grouping with Functions

Mapping with a dict or a series is very easy. However, it operates under the assumption that you have a fixed map for your data. Sometimes, this isn't the case.

Suppose you wanted to group by the length of names; while you could compute an array of string lengths, it's simpler to just apss the len function.

In [12]:
for length, group in people.groupby(len):
  print(f'Name length: {length}')
  display(group)

Name length: 3


Unnamed: 0,a,b,c,d,e
Joe,-0.122306,-0.4751,-0.213892,0.397373,0.181296
Wes,-0.659468,,,-0.27231,-0.565033
Jim,0.632094,0.936541,-1.206588,1.141811,0.609661


Name length: 5


Unnamed: 0,a,b,c,d,e
Steve,0.958647,-0.171377,-1.818325,0.20167,1.932927


Name length: 6


Unnamed: 0,a,b,c,d,e
Travis,-1.084104,0.04226,-0.20318,0.102754,-0.355167


Further, mixing functions and dict is not a problem as everething is converted into lists internally

In [13]:
key_team = ['Team 1', 'Team 1', 'Team 2', 'Team 2', 'Team 1']

for key, group in people.groupby([len, key_team]):
  length, team = key
  print(f'Name length: {length}, Team: {team}')
  display(group)

Name length: 3, Team: Team 1


Unnamed: 0,a,b,c,d,e
Joe,-0.122306,-0.4751,-0.213892,0.397373,0.181296


Name length: 3, Team: Team 2


Unnamed: 0,a,b,c,d,e
Wes,-0.659468,,,-0.27231,-0.565033
Jim,0.632094,0.936541,-1.206588,1.141811,0.609661


Name length: 5, Team: Team 1


Unnamed: 0,a,b,c,d,e
Steve,0.958647,-0.171377,-1.818325,0.20167,1.932927


Name length: 6, Team: Team 1


Unnamed: 0,a,b,c,d,e
Travis,-1.084104,0.04226,-0.20318,0.102754,-0.355167


In [14]:
key_school = {
    'Joe': 'University C', 
    'Steve': 'University B',
    'Wes': 'University C',
    'Jim': 'University B',
    'Travis': 'University A',
}

for key, group in people.groupby([len, key_school]):
  length, team = key
  print(f'Name length: {length}, Team: {team}')
  display(group)

Name length: 3, Team: University B


Unnamed: 0,a,b,c,d,e
Jim,0.632094,0.936541,-1.206588,1.141811,0.609661


Name length: 3, Team: University C


Unnamed: 0,a,b,c,d,e
Joe,-0.122306,-0.4751,-0.213892,0.397373,0.181296
Wes,-0.659468,,,-0.27231,-0.565033


Name length: 5, Team: University B


Unnamed: 0,a,b,c,d,e
Steve,0.958647,-0.171377,-1.818325,0.20167,1.932927


Name length: 6, Team: University A


Unnamed: 0,a,b,c,d,e
Travis,-1.084104,0.04226,-0.20318,0.102754,-0.355167


## Data Aggregation

Aggregation refers to any data transformation that produces scalar values from arrays like mean, count, min, sum, etc. However, we are not limited to these methods.

Here is a list of the optmized GroupBy methods

- count: Numer of non-NA values in group
- sum: Sum of all non-NA values
- mean: Mean of all non-NA values
- median: Arithmetic median of all non-NA values
- std, var: Unbiased(n-1) standard deviation and variance
- min, max: Minimum and maximum of non-NA values
- prod: Product of non-NA values
- first, last: First and last non-NA values

You can also use your own custom aggreagte functions and additionally call any method that is also defined on the grouped object. For example, quartile computes the sample quartiles of a Series or a DataFrame's columns. While quartile is not explicitly implemented for GroupBy object, it is a Series method and thus available for use.

To use your own aggregation functions, pass any function that aggreagates an array to the aggreagate or agg method:

In [15]:
def peak_to_peak(arr):
  return arr.max() - arr.min()

grouped.agg(peak_to_peak)

Unnamed: 0_level_0,Unnamed: 1_level_0,discount,price
manufacturer,product,Unnamed: 2_level_1,Unnamed: 3_level_1
company 1,pants,11,48
company 1,shirt,13,48
company 1,shoes,13,47
company 2,pants,10,57
company 2,shirt,10,42
company 2,shoes,8,38
company 3,pants,1,31
company 3,shirt,13,38
company 3,shoes,13,43


### Column-wise and Multiple Function Application

Aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate with desired function or calling a method like mean or std. However, you may want to aggreagte using a different function depending on the coolumn or multiple functions at once. Fortunately, this is very easy to do.

In [16]:
import seaborn as sns

restaurant = sns.load_dataset('tips')

In [17]:
restaurant['tip_pct'] = restaurant['tip'] / restaurant['total_bill']

In [18]:
grouped = restaurant.groupby(['day', 'smoker'])
grouped["tip_pct"].agg('mean')

day   smoker
Thur  Yes       0.163863
      No        0.160298
Fri   Yes       0.174783
      No        0.151650
Sat   Yes       0.147906
      No        0.158048
Sun   Yes       0.187250
      No        0.160113
Name: tip_pct, dtype: float64

In [19]:
grouped["tip_pct"].agg(['mean', 'std', peak_to_peak])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,0.163863,0.039389,0.15124
Thur,No,0.160298,0.038774,0.19335
Fri,Yes,0.174783,0.051293,0.159925
Fri,No,0.15165,0.028123,0.067349
Sat,Yes,0.147906,0.061375,0.290095
Sat,No,0.158048,0.039767,0.235193
Sun,Yes,0.18725,0.154134,0.644685
Sun,No,0.160113,0.042347,0.193226


You don't need to accept the names that GroupBy gives to columns; notably for lamda functions that have the name '<lambda>', which makes them hard to identify. Thus, if you pass a list of (name, function) tuples, the first element of each tuple will be used as the DataFrame column name.

In [20]:
grouped["tip_pct"].agg([('func1', 'mean'), ('func2', 'std'), ('func3', peak_to_peak)])

Unnamed: 0_level_0,Unnamed: 1_level_0,func1,func2,func3
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,0.163863,0.039389,0.15124
Thur,No,0.160298,0.038774,0.19335
Fri,Yes,0.174783,0.051293,0.159925
Fri,No,0.15165,0.028123,0.067349
Sat,Yes,0.147906,0.061375,0.290095
Sat,No,0.158048,0.039767,0.235193
Sun,Yes,0.18725,0.154134,0.644685
Sun,No,0.160113,0.042347,0.193226


With a DaraFrame, you have more options, as you can specify which columns or different functions per column. To start, suppose we wanted to compute the same three statistiics for the `tip_pct` and the `total_bill` columns:

In [21]:
result = grouped["tip_pct", "total_bill"].agg(['mean', 'std', peak_to_peak])
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,peak_to_peak,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Thur,Yes,0.163863,0.039389,0.15124,19.190588,8.355149,32.77
Thur,No,0.160298,0.038774,0.19335,17.113111,7.721728,33.68
Fri,Yes,0.174783,0.051293,0.159925,16.813333,9.086388,34.42
Fri,No,0.15165,0.028123,0.067349,18.42,5.059282,10.29
Sat,Yes,0.147906,0.061375,0.290095,21.276667,10.069138,47.74
Sat,No,0.158048,0.039767,0.235193,19.661778,8.939181,41.08
Sun,Yes,0.18725,0.154134,0.644685,24.12,10.442511,38.1
Sun,No,0.160113,0.042347,0.193226,20.506667,8.130189,39.4


As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using concat to glue the results together using the column names as the keys argument:

In [22]:
result["tip_pct"]

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,0.163863,0.039389,0.15124
Thur,No,0.160298,0.038774,0.19335
Fri,Yes,0.174783,0.051293,0.159925
Fri,No,0.15165,0.028123,0.067349
Sat,Yes,0.147906,0.061375,0.290095
Sat,No,0.158048,0.039767,0.235193
Sun,Yes,0.18725,0.154134,0.644685
Sun,No,0.160113,0.042347,0.193226


In [23]:
result["total_bill"]

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,peak_to_peak
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Thur,Yes,19.190588,8.355149,32.77
Thur,No,17.113111,7.721728,33.68
Fri,Yes,16.813333,9.086388,34.42
Fri,No,18.42,5.059282,10.29
Sat,Yes,21.276667,10.069138,47.74
Sat,No,19.661778,8.939181,41.08
Sun,Yes,24.12,10.442511,38.1
Sun,No,20.506667,8.130189,39.4


Now, suppose you want to apply potentially different functions to one or more of the columns. To do thism pass a dict to agg that contains a mapping of the column names to any of the function specifications listed so far:

In [24]:
grouped.agg({ 'tip': np.max, 'size': 'sum', 'tip_pct': ['min', 'max', 'mean', 'std'] })

Unnamed: 0_level_0,Unnamed: 1_level_0,tip,size,tip_pct,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,Unnamed: 1_level_1,amax,sum,min,max,mean,std
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Thur,Yes,5.0,40,0.090014,0.241255,0.163863,0.039389
Thur,No,6.7,112,0.072961,0.266312,0.160298,0.038774
Fri,Yes,4.73,31,0.103555,0.26348,0.174783,0.051293
Fri,No,3.5,9,0.120385,0.187735,0.15165,0.028123
Sat,Yes,10.0,104,0.035638,0.325733,0.147906,0.061375
Sat,No,9.0,115,0.056797,0.29199,0.158048,0.039767
Sun,Yes,6.5,49,0.06566,0.710345,0.18725,0.154134
Sun,No,6.0,167,0.059447,0.252672,0.160113,0.042347


In [25]:
restaurant.groupby(['day', 'smoker'], as_index=False).agg({ 'tip': np.max, 'size': 'sum', 'tip_pct': ['min', 'max', 'mean', 'std'] })

Unnamed: 0_level_0,day,smoker,tip,size,tip_pct,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,amax,sum,min,max,mean,std
0,Thur,Yes,5.0,40,0.090014,0.241255,0.163863,0.039389
1,Thur,No,6.7,112,0.072961,0.266312,0.160298,0.038774
2,Fri,Yes,4.73,31,0.103555,0.26348,0.174783,0.051293
3,Fri,No,3.5,9,0.120385,0.187735,0.15165,0.028123
4,Sat,Yes,10.0,104,0.035638,0.325733,0.147906,0.061375
5,Sat,No,9.0,115,0.056797,0.29199,0.158048,0.039767
6,Sun,Yes,6.5,49,0.06566,0.710345,0.18725,0.154134
7,Sun,No,6.0,167,0.059447,0.252672,0.160113,0.042347


## General Split-Apply-Combine

In [26]:
def top(df, n=5, column='tip_pct'):
  return df.sort_values(by=column)[-n:]