## Unapply apply a custom aggregating function in pandas

- Goal: Optimize custom aggregating function when grouping.
- Groupby apply can be one of the slowest operations in pandas.
- Huge performance gain by using the built-in functions.
- simple contrived to show how to efficiently rewrite a groupby apply that is passed a custom function.

In [1]:
import pandas as pd
import numpy as np

In [2]:
n = 1_000_000   ## 1000000 rows
n

1000000

In [3]:
df = pd.DataFrame({'group': np.random.randint(0, 1000, n),   ## 1000 groups 
                  'value': np.random.rand(n)})
df.head()

Unnamed: 0,group,value
0,45,0.386989
1,71,0.357759
2,958,0.188228
3,485,0.761128
4,561,0.613681


## Problem statement
For each group, calculate the sum, mean, median and difference between the mean and median of the __value__ column but only for values greater than .5 

- This can be done easily using filtering rows based on values greater than 0.5.
- Want to do this using a custom grouping function, how to unapply apply.

Eg: sum, mean, median are built-in functions. There are 15 to 20 aggregation functions.

| __Aggregation__   | __Description__                 |
|-------------------|---------------------------------|
| count()           | Total number of items           |
| first(),  last()  | First and last item             |
| mean(),  median() | Mean and median                 |
| min(),  max()     | Minimum and maximum             |
| std(),  var()     | Standard deviation and variance |
| mad()             | Mean absolute deviation         |
| prod()            | Product of all items            |
| sum()             | Sum of all items                |



In [4]:
df.groupby('group')['value'].agg(['sum', 'mean', 'median']).head()

Unnamed: 0_level_0,sum,mean,median
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,501.266413,0.510455,0.52566
1,525.02044,0.49437,0.500109
2,481.266696,0.498722,0.501171
3,515.323021,0.507208,0.505947
4,479.473274,0.499972,0.49846


We are forced to write a custom aggregating function as we dont have any function that can provide the condition to fetch values > 0.5.  

### Function Explanation
Pass dataframe to custom aggregate functions. So first we need to built a function such that values are greater than 0.5. So here x - dataframe.

In [5]:
def f(x):
    filt = x['value'] > .5   # group-independent operation
    high_values = x.loc[filt, 'value']
    sum_ = high_values.sum()  ## sum_ as sum is a reserved word. Output of this will be a pandas series
    mean_ = high_values.mean()
    median_ = high_values.median()
    diff = mean_ - median_
    return (pd.Series({'sum': sum_,
                      'mean': mean_,
                      'median': median_,
                      'diff': diff}))

## way to create new column for each aggregation is to return it as a series and can make a dictionary to map a new column name 
## to each aggregation.

In [6]:
type(df.groupby('group').apply(f))

pandas.core.frame.DataFrame

In [7]:
df.groupby('group').apply(f).head()

Unnamed: 0_level_0,sum,mean,median,diff
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,385.825433,0.750633,0.74622,0.004413
1,392.238185,0.73729,0.73758,-0.000291
2,364.483918,0.753066,0.747955,0.005111
3,388.43339,0.751322,0.756156,-0.004834
4,357.017243,0.746898,0.739531,0.007367


Observe it is taking longer time as compared to built-in functions. It is very slow.

## Correct result but slow
- custom function is run for each group. i.e., it will run each time for each of the groups
- custom function is not optimized like built-in functions.

So we are not going to rely on apply or custom-aggregate functions.

## Steps to unapply apply
1. Compute group-independent operations before groupby. These calculations are applied to the entire DataFrame as a whole.
2. Create new columns in your dataframe that contains the result of these new calculations from step 1.
3. Use the built-in groupby aggregation methods. Do not use custom functions.
4. Calculate these columns that depend on aggregation result after grouping.

## Optimize

In [8]:
# calculate new column with filter first
filt = df['value'] > 0.5

Now we want to filter out these values for each group. Use `where()` function, it will retain the values which has True and put NaN for False values.

[Pandas where](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html)

In [9]:
df['value'].where(filt).head(8)

0         NaN
1         NaN
2         NaN
3    0.761128
4    0.613681
5         NaN
6    0.763550
7    0.593056
Name: value, dtype: float64

In [10]:
df['value_large'] = df['value'].where(filt)  ## here dimensions of df is retained
df.head(9)

Unnamed: 0,group,value,value_large
0,45,0.386989,
1,71,0.357759,
2,958,0.188228,
3,485,0.761128,0.761128
4,561,0.613681,0.613681
5,500,0.332465,
6,649,0.76355,0.76355
7,982,0.593056,0.593056
8,220,0.159911,


In [11]:
## only use built-in group by methods
df2 = df.groupby('group')['value_large'].agg(['sum', 'mean', 'median'])
df2.head()

Unnamed: 0_level_0,sum,mean,median
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,385.825433,0.750633,0.74622
1,392.238185,0.73729,0.73758
2,364.483918,0.753066,0.747955
3,388.43339,0.751322,0.756156
4,357.017243,0.746898,0.739531


In [12]:
## finally calculate the columns that depend on aggregated result
df2["diff"] = df2['mean'] - df2['median']
df2.head()

Unnamed: 0_level_0,sum,mean,median,diff
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,385.825433,0.750633,0.74622,0.004413
1,392.238185,0.73729,0.73758,-0.000291
2,364.483918,0.753066,0.747955,0.005111
3,388.43339,0.751322,0.756156,-0.004834
4,357.017243,0.746898,0.739531,0.007367


In [13]:
%%timeit 
## putting all these steps together to timeit
filt = df['value'] > 0.5
df['value_large'] = df['value'].where(filt)
df2 = df.groupby('group')['value_large'].agg(['sum', 'mean', 'median'])
df2["diff"] = df2['mean'] - df2['median']
df2.head()

131 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
%timeit df.groupby('group').apply(f)

1.23 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


So try to do calculations outside of groupby and rely only on built-in aggregate functions

## References

-  [Python Data Science handbook - aggregation-and-grouping](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)