# Apply vs. Agg
_Notebook prepared by: Jessa Rili-Migriño ([LinkedIn](https://www.linkedin.com/in/jessa-rili-migrino))_

The objective for this notebook is to discern the difference between `.apply()` and `.agg()` through examples.
Comparisons will be made in terms of performance, implementation complexity, and flexibility.

# TL;DR
| Aspect              | Apply            | Agg              |
| ------------------- | ---------------- | ---------------- |
| Performance<br>(Speed and Optimization) | ❌Generally slower | ✅Generally faster |
| Application of multiple functions per column | ✅Allowed only on DataFrame,<br>❌Not allowed on GroupBy | ✅Allowed |
| Function return value | ✅Accepts functions with return types of any shape (scalar, DataFrame, Series) | ❌Only functions that return scalar values are allowed |
| Custom logic operating on multiple columns | ✅Intuitive implementation<br>❌But slow | ❌Relatively complex implementation<br>✅But faster | 

First do the necessary imports and define a function `range()` we can use as input to `agg` and `apply`.

In [160]:
import pandas as pd

In [161]:
def range(column):
    return column.max() - column.min()

# When applied to a DataFrame
Docs:
* [pandas.DataFrame.agg()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html)
* [pandas.DataFrame.apply()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)

First let's define a `sales` dataframe that will be used for examples:

In [None]:
sales = pd.DataFrame([
    # Health and beauty
    {"date": "2018-01-15", "product_line": "Health and beauty", "product": "Shampoo", "unit_price": 6.99, "quantity": 7},
    {"date": "2018-01-18", "product_line": "Health and beauty", "product": "Conditioner", "unit_price": 8.99, "quantity": 4},
    {"date": "2018-01-20", "product_line": "Health and beauty", "product": "Body Wash", "unit_price": 9.50, "quantity": 6},

    # Electronic accessories
    {"date": "2018-01-15", "product_line": "Electronic accessories", "product": "Headphones", "unit_price": 25.28, "quantity": 5},
    {"date": "2018-01-19", "product_line": "Electronic accessories", "product": "Charger", "unit_price": 15.75, "quantity": 8},

    # Home and lifestyle
    {"date": "2018-01-16", "product_line": "Home and lifestyle", "product": "Lamp", "unit_price": 46.33, "quantity": 3},
    {"date": "2018-01-19", "product_line": "Home and lifestyle", "product": "Curtains", "unit_price": 22.00, "quantity": 5},

    # Sports
    {"date": "2018-01-16", "product_line": "Sports", "product": "Yoga mat", "unit_price": 39.99, "quantity": 5},
    {"date": "2018-01-18", "product_line": "Sports", "product": "Dumbbells", "unit_price": 30.00, "quantity": 4},

    # Food and beverages
    {"date": "2018-01-17", "product_line": "Food and beverages", "product": "Milk", "unit_price": 5.99, "quantity": 8},
    {"date": "2018-01-20", "product_line": "Food and beverages", "product": "Bread", "unit_price": 3.49, "quantity": 10}
])

sales

Unnamed: 0,date,product_line,product,unit_price,quantity
0,2018-01-15,Health and beauty,Shampoo,6.99,7
1,2018-01-18,Health and beauty,Conditioner,8.99,4
2,2018-01-20,Health and beauty,Body Wash,9.5,6
3,2018-01-15,Electronic accessories,Headphones,25.28,5
4,2018-01-19,Electronic accessories,Charger,15.75,8
5,2018-01-16,Home and lifestyle,Lamp,46.33,3
6,2018-01-19,Home and lifestyle,Curtains,22.0,5
7,2018-01-16,Sports,Yoga mat,39.99,5
8,2018-01-18,Sports,Dumbbells,30.0,4
9,2018-01-17,Food and beverages,Milk,5.99,8


## Single Function
Let's try first passing the `range` function to both `apply` and `agg`

In [163]:
df = sales[['unit_price', 'quantity']].apply(range)

print(type(df))
df

<class 'pandas.core.series.Series'>


unit_price    42.84
quantity       7.00
dtype: float64

In [164]:
sales[['unit_price', 'quantity']].agg(range)

print(type(df))
df

<class 'pandas.core.series.Series'>


unit_price    42.84
quantity       7.00
dtype: float64

When applied directly to a dataframe's columns, the output is the same. But what about speed? Let's use `%timeit` magic below:

In [165]:
%timeit sales[['unit_price', 'quantity']].apply(range)

750 μs ± 14.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [166]:
%timeit sales[['unit_price', 'quantity']].agg(range)

740 μs ± 12.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


We see then that `apply()` seems to run faster than `agg()` when applied to dataframes.

## Multiple Functions

How about using multiple callable functions?
The [pandas.DataFrame.agg](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html#pandas-dataframe-agg) docs explicitly states that we can in put a list of callable functions to do this:

In [167]:
df = sales[['unit_price', 'quantity']].agg([range, 'min', 'max'])

print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,unit_price,quantity
range,42.84,7
min,3.49,3
max,46.33,10


According to the [pandas.DataFrame.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas-dataframe-apply) docs (pandas 2.2), `apply()` can only accept a single function.
Let's try:

In [168]:
df = sales[['unit_price', 'quantity']].apply([range, 'min', 'max'])

print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,unit_price,quantity
range,42.84,7
min,3.49,3
max,46.33,10


Using multiple functions worked with `apply` (as of pandas 2.2.3) and has the same output as `agg`.
How about processing time?

In [178]:
%timeit sales[['unit_price', 'quantity']].apply([range, 'min', 'max'])

2.12 ms ± 257 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [179]:
%timeit sales[['unit_price', 'quantity']].agg([range, 'min', 'max'])

1.87 ms ± 7.58 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


We see that `agg()` ran slightly faster to produce the same output on DataFrame objects than `apply()` with multiple functions.

# Applied to Groupby object

Docs:
* [pandas.groupby.DataFrameGroupBy.apply](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.apply.html#pandas-core-groupby-dataframegroupby-apply)
* [pandas.groupby.DataFrameGroupBy.agg](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html#pandas-core-groupby-dataframegroupby-agg)

Let's try using `apply` and `agg` on GroupBy objects next.

## Single Function

In [171]:
df = sales.groupby(by='product_line')[['unit_price', 'quantity']].apply('sum')

print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,unit_price,quantity
product_line,Unnamed: 1_level_1,Unnamed: 2_level_1
Electronic accessories,41.03,13
Food and beverages,9.48,18
Health and beauty,25.48,17
Home and lifestyle,68.33,8
Sports,69.99,9


In [172]:
df = sales.groupby(by='product_line')[['unit_price', 'quantity']].agg('sum')

print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,unit_price,quantity
product_line,Unnamed: 1_level_1,Unnamed: 2_level_1
Electronic accessories,41.03,13
Food and beverages,9.48,18
Health and beauty,25.48,17
Home and lifestyle,68.33,8
Sports,69.99,9


Checking the processing time again:

In [182]:
%timeit df = sales.groupby(by='product_line')[['unit_price', 'quantity']].apply('sum')

811 μs ± 28.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [183]:
%timeit df = sales.groupby(by='product_line')[['unit_price', 'quantity']].agg('sum')

863 μs ± 9.86 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


We see that `apply` still runs slightly faster than `groupby` when applying a single function even on a GroupBy object.

## Multiple Functions

If we try to use multiple functions with `apply()` applied to a GroupBy object,
we see that it results in an error because multipe functions are not accepted by `apply()`.

In [175]:
sales.groupby(by='product_line')[['unit_price', 'quantity']].apply([range, 'min', 'max', 'sum'])

TypeError: unhashable type: 'list'

Meanwhile, `agg` is able to produce a multi-level dataframe output like so:

In [184]:
df = sales.groupby(by='product_line')[['unit_price', 'quantity']].agg([range, 'min', 'max', 'sum'])

print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,unit_price,unit_price,unit_price,unit_price,quantity,quantity,quantity,quantity
Unnamed: 0_level_1,range,min,max,sum,range,min,max,sum
product_line,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Electronic accessories,9.53,15.75,25.28,41.03,3,5,8,13
Food and beverages,2.5,3.49,5.99,9.48,2,8,10,18
Health and beauty,2.51,6.99,9.5,25.48,3,4,7,17
Home and lifestyle,24.33,22.0,46.33,68.33,2,3,5,8
Sports,9.99,30.0,39.99,69.99,1,4,5,9


Note that all `product_line`s have equal values for each aggregation function because there is only one row for each group.

## Custom Group Summary
Let's say we need a custom group summary that contains the ff. information:
* cross-column summary, like `sales` = `unit_price` * `quantity`
* summary that depends on a cross-column summary, like`product` with the biggest `sales`

In [None]:
def custom_summary(group):
    group['sales'] = group['unit_price'] * group['quantity']
    total_sales = group['sales'].sum()
    top_product = group.loc[group['sales'].idxmax(), 'product']
    return pd.Series({
        'total_sales': total_sales,
        'best_seller': top_product
    })


Using `apply` is intuitive and easy here:

In [None]:
df = sales.groupby(by='product_line')[['unit_price', 'quantity', 'product']].apply(custom_summary)

print(type(df))
df

However if try to pass `custom_summary` into `agg`, we encounter an error. This is because `agg`:
* can only apply functions column-by-column (no cross-column logic)
* can only return scalar values per column
* can't create new columns from cross-column logic

In [None]:
df = sales.groupby(by='product_line')[['unit_price', 'quantity', 'product']].agg(custom_summary)

print(type(df))
df

We can achive the same output, however, by performing a separate `groupby().agg` to get the `total_sales`, and a `groupby().idxmax()` to get the `best_seller` product like so:

In [None]:
sales_new = sales.copy()
sales_new['total_sales'] = sales_new['unit_price'] * sales_new['quantity']

# Get the 
df = sales_new.groupby(by='product_line')[['total_sales']].agg('sum')

# Get the row wih max sales for each product_line
idx = sales_new.groupby('product_line')['total_sales'].idxmax()
df['best_seller'] = sales_new.loc[idx, 'product'].values

print(type(df))
df

Let's compare the processing time of these 2 implementations:

In [None]:
%%timeit
df = sales.groupby(by='product_line')[['unit_price', 'quantity', 'product']].apply(custom_summary)

In [None]:
%%timeit
sales_new = sales.copy()
sales_new['total_sales'] = sales_new['unit_price'] * sales_new['quantity']

# Get the 
df = sales_new.groupby(by='product_line')[['total_sales']].agg('sum')

# Get the row wih max sales for each product_line
idx = sales_new.groupby('product_line')['total_sales'].idxmax()
df['best_seller'] = sales_new.loc[idx, 'product'].values

Amazingly, the `agg` and `idxmax` implementation performed faster than `apply` to return the custom summary data.

# Summary
Summarizing the results at this point,

| Application                       | Apply  | Agg     |
| --------------------------------- | ------ | ------- |
| Single function on a DataFrame    | ❌Slower | ✅Faster  |
| Multiple functions on a DataFrame | Allowed/possible (as of Pandas 2.2.3)<br>❌But slower | ✅Faster  |
| Single function on a GroupBy object | ✅Slightly faster | ✅Slightly slower  |
| Multiple functions on a GroupBy object | ❌Not allowed/possible<br>(as of Pandas 2.2.3) | ✅Allowed  |
| Custom cross-column summary statistics | ✅Intuitive<br>❌But slow | ❌Non-intuitive: Needs slightly complex work-around<br>✅But faster |

# Conclusion

In general, we conclude the following:

| Aspect              | Apply            | Agg              |
| ------------------- | ---------------- | ---------------- |
| Performance<br>(Speed and Optimization) | ❌Generally slower | ✅Generally faster |
| Application of multiple functions per column | ✅Allowed only on DataFrame,<br>❌Not allowed on GroupBy | ✅Allowed |
| Function return value | ✅Accepts functions with return types of any shape (scalar, DataFrame, Series) | ❌Only functions that return scalar values are allowed |
| Custom logic operating on multiple columns | ✅Intuitive implementation<br>❌But slow | ❌Relatively complex implementation<br>✅But faster | 


# Credits
_Notebook prepared by: Jessa Rili-Migriño ([LinkedIn](https://www.linkedin.com/in/jessa-rili-migrino))_

* DataCamp, for the [Data Manipulation Practice Problem](https://practice.datacamp.com/p/513) that brought on the questions that needed to be answered through this notebook