In [1]:
import pandas as pd

# Investigating `transform()`

Last time, we saw that `transform()` is kind of like `apply()` but for groups. Let's expand on that a bit.

For this investigation, let's go back to the camping data. As a refresher, here's what that data looks like.

In [2]:
camping = pd.read_csv('camping.csv', index_col=0)
camping

Unnamed: 0_level_0,Category,Quantity,UnitWeight
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pack,Pack,1,33.0
Tent,Shelter,1,80.0
Sleeping Pad,Sleep,0,27.0
Sleeping Bag,Sleep,2,20.0
Toothbrush/Toothpaste,Health,1,2.0
Sunscreen,Health,1,5.0
Medical Kit,Health,1,3.7
Spoon,Kitchen,5,0.7
Stove,Kitchen,1,20.0
Water Filter,Kitchen,2,1.8


First off, it is crucial to keep in mind one crucial distinction: `apply()` can be used to operate on either DataFrames or GroupBy objects, while `transform()` can _only_ be used to operate on GroupBy objects.

## Simple function

To understand the nuances of how `transform()` works, let's first take a look at how we can use the simple `sum` function to operate on _(1)_ one column, _(2)_ multiple columns, and _(3)_ an entire DataFrame (or group of DataFrames). Further, for each of these, we'll build our way up to `transform()` in three steps.

_a)_ Use `apply()` on a DataFrame
  
_b)_ Use `apply()` on a GroupBy object

_c)_ Use `transform()` on a GroupBy object

To keep things as clean as possible, let's start by grouping our camping data by `Category`; this way, we can refer to the GroupBy object directly, rather than having to call the `groupby()` method every time.

In [4]:
categories = camping.groupby('Category')
categories.groups

{'Clothing': Index(['Rain Poncho', 'Shoes', 'Hat'], dtype='object', name='Item'),
 'Health': Index(['Toothbrush/Toothpaste', 'Sunscreen', 'Medical Kit'], dtype='object', name='Item'),
 'Kitchen': Index(['Spoon', 'Stove', 'Water Filter', 'Water Bottles'], dtype='object', name='Item'),
 'Pack': Index(['Pack'], dtype='object', name='Item'),
 'Shelter': Index(['Tent'], dtype='object', name='Item'),
 'Sleep': Index(['Sleeping Pad', 'Sleeping Bag'], dtype='object', name='Item'),
 'Utility': Index(['Pack Liner', 'Stuff Sack', 'Trekking Poles'], dtype='object', name='Item')}

### 1) Operating on a single column

For each of the following queries, we'll compare the results of steps _a_, _b_, and _c_ on a single column (`Quantity`).

In [5]:
# 1a) apply() with DataFrame
camping['Quantity'].apply('sum')

26

In [6]:
# 1b) apply() with GroupBy object
categories['Quantity'].apply(sum)

Category
Clothing     4
Health       3
Kitchen     10
Pack         1
Shelter      1
Sleep        2
Utility      5
Name: Quantity, dtype: int64

In [7]:
# 1c) transform() with GroupBy object
categories['Quantity'].transform('sum')

Item
Pack                      1
Tent                      1
Sleeping Pad              2
Sleeping Bag              2
Toothbrush/Toothpaste     3
Sunscreen                 3
Medical Kit               3
Spoon                    10
Stove                    10
Water Filter             10
Water Bottles            10
Pack Liner                5
Stuff Sack                5
Trekking Poles            5
Rain Poncho               4
Shoes                     4
Hat                       4
Name: Quantity, dtype: int64

### 2) Operating on multiple columns

For each of the following queries, we'll compare the results of steps _a_, _b_, and _c_ on two columns (`Quantity` and `UnitWeight`).

In [8]:
# 2a) apply() with DataFrame
camping[['Quantity', 'UnitWeight']].apply('sum')

Quantity       26.0
UnitWeight    256.7
dtype: float64

In [13]:
# 2b) apply() with GroupBy object
categories[['Quantity', 'UnitWeight']].apply(sum)

Unnamed: 0_level_0,Quantity,UnitWeight
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Clothing,4.0,20.5
Health,3.0,10.7
Kitchen,10.0,47.5
Pack,1.0,33.0
Shelter,1.0,80.0
Sleep,2.0,47.0
Utility,5.0,18.0


In [14]:
# 2c) transform() with GroupBy object
categories[['Quantity', 'UnitWeight']].transform(sum)

Unnamed: 0_level_0,Quantity,UnitWeight
Item,Unnamed: 1_level_1,Unnamed: 2_level_1
Pack,1,33.0
Tent,1,80.0
Sleeping Pad,2,47.0
Sleeping Bag,2,47.0
Toothbrush/Toothpaste,3,10.7
Sunscreen,3,10.7
Medical Kit,3,10.7
Spoon,10,47.5
Stove,10,47.5
Water Filter,10,47.5


### 3) Operating on an entire DataFrame OR GroupBy object

For each of the following queries, we'll compare the results of steps _a_, _b_, and _c_ on an entire DataFrame/GroupBy object. In other words, we'll see what happens when we don't first query out specific columns.

In [15]:
# 3a) apply() with DataFrame
camping.apply('sum')

Category      PackShelterSleepSleepHealthHealthHealthKitchen...
Quantity                                                     26
UnitWeight                                                256.7
dtype: object

In [16]:
# 3b) apply() with GroupBy object
categories.apply(sum)

Unnamed: 0_level_0,Category,Quantity,UnitWeight
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clothing,ClothingClothingClothing,4,20.5
Health,HealthHealthHealth,3,10.7
Kitchen,KitchenKitchenKitchenKitchen,10,47.5
Pack,Pack,1,33.0
Shelter,Shelter,1,80.0
Sleep,SleepSleep,2,47.0
Utility,UtilityUtilityUtility,5,18.0


In [17]:
# 3c) transform() with GroupBy object
categories.transform('sum')

Unnamed: 0_level_0,Quantity,UnitWeight
Item,Unnamed: 1_level_1,Unnamed: 2_level_1
Pack,1,33.0
Tent,1,80.0
Sleeping Pad,2,47.0
Sleeping Bag,2,47.0
Toothbrush/Toothpaste,3,10.7
Sunscreen,3,10.7
Medical Kit,3,10.7
Spoon,10,47.5
Stove,10,47.5
Water Filter,10,47.5


## More complex function

So it looks like `transform()` is pretty much equivalent to doing an `apply()` by group, and then just copying each group's result to every row in the group. But it's just a _little_ bit more nuanced than that.

To reveal that nuance, let's repeat _a_, _b_, and _c_ from the previous section once more, but this time we'll use our slightly more complex `pct` function that we wrote in class.

Recall that the `pct` function looks like this. Given a column `x`, the function finds the sum of the entire column, and then uses that sum to calculate the percentage of each individual value within the column.

In [18]:
def pct(x):
    return x / sum(x) * 100

Now, let's use this function to complete operations _a_, _b_, and _c_ on multiple columns.

In [20]:
# a) apply() on a DataFrame
camping[['Quantity', 'UnitWeight']].apply(pct)

Unnamed: 0_level_0,Quantity,UnitWeight
Item,Unnamed: 1_level_1,Unnamed: 2_level_1
Pack,3.846154,12.855473
Tent,3.846154,31.164784
Sleeping Pad,0.0,10.518115
Sleeping Bag,7.692308,7.791196
Toothbrush/Toothpaste,3.846154,0.77912
Sunscreen,3.846154,1.947799
Medical Kit,3.846154,1.441371
Spoon,19.230769,0.272692
Stove,3.846154,7.791196
Water Filter,7.692308,0.701208


When we use `apply()` on a DataFrame column, it uses the entire column as the input for the applied function. So with our `pct` function, each column gets passed in as the value of `x`.

In [21]:
# b) apply() on a GroupBy object
categories[['Quantity', 'UnitWeight']].apply(pct)

# ERROR: Pandas tries to apply the function to ALL columns, including the
# 'Category' column that we grouped by. Also, this query doesn't logically
# make sense anyway, since theoretically we'd be trying to calculate the
# percentage of each group with respect to itself.

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [22]:
# c) transform() on a GroupBy object
categories[['Quantity', 'UnitWeight']].transform(pct)

Unnamed: 0_level_0,Quantity,UnitWeight
Item,Unnamed: 1_level_1,Unnamed: 2_level_1
Pack,100.0,100.0
Tent,100.0,100.0
Sleeping Pad,0.0,57.446809
Sleeping Bag,100.0,42.553191
Toothbrush/Toothpaste,33.333333,18.691589
Sunscreen,33.333333,46.728972
Medical Kit,33.333333,34.579439
Spoon,50.0,1.473684
Stove,10.0,42.105263
Water Filter,20.0,3.789474


When we use `transform()` on a GroupBy column, it separately applies the function to each sub-DataFrame within the GroupBy. We can essentially think of it as splitting apart the GroupBy into sub-DataFrames, and then using `apply()` on each of those sub-DataFrames.

Thus, the above result shows us the percentages of each item _as compared to its group totals_. However, it's a bit difficult to check to see if our answer is correct since the only item identifiers are the index numbers. So to make this easier to read, let's add these columns to the original DataFrame with the column names `%Quantity` and `%UnitWeight`.

In [25]:
camping[['%Quantity', '%UnitWeight']] = categories[['Quantity', 'UnitWeight']].transform(pct)
camping

Unnamed: 0_level_0,Category,Quantity,UnitWeight,%Quantity,%UnitWeight
Item,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Pack,Pack,1,33.0,100.0,100.0
Tent,Shelter,1,80.0,100.0,100.0
Sleeping Pad,Sleep,0,27.0,0.0,57.446809
Sleeping Bag,Sleep,2,20.0,100.0,42.553191
Toothbrush/Toothpaste,Health,1,2.0,33.333333,18.691589
Sunscreen,Health,1,5.0,33.333333,46.728972
Medical Kit,Health,1,3.7,33.333333,34.579439
Spoon,Kitchen,5,0.7,50.0,1.473684
Stove,Kitchen,1,20.0,10.0,42.105263
Water Filter,Kitchen,2,1.8,20.0,3.789474


Now, we can see that our calculations worked perfectly. For example, take the `Utility` category. The sum of the utility `Quantity` values is $1 + 0 + 4 = 5$.

Then, of the utility items, trekking poles comprise 4 of the 5 total items. Thus, the value of `%Quantity` for `Trekking Poles` is $\frac{4}{5} \times 100 = \boxed{80}$.

Similarly, the pack liner comprises 1 of the 5 total items. Thus, the value of `%Quantity` for `Pack Liner` is $\frac{1}{5} \times 100 = \boxed{20}$.

The same holds for `Stuff Sack` within the `Utility` category, as well as for every other item within their respective categories.

## In summary

So now we see that `transform()` does a little more than we previously thought. When we tell Pandas to `transform` a column using a given function, it essentially executes a group-based `apply`. It treats each group as a separate DataFrame, and applies the function separately to each sub-DataFrame. 