In [1]:
import pandas as pd

Last time, we used the following data set describing a stock of camping equipment. We're going to continue using that data today.

In [2]:
camping_df = pd.read_csv('camping.csv')
camping_df

Unnamed: 0,Item,Category,Quantity,UnitWeight
0,Pack,Pack,1,33.0
1,Tent,Shelter,1,80.0
2,Sleeping Pad,Sleep,0,27.0
3,Sleeping Bag,Sleep,2,20.0
4,Toothbrush/Toothpaste,Health,1,2.0
5,Sunscreen,Health,1,5.0
6,Medical Kit,Health,1,3.7
7,Spoon,Kitchen,5,0.7
8,Stove,Kitchen,1,20.0
9,Water Filter,Kitchen,2,1.8


# Review: Apply and GroupBy

## `apply()`

First off, `apply()` broadcasts a given function to one or more columns of a DataFrame.

In [3]:
# Apply the sum function to a single column
camping_df['Quantity'].apply('sum')

26

In [4]:
# Apply the sum function to the entire DataFrame
camping_df.apply('sum')

Item          PackTentSleeping PadSleeping BagToothbrush/Too...
Category      PackShelterSleepSleepHealthHealthHealthKitchen...
Quantity                                                     26
UnitWeight                                                256.7
dtype: object

In [5]:
# Apply the sum function to select rows
camping_df[['Quantity', 'UnitWeight']].apply('sum')

Quantity       26.0
UnitWeight    256.7
dtype: float64

A full list of the available alternatives to `'sum'` can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics), under the section titled "Descriptive statistics."

We can also create our own functions and apply them to a DataFrame. The following code creates a method that divides its input by 2, and applies it to all of the numerical columns in the DataFrame.

In [6]:
def divide_by_2(x):
    return x / 2

# Apply divide_by_2 to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(divide_by_2)

Unnamed: 0,Quantity,UnitWeight
0,0.5,16.5
1,0.5,40.0
2,0.0,13.5
3,1.0,10.0
4,0.5,1.0
5,0.5,2.5
6,0.5,1.85
7,2.5,0.35
8,0.5,10.0
9,1.0,0.9


Here's another example. We apply the `pct` function to the `Quantity` and `Weight` columns; for each column, it divides each value in the column by the sum of the entire column, and then multiplies the result by 100. In other words, it returns the _percentage_ of a particular value in relation to its entire column.

In [7]:
def pct(x):
    return x / sum(x) * 100

# Apply pct to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(pct)

Unnamed: 0,Quantity,UnitWeight
0,3.846154,12.855473
1,3.846154,31.164784
2,0.0,10.518115
3,7.692308,7.791196
4,3.846154,0.77912
5,3.846154,1.947799
6,3.846154,1.441371
7,19.230769,0.272692
8,3.846154,7.791196
9,7.692308,0.701208


(See Advanced Queries Part 1 for a comparison between `apply()` and generalized broadcasting.)

---

## `groupby()`

The `groupby()` function allows us to split a DataFrame into a collection of sub-DataFrames, stored in a DataFrameGroupBy object.

In [8]:
# Store the DataFrameGroupBy object in a variable
categories = camping_df.groupby('Category')
categories

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f80bd0ad9a0>

This DataFrameGroupBy object is essentially a dictionary where each key is a particular unique `Category`, and each corresponding value is a DataFrame of all the rows belonging to that particular category.

We can use the `groups` attribute to get a summary of each Category and the corresponding rows it includes.

In [9]:
# Ask for a listing of each group
categories.groups

{'Clothing': Int64Index([14, 15, 16], dtype='int64'),
 'Health': Int64Index([4, 5, 6], dtype='int64'),
 'Kitchen': Int64Index([7, 8, 9, 10], dtype='int64'),
 'Pack': Int64Index([0], dtype='int64'),
 'Shelter': Int64Index([1], dtype='int64'),
 'Sleep': Int64Index([2, 3], dtype='int64'),
 'Utility': Int64Index([11, 12, 13], dtype='int64')}

We can retrieve a particular group using the `get_group()` method, passing in the name of the group we want.

In [10]:
categories.get_group('Health')

# Technically equivalent to this query:
# camping_df[camping_df['Category'] == 'Health']

Unnamed: 0,Item,Category,Quantity,UnitWeight
4,Toothbrush/Toothpaste,Health,1,2.0
5,Sunscreen,Health,1,5.0
6,Medical Kit,Health,1,3.7


We can also ask for the size of each group using the `size()` method, which returns a Series containing each group name and their corresponding sizes.

In [11]:
categories.size()

Category
Clothing    3
Health      3
Kitchen     4
Pack        1
Shelter     1
Sleep       2
Utility     3
dtype: int64

Perhaps most importantly, we can loop through a DataFrameGroupBy object much like how we would loop through a dictionary.

In [12]:
for name, group in categories:
    print(name)
    print(group)
    print('-'*50)

Clothing
           Item  Category  Quantity  UnitWeight
14  Rain Poncho  Clothing         0         6.0
15        Shoes  Clothing         1        12.0
16          Hat  Clothing         3         2.5
--------------------------------------------------
Health
                    Item Category  Quantity  UnitWeight
4  Toothbrush/Toothpaste   Health         1         2.0
5              Sunscreen   Health         1         5.0
6            Medical Kit   Health         1         3.7
--------------------------------------------------
Kitchen
             Item Category  Quantity  UnitWeight
7           Spoon  Kitchen         5         0.7
8           Stove  Kitchen         1        20.0
9    Water Filter  Kitchen         2         1.8
10  Water Bottles  Kitchen         2        25.0
--------------------------------------------------
Pack
   Item Category  Quantity  UnitWeight
0  Pack     Pack         1        33.0
--------------------------------------------------
Shelter
   Item Category  Qu

---
---

# Advanced Queries (Part 2): Aggregate and Transform

## `agg()` / `aggregate()`

_Note that `agg` and `aggregate` are interchangeable; thus, I will exclusively use `agg` since it's shorter._

The `agg` function allows us to apply multiple functions at once. In the example below, we query first for the numerical columns from the camping data, and then aggregate both the `sum` and the `mean` of those two columns.

In [13]:
camping_df[['Quantity', 'UnitWeight']].agg(['sum', 'mean'])

Unnamed: 0,Quantity,UnitWeight
sum,26.0,256.7
mean,1.529412,15.1


The `agg` function is very often combined with `groupby` in order to query for group statistics. In the example below, we first group the data by `Category`, and then we aggregate both the `sum` and `mean` for each group.

In [21]:
categories = camping_df.groupby('Category')
categories.agg(['sum', 'mean'])

Unnamed: 0_level_0,Quantity,Quantity,UnitWeight,UnitWeight
Unnamed: 0_level_1,sum,mean,sum,mean
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Clothing,4,1.333333,20.5,6.833333
Health,3,1.0,10.7,3.566667
Kitchen,10,2.5,47.5,11.875
Pack,1,1.0,33.0,33.0
Shelter,1,1.0,80.0,80.0
Sleep,2,1.0,47.0,23.5
Utility,5,1.666667,18.0,6.0


In [23]:
categories.agg('sum')

Unnamed: 0_level_0,Quantity,UnitWeight
Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Clothing,4,20.5
Health,3,10.7
Kitchen,10,47.5
Pack,1,33.0
Shelter,1,80.0
Sleep,2,47.0
Utility,5,18.0


A more complex method of aggregating is to specify specific functions to apply to specific columns. Take a close look at the code below, along with its result.

In [22]:
## Find the sum of the Quantity column, and both sum and mean for UnitWeight
categories.agg({
    'Quantity': 'sum',
    'UnitWeight': ['sum', 'mean']
})

Unnamed: 0_level_0,Quantity,UnitWeight,UnitWeight
Unnamed: 0_level_1,sum,sum,mean
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Clothing,4,20.5,6.833333
Health,3,10.7,3.566667
Kitchen,10,47.5,11.875
Pack,1,33.0,33.0
Shelter,1,80.0,80.0
Sleep,2,47.0,23.5
Utility,5,18.0,6.0


Here, we passed a dictionary to the `agg` function, where the keys specify columns to operate on, and the corresponding values specify which function(s) to apply to the given column.

Note that each column can be given either a single function to apply, or a list of functions. In this example, the `Quantity` column was given just the `sum` function, but the `Weight` column was given both `sum` and `mean`.

_Once again, a full list of basic statistics functions provided by Pandas can be found [on this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics), under the section titled "Descriptive statistics."_

---

## `transform()`

Unlike `agg()`, which returns a reduced version of its input, `transform()` returns a DataFrame that's the same size as its input, but edited (i.e. *transformed*) somehow.

Here's a use of `transform()` that simply replaces each column value with the sum of its entire group.

In [24]:
categories.transform('sum')

Unnamed: 0,Item,Quantity,UnitWeight
0,Pack,1,33.0
1,Tent,1,80.0
2,Sleeping PadSleeping Bag,2,47.0
3,Sleeping PadSleeping Bag,2,47.0
4,Toothbrush/ToothpasteSunscreenMedical Kit,3,10.7
5,Toothbrush/ToothpasteSunscreenMedical Kit,3,10.7
6,Toothbrush/ToothpasteSunscreenMedical Kit,3,10.7
7,SpoonStoveWater FilterWater Bottles,10,47.5
8,SpoonStoveWater FilterWater Bottles,10,47.5
9,SpoonStoveWater FilterWater Bottles,10,47.5


Perhaps we don't want to include the `Item` column, since it contains string values. We can omit it by first querying for only the columns that we want.

In [25]:
categories[['Quantity', 'UnitWeight']].transform('sum')

Unnamed: 0,Quantity,UnitWeight
0,1,33.0
1,1,80.0
2,2,47.0
3,2,47.0
4,3,10.7
5,3,10.7
6,3,10.7
7,10,47.5
8,10,47.5
9,10,47.5


As a slightly more useful example, let's apply the `pct` function that we created earlier. The result is a DataFrame that tells us the percentage of quantity and weight that each item plays, _with respect to its group_.

In [26]:
categories[['Quantity', 'UnitWeight']].transform(pct)

Unnamed: 0,Quantity,UnitWeight
0,100.0,100.0
1,100.0,100.0
2,0.0,57.446809
3,100.0,42.553191
4,33.333333,18.691589
5,33.333333,46.728972
6,33.333333,34.579439
7,50.0,1.473684
8,10.0,42.105263
9,20.0,3.789474


Of course, these numbers are not particularly helpful unless we can compare them directly to the items that they belong to. So a very common use of `transform()` is to re-assign the results to new columns in the original DataFrame. Let's take the percentage results from the query above and re-assign them to our original `camping_df` as new columns.

In [27]:
camping_df[['%Quantity', '%UnitWeight']] = categories[['Quantity', 'UnitWeight']].transform(pct)
camping_df

Unnamed: 0,Item,Category,Quantity,UnitWeight,%Quantity,%UnitWeight
0,Pack,Pack,1,33.0,100.0,100.0
1,Tent,Shelter,1,80.0,100.0,100.0
2,Sleeping Pad,Sleep,0,27.0,0.0,57.446809
3,Sleeping Bag,Sleep,2,20.0,100.0,42.553191
4,Toothbrush/Toothpaste,Health,1,2.0,33.333333,18.691589
5,Sunscreen,Health,1,5.0,33.333333,46.728972
6,Medical Kit,Health,1,3.7,33.333333,34.579439
7,Spoon,Kitchen,5,0.7,50.0,1.473684
8,Stove,Kitchen,1,20.0,10.0,42.105263
9,Water Filter,Kitchen,2,1.8,20.0,3.789474
