In [1]:
import pandas as pd

# Advanced Queries (Part 1): Apply and GroupBy

In today's class, we'll be learning how to use some powerful functions that allow us to make more complex queries.

To learn these functions, we'll be using the following data set describing a stock of camping equipment.

In [2]:
camping_df = pd.read_csv('camping.csv')
camping_df

Unnamed: 0,Item,Category,Quantity,UnitWeight
0,Pack,Pack,1,33.0
1,Tent,Shelter,1,80.0
2,Sleeping Pad,Sleep,0,27.0
3,Sleeping Bag,Sleep,2,20.0
4,Toothbrush/Toothpaste,Health,1,2.0
5,Sunscreen,Health,1,5.0
6,Medical Kit,Health,1,3.7
7,Spoon,Kitchen,5,0.7
8,Stove,Kitchen,1,20.0
9,Water Filter,Kitchen,2,1.8


## `apply()`

Our first function is `apply()`, which allows us to _apply_ a function (through broadcasting) to one or more columns of a DataFrame.

In [6]:
# Apply the sum function to a single column
camping_df['Quantity'].apply('sum')

26

In [7]:
# Apply the sum function to the entire DataFrame
camping_df.apply('sum')

Item          PackTentSleeping PadSleeping BagToothbrush/Too...
Category      PackShelterSleepSleepHealthHealthHealthKitchen...
Quantity                                                     26
UnitWeight                                                256.7
dtype: object

We can query for specific columns first, and then apply a function to that subset of the original DataFrame.

In [8]:
camping_df[['Quantity', 'UnitWeight']].apply('sum')

Quantity       26.0
UnitWeight    256.7
dtype: float64

You may notice that the above queries tell Pandas to apply a function called `'sum'`, but we never actually defined such a function. This is because Pandas has a number of basic statistics functions available for use; a full list of the available functions can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics), under the section titled "Descriptive statistics."

In addition to the default statistics functions, we can create our own functions and apply them to a DataFrame. The following code creates a method that divides its input by 2, and applies it to all of the numerical columns in the DataFrame.

In [14]:
# Write a method called divide_by_2 that takes a parameter and divides its value divided by 2
def divide_by_2(x):
    return x / 2;

# Apply divide_by_2 to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(divide_by_2)

Unnamed: 0,Quantity,UnitWeight
0,0.5,16.5
1,0.5,40.0
2,0.0,13.5
3,1.0,10.0
4,0.5,1.0
5,0.5,2.5
6,0.5,1.85
7,2.5,0.35
8,0.5,10.0
9,1.0,0.9


Note that before applying `divide_by_2`, we had to query for just the numerical columns. Without this first step, Pandas would try to apply the function to the `Categories` column as well, which would throw an error because string data cannot be divided.

In [15]:
# ERROR
camping_df.apply(divide_by_2)

TypeError: unsupported operand type(s) for /: 'str' and 'int'

You may notice that this simple application of `apply()` is functionally identical to a broadcasted operation.

In [16]:
camping_df[['Quantity', 'UnitWeight']] / 2

Unnamed: 0,Quantity,UnitWeight
0,0.5,16.5
1,0.5,40.0
2,0.0,13.5
3,1.0,10.0
4,0.5,1.0
5,0.5,2.5
6,0.5,1.85
7,2.5,0.35
8,0.5,10.0
9,1.0,0.9


Given that we can accomplish the same thing by broadcasting, it may seem silly to have `apply()` at all. But we keep it because once an operation becomes more complex than just a simple division operation, it very quickly gets messy. Thus, it's better to separate out the function that we want to broadcast, and then use it on our data with `apply()`.

Here's an example. We apply the `pct` function to the `Quantity` and `Weight` columns; for each column, it divides each value in the column by the sum of the entire column, and then multiplies the result by 100. In other words, it returns the _percentage_ of a particular value in relation to its entire column.

In [18]:
# Write a method called pct that divides a column by the sum of the column, and multiplies by 100
def pct(col):
    return col / sum(col) * 100

# Apply pct to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(pct)

Unnamed: 0,Quantity,UnitWeight
0,3.846154,12.855473
1,3.846154,31.164784
2,0.0,10.518115
3,7.692308,7.791196
4,3.846154,0.77912
5,3.846154,1.947799
6,3.846154,1.441371
7,19.230769,0.272692
8,3.846154,7.791196
9,7.692308,0.701208


The broadcasting equivalent of this function is significantly more messy.

In [20]:
camping_df[['Quantity', 'UnitWeight']] / camping_df[['Quantity', 'UnitWeight']].sum() * 100

Unnamed: 0,Quantity,UnitWeight
0,3.846154,12.855473
1,3.846154,31.164784
2,0.0,10.518115
3,7.692308,7.791196
4,3.846154,0.77912
5,3.846154,1.947799
6,3.846154,1.441371
7,19.230769,0.272692
8,3.846154,7.791196
9,7.692308,0.701208


So `apply` helps us to keep things cleaner when broadcasting complex functions. But even more importantly, the idea of _applying_ functions is very fundamental to Pandas, to the extent that it is the underlying concept upon which several other functions are built.

---

## `groupby()`

The `groupby` function is not built on top of `apply`, but it is almost just as fundamental so we will cover it next. This function is used to split a DataFrame into various sub-DataFrames, according to a specified "group by" column.

The following line of code splits the camping data according to which `Category` each row falls into.

In [22]:
camping_df.groupby('Category')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fee1bc63490>

The result of a `groupby` query is what we call a `DataFrameGroupBy` object, which is useless to look at by itself (as we see above). But such an object comes with a whole bunch of extra functionality that makes it super useful.

First, we have the `groups` attribute, which lists out each group according to the indexes of its corresponding rows.

In [23]:
# Store the DataFrameGroupBy object in a variable
categories = camping_df.groupby('Category')
# Ask for a listing of each group
categories.groups

{'Clothing': Int64Index([14, 15, 16], dtype='int64'),
 'Health': Int64Index([4, 5, 6], dtype='int64'),
 'Kitchen': Int64Index([7, 8, 9, 10], dtype='int64'),
 'Pack': Int64Index([0], dtype='int64'),
 'Shelter': Int64Index([1], dtype='int64'),
 'Sleep': Int64Index([2, 3], dtype='int64'),
 'Utility': Int64Index([11, 12, 13], dtype='int64')}

Once we know what the groups are, we can use the `get_group` function to request a specific group. Note that the group is returned as its own DataFrame, which means that you can more or less think of a DataFrameGroupBy object as a list of small DataFrames, each of which is a subset of the original.

In [24]:
categories.get_group('Health')

Unnamed: 0,Item,Category,Quantity,UnitWeight
4,Toothbrush/Toothpaste,Health,1,2.0
5,Sunscreen,Health,1,5.0
6,Medical Kit,Health,1,3.7


We can ask for the size of each group using the `size` function. Pretty straightforward.

In [25]:
categories.size()

Category
Clothing    3
Health      3
Kitchen     4
Pack        1
Shelter     1
Sleep       2
Utility     3
dtype: int64

We can loop through a DataFrameGroupBy object, pulling out the name and sub-DataFrame for each group in the list.

In [29]:
for name, group in categories:
    print(name)
    print(group)
    print('-' * 50)

Clothing
           Item  Category  Quantity  UnitWeight
14  Rain Poncho  Clothing         0         6.0
15        Shoes  Clothing         1        12.0
16          Hat  Clothing         3         2.5
--------------------------------------------------
Health
                    Item Category  Quantity  UnitWeight
4  Toothbrush/Toothpaste   Health         1         2.0
5              Sunscreen   Health         1         5.0
6            Medical Kit   Health         1         3.7
--------------------------------------------------
Kitchen
             Item Category  Quantity  UnitWeight
7           Spoon  Kitchen         5         0.7
8           Stove  Kitchen         1        20.0
9    Water Filter  Kitchen         2         1.8
10  Water Bottles  Kitchen         2        25.0
--------------------------------------------------
Pack
   Item Category  Quantity  UnitWeight
0  Pack     Pack         1        33.0
--------------------------------------------------
Shelter
   Item Category  Qu

Suppose we still want to group by `Category`, but we only care about the `Weight` column of each item. We can query the DataFrameGroupBy object for `Weight`, and then loop through that instead.

In [34]:
weights = categories['UnitWeight']
for name, group in weights:
    print(name)
    print(group)
    print('-' * 50)

('Clothing', 14     6.0
15    12.0
16     2.5
Name: UnitWeight, dtype: float64)
('Health', 4    2.0
5    5.0
6    3.7
Name: UnitWeight, dtype: float64)
('Kitchen', 7      0.7
8     20.0
9      1.8
10    25.0
Name: UnitWeight, dtype: float64)
('Pack', 0    33.0
Name: UnitWeight, dtype: float64)
('Shelter', 1    80.0
Name: UnitWeight, dtype: float64)
('Sleep', 2    27.0
3    20.0
Name: UnitWeight, dtype: float64)
('Utility', 11     1.0
12     1.0
13    16.0
Name: UnitWeight, dtype: float64)


So to summarize, `groupby` allows us to split a DataFrame into a DataFrameGroupBy object, which is essentially a list of sub-DataFrames.

In [53]:
# Compute Quantity percentage of each item in relation to its category
categories['Quantity'].apply(pct)

0     100.000000
1     100.000000
2       0.000000
3     100.000000
4      33.333333
5      33.333333
6      33.333333
7      50.000000
8      10.000000
9      20.000000
10     20.000000
11     20.000000
12      0.000000
13     80.000000
14      0.000000
15     25.000000
16     75.000000
Name: Quantity, dtype: float64