In [None]:
import pandas as pd

# Advanced Queries (Part 1): Apply and GroupBy

In today's class, we'll be learning how to use some powerful functions that allow us to make more complex queries.

To learn these functions, we'll be using the following data set describing a stock of camping equipment.

In [None]:
camping_df = pd.read_csv('camping.csv')
camping_df

## `apply()`

Our first function is `apply()`, which allows us to _apply_ a function (through broadcasting) to one or more columns of a DataFrame.

In [None]:
# Apply the sum function to a single column
camping_df['Quantity'].apply('sum')

In [None]:
# Apply the sum function to the entire DataFrame
camping_df.apply('sum')

We can query for specific columns first, and then apply a function to that subset of the original DataFrame.

In [None]:
camping_df[['Quantity', 'UnitWeight']].apply('sum')

You may notice that the above queries tell Pandas to apply a function called `'sum'`, but we never actually defined such a function. This is because Pandas has a number of basic statistics functions available for use; a full list of the available functions can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics), under the section titled "Descriptive statistics."

In addition to the default statistics functions, we can create our own functions and apply them to a DataFrame. The following code creates a method that divides its input by 2, and applies it to all of the numerical columns in the DataFrame.

In [None]:
# Write a method called divide_by_2 that takes a parameter and divides its value divided by 2
def divide_by_2(x):
    return x / 2

# Apply divide_by_2 to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(divide_by_2)

Note that before applying `divide_by_2`, we had to query for just the numerical columns. Without this first step, Pandas would try to apply the function to the `Categories` column as well, which would throw an error because string data cannot be divided.

In [None]:
# ERROR
camping_df.apply(divide_by_2)

You may notice that this simple application of `apply()` is functionally identical to a broadcasted operation.

In [None]:
camping_df[['Quantity', 'UnitWeight']] / 2

Given that we can accomplish the same thing by broadcasting, it may seem silly to have `apply()` at all. But we keep it because once an operation becomes more complex than just a simple division operation, it very quickly gets messy. Thus, it's better to separate out the function that we want to broadcast, and then use it on our data with `apply()`.

Here's an example. We apply the `pct` function to the `Quantity` and `Weight` columns; for each column, it divides each value in the column by the sum of the entire column, and then multiplies the result by 100. In other words, it returns the _percentage_ of a particular value in relation to its entire column.

In [None]:
# Write a method called pct that divides a column by the sum of the column, and multiplies by 100
def pct(x):
    return x / sum(x) * 100

# Apply pct to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(pct)

The broadcasting equivalent of this function is significantly more messy.

In [None]:
camping_df[['Quantity', 'UnitWeight']] / camping_df[['Quantity', 'UnitWeight']].sum() * 100

So `apply` helps us to keep things cleaner when broadcasting complex functions. But even more importantly, the idea of _applying_ functions is very fundamental to Pandas, to the extent that it is the underlying concept upon which several other functions are built.

---

## `groupby()`

The `groupby` function is not built on top of `apply`, but it is almost just as fundamental so we will cover it next. This function is used to split a DataFrame into various sub-DataFrames, according to a specified "group by" column.

The following line of code splits the camping data according to which `Category` each row falls into.

In [None]:
camping_df.groupby('Category')

The result of a `groupby` query is what we call a `DataFrameGroupBy` object, which is useless to look at by itself (as we see above). But such an object comes with a whole bunch of extra functionality that makes it super useful.

First, we have the `groups` attribute, which lists out each group according to the indexes of its corresponding rows.

In [None]:
# Store the DataFrameGroupBy object in a variable
categories = camping_df.groupby('Category')
# Ask for a listing of each group
categories.groups

Once we know what the groups are, we can use the `get_group` function to request a specific group. Note that the group is returned as its own DataFrame, which means that you can more or less think of a DataFrameGroupBy object as a list of small DataFrames, each of which is a subset of the original.

In [None]:
categories.get_group('Health')

We can ask for the size of each group using the `size` function. Pretty straightforward.

In [None]:
categories.size()

We can loop through a DataFrameGroupBy object, pulling out the name and sub-DataFrame for each group in the list.

In [None]:
for name, group in categories:
    print(name)
    print(group)
    print('-' * 51)   # Separators between groups, just for clarity

Suppose we still want to group by `Category`, but we only care about the `Weight` column of each item. We can query the DataFrameGroupBy object for `Weight`, and then loop through that instead.

In [None]:
weights = categories['UnitWeight']
for name, group in weights:
    print(name)
    print(group)
    print('-'*28)

So to summarize, `groupby` allows us to split a DataFrame into a DataFrameGroupBy object, which is essentially a list of sub-DataFrames.