In [None]:
import pandas as pd

Last time, we used the following data set describing a stock of camping equipment. We're going to continue using that data today.

In [None]:
camping_df = pd.read_csv('camping.csv')
camping_df

# Review: Apply and GroupBy

## `apply()`

First off, `apply()` broadcasts a given function to one or more columns of a DataFrame.

In [None]:
# Apply the sum function to a single column
camping_df['Quantity'].apply('sum')

In [None]:
# Apply the sum function to the entire DataFrame
camping_df.apply('sum')

In [None]:
# Apply the sum function to select rows
camping_df[['Quantity', 'UnitWeight']].apply('sum')

A full list of the available alternatives to `'sum'` can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics), under the section titled "Descriptive statistics."

We can also create our own functions and apply them to a DataFrame. The following code creates a method that divides its input by 2, and applies it to all of the numerical columns in the DataFrame.

In [None]:
def divide_by_2(x):
    return x / 2

# Apply divide_by_2 to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(divide_by_2)

Here's another example. We apply the `pct` function to the `Quantity` and `Weight` columns; for each column, it divides each value in the column by the sum of the entire column, and then multiplies the result by 100. In other words, it returns the _percentage_ of a particular value in relation to its entire column.

In [None]:
def pct(x):
    return x / sum(x) * 100

# Apply pct to the numerical columns
camping_df[['Quantity', 'UnitWeight']].apply(pct)

(See Advanced Queries Part 1 for a comparison between `apply()` and generalized broadcasting.)

---

## `groupby()`

The `groupby()` function allows us to split a DataFrame into a collection of sub-DataFrames, stored in a DataFrameGroupBy object.

In [None]:
# Store the DataFrameGroupBy object in a variable
categories = camping_df.groupby('Category')
categories

This DataFrameGroupBy object is essentially a dictionary where each key is a particular unique `Category`, and each corresponding value is a DataFrame of all the rows belonging to that particular category.

We can use the `groups()` method to get a summary of each Category and the corresponding rows it includes.

In [None]:
# Ask for a listing of each group
categories.groups

We can retrieve a particular group using the `get_group()` method, passing in the name of the group we want.

In [None]:
categories.get_group('Health')

# Technically equivalent to this query:
# camping_df[camping_df['Category'] == 'Health']

We can also ask for the size of each group using the `size()` method, which returns a Series containing each group name and their corresponding sizes.

In [None]:
categories.size()

Perhaps most importantly, we can loop through a DataFrameGroupBy object much like how we would loop through a dictionary.

In [None]:
for name, group in categories:
    print(name)
    print(group)
    print('-'*50)

---
---

# Advanced Queries (Part 2): Aggregate and Transform

## `agg()` / `aggregate()`

_Note that `agg` and `aggregate` are interchangeable; thus, I will exclusively use `agg` since it's shorter._

The `agg` function allows us to apply multiple functions at once. In the example below, we query first for the numerical columns from the camping data, and then aggregate both the `sum` and the `mean` of those two columns.

In [None]:
camping_df[['Quantity', 'UnitWeight']].agg(['sum', 'mean'])

The `agg` function is very often combined with `groupby` in order to query for group statistics. In the example below, we first group the data by `Category`, and then we aggregate both the `sum` and `mean` for each group.

In [None]:
categories = camping_df.groupby('Category')
categories.agg(['sum', 'mean'])

A more complex method of aggregating is to specify specific functions to apply to specific columns. Take a close look at the code below, along with its result.

In [None]:
## Find the sum of the Quantity column, and both sum and mean for UnitWeight
categories.agg({'Quantity': 'sum',
                'UnitWeight': ['sum', 'mean']})

Here, we passed a dictionary to the `agg` function, where the keys specify columns to operate on, and the corresponding values specify which function(s) to apply to the given column.

Note that each column can be given either a single function to apply, or a list of functions. In this example, the `Quantity` column was given just the `sum` function, but the `Weight` column was given both `sum` and `mean`.

_Once again, a full list of basic statistics functions provided by Pandas can be found [on this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#descriptive-statistics), under the section titled "Descriptive statistics."_

---

## `transform()`

Unlike `agg()`, which returns a reduced version of its input, `transform()` returns a DataFrame that's the same size as its input, but edited (i.e. *transformed*) somehow.

Here's a use of `transform()` that simply replaces each column value with the sum of its entire group.

In [None]:
categories.transform('sum')

Perhaps we don't want to include the `Item` column, since it contains string values. We can omit it by first querying for only the columns that we want.

In [None]:
categories[['Quantity', 'UnitWeight']].transform('sum')

As a slightly more useful example, let's apply the `pct` function that we created earlier. The result is a DataFrame that tells us the percentage of quantity and weight that each item plays, _with respect to its group_.

In [None]:
categories[['Quantity', 'UnitWeight']].transform(pct)

Of course, these numbers are not particularly helpful unless we can compare them directly to the items that they belong to. So a very common use of `transform()` is to re-assign the results to new columns in the original DataFrame. Let's take the percentage results from the query above and re-assign them to our original `camping_df` as new columns.

In [None]:
camping_df[['%Quantity', '%UnitWeight']] = categories[['Quantity', 'UnitWeight']].transform(pct)
camping_df