# __Aggregating DataFrames__

# Outline
- [1 Summary Statistics](#sum-stats)
- [&nbsp;&nbsp;1.1 Mean and Median](#mean-med)
- [&nbsp;&nbsp;1.2 Summarizing Dates](#sum-dates)
- [&nbsp;&nbsp;1.3 Efficient Summaries](#sum-eff)
- [&nbsp;&nbsp;1.4 Cumulative Statistics](#cum-stats)
- [2 Counting](#count)
- [&nbsp;&nbsp;2.1 Dropping Duplicates](#drop-dup)
- [&nbsp;&nbsp;2.2 Counting Categorical Variables](#count-categ)
- [3 Grouped Summary Statistics](#grp-stat)
- [&nbsp;&nbsp;3.1 What percent of sales occurred at each store type?](#percent)
- [&nbsp;&nbsp;3.2 Calculations with .groupby()](#grp)
- [&nbsp;&nbsp;3.3 Multiple grouped summaries](#multi)




Importing `pandas` & loading data into `sales`

In [None]:
import pandas as pd
sales = pd.read_csv("./../../data/sales_subset.csv", index_col=0)

<a id="sum-stats"></a>
# 1 __Summary Statistics__
<a id="mean-med"></a>
## 1.1 Mean and median<br>
Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

Explore your new DataFrame first by printing the first few rows of the sales DataFrame.

In [None]:
sales.head()

Print information about the columns in sales.

In [None]:
sales.info()

Print the mean of the weekly_sales column.

In [None]:
sales["weekly_sales"].mean()

Print the median of the weekly_sales column.

In [None]:
sales["weekly_sales"].median()

<a id="sum-dates"></a>
## 1.2 Summarizing dates<br>
Summary statistics can also be calculated on date columns that have values with the data type `datetime64`. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

Print the maximum of the date column.

In [None]:
sales["date"].max()

Print the minimum of the date column.

In [None]:
sales["date"].min()

<a id="sum-eff"></a>
## 1.3 Efficient summaries<br>
While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,<br>
`df['column'].agg(function)` <br>
In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

In [None]:
import numpy as np

def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

Use the custom `iqr` function defined for you along with `.agg()` to print the IQR of the `temperature_c` column of `sales`.

In [None]:
sales["temperature_c"].agg(iqr)

Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of `temperature_c`, `fuel_price_usd_per_l`, and `unemployment`, in that order.

In [None]:
sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr)

Update the aggregation functions called by `.agg()`: include `iqr` and `np.median` in that order.

In [None]:
sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median])

<a id="cum-stats"></a>
## 1.4 Cumulative statistics<br>
Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called `sales_1_1` has been created for you, which contains the sales data for department 1 of store 1. `pandas` is loaded as `pd`.

In [None]:
sales_1_1 = sales[(sales["department"] == 1) & (sales["store"] == 1)]

Sort the rows of `sales_1_1` by the `date` column in ascending order.

In [None]:
sales_1_1.sort_values("date", ascending=True)

Get the cumulative sum of `weekly_sales` and add it as a new column of `sales_1_1` called `cum_weekly_sales`.

In [None]:
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

Get the cumulative maximum of `weekly_sales`, and add it as a column called `cum_max_sales`.

In [None]:
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

Print the `date`, `weekly_sales`, `cum_weekly_sales`, and `cum_max_sales` columns.

In [None]:
sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]]

<a id="count"></a>
# __2 Counting__
<a id="drop-dup"></a>
## 2.1 Dropping duplicates
Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from `sales`.

Remove rows of sales with duplicate pairs of `store` and `type` and save as `store_types` and print the head.

In [None]:
store_types = sales.drop_duplicates(["store", "type"])
store_types.head()

Remove rows of `sales` with duplicate pairs of `store` and `department` and save as `store_depts` and print the head.

In [None]:
store_depts = sales.drop_duplicates(["store", "department"])
store_depts.head()

Subset the rows that are holiday weeks using the `is_holiday` column, and drop the duplicate `dates`, saving as `holiday_dates`.

In [None]:
holiday_dates = sales[sales["is_holiday"]].drop_duplicates("date")

Select the `date` column of `holiday_dates`, and print.

In [None]:
print(holiday_dates["date"])

<a id="count-categ"></a>
## 2.2 Counting categorical variables<br>
Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

\# Drop duplicate store/type combinations <br>
`store_types = sales.drop_duplicates(subset=["store", "type"])`<br><br>
\# Drop duplicate store/department combinations<br>
`store_depts = sales.drop_duplicates(subset=["store", "department"])`

Count the number of stores of each store `type` in `store_types`.

In [None]:
store_counts = store_types["type"].value_counts()
store_counts

Count the proportion of stores of each store `type` in `store_types`.

In [None]:
store_props = store_types["type"].value_counts(normalize=True)
store_props

Count the number of different `department`s in `store_depts`, sorting the counts in descending order.

In [None]:
dept_counts_sorted = store_depts["department"].value_counts(sort=True)
dept_counts_sorted

Count the proportion of different `department`s in `store_depts`, sorting the proportions in descending order.

In [None]:
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
dept_props_sorted

<a id="grp-stat"></a>
# 3 Grouped Summary Statistics
<a id="percent"></a>
## 3.1 What percent of sales occurred at each store type?
While .groupby() is useful, you can calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this exercise, you'll calculate the total sales made at each store type, without using .groupby(). You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

Calculate the total weekly_sales over the whole dataset.

In [None]:
# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

Subset for type "A" stores, and calculate their total weekly sales.

In [None]:

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

Do the same for type "B" and type "C" stores.

In [None]:
# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()
# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

Combine the A/B/C results into a list, and divide by sales_all to get the proportion of sales by type.

In [None]:
# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales["weekly_sales"].sum()
print(sales_propn_by_type)

<a id="grp"></a>
## 3.2 Calculations with .groupby()
The .groupby() method makes life much easier. In this exercise, you'll perform the same calculations as last time, except you'll use the .groupby() method. You'll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it's a holiday week or not.

sales is available and pandas is loaded as pd.