# Aggregating DataFrames

# Outline
- [1 Summary Statistics](#sum-stat)
- [&nbsp;&nbsp;1.1 Mean and Median](#mean-med)
- [&nbsp;&nbsp;1.2 Summarizing Dates](#sum-dates)
- [&nbsp;&nbsp;1.3 Efficient Summaries](#sum-eff)
- [&nbsp;&nbsp;1.4 Cumulative Statistics](#cum-stats)
- [2 Counting](#count)
- [&nbsp;&nbsp;2.1 Dropping Duplicates](#drop-dup)

Importing `pandas` & loading data into `sales`

In [None]:
import pandas as pd
sales = pd.read_csv("./../../data/sales_subset.csv", index_col=0)

<a id="sum-stat"></a>
# __1 Summary Statistics__
<a id="mean-med"></a>
#### __1.1 Mean and median__
Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

Explore your new DataFrame first by printing the first few rows of the sales DataFrame.

In [None]:
sales.head()

Print information about the columns in sales.

In [None]:
sales.info()

Print the mean of the weekly_sales column.

In [None]:
sales["weekly_sales"].mean()

Print the median of the weekly_sales column.

In [None]:
sales["weekly_sales"].median()

<a id="sum-dates"></a>
#### __1.2 Summarizing dates__
Summary statistics can also be calculated on date columns that have values with the data type `datetime64`. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

Print the maximum of the date column.

In [None]:
sales["date"].max()

Print the minimum of the date column.

In [None]:
sales["date"].min()

<a id="sum-eff"></a>
#### __1.3 Efficient summaries__
While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,<br>
`df['column'].agg(function)` <br>
In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

In [None]:
import numpy as np

def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

Use the custom `iqr` function defined for you along with `.agg()` to print the IQR of the `temperature_c` column of `sales`.

In [None]:
sales["temperature_c"].agg(iqr)

Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of `temperature_c`, `fuel_price_usd_per_l`, and `unemployment`, in that order.

In [None]:
sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr)

Update the aggregation functions called by `.agg()`: include `iqr` and `np.median` in that order.

In [None]:
sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median])

<a id="cum-stats"></a>
#### __1.4 Cumulative statistics__
Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called `sales_1_1` has been created for you, which contains the sales data for department 1 of store 1. `pandas` is loaded as `pd`.

In [None]:
sales_1_1 = sales[(sales["department"] == 1) & (sales["store"] == 1)]

Sort the rows of `sales_1_1` by the `date` column in ascending order.

In [None]:
sales_1_1.sort_values("date", ascending=True)

Get the cumulative sum of `weekly_sales` and add it as a new column of `sales_1_1` called `cum_weekly_sales`.

In [None]:
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

Get the cumulative maximum of `weekly_sales`, and add it as a column called `cum_max_sales`.

In [None]:
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

Print the `date`, `weekly_sales`, `cum_weekly_sales`, and `cum_max_sales` columns.

In [None]:
sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]]

<a id="count"></a>
# __2 Counting__
<a id="drop-dup"></a>
#### __2.1 Dropping duplicates__
Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from `sales`.

Remove rows of sales with duplicate pairs of `store` and `type` and save as `store_types` and print the head.

In [None]:
store_types = sales.drop_duplicates(["store", "type"])
store_types.head()

Remove rows of `sales` with duplicate pairs of `store` and `department` and save as `store_depts` and print the head.

In [49]:
store_depts = sales.drop_duplicates(["store", "department"])
store_depts.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
12,1,A,2,2010-02-05,50605.27,False,5.727778,0.679451,8.106
24,1,A,3,2010-02-05,13740.12,False,5.727778,0.679451,8.106
36,1,A,4,2010-02-05,39954.04,False,5.727778,0.679451,8.106
48,1,A,5,2010-02-05,32229.38,False,5.727778,0.679451,8.106


Subset the rows that are holiday weeks using the `is_holiday` column, and drop the duplicate `dates`, saving as `holiday_dates`.

In [None]:
holiday_dates = sales[sales["is_holiday"]].drop_duplicates("date")

Select the `date` column of `holiday_dates`, and print.

In [None]:
print(holiday_dates["date"])

<a id="count-categ"></a>
#### __2.2 Counting categorical variables__
Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

\# Drop duplicate store/type combinations <br>
`store_types = sales.drop_duplicates(subset=["store", "type"])`<br><br>
\# Drop duplicate store/department combinations<br>
`store_depts = sales.drop_duplicates(subset=["store", "department"])`

Count the number of stores of each store `type` in `store_types`.

In [None]:
store_counts = store_types["type"].value_counts()
store_counts

Count the proportion of stores of each store `type` in `store_types`.

In [None]:
store_props = store_types["store"].value_counts(normalize=True)
store_props

Count the number of different `department`s in `store_depts`, sorting the counts in descending order.

In [54]:
dept_counts_sorted = store_depts["department"].value_counts().sort_values(ascending=False)
dept_counts_sorted

department
1     12
3     12
5     12
6     12
7     12
      ..
37    10
48     8
50     6
39     4
43     2
Name: count, Length: 80, dtype: int64

Count the proportion of different `department`s in `store_depts`, sorting the proportions in descending order.

In [65]:
dept_props_sorted = store_depts["department"].value_counts(normalize=True)
dept_props_sorted

department
1     0.012917
55    0.012917
72    0.012917
71    0.012917
67    0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: proportion, Length: 80, dtype: float64