<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/004-aggregating_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Walmart Dataset Overview  

The dataset contains weekly sales data for Walmart stores, including:  

- Store ID and type  
- Weekly sales and department IDs  
- Holiday week indicator  
- Average temperature and fuel prices  
- National unemployment rate  

The dataset can be downloaded here: [sales_subset.csv](@site/assets/datasets/pandas-aggregating-data/sales_subset.csv)

Sample rows:  

| Store | Sales    | Dept | Holiday | Temp (°C) | Fuel Price (USD/L) | Unemployment |
|-------|----------|------|---------|-----------|---------------------|--------------|
|   1   | 24924.50 |   1  |    0    |   5.73    |         0.68        |     8.11     |
|   1   | 21827.90 |   1  |    0    |   8.06    |         0.69        |     8.11     |
|   1   | 57258.43 |   1  |    0    |  16.82    |         0.72        |     7.81     |
|   1   | 17413.94 |   1  |    0    |  22.53    |         0.75        |     7.81     |
|   1   | 17558.09 |   1  |    0    |  27.05    |         0.71        |     7.81     |
|   1   | 16333.14 |   1  |    0    |  27.17    |         0.71        |     7.79     | 

# Mean and Median

Print information about the columns in sales using `head()`

In [None]:
import pandas as pd

# Use URL , so that notebook can be ran from Google Colab without downloading csv file
url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/aggregating-data-using-pandas/sales_subset.csv'
sales = pd.read_csv(url)

print(sales.head())

Print information about the columns in sales using `info()`

In [None]:
print(sales.info())

Print the mean of the `weekly_sales` column.

In [None]:
print(sales["weekly_sales"].mean())

Print the median of the `weekly_sales` column.

In [None]:
print(sales["weekly_sales"].median())

# Dates 

Print the maximum of the date column

In [None]:
print(sales["date"].min())

Print the minimum of the date column.

In [None]:
print(sales["date"].min())

# Efficient Summaries  

Sometimes, you need custom functions to summarize your data beyond what pandas and NumPy offer.  
The `.agg()` method allows custom functions to a DataFrame, even across multiple columns, for efficient aggregation.  

Example: 

```python
df['column'].agg(function)
```  

In this exercise, the custom function calculates the "IQR" (inter-quartile range), which is the difference between the 75th and 25th percentiles. Using this function, print the IQR of the `temperature_c` column of sales.

In [None]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales["temperature_c"].agg(iqr))

Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of these in order:

- `temperature_c`
- `fuel_price_usd_per_l`
- `unemployment`

In [None]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[[
    "temperature_c", 
    "fuel_price_usd_per_l",
    "unemployment"
]].agg(iqr))

Update the aggregation functions called by `.agg()` in order:

- `iqr`
- `np.median`

In [None]:
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[[
    "temperature_c", 
    "fuel_price_usd_per_l",
    "unemployment"
]].agg([iqr, np.median]))

# Cumulative Statistics  

Cumulative statistics help track data over time. In this exercise, you'll calculate the cumulative sum and maximum of a department's weekly sales to find the total sales so far and the highest weekly sales so far.  

Create the a DataFrame `sales_1_1` with sales data for department 1 of store 1.

In [None]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/aggregating-data-using-pandas/sales_subset_store_1.csv'
sales = pd.read_csv(url)