<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/001-Sample-Notebooks/004-aggregating_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Walmart Dataset Overview  

The dataset contains weekly sales data for Walmart stores, including:  

- Store ID and type  
- Weekly sales and department IDs  
- Holiday week indicator  
- Average temperature and fuel prices  
- National unemployment rate  

The dataset can be downloaded here: [sales_subset.csv](@site/assets/datasets/pandas-aggregating-data/sales_subset.csv)

Sample rows:  

| Store | Sales    | Dept | Holiday | Temp (°C) | Fuel Price (USD/L) | Unemployment |
|-------|----------|------|---------|-----------|---------------------|--------------|
|   1   | 24924.50 |   1  |    0    |   5.73    |         0.68        |     8.11     |
|   1   | 21827.90 |   1  |    0    |   8.06    |         0.69        |     8.11     |
|   1   | 57258.43 |   1  |    0    |  16.82    |         0.72        |     7.81     |
|   1   | 17413.94 |   1  |    0    |  22.53    |         0.75        |     7.81     |
|   1   | 17558.09 |   1  |    0    |  27.05    |         0.71        |     7.81     |
|   1   | 16333.14 |   1  |    0    |  27.17    |         0.71        |     7.79     | 

# Mean and Median

Print information about the columns in sales using `head()`

In [69]:
import pandas as pd

# Use URL , so that notebook can be ran from Google Colab without downloading csv file
url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/aggregating-data-using-pandas/sales_subset.csv'
sales = pd.read_csv(url)

print(sales.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  


Print information about the columns in sales using `info()`

In [70]:
print(sales.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            10774 non-null  int64  
 1   store                 10774 non-null  int64  
 2   type                  10774 non-null  object 
 3   department            10774 non-null  int64  
 4   date                  10774 non-null  object 
 5   weekly_sales          10774 non-null  float64
 6   is_holiday            10774 non-null  bool   
 7   temperature_c         10774 non-null  float64
 8   fuel_price_usd_per_l  10774 non-null  float64
 9   unemployment          10774 non-null  float64
dtypes: bool(1), float64(4), int64(3), object(2)
memory usage: 768.2+ KB
None


Print the mean of the `weekly_sales` column.

In [71]:
print(sales["weekly_sales"].mean())

23843.95014850566


Print the median of the `weekly_sales` column.

In [72]:
print(sales["weekly_sales"].median())

12049.064999999999


# Dates 

Print the maximum of the date column

In [73]:
print(sales["date"].min())

2010-02-05


Print the minimum of the date column.

In [74]:
print(sales["date"].min())

2010-02-05


# Efficient Summaries  

Sometimes, you need custom functions to summarize your data beyond what pandas and NumPy offer.  
The `.agg()` method allows custom functions to a DataFrame, even across multiple columns, for efficient aggregation.  

Example: 

```python
df['column'].agg(function)
```  

In this exercise, the custom function calculates the "IQR" (inter-quartile range), which is the difference between the 75th and 25th percentiles. Using this function, print the IQR of the `temperature_c` column of sales.

In [75]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales["temperature_c"].agg(iqr))

16.583333333333336


Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of these in order:

- `temperature_c`
- `fuel_price_usd_per_l`
- `unemployment`

In [76]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[[
    "temperature_c", 
    "fuel_price_usd_per_l",
    "unemployment"
]].agg(iqr))

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


Update the aggregation functions called by `.agg()` in order:

- `iqr`
- `np.median`

In [77]:
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[[
    "temperature_c", 
    "fuel_price_usd_per_l",
    "unemployment"
]].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


  ]].agg([iqr, np.median]))
  ]].agg([iqr, np.median]))
  ]].agg([iqr, np.median]))


# Cumulative Statistics  

Cumulative statistics help track data over time. In this exercise, you'll calculate the cumulative sum and maximum of a department's weekly sales to find the total sales so far and the highest weekly sales so far.  

Create the a DataFrame `sales_1_1` with sales data for department 1 of store 1.

In [36]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/aggregating-data-using-pandas/sales_subset_store_1.csv'
sales_1_1 = pd.read_csv(url)

Sort `sales_1_1` by date. Get the cumulative sum of `weekly_sales`, then add as a new column called `cum_weekly_sales`.

In [None]:
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()
print(sales_1_1.head())

Get the cumulative max of `weekly_sales`, then add as a new column called `cum_max_sales`.

In [None]:
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()
print(sales_1_1.head())

See the three new columns.

In [None]:
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])