<a href="https://colab.research.google.com/github/joseeden/joeden/blob/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/001-Sample-Notebooks/004-aggregating_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Walmart Dataset Overview  

The dataset contains weekly sales data for Walmart stores, including:  

- Store ID and type  
- Weekly sales and department IDs  
- Holiday week indicator  
- Average temperature and fuel prices  
- National unemployment rate  

The dataset can be downloaded here: [sales_subset.csv](@site/assets/datasets/pandas-aggregating-data/sales_subset.csv)

Sample rows:  

| Store | Sales    | Dept | Holiday | Temp (°C) | Fuel Price (USD/L) | Unemployment |
|-------|----------|------|---------|-----------|---------------------|--------------|
|   1   | 24924.50 |   1  |    0    |   5.73    |         0.68        |     8.11     |
|   1   | 21827.90 |   1  |    0    |   8.06    |         0.69        |     8.11     |
|   1   | 57258.43 |   1  |    0    |  16.82    |         0.72        |     7.81     |
|   1   | 17413.94 |   1  |    0    |  22.53    |         0.75        |     7.81     |
|   1   | 17558.09 |   1  |    0    |  27.05    |         0.71        |     7.81     |
|   1   | 16333.14 |   1  |    0    |  27.17    |         0.71        |     7.79     | 

# Mean and Median

Print information about the columns in sales using `head()`

In [125]:
import pandas as pd

# Use URL , so that notebook can be ran from Google Colab without downloading csv file
url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/data-manipulation-using-pandas/sales_subset.csv'
sales = pd.read_csv(url)

print(sales.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  


Print information about the columns in sales using `info()`

In [126]:
print(sales.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            10774 non-null  int64  
 1   store                 10774 non-null  int64  
 2   type                  10774 non-null  object 
 3   department            10774 non-null  int64  
 4   date                  10774 non-null  object 
 5   weekly_sales          10774 non-null  float64
 6   is_holiday            10774 non-null  bool   
 7   temperature_c         10774 non-null  float64
 8   fuel_price_usd_per_l  10774 non-null  float64
 9   unemployment          10774 non-null  float64
dtypes: bool(1), float64(4), int64(3), object(2)
memory usage: 768.2+ KB
None


Print the mean of the `weekly_sales` column.

In [127]:
print(sales["weekly_sales"].mean())

23843.95014850566


Print the median of the `weekly_sales` column.

In [128]:
print(sales["weekly_sales"].median())

12049.064999999999


# Dates 

Print the maximum of the date column

In [129]:
print(sales["date"].min())

2010-02-05


Print the minimum of the date column.

In [130]:
print(sales["date"].min())

2010-02-05


# Efficient Summaries  

Sometimes, you need custom functions to summarize your data beyond what pandas and NumPy offer.  
The `.agg()` method allows custom functions to a DataFrame, even across multiple columns, for efficient aggregation.  

Example: 

```python
df['column'].agg(function)
```  

In this exercise, the custom function calculates the "IQR" (inter-quartile range), which is the difference between the 75th and 25th percentiles. Using this function, print the IQR of the `temperature_c` column of sales.

In [131]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales["temperature_c"].agg(iqr))

16.583333333333336


Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of these in order:

- `temperature_c`
- `fuel_price_usd_per_l`
- `unemployment`

In [132]:
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[[
    "temperature_c", 
    "fuel_price_usd_per_l",
    "unemployment"
]].agg(iqr))

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


Update the aggregation functions called by `.agg()` in order:

- `iqr`
- `np.median`

In [133]:
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

print(sales[[
    "temperature_c", 
    "fuel_price_usd_per_l",
    "unemployment"
]].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


  ]].agg([iqr, np.median]))
  ]].agg([iqr, np.median]))
  ]].agg([iqr, np.median]))


# Cumulative Statistics  

Cumulative statistics help track data over time. In this exercise, you'll calculate the cumulative sum and maximum of a department's weekly sales to find the total sales so far and the highest weekly sales so far.  

Create the a DataFrame `sales_1_1` with sales data for department 1 of store 1.

In [134]:
import pandas as pd 

url = 'https://raw.githubusercontent.com/joseeden/joeden/refs/heads/master/docs/021-Software-Engineering/021-Jupyter-Notebooks/000-Sample-Datasets/data-manipulation-using-pandas/sales_subset_store_1.csv'
sales_1_1 = pd.read_csv(url)

Sort `sales_1_1` by date. Get the cumulative sum of `weekly_sales`, then add as a new column called `cum_weekly_sales`.

In [135]:
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()
print(sales_1_1.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  cum_weekly_sales  
0       5.727778              0.679451         8.106          24924.50  
1       8.055556              0.693452         8.106          46752.40  
2      16.816667              0.718284         7.808         104010.83  
3      22.527778              0.748928         7.808         121424.77  
4      27.050000              0.714586         7.808         138982.86  


Get the cumulative max of `weekly_sales`, then add as a new column called `cum_max_sales`.

In [136]:
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()
print(sales_1_1.head())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  cum_weekly_sales  \
0       5.727778              0.679451         8.106          24924.50   
1       8.055556              0.693452         8.106          46752.40   
2      16.816667              0.718284         7.808         104010.83   
3      22.527778              0.748928         7.808         121424.77   
4      27.050000              0.714586         7.808         138982.86   

   cum_max_sales  
0       24924.50  
1       24924.50  
2       57258.43  
3   

See the three new columns.

In [137]:
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

           date  weekly_sales  cum_weekly_sales  cum_max_sales
0    2010-02-05      24924.50          24924.50       24924.50
1    2010-03-05      21827.90          46752.40       24924.50
2    2010-04-02      57258.43         104010.83       57258.43
3    2010-05-07      17413.94         121424.77       57258.43
4    2010-06-04      17558.09         138982.86       57258.43
..          ...           ...               ...            ...
896  2011-11-25       2400.00       18826604.55      140504.41
897  2011-12-09        435.00       18827039.55      140504.41
898  2012-02-03        330.00       18827369.55      140504.41
899  2012-04-06        140.00       18827509.55      140504.41
900  2012-10-05        635.00       18828144.55      140504.41

[901 rows x 4 columns]


# Dropping duplicates

Removing duplicates is important accurate counts because often, you don't want to count the same thing multiple times. 

Using the same `sales` dataframe, remove rows with duplicate pairs of `store` and `type`, and save as `store_types` and print the head.

In [138]:
store_types = sales.drop_duplicates(subset=['store','type'])
print(store_types.head())

      Unnamed: 0  store type  department        date  weekly_sales  \
0              0      1    A           1  2010-02-05      24924.50   
901          901      2    A           1  2010-02-05      35034.06   
1798        1798      4    A           1  2010-02-05      38724.42   
2699        2699      6    A           1  2010-02-05      25619.00   
3593        3593     10    B           1  2010-02-05      40212.84   

      is_holiday  temperature_c  fuel_price_usd_per_l  unemployment  
0          False       5.727778              0.679451         8.106  
901        False       4.550000              0.679451         8.324  
1798       False       6.533333              0.686319         8.623  
2699       False       4.683333              0.679451         7.259  
3593       False      12.411111              0.782478         9.765  


Remove rows of `sales` with duplicate pairs of `store` and `department`, and save as `store_depts` and print the head.

In [139]:
store_depts = sales.drop_duplicates(subset=['store','department'])
print(store_depts.head())

    Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0            0      1    A           1  2010-02-05      24924.50       False   
12          12      1    A           2  2010-02-05      50605.27       False   
24          24      1    A           3  2010-02-05      13740.12       False   
36          36      1    A           4  2010-02-05      39954.04       False   
48          48      1    A           5  2010-02-05      32229.38       False   

    temperature_c  fuel_price_usd_per_l  unemployment  
0        5.727778              0.679451         8.106  
12       5.727778              0.679451         8.106  
24       5.727778              0.679451         8.106  
36       5.727778              0.679451         8.106  
48       5.727778              0.679451         8.106  


Subset the rows that are holiday weeks using the `is_holiday` column, and drop the duplicate `dates`, saving as `holiday_dates`. Print the holiday dates.

In [140]:
holiday_dates = sales[sales['is_holiday'] == True].drop_duplicates(subset='date')
print(holiday_dates["date"])

498     2010-09-10
691     2011-11-25
2315    2010-02-12
6735    2012-09-07
6810    2010-12-31
6815    2012-02-10
6820    2011-09-09
Name: date, dtype: object


# Counting 

Counting provides a quick overview of your data and helps uncover interesting patterns or anomalies you might otherwise miss. 

Using the `store_types` and `store_depts` dataframe from the previous example, count the number of stores of each store `type` in `store_types`.

In [141]:
store_counts = store_types["type"].value_counts()
print(store_counts)

type
A    11
B     1
Name: count, dtype: int64


Count the proportion of stores of each store `type` in `store_types`.

In [142]:
store_props = store_types["type"].value_counts(sort=True,normalize=True)
print(store_props)

type
A    0.916667
B    0.083333
Name: proportion, dtype: float64


Count the number of stores of each `department` in `store_depts`, sorting the counts in **descending** order.

In [143]:
dept_counts_sorted = store_depts["department"].value_counts()
print(dept_counts_sorted)

department
1     12
2     12
3     12
4     12
5     12
      ..
37    10
48     8
50     6
39     4
43     2
Name: count, Length: 80, dtype: int64


Count the proportion of stores of each `department` in `store_depts`, sorting the proportions in **descending** order.

In [144]:
dept_props_sorted = store_depts["department"].value_counts(sort=True,normalize=True)
print(dept_props_sorted)

department
1     0.012917
2     0.012917
3     0.012917
4     0.012917
5     0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: proportion, Length: 80, dtype: float64


# Grouped Statistics

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this step, you need to calculate the total sales made at each store type. The numbers you get can then used to see what proportion of Walmart's total sales were made at each type.

Calculate the total `weekly_sales` over the whole dataset.

In [145]:
sales_all = sales["weekly_sales"].sum()
print(sales_all)

256894718.89999998


Subset for type "A" stores, and calculate their total weekly sales.
Do the same for type "B" and type "C" stores.

In [146]:
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

Combine the A/B/C results into a list, and divide by sales_all to get the proportion of sales by type.

In [147]:
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

[0.9097747 0.0902253 0.       ]


Performing statistics computation  for a few groups is straightforward by running the same command for each group. However, this becomes tedious as when you need to run the same computation over larger datasets. The `.groupby()` method simplifies this process by performing calculations across all groups efficiently.

Group sales by "type", take the sum of "weekly_sales", and then store as `sales_by_type`.

In [149]:
sales_by_type = sales.groupby("type")["weekly_sales"].sum()
print(sales_propn_by_type)

[0.9097747 0.0902253 0.       ]


Calculate the proportion of sales at each store type by dividing by the sum of `sales_by_type`. Assign to `sales_propn_by_type`.

In [150]:
sales_propn_by_type = sales_by_type / sum(sales_by_type)
print(sales_propn_by_type)

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64


Group sales by "type" and "is_holiday", take the sum of `weekly_sales`, and store as `sales_by_type_is_holiday`.

In [151]:
sales_by_type_is_holiday = sales.groupby(["type", "is_holiday"])["weekly_sales"].sum()
print(sales_by_type_is_holiday)

type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64


Get the min, max, mean, and median of `weekly_sales` for each store type using `.groupby()` and `.agg().` Store this as `sales_stats`.

In [152]:
import numpy as np

sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
print(sales_stats)

unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])
print(unemp_fuel_stats)

         min        max          mean    median
type                                           
A    -1098.0  293966.05  23674.667242  11943.92
B     -798.0  232558.51  25696.678370  13336.08
     unemployment                         fuel_price_usd_per_l            \
              min    max      mean median                  min       max   
type                                                                       
A           3.879  8.992  7.972611  8.067             0.664129  1.107410   
B           7.170  9.765  9.279323  9.199             0.760023  1.107674   

                          
          mean    median  
type                      
A     0.744619  0.735455  
B     0.805858  0.803348  


  sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
  sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
  sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
  sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
  unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])
  unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])
  unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])
  unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np.mean, np.median])
  unemp_fuel_stats = sales.groupby("type")[["unemployment", "fuel_price_usd_per_l"]].agg([np.min, np.max, np

# Pivot Tables 

In pandas, pivot tables are essentially another way of performing grouped calculations. That is, the `.pivot_table()` method is an alternative to `.groupby()`.

Get the mean `weekly_sales` by type using `.pivot_table()` and store as `mean_sales_by_type`.

In [153]:
mean_sales_by_type = sales.pivot_table(
    values="weekly_sales",
    index="type"
)

print(mean_sales_by_type)

      weekly_sales
type              
A     23674.667242
B     25696.678370


Get the mean and median (using NumPy functions) of `weekly_sales` by `type` and store as `mean_med_sales_by_type`.

In [155]:
import numpy as np

mean_med_sales_by_type = sales.pivot_table(
    values="weekly_sales",
    index="type",
    aggfunc=[np.mean, np.median]
)

print(mean_med_sales_by_type)

              mean       median
      weekly_sales weekly_sales
type                           
A     23674.667242     11943.92
B     25696.678370     13336.08


  mean_med_sales_by_type = sales.pivot_table(
  mean_med_sales_by_type = sales.pivot_table(


Get the mean of `weekly_sales` by `type` and `is_holiday` and store as `mean_sales_by_type_holiday`.

In [156]:
mean_sales_by_type_holiday = sales.pivot_table(
    values="weekly_sales",
    index="type",
    columns="is_holiday"
)

print(mean_sales_by_type_holiday)

is_holiday         False      True 
type                               
A           23768.583523  590.04525
B           25751.980533  810.70500


Print the mean `weekly_sales` by `department` and `type`, filling in any missing values with 0.

In [157]:
by_dept = sales.pivot_table(
    values="weekly_sales",
    index="type",
    columns="department",
    fill_value=0
)

print(by_dept)

department            1              2             3             4   \
type                                                                  
A           30961.725379   67600.158788  17160.002955  44285.399091   
B           44050.626667  112958.526667  30580.655000  51219.654167   

department            5             6             7             8   \
type                                                                 
A           34821.011364   7136.292652  38454.336818  48583.475303   
B           63236.875000  10717.297500  52909.653333  90733.753333   

department            9             10  ...            90            91  \
type                                    ...                               
A           30120.449924  30930.456364  ...  85776.905909  70423.165227   
B           66679.301667  48595.126667  ...  14780.210000  13199.602500   

department             92            93            94             95  \
type                                                         

Print the mean `weekly_sales` by `department` and `type`, filling in any missing values with 0 and summing all rows and columns.

In [None]:
by_dept = sales.pivot_table(
    values="weekly_sales",
    index="type",
    columns="department",
    fill_value=0,
    margins=True
)

print(by_dept)