# Summary statistics with pandas
This notebook loads the sales dataset and demonstrates quick dataset inspection, basic summary statistics, and a custom aggregation function.


In [2]:
import pandas as pd

sales = pd.read_csv('data/sales.csv')
sales.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,6,2009-07-11,37246.23,False,26.312,0.655,7.704
1,1,A,7,2010-04-20,25857.08,False,11.095,0.736,7.951
2,1,A,5,2010-10-28,13351.08,False,21.461,0.683,7.963
3,1,A,7,2010-06-26,15998.72,False,28.705,0.707,7.807
4,1,A,4,2010-04-29,32233.11,False,21.988,0.773,7.81


## Basic dataset inspection and summary stats
Use `info()` to review columns and dtypes, then compute mean and median for `weekly_sales` and show min/max dates.


In [3]:
# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales['weekly_sales'].mean())

# Print the median of weekly_sales
print(sales['weekly_sales'].median())

# Print the maximum of the date column
print(f"Maximum Date is :{sales['date'].max()}")

# Print the minimum of the date column
print(f"Minimum Date is :{sales['date'].min()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   store                 400 non-null    int64  
 1   type                  400 non-null    object 
 2   department            400 non-null    int64  
 3   date                  400 non-null    object 
 4   weekly_sales          400 non-null    float64
 5   is_holiday            400 non-null    bool   
 6   temperature_c         400 non-null    float64
 7   fuel_price_usd_per_l  400 non-null    float64
 8   unemployment          400 non-null    float64
dtypes: bool(1), float64(4), int64(2), object(2)
memory usage: 25.5+ KB
None
24959.1387
24616.83
Maximum Date is :2011-12-16
Minimum Date is :2009-02-18


## Custom aggregation with `agg`
Define an interquartile range (IQR) function and pass it to `agg` to summarize a column.


In [4]:
# Aggregate Function as a callback
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales['temperature_c'].agg(iqr))

11.3375


In [7]:
# Create a custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr,"median"]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr           11.3375                 0.036        0.1565
median        17.1780                 0.712        7.8550
