# February DS/AL-ML + BIA Data Jam - US Consumer Behavior

## Introduction

The project is to collaboratively evaluate the claim using real U.S. macroeconomic data from Federal Reserve Economic Data (FRED) and present a clear, evidence-based conclusion. We will explore how inflation has structurally altered consumer spending, saving, and borrowing habits.

## 1. Environment Setup & Data Loading

In [1]:
# Import necessary libraries
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load the starter datasets
credit_owned = pd.read_csv('../data/CREDIT_OWNED.csv')
personal_expenditure = pd.read_csv('../data/PERSONAL_EXPENDITURE.csv')
saving_rate = pd.read_csv('../data/SAVING_RATE.csv')
cpi = pd.read_csv('../data/cpiaucsl.csv')

## 2. Initial Data Inspection
Before merging, we must understand the shape, completeness, and historical timelines of our individual datasets.

In [3]:
# Display the first few rows of each dataset
datasets = {
    'Credit Owned': credit_owned,
    'Personal Expenditure': personal_expenditure,
    'Saving Rate': saving_rate,
    'CPI': cpi
}

for name, df in datasets.items():
    print(f"{name} Dataset:")
    display(df.head())
    print("\n")

Credit Owned Dataset:


Unnamed: 0,observation_date,TOTALSL
0,1943-01-01,6577.83
1,1943-02-01,6463.04
2,1943-03-01,6234.21
3,1943-04-01,6125.75
4,1943-05-01,5936.26




Personal Expenditure Dataset:


Unnamed: 0,observation_date,PCEC96
0,2007-01-01,11181.0
1,2007-02-01,11178.2
2,2007-03-01,11190.7
3,2007-04-01,11201.5
4,2007-05-01,11218.0




Saving Rate Dataset:


Unnamed: 0,observation_date,PSAVERT
0,1959-01-01,11.3
1,1959-02-01,10.6
2,1959-03-01,10.3
3,1959-04-01,11.2
4,1959-05-01,10.6




CPI Dataset:


Unnamed: 0,observation_date,CPIAUCSL
0,1947-01-01,21.48
1,1947-02-01,21.62
2,1947-03-01,22.0
3,1947-04-01,22.0
4,1947-05-01,21.95






In [4]:
# Determining the size of all the DataFrames

for name, df in datasets.items():
    print(f"{name} Dataset Shape: {df.shape}")

Credit Owned Dataset Shape: (995, 2)
Personal Expenditure Dataset Shape: (227, 2)
Saving Rate Dataset Shape: (803, 2)
CPI Dataset Shape: (949, 2)


In [5]:
# Display informative summary of each dataset
for name, df in datasets.items():
    print(f"{name} Informative Summary:")
    df.info()
    print("\n")

Credit Owned Informative Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   observation_date  995 non-null    object 
 1   TOTALSL           995 non-null    float64
dtypes: float64(1), object(1)
memory usage: 15.7+ KB


Personal Expenditure Informative Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   observation_date  227 non-null    object 
 1   PCEC96            227 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.7+ KB


Saving Rate Informative Summary:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 803 entries, 0 to 802
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  


In [6]:
# Display descriptive statistics of each dataset
for name, df in datasets.items():
    print(f"{name} Descriptive Statistics:")
    display(df.describe()) 

Credit Owned Descriptive Statistics:


Unnamed: 0,TOTALSL
count,995.0
mean,1230814.0
std,1481781.0
min,5354.36
25%,74641.36
50%,480994.5
75%,2215879.0
max,5084831.0


Personal Expenditure Descriptive Statistics:


Unnamed: 0,PCEC96
count,227.0
mean,13176.984581
std,1724.263802
min,11068.0
25%,11555.0
50%,12884.0
75%,14455.35
max,16715.4


Saving Rate Descriptive Statistics:


Unnamed: 0,PSAVERT
count,803.0
mean,8.404857
std,3.424809
min,1.4
25%,5.7
50%,8.3
75%,11.1
max,31.8


CPI Descriptive Statistics:


Unnamed: 0,CPIAUCSL
count,948.0
mean,124.062082
std,89.409534
min,21.48
25%,32.825
50%,109.65
75%,199.95
max,326.588


In [7]:
# Check for missing values and duplicates
for name, df in zip(['Credit', 'Expenditure', 'Saving', 'CPI'], [credit_owned, personal_expenditure, saving_rate, cpi]):
    print(f"{name} - Missing: {df.isnull().sum().sum()} | Duplicates: {df.duplicated().sum()}")

Credit - Missing: 0 | Duplicates: 0
Expenditure - Missing: 0 | Duplicates: 0
Saving - Missing: 0 | Duplicates: 0
CPI - Missing: 1 | Duplicates: 0


## 3. Clean & Merge Data
**Observation:** Our inspection reveals a significant misalignment in our historical data:
* `credit_owned` contains 995 rows (starting in 1943).
* `saving_rate` contains 803 rows (starting in 1959).
* `personal_expenditure` contains 227 rows (starting in 2007).

**Action:** Because machine learning models require uniform metrices without missing values across features, we cannot simply concatenate these files. We must standardize the date columns to a `datetime` object and perform an **inner join**. This will naturally trim our timeline to start around 2007 (the earliest shared date across all datasets), ensuring we are only analyzing periods where we have a complete macroeconomic picture.

In [8]:
# Standardize the date column names and cast to datetime objects
for df in [credit_owned, saving_rate, personal_expenditure, cpi]:
    # Rename the first column to 'DATE' regardless of what FRED named it
    df.rename(columns={df.columns[0]: 'DATE'}, inplace=True)
    df['DATE'] = pd.to_datetime(df['DATE'])

# Perform the Inner Merge on the unified DATE key
master_df = personal_expenditure.merge(saving_rate, on='DATE', how='inner') \
                                .merge(credit_owned, on='DATE', how='inner') \
                                .merge(cpi, on='DATE', how='inner')

# Rename columns to be more descriptive
master_df.rename(columns={
    'PCEC96': 'expenditure_billions',
    'PSAVERT': 'saving_rate_pct',
    'TOTALSL': 'credit_owned_billions',
    'CPIAUCSL': 'cpi_index',
    'DATE': 'date'
}, inplace=True)

# Ensure chronological order
master_df.sort_values('date', inplace=True)

## 4. Feature Engineering
Raw nominal dollars and static percentages are difficult for ML algorithms to interpret over long periods. We will engineer new features that capture *behavioral momentum* and *macroeconomic stress*.

In [9]:
# 1. Ratio Features
master_df['credit_to_spend_ratio'] = master_df['credit_owned_billions'] / master_df['expenditure_billions']

# 2. Year-over-Year (YoY) Growth Features
master_df['spend_yoy_growth'] = master_df['expenditure_billions'].pct_change(periods=12) * 100
master_df['credit_yoy_growth'] = master_df['credit_owned_billions'].pct_change(periods=12) * 100
master_df['inflation_yoy'] = master_df['cpi_index'].pct_change(periods=12) * 100

# 3. Regime Categorization
def assign_regime(date):
    if date < pd.to_datetime('2020-03-01'):
        return '1_pre_covid'
    elif date < pd.to_datetime('2021-06-01'):
        return '2_covid_stimulus'
    else:
        return '3_post_inflation_shock'

master_df['regime'] = master_df['date'].apply(assign_regime)

# Drop the first 12 months which now contain NaNs due to the YoY calculation
master_df.dropna(inplace=True)

### Engineered Features Explained
To successfully model consumer behavior, we derived the following indicators:

* **`credit_to_spend_ratio`:** A proxy for financial health. It measures how much outstanding debt consumers hold relative to their current spending levels. An increasing ratio suggests consumers are relying heavier on credit to fund their lifestyle.
* **`spend_yoy_growth` & `credit_yoy_growth`:** Year-over-year percentage changes. Using a 12-month lookback completely removes annual seasonality (e.g., the December holiday shopping spike) and allows our models to measure true behavioral momentum.
* **`inflation_yoy`:** The core driver of our hypothesis. This translates the raw CPI index into the actual inflation rate experienced by consumers over the last 12 months. 
* **`regime`:** A categorical flag that splits the timeline into three distinct economic eras (`1_pre_covid`, `2_covid_stimulus`, and `3_post_inflation_shock`). This is critical for detecting structural breaks, as it allows our models to compare pre-shock behavior against post-shock behavior.

In [10]:
# Save the enriched dataset
master_df.to_csv('../data/master_df.csv', index=False)
print(f"Data successfully cleaned and saved! Final shape: {master_df.shape}")
display(master_df.head())

Data successfully cleaned and saved! Final shape: (214, 10)


Unnamed: 0,date,expenditure_billions,saving_rate_pct,credit_owned_billions,cpi_index,credit_to_spend_ratio,spend_yoy_growth,credit_yoy_growth,inflation_yoy,regime
12,2008-01-01,11333.2,2.6,2619427.65,212.174,231.128688,1.361238,6.569798,4.294696,1_pre_covid
13,2008-02-01,11293.9,3.0,2634496.42,212.687,233.267199,1.03505,6.657618,4.142959,1_pre_covid
14,2008-03-01,11322.1,2.9,2645603.64,213.448,233.667221,1.174189,6.487213,3.974904,1_pre_covid
15,2008-04-01,11340.5,2.4,2654243.23,213.942,234.04993,1.240905,6.436682,3.903761,1_pre_covid
16,2008-05-01,11361.6,6.8,2660193.15,215.208,234.138955,1.280086,5.983113,4.088414,1_pre_covid


## Data Analysis