# Resampling
We have been talking about breaking the data into train/test data sets and cross-validation as a way to avoid overfitting a model to the data. In this exercise, we will try a few different ways. 

In [None]:
# Import our most-used packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Import useful ones from the sklearn package 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, LeaveOneOut
from sklearn.model_selection import KFold, cross_val_score

In [None]:
# Set the style for plotting
sns.set_style('white')

## The Data
We have been given sample data from our customers. The data has been aggregated from individual purchases / transactions across the time period (e.g., last year). The goal is to see if we can predict the spending amounts for the next time period (e.g., next year).

In [None]:
# Read in the data and print out its shape
cust = pd.read_csv('./data/customers_clean.csv')
print(cust.shape)

In [None]:
# Look at info
cust.info()

In [None]:
# Sample the data
cust.sample(5)

In [None]:
# Using .value_counts is useful for categorical columns
cust.value_counts(subset=['gender', 'marital_status', 'home_ownership'],
                 normalize=True)

In [None]:
# See summary statistics
cust.describe()

In [None]:
# We can also use .describe() on all columns
# This can be useful to see if we might need to clean the data
cust.describe(include='all')

## End Result for Input
We want to have all numerical variables. This means we should create *dummy* variables for `gender`, `marital_status`, and `home_ownership`. We also do not need `cust_id` since it is just a unique id (now that we have dropped duplicates). The two date columns could be used to create numerical values, but we will simply ignore them for now.

In [None]:
# We want to create dummy variables for gender, marital_status,
# and home_ownership
pd.get_dummies(cust[['gender','marital_status','home_ownership']],
              dtype=int)

In [None]:
# Let's drop the following columns:
# cust_id, join_date, last_purchase_date
new_cust = cust.drop(columns=['cust_id','join_date','last_purchase_date'])
new_cust.info()

## Create Dummy Variables

If your `DataFrame` contains categorical (or object) columns, you can call `pd.get_dummies(your_dataframe)` to create the dummy variables for **every** categorical column in the `DataFrame`. Since we deleted the "extra" columns, let's try it on our `new_cust` variable and see the results.

In [None]:
# Create all dummies on all categorical columns
pd.get_dummies(new_cust, dtype=int)

In [None]:
# Really just want k-1 dummies for k categories
# We can use the argument drop_first=True
pd.get_dummies(new_cust, dtype=int, drop_first=True)

In [None]:
# Run it again and save it in a new DataFrame
data = pd.get_dummies(new_cust, dtype=int, drop_first=True)
data.info()

In [None]:
# Look at .describe()
data.describe()

## Some Visualizations

Let's try to look at a few visualizations for the attributes that we have.

In [None]:
# We can create histograms for each variable in data now
data.hist()

In [None]:
# Try boxplots?
data.boxplot()

In [None]:
# Could loop over every column and make a displot
for i in data.columns:
    sns.displot(data[i])

In [None]:
# What about correlations?
data.corr()

In [None]:
# Create a heatamp of correlations
sns.heatmap(data.corr(), vmin=-1, vmax=1, annot=True, linewidth=0.5)

In [None]:
# Try a pairplot
sns.pairplot(data)

## Time for Resampling Attempts
Need to first define `X` and `y`. Then we can try train/test split.

In [None]:
# define the output variable, y
y = data.spend

# define the X
X = data.drop('spend', axis=1)

In [None]:
# Look at shape of X
X.shape

In [None]:
# Look at shape of y
y.shape

In [None]:
# Time to split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.2,
                                                   random_state=163)

In [None]:
# Look at shape of X_train
X_train.shape

In [None]:
# Look at shape of X_test
X_test.shape

In [None]:
# Kick out summary statistics for X_train
X_train.describe()

In [None]:
# Kick out summary statistics for X_test
# Hope these are close to X_train stats
X_test.describe()

In [None]:
# Time to run a regression
reg = LinearRegression()
reg.fit(X_train, y_train)
print(f'Intercept:    {reg.intercept_}')
print(f'Coefficients: {reg.coef_}')

In [None]:
# We can compute the training R-squared, MSE, and RMSE
trainR2 = r2_score(y_train, reg.predict(X_train))
trainMSE = mean_squared_error(y_train, reg.predict(X_train))
trainRMSE = np.sqrt(trainMSE)

print(f'Training R-squared is: {trainR2:.2%}')
print(f'          and MSE is:  {trainMSE:.2f}')
print(f'          and RMSE is: {trainRMSE:.2f}')

In [None]:
# Really interested in the test metrics
pred = reg.predict(X_test)
rSquare = r2_score(y_test, pred)
mse = mean_squared_error(y_test, pred)
rmse = np.sqrt(mse)

print(f'Test R-squared is: {rSquare:.2%}')
print(f'      and MSE is:  {mse:.2f}')
print(f'      and RMSE is: {rmse:.2f}')

## Try Different Splits
We said that a different 80/20 split will give us different metrics. We hope that they are not "too different". Let's split the full data set 10 different ways and see how the metrics differ. We can also calculate the average MSE, etc.

In [None]:
# Set the random number generator seed 
np.random.seed(131)

# Create 10 different random_state values for splitting
random_states = np.random.choice(range(1,500), 10)
random_states

In [None]:
# Let's loop through the randomStates, split the data, fit the model
# Calculate the metrics (capture them)
# Create 6 empty dictionaries
trainR2s = {}
trainMSEs = {}
trainRMSEs = {}
testR2s = {}
testMSEs = {}
testRMSEs = {}

# Use our good friend the for loop
for i in random_states:
    # split the data using randomStates[i] stored in i
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                       random_state=i)
    
    # Build and fit the regression model
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    
    # Make predictions for the training set and calculate metrics
    trainPred = reg.predict(X_train)
    trainR2s[i] = r2_score(y_train, trainPred)
    trainMSEs[i] = mean_squared_error(y_train, trainPred)
    trainRMSEs[i] = np.sqrt(mean_squared_error(y_train, trainPred))
    
    # Make predictions for the test set and calculate metrics
    testPred = reg.predict(X_test)
    testR2s[i] = r2_score(y_test, testPred)
    testMSEs[i] = mean_squared_error(y_test, testPred)
    testRMSEs[i] = np.sqrt(mean_squared_error(y_test, testPred))

In [None]:
# Print out training metric dictionaries
print(trainR2s)
print(trainMSEs)
print(trainRMSEs)

In [None]:
# Make these dictionaries into DataFrames
trainingR2s = pd.DataFrame.from_dict(trainR2s, orient='index', columns=['Training_R2'])
trainingMSEs = pd.DataFrame.from_dict(trainMSEs, orient='index', columns=['Training_MSE'])
trainingRMSEs = pd.DataFrame.from_dict(trainRMSEs, orient='index', columns=['Training_RMSE'])

# Combine them into a single DataFrame
training_metrics = pd.concat([trainingR2s, trainingMSEs, trainingRMSEs], axis='columns')
training_metrics

In [None]:
training_metrics

In [None]:
# Print out test metric dictionaries
print(testR2s)
print(testMSEs)
print(testRMSEs)

In [None]:
# Make the test dictionaries into DataFrames
testingR2s = pd.DataFrame.from_dict(testR2s, orient='index', columns=['Testing_R2'])
testingMSEs = pd.DataFrame.from_dict(testMSEs, orient='index', columns=['Testing_MSE'])
testingRMSEs = pd.DataFrame.from_dict(testRMSEs, orient='index', columns=['Testing_RMSE'])

# Combine them into a single DataFrame
testing_metrics = pd.concat([testingR2s, testingMSEs, testingRMSEs], axis='columns')
testing_metrics

In [None]:
# Look at the average RMSE for training and test sets
print(f'Avg Training RMSE: {training_metrics.Training_RMSE.mean():.2f}')
print(f'Avg Test RMSE:     {testing_metrics.Testing_RMSE.mean():.2f}')

In [None]:
sns.scatterplot(x=list(range(1,11)),
                y=training_metrics.Training_RMSE,
                label='Train MSE',
                color='blue')
sns.scatterplot(x=list(range(1,11)),
                y=testing_metrics.Testing_RMSE,
                label='Test RMSE',
                color='orange')
plt.ylabel('RMSE')
plt.xlabel('Split #')

# What about Leave-One-Out CV?
Can we do it?

In [None]:
# LeaveOneOut really likes numpy arrays
# Create numpy arrays out X and y
XArray = X.to_numpy()
yArray = y.to_numpy()
print(XArray)

In [None]:
# Create a LeaveOneOut object and call get_n_splits
loo = LeaveOneOut()
loo.get_n_splits(XArray)

In [None]:
# Create an empty list to store the test RMSEs from loo
looRMSEs = []

In [None]:
for i, (train_index, test_index) in enumerate(loo.split(XArray)):
    print(f'Fold {i}')
    print(f'   Train index = {train_index}')
    print(f'   Test index  = {test_index}')
    if i == 4:
        break

In [None]:
# Loop over all of the splits
for i, (train_index, test_index) in enumerate (loo.split(XArray)):
    X_train, X_test = XArray[train_index], XArray[test_index]
    y_train, y_test = yArray[train_index], yArray[test_index]
    # Fit the model
    reg = LinearRegression()
    reg.fit(X_train, y_train)
    # make prediction
    pred = reg.predict(X_test)
    # find RMSE and store it
    looRMSEs.append(np.sqrt(mean_squared_error(y_test, pred)))

In [None]:
# see the average RMSE
np.mean(looRMSEs)

### What if we wanted to replicate LOOCV with cross_val_score?
We can do it!

In [None]:
# See what scorers we have for metrics
# import all of sklearn (because I am lazy at this point)
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

In [None]:
# Create a LinearRegression object
# call cross_val_score
reg = LinearRegression()
cvScores = cross_val_score(reg, X, y, cv=len(X),
                          scoring='neg_mean_squared_error')

In [None]:
# print out average RMSE ... should be same as above
np.mean(np.sqrt(np.absolute(cvScores)))

## Try $K$-Fold CV

Remember $K$ is the number of folds.

In [None]:
# Let's try K=5 fold CV
folds = 5
reg = LinearRegression()
cvMSE = cross_val_score(reg, X, y, cv=folds, scoring='neg_mean_squared_error')
cvR2 = cross_val_score(reg, X, y, cv=folds)

In [None]:
# Print out RMSEs
print(np.sqrt(np.absolute(cvMSE)))

In [None]:
# Print out R-squareds
print(cvR2)

In [None]:
# Print out the average RMSE and average R-squared
print(f'Avg CV RMSE = {np.mean(np.sqrt(np.absolute(cvMSE))):.2f}')
print(f'Avg CV R2   = {np.mean(cvR2):.2%}')

In [None]:
# We can also use KFold
folds = 5
kf5 = KFold(n_splits=folds)
reg = LinearRegression()
cvMSE2 = cross_val_score(reg, X, y, cv=kf5, scoring='neg_mean_squared_error')
cvR2_2 = cross_val_score(reg, X, y, cv=kf5)

In [None]:
print(f'CV RMSE = {np.mean(np.sqrt(np.absolute(cvMSE2))):.2f}')
print(f'CV R2   = {np.mean(cvR2_2):.2%}')

In [None]:
# Let's try shuffling the folds
folds = 5
# Set shuffle=True
kf5 = KFold(n_splits=folds, shuffle=True)
reg = LinearRegression()
cvMSE3 = cross_val_score(reg, X, y, cv=kf5, scoring='neg_mean_squared_error')
cvR2_3 = cross_val_score(reg, X, y, cv=kf5)

print(f'CV RMSE = {np.mean(np.sqrt(np.absolute(cvMSE3))):.2f}')
print(f'CV R2   = {np.mean(cvR2_3):.2%}')

## How Many Folds?

Let's try several different values for $k$, the number of folds.

In [None]:
# Create 2 dictionaries to hold results
avgRMSEs = {}
stdRMSEs = {}

# Loop over multiple k for the folds storing results
for i in range(2, 16):
    reg = LinearRegression()
    cvMSE = cross_val_score(reg, X, y, cv=i,
                           scoring='neg_mean_squared_error')
    avgRMSEs[i] = np.mean(np.sqrt(np.absolute(cvMSE)))
    stdRMSEs[i] = np.std(np.sqrt(np.absolute(cvMSE)))

In [None]:
# Print out the average RMSEs and the average std
print(avgRMSEs)
print(stdRMSEs)

In [None]:
# Plot the avg MSE for each value of k
sns.scatterplot(x=list(avgRMSEs.keys()), y=list(avgRMSEs.values()))
plt.xlabel('K used in cross-validation')
plt.ylabel('Average RMSE')

In [None]:
# Plot the standard deviation for each k used
sns.scatterplot(x=list(stdRMSEs.keys()), y=list(stdRMSEs.values()))
plt.xlabel('K used in cross-validation')
plt.ylabel('Standard Deviation of RMSE')

## What About Predicting

You have been given a new list of customers. You want to predict how much they will spend next year. What should we do? The file is in `new_cust.csv`.

**&copy; 2022 - Present: Matthew D. Dean, Ph.D.   
Clinical Associate Professor of Business Analytics at William \& Mary.**