# Cross validation
Validation framework that splits or folds the training data into K parts and treats each part as the validation data across iterations

We fit the model k times on the training folds, while validating on a different fold each time

We evaluate cross-validation by taking an average of these scores. 

Also it is also useful to look at how much fluctuation there was between each of our folds. If we see consistent performance across folds or maybe small difference, then this is a good sign. But if we see large gaps in terms of validation folds, that is not a good sign. That is when the validation score is bouncing around between .2 and .6. That is sign of a lot of variance in our model

In [6]:
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
insurance_df = pd.read_csv('../Course Materials/Data/insurance.csv')
import numpy as np

In [7]:
insurance_df['smoker_flag'] = np.where(insurance_df['smoker']=='yes', 1,0)
features = ['age', 'bmi', 'children', 'smoker_flag']

X = sm.add_constant(insurance_df[features])
y = insurance_df['charges']

# train- test split:

X, X_test, y, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=42)

model = sm.OLS(y_train, X_train).fit()
model.summary()

0,1,2,3
Dep. Variable:,charges,R-squared:,0.745
Model:,OLS,Adj. R-squared:,0.744
Method:,Least Squares,F-statistic:,582.3
Date:,"Fri, 19 Dec 2025",Prob (F-statistic):,8.9e-235
Time:,12:10:37,Log-Likelihood:,-8099.6
No. Observations:,802,AIC:,16210.0
Df Residuals:,797,BIC:,16230.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.226e+04,1183.523,-10.361,0.000,-1.46e+04,-9938.840
age,253.8465,14.975,16.951,0.000,224.451,283.242
bmi,332.2119,34.140,9.731,0.000,265.197,399.227
children,400.9672,168.775,2.376,0.018,69.671,732.263
smoker_flag,2.326e+04,517.665,44.935,0.000,2.22e+04,2.43e+04

0,1,2,3
Omnibus:,184.908,Durbin-Watson:,2.106
Prob(Omnibus):,0.0,Jarque-Bera (JB):,429.63
Skew:,1.228,Prob(JB):,5.09e-94
Kurtosis:,5.613,Cond. No.,291.0


In [8]:
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score as r2

In [None]:
kf = KFold(n_splits=5, shuffle=True, random_state=32)

# KFold function is going to assign every point in the training dataset to a given fold

In [10]:
from sklearn.metrics import r2_score as r2
from sklearn.metrics import mean_absolute_error as mae

In [11]:
# Create a list to store validation scores for each fold
cv_lm_r2s =[]
cv_lm_mae = []

# Loop through each fold in X and y
for train_ind, val_ind in kf.split(X,y):
    # Subset data based on CV folds
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_val, y_val = X.iloc[val_ind], y.iloc[val_ind]

    # Fit the model fold's training data
    model = sm.OLS(y_train, X_train).fit()

    # Append Validation score to list
    cv_lm_r2s.append(r2(y_val, model.predict(X_val),))
    cv_lm_mae.append(mae(y_val, model.predict(X_val),))

print("All Validation R2s: ", [round(x, 3) for x in cv_lm_r2s])
print(f"Cross Val R2s: {round(np.mean(cv_lm_r2s), 3)}+- {round(np.std(cv_lm_mae), 3)}")

All Validation R2s:  [0.745, 0.735, 0.692, 0.756, 0.723]
Cross Val R2s: 0.73+- 305.698


In [12]:
print("All validation MAEs: ", [round(x,3) for x in cv_lm_mae])
print(f'Cross Val MAEs: {round(np.mean(cv_lm_mae), 3)}+-{round(np.std(cv_lm_mae), 3)}')

All validation MAEs:  [4580.921, 3726.997, 4319.661, 4497.197, 4114.804]
Cross Val MAEs: 4247.916+-305.698
