## Assignment: Data Splitting

1. Split your data into a training and test set
2. Use cross validation to fit a model on all numeric features. Report r2 values for each validation fold.
3. Fit your model on all of your training data and score on the test dataset. 

In [4]:
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import numpy as np

computers = pd.read_csv("../Data/Computers.csv")

computers.tail()

Unnamed: 0,price,speed,hd,ram,screen,cd,multi,premium,ads,trend
6254,1690,100,528,8,15,no,no,yes,39,35
6255,2223,66,850,16,15,yes,yes,yes,39,35
6256,2654,100,1200,24,15,yes,no,yes,39,35
6257,2195,100,850,16,15,yes,no,yes,39,35
6258,2490,100,850,16,17,yes,no,yes,39,35


### Test Data Split

In [5]:
from sklearn.model_selection import train_test_split


features = ["speed", "hd", "ram", "screen", "ads", "trend"]

X = sm.add_constant(computers[features])
y = computers["price"]

# Test Split
X, X_test, y, y_test = train_test_split(X, y, test_size=.2, random_state=12345)

### Cross Validation Loop

In [6]:
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score as r2
from sklearn.metrics import mean_absolute_error as mae



kf = KFold(n_splits=5, shuffle=True, random_state=2023)

# Create a list to store validation scores for each fold
cv_lm_r2s = []
cv_lm_mae = []

# Loop through each fold in X and y
for train_ind, val_ind in kf.split(X, y):
    # Subset data based on CV folds
    X_train, y_train = X.iloc[train_ind], y.iloc[train_ind]
    X_val, y_val = X.iloc[val_ind], y.iloc[val_ind]
    # Fit the Model on fold's training data
    model = sm.OLS(y_train, X_train).fit()
    # Append Validation score to list 
    cv_lm_r2s.append(r2(y_val, model.predict(X_val),))
    cv_lm_mae.append(mae(y_val, model.predict(X_val),))

print("All Validation R2s: ", [round(x, 3) for x in cv_lm_r2s])
print(f"Cross Val R2s: {round(np.mean(cv_lm_r2s), 3)} +- {round(np.std(cv_lm_r2s), 3)}")

print("All Validation MAEs: ", [round(x, 3) for x in cv_lm_mae])
print(f"Cross Val MAEs: {round(np.mean(cv_lm_mae), 3)} +- {round(np.std(cv_lm_mae), 3)}")

All Validation R2s:  [0.728, 0.71, 0.714, 0.707, 0.687]
Cross Val R2s: 0.709 +- 0.013
All Validation MAEs:  [220.617, 228.202, 223.564, 227.74, 227.045]
Cross Val MAEs: 225.434 +- 2.908


### Model Fit on All Training Data

In [7]:
model = sm.OLS(y, X).fit()

model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.711
Model:,OLS,Adj. R-squared:,0.71
Method:,Least Squares,F-statistic:,2048.0
Date:,"Wed, 23 Aug 2023",Prob (F-statistic):,0.0
Time:,13:24:19,Log-Likelihood:,-35823.0
No. Observations:,5007,AIC:,71660.0
Df Residuals:,5000,BIC:,71700.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-160.6934,73.659,-2.182,0.029,-305.097,-16.290
speed,8.9041,0.232,38.400,0.000,8.450,9.359
hd,0.7074,0.035,20.424,0.000,0.640,0.775
ram,46.9916,1.321,35.584,0.000,44.403,49.580
screen,121.6871,4.996,24.356,0.000,111.892,131.482
ads,0.9228,0.064,14.440,0.000,0.797,1.048
trend,-47.0682,0.758,-62.105,0.000,-48.554,-45.582

0,1,2,3
Omnibus:,1072.791,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2917.919
Skew:,1.138,Prob(JB):,0.0
Kurtosis:,5.968,Cond. No.,8850.0


### Score on Test Data

In [8]:
print(f"Test R2: {r2(y_test, model.predict(X_test))} ")
print(f"Test MAE: {mae(y_test, model.predict(X_test))}")

Test R2: 0.7171544267656859 
Test MAE: 225.60019419706245
