# Preprocessing & Modeling

This notebooks contains my preprocessing of the data in preparation for modeling as well as the steps to select and train a regression model.  The product of this notebook will be a model that can be used for prediction.

------

## Contents<a id='Contents'></a>
* [XXXX](#introduction--feature-descriptions)
* [Imports & Reading Data](#imports--readingpreparing-data)
---

In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.set_option('display.max_columns',None)
plt.style.use('ggplot')
import os

from itertools import combinations

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectKBest, f_regression

import statsmodels.api as sm

In [2]:
# Read csv into a pandas dataframe
df = pd.read_csv('../data/Concrete_Data_Yeh.csv')
df.head()

Unnamed: 0,cement,slag,flyash,water,superplasticizer,coarseaggregate,fineaggregate,age,csMPa
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
# Split data into predictor features (X) and target feature (y)
X = df.drop('csMPa', axis = 1)
y = df['csMPa']

In [4]:
# Split the data into a training and a test set (random state set for reproducability)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

## Baseline Model (No relationship between predictor features and target feature)

In [5]:
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_

array([[35.92209951]])

This "model" predicts the compression strength as the sample mean in every case (y = 35.922 + 0*X).

In [6]:
# Get r squared value for training data
dumb_reg.score(X_train, y_train)

0.0

For the training set, 0.0% of the variance in the target feature is explained by the model.

In [7]:
# Get mean squared error for test data.  This will be the baseline 
y_predict_baseline = dumb_reg.predict(X_test)
mse = mean_squared_error(y_test, y_predict_baseline)
print("Baseline MSE:", mse)

Baseline MSE: 283.72877665204663


## Methods for Model Selection

### Best Subset Selection:

Code sampled from [R. Jordan Crouser at Smith College for SDS293: Machine Learning (Spring 2016)](http://www.science.smith.edu/~jcrouser/SDS293/labs/lab8-py.html). 

In [15]:
type(X_test)

pandas.core.frame.DataFrame

In [22]:
def processSubset(subset, X_train, y_train, X_test, y_test):
    """Given the data split into training and test sets, this will fit a linear regression model on a subset of the dataset's features

    Args:
        subset (_type_): subset to train model on
        X_train (_type_): training set of predictor features
        y_train (_type_): training set of target feature
        X_test (_type_): test set of predictor features
        y_test (_type_): test set of target feature

    Returns:
        dict: dictionary containing the model, its parameters, its coefficients, its intercept, and the test set mean squared error 
    """
    model = LinearRegression()
    model.fit(X_train[list(subset)], y_train)
    y_hat = model.predict(X_test[list(subset)])
    mse = mean_squared_error(y_test, y_hat)
    return {'model': model, 'parameters': model.feature_names_in_, 'coefficients': model.coef_, 'intercept': model.intercept_, 'MSE': mse}

In [9]:
def getBestKModel(k: int, X_train, y_train, X_test, y_test):
    
    results = []
    
    for combo in combinations(X_train.columns, k):
        results.append(processSubset(combo, X_train, y_train, X_test, y_test))
        
    models = pd.DataFrame(results)
    
    best_model = models.loc[models['MSE'].argmin()]
    
    print("Processed", models.shape[0], "models on", k, "predictor(s)")
    
    return best_model

In [14]:
bestModels = []
for i in range(1, 9):
    bestModels.append(getBestKModel(i, X_train, y_train, X_test, y_test))
    
bestModels = pd.DataFrame(bestModels)
bestModels

Processed 8 models on 1 predictor(s)
Processed 28 models on 2 predictor(s)
Processed 56 models on 3 predictor(s)
Processed 70 models on 4 predictor(s)
Processed 56 models on 5 predictor(s)
Processed 28 models on 6 predictor(s)
Processed 8 models on 7 predictor(s)
Processed 1 models on 8 predictor(s)


Unnamed: 0,model,parameters,coefficients,intercept,MSE
0,LinearRegression(),[cement],[0.07792767537004329],13.894509,195.56604
2,LinearRegression(),"[cement, water]","[0.07488426922156169, -0.18682469672520077]",48.696134,173.813442
1,LinearRegression(),"[cement, slag, water]","[0.08901323972515572, 0.06390238568424605, -0....",44.023246,152.086109
8,LinearRegression(),"[cement, slag, water, age]","[0.08419471576690608, 0.06815936622855842, -0....",55.661888,139.78382
3,LinearRegression(),"[cement, slag, flyash, water, age]","[0.11315050495625445, 0.09724247649332822, 0.0...",29.23165,135.349416
5,LinearRegression(),"[cement, slag, flyash, water, fineaggregate, age]","[0.11730063090601521, 0.10187957046620805, 0.0...",17.028503,135.566568
3,LinearRegression(),"[cement, slag, flyash, water, coarseaggregate,...","[0.11299931669821367, 0.09666429982466201, 0.0...",31.394116,136.525844
0,LinearRegression(),"[cement, slag, flyash, water, superplasticizer...","[0.11245418761334908, 0.09627841251483628, 0.0...",-0.819576,137.955851


### Forward Selection:

### Backward Selection: