# Regression Modeling
___

In this notebook we read in the previously cleaned df_corr_droptop_highcorr.csv dataset for modeling.  We create four models, Ridge, Lasso, KNeighbors, and Random Forest.  We assess these models then try to improve scores through boosting and stacking. 

In [4]:
# Import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, StackingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [5]:
# Read in df_corr_droptop_highcorr.csv dataset created and cleaned in the data_collection notebook

df = pd.read_csv('../datasets/df_corr_droptop_highcorr.csv')

In [6]:
# Create X and y variables; train/test split

X = df.drop(columns = ['CPI'])
y = df['CPI']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Instantiate pipelines

Instantiate four regression models, Ridge, Lasso, KNeighbors, and Random Forest. 

In [7]:
# Instantiate pipelines 

# RidgeCV
ridge_cv_pipe = Pipeline([
    ('sc', StandardScaler()),
    ('ridge_cv', RidgeCV())
])

# LassoCV
lasso_cv_pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lasso_cv', LassoCV())
])

# KNeighborsRegressor
knn_pipe = Pipeline([
    ('sc', StandardScaler()),
    ('knn', KNeighborsRegressor())
])

# RandomForestRegressor
rf_pipe = Pipeline([
    ('sc', StandardScaler()),
    ('rf', RandomForestRegressor())
])

### Pipeline parameters

Set pipeline parameters for each model to gridsearch over

In [8]:
# Set parameters for each pipeline

# RidgeCV pipeline parameters
ridge_cv_pipeline_params = {
    'ridge_cv__alphas': range(1,11)
}

# LassoCV pipeline parameters
lasso_cv_pipeline_params = {
    'lasso_cv__alphas': [None]
}

# KNeighborsRegressor pipeline parameters
knn_pipeline_params = {
    'knn__n_neighbors': range(1, 50, 2)
}

# RandomForestRegressor pipeline parameters
rf_pipeline_params = {
    'rf__n_estimators': range(250, 500, 50),
    'rf__max_depth': [None, 5, 10]
}

### Instantiate GridSearchCV objects

In [9]:
# Instantiate gridseach objects

# RidgeCV gridseach
ridge_cv_gs = GridSearchCV(ridge_cv_pipe,
                       ridge_cv_pipeline_params,
                       cv = 5)

# LassoCV gridsearch
lasso_cv_gs = GridSearchCV(lasso_cv_pipe,
                       lasso_cv_pipeline_params,
                       cv = 5)

# KNeighborsRegressor gridsearch
knn_gs = GridSearchCV(knn_pipe,
                       knn_pipeline_params,
                       cv = 5)

# RandomForestRegressor gridsearch
rf_gs = GridSearchCV(rf_pipe,
                       rf_pipeline_params,
                       cv = 5)

### Fit Models

Fit each model on training data then output train/test scores along with the best parameters

In [7]:
ridge_cv_gs.fit(X_train, y_train)

In [8]:
lasso_cv_gs.fit(X_train, y_train)

In [10]:
knn_gs.fit(X_train, y_train)

In [13]:
rf_gs.fit(X_train, y_train)

In [11]:
# Output best parameters, best score, test score

gs_dict = {'RidgeCV' : ridge_cv_gs, 
           'LassoCV' : lasso_cv_gs, 
           'KNeighborsRegressor' : knn_gs, 
           'RandomForestRegressor' : rf_gs, 
          }

for key, value in gs_dict.items():
    print('=' * 40)
    print(key)
    print(f'Train Score: {(value.best_score_).round(3)}')
    print(f'Test Score: {(value.score(X_test, y_test)).round(3)}')
    print(f'Best Parameters: {value.best_params_}')

RidgeCV
Train Score: 0.838
Test Score: 0.836
Best Parameters: {'ridge_cv__alphas': 5}
LassoCV
Train Score: 0.837
Test Score: 0.837
Best Parameters: {'lasso_cv__alphas': None}
KNeighborsRegressor
Train Score: 0.969
Test Score: 0.972
Best Parameters: {'knn__n_neighbors': 1}
RandomForestRegressor
Train Score: 0.97
Test Score: 0.975
Best Parameters: {'rf__max_depth': None, 'rf__n_estimators': 400}


**Observation:** The lasso and ridge models performed similar to the linear reagression model previously used for coefficent interpretation.  KNeeighborsRegressor and RandomForestRegressor both performed very well with test scores over 0.97 for both models.  

### Boost KNN and RandomForest

In [11]:
# Boost KNeighborsRegressor

ada = AdaBoostRegressor(estimator = knn_pipe)

ada_params = {
    'n_estimators': [50, 100],
    'learning_rate': [0.9, 1.1]
}

gs = GridSearchCV(ada, param_grid=ada_params, cv = 3)

gs.fit(X_train, y_train)

In [12]:
print('Boosted KNN')
print(f'Train Score: {(gs.best_score_).round(3)}')
print(f'Test Score: {(gs.score(X_test, y_test)).round(3)}')
print(f'Best Parameters: {gs.best_params_}')

Boosted KNN
Train Score: 0.966
Test Score: 0.97
Best Parameters: {'learning_rate': 0.9, 'n_estimators': 50}


**Observation:** KNeighborsRegressor performance essentially stays the same.

In [15]:
ada = AdaBoostRegressor(estimator = rf_pipe)

ada_params = {
    'n_estimators': [50, 100],
    'learning_rate': [0.7, 0.9]
}

gs = GridSearchCV(ada, param_grid=ada_params, cv = 3)

gs.fit(X_train, y_train)

In [16]:
print('Boosted RandomForest')
print(f'Train Score: {(gs.best_score_).round(3)}')
print(f'Test Score: {(gs.score(X_test, y_test)).round(3)}')
print(f'Best Parameters: {gs.best_params_}')

Boosted RandomForest
Train Score: 0.974
Test Score: 0.981
Best Parameters: {'learning_rate': 0.7, 'n_estimators': 100}


**Observation:** RandomForestRegressor performance improved slightly with boosting with test scores increasing from 0.975 to 0.981.

### Stack RandomForest

In [17]:
# Create level 1 of the stack

level1_estimators = [
        ('ridge_pipe', Pipeline([
        ('sc', StandardScaler()),
        ('ridge_cv', RidgeCV())
    ])),
    ('lasso_pipe', Pipeline([
        ('sc', StandardScaler()),
        ('lasso_cv', LassoCV())
    ])),
    ('knn_pipe', Pipeline([
        ('sc', StandardScaler()),
        ('knn', KNeighborsRegressor())
    ])),
]

In [18]:
# Create stacked model

stacked_model = StackingRegressor(estimators = level1_estimators,
                                 final_estimator = rf_pipe)

In [19]:
# Fit model

stacked_model.fit(X_train, y_train)

In [20]:
print('Stacked RandomForest')
print(f'Train Score: {(stacked_model.score(X_train, y_train)).round(3)}')
print(f'Test Score: {(stacked_model.score(X_test, y_test)).round(3)}')

Stacked RandomForest
Train Score: 0.974
Test Score: 0.958


**Observation:** RandomForestRegressor performance decreased slightly, from the boosted RandomForest, with stacking with test scores decreasing from 0.981 to 0.958.

### Stack/Boosted RandomForest

In [22]:
# create boosted stack

stacked_model = StackingRegressor(estimators = level1_estimators,
                                 final_estimator = AdaBoostRegressor(estimator = rf_pipe))

In [23]:
# Fit model

stacked_model.fit(X_train, y_train)

In [24]:
print('Stacked RandomForest')
print(f'Train Score: {(stacked_model.score(X_train, y_train)).round(3)}')
print(f'Test Score: {(stacked_model.score(X_test, y_test)).round(3)}')

Stacked RandomForest
Train Score: 0.972
Test Score: 0.955


**Observation:** RandomForestRegressor performance decreased slightly, from the boosted RandomForest, with stacking and boosting with test scores decreasing from 0.981 to 0.955.

## Key Takeaways

This notebook brings in the cleaned dataset from the data_collection notwbook and tries several different models and techniques to create the best model for this dataset.  KNeighbors and RandomForest performed the best with parameter optimization alone with test scores of 0.972 and 0.975 respectively.  These two models were then boosted to improve performance.  Boosted KNeighbor performace decreased slightly compared to not boosted, 0.972 to 0.970.  Boosted RandomForest performance improved slightly with test scores raising from 0.975 to 0.981.  Stacking and stacking/boosting RandomForest reduced test scores to 0.950-0.960.