<h1 style="font-size:30px">Model Training - Linear Regression</h1>
<hr>

1. Split the dataset
2. Build model pipelines
3. Declare hyperparameters o tune
4. Fit and tune models with cross-validation
5. Evaluate metrics and select winner
6. Saving the winning model

<span style="font-size:18px">**Import libraries**</span>

In [1]:
# Numpy for numerical computing
import numpy as np

# Pandas for Dataframes
import pandas as pd
pd.set_option('display.max_columns',100)

# Matplolib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# Scikit-Learn for modeling
import sklearn

# Pickle for saving model files
import pickle

In [2]:
# Function for splitting training and test set
from sklearn.model_selection import train_test_split

# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

# Function for standardization
from sklearn.preprocessing import StandardScaler

# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

**Linear Regression Algorithms**

In [3]:
# Import Elastic Net, Ridge Regression and Lasso Regression
from sklearn.linear_model import ElasticNet, Ridge, Lasso

# Import Random Forest and Gradient Boosted Trees
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# Import r2_score and mean_absolute_error functions
from sklearn.metrics import r2_score, mean_absolute_error

<span style="font-size:18px">**Load analytical base table**</span>

In [4]:
# Load analytical base table from Feature Engineering
df = pd.read_csv('analytical_base_table.csv')

<span style="font-size:18px">**1. Split the dataset**</span><br>

Separate the dataframe into separate objects for the target variable (y) and the input features (X)

In [5]:
# Create separate object for target variable
y = df.price

# Create separate object for input features
X = df.drop('price', axis = 1)

**Training sets** are used to fit and tune the models<br>
**Test sets** are put aside as unseen data to evaluate your models
<br>
* Split the train and test set, passing in the argument **test_size = 0.2** to set aside 20% of the observations for the test set
* The **random_state = 1234** is set for replicable results
* **Important**: For classification model also pass in the argument **stratify = df.target** in order to make sure the target variable's classes are balanced in each subset of data. This is **stratified random sampling**

In [6]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)

# Print number of observations in X_train, X_test, y_train, and y_test
print(len(X_train), len(X_test), len(y_train), len(y_test))

43136 10784 43136 10784


<span style="font-size:18px">**2. Build model pipelines**</span><br>
The pipeline will standardize the data first, then apply the model algorithm to it

**Preprocessing**: should be performed inside the cross-validation loop
* Transform or scale the features
* Perform automatic feature reduction (e.g. PCA)
* Remove correlated features<br>
<br>
**Standartization**: transforms all features to the same **scale** by substracting means and dividing by standard deviations.
* Feature's distribution **centered around zero, with unit variance**

* The **random_state = 123** is set for replicable results

**Linear Regression pipelines**

In [7]:
# Create pipelines dictionary
pipelines = {'lasso': make_pipeline(StandardScaler(), Lasso(random_state = 123)),
             'ridge': make_pipeline(StandardScaler(), Ridge(random_state = 123)),
             'enet': make_pipeline(StandardScaler(), ElasticNet(random_state = 123)),
             'rf': make_pipeline(StandardScaler(), RandomForestRegressor(random_state = 123)),
             'gb': make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state = 123))}

<span style="font-size:18px">**3. Declare hyperparameters to tune**</span><br>
**Hyperparameters** express "higher-level" structural settings for modeling algorithms<br>
* e.g. strength of the penalty used in regularized regression
* e.g. the number of trees to include in a random forest
* They are **decided** before training the model because they cannot be learned from the data

All of the keys that begin with 'lasso__' are hyperparameters for Lasso regression

In [8]:
# List of tuneable hyperparameters of Lasso pipeline
pipelines['lasso'].get_params()

{'memory': None,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('lasso', Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=123,
      selection='cyclic', tol=0.0001, warm_start=False))],
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'lasso': Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
    normalize=False, positive=False, precompute=False, random_state=123,
    selection='cyclic', tol=0.0001, warm_start=False),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'lasso__alpha': 1.0,
 'lasso__copy_X': True,
 'lasso__fit_intercept': True,
 'lasso__max_iter': 1000,
 'lasso__normalize': False,
 'lasso__positive': False,
 'lasso__precompute': False,
 'lasso__random_state': 123,
 'lasso__selection': 'cyclic',
 'lasso__tol': 0.0001,
 'lasso__warm_start': False}

**Linear Regression hyperparameters**

For regularized regression, the most impactful hyperparameter is the alpha(**strength of the penalty**)
* Also known as lambda
* **alpha** is a positive value, typically between 0 and 10
* The default value is 0.1

In [9]:
# Lasso hyperparameters
lasso_hyperparameters = {'lasso__alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]}

In [10]:
# Ridge hyperparameters
ridge_hyperparameters = {'ridge__alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]}

* **l1_ratio** is the ratio of **L1** penalty to **L2** penalty
* The default value is 0.5
* When l1_ratio = 1, it is Lasso regression
* When l1_ratio = 0, it is Ridge regression

In [11]:
# Elastic Net hyperparameters
enet_hyperparameters = {'elasticnet__alpha': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
                        'elasticnet__l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9]}

* **n_estimators** is the **number of decision trees** to include in the random forest
* Usually, more is better
* The default value is 10, which is usually too few
* Try 100 and 200<br>
<br>
* **max_features** controls the number of features each tree is allowed to choose from
* It's what allows your random forest to perform feature selection
* The default value is 'auto', which sets max_features = n_features
* 'sqrt' sets max_features = sqrt(n_features)
* 0.33 sets max_features = 0.33*n_features

In [12]:
# Random Forest hyperparameters
rf_hyperparameters = {'randomforestregressor__n_estimators': [100, 200],
                     'randomforestregressor__max_features': ['auto', 'sqrt', 0.33]}

* **n_estimators** same as for Random Forest
* Try 100 and 200<br>
<br>
* **learning_rate** shrinks the contribution of each tree
* There is a tradeoff between learning rate and number of trees
* The default value is 0.1
* Try 0.05, 0.1 and 0.2<br>
<br>
* **max_depth** controls the maximum depth of each tree
* The default value is 3
* Try 1, 3 and 5

In [13]:
# Boosted tree hyperparameters
gb_hyperparameters = {'gradientboostingregressor__n_estimators': [100, 200],
                      'gradientboostingregressor__learning_rate': [0.05, 0.1, 0.2],
                      'gradientboostingregressor__max_depth': [1, 3, 5]}

In [14]:
hyperparameters = {'lasso': lasso_hyperparameters,
                   'ridge': ridge_hyperparameters,
                   'enet': enet_hyperparameters,
                   'rf': rf_hyperparameters,
                   'gb': gb_hyperparameters}

<span style="font-size:18px">**4. Fit and tune models with cross-validation**</span><br>

The GridSearchCV function performs cross-validation on the **hyperparameter grid**, through each **combination of values**. It then calculates **cross-validated scores** (using performance metrics) for each combination of hyperparameter values and picks the combination that has the best score
* **cv** is the number of cross-validation folds
* **n_jobs = -1** trains in parallel across the maximum number of cores of the computer, speeding it up

In [15]:
# Create empty dictionary called fitted_models
fitted_models = {}

# Loop through model pipelines, tuning each one and saving it to fitted_models
for name, pipeline in pipelines.items():
    
    # Create cross-validation object from pipeline and hyperparameters
    model = GridSearchCV(pipeline, hyperparameters[name], cv = 10, n_jobs = -1)
    
    # Fit model on X_train, y_train
    model.fit(X_train, y_train)
    
    # Store model in fitted_models[name]
    fitted_models[name] = model
    
    # Print when the model is fitted
    print(name, 'has been fitted.')

lasso has been fitted.
ridge has been fitted.




enet has been fitted.
rf has been fitted.
gb has been fitted.


<span style="font-size:18px">**5. Evaluate models and select winner**</span><br>

In [16]:
# Display best_score_ for each fitted_model
for name, model in fitted_models.items():
    print(name, model.best_score_)

lasso 0.919009382798393
ridge 0.9184615703596563
enet 0.9188422102370558
rf 0.9806322144008012
gb 0.9808295698775237


**Linear Regression metrics**

For **regression problems**, the default scoring metric is the average **R²** on the holdout folds
* **R²** is the "percent of variation in the target variable that can be explained by the model"
* Because is the average R² from the **holdout folds**, higher is almost always better<br>
<br>
* **MAE** (Mean Absolute Error) is the average absolute difference between predicted and actual values for our target variable.

In [17]:
# Loop through model pipelines, predicting and calculating R^2 and MAE
for name, model in fitted_models.items():
    
    # Predict test set using the fitted models
    pred = model.predict(X_test)
    print(name)
    print('----------')
    
    # Calculate and print R^2 and MAE
    print('R^2: ', r2_score(y_test, pred))
    print('MAE: ', mean_absolute_error(y_test, pred))
    print()

lasso
----------
R^2:  0.9195452051901299
MAE:  742.7924726851604

ridge
----------
R^2:  0.9201148406728155
MAE:  738.4799951489126

enet
----------
R^2:  0.9192310296831971
MAE:  750.1782207428162

rf
----------
R^2:  0.9820270151300432
MAE:  264.8424436323212

gb
----------
R^2:  0.9816126039235502
MAE:  278.52906986162003



<span style="font-size:18px">**6. Saving the winning model**</span><br>

In [19]:
# Selected winning hyperparameters
fitted_models['rf'].best_estimator_

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=0.33, max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1,
           oob_score=False, random_state=123, verbose=0, warm_start=False))])

In [20]:
with open('final_model.pkl', 'wb') as f:
    pickle.dump(fitted_models['rf'].best_estimator_, f)