# **Week 6:** Model Building
### Neural Network, Random Forest, Gradient Boosting
#### July 27, 2023
---------------- 

This notebook is an exercise in preparing and building neural network, random forest, and gradient boosting models using the Ames Iowa housing data (`ames_housing.csv`). 

[Link to Ames Housing Kaggle Dataset](https://www.kaggle.com/datasets/marcopale/housing)

[Link to Titanic Training Dataset](https://www.kaggle.com/competitions/titanic/data)

Install necessary Python packages

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn.metrics
import category_encoders as ce
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPRegressor
import xgboost as xgb


### Random Forest Model - Regression 

#### Setup Dataset

Read the `ames_housing.csv` CSV file into a Pandas Dataframe, and call it `ames`.

In [None]:
ames = pd.read_csv('ames_housing.csv')

Declare the target and predictor variables for the Ames dataset. We want to predict the sale price for each home observation.

In [None]:
X = ames.drop(['Sale_Price'], axis = 1)
y = ames['Sale_Price']

Split the data into training and test sets (70%, 30%). Print out the number of rows in each set to make sure the data is split correctly.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
X_train.shape, X_test.shape

Print the head of the dataset to preview the columns and column types. 

In [None]:
X_train.head()

#### Variable Selection and Feature Engineering

Reduce down the number of variables only for ease of computation.

In [None]:
X_train = X_train.loc[:, ['Bedroom_AbvGr', 'Year_Built', 'Mo_Sold', 'Lot_Area', 
           'Street', 'Central_Air', '1st_Flr_SF', '2nd_Flr_SF', 'Full_Bath', 
           'Half_Bath', 'Fireplaces', 'Garage_Area', 'Gr_Liv_Area', 'TotRms_AbvGrd']] 
X_test = X_test.loc[:, ['Bedroom_AbvGr', 'Year_Built', 'Mo_Sold', 'Lot_Area', 
           'Street', 'Central_Air', '1st_Flr_SF', '2nd_Flr_SF', 'Full_Bath', 
           'Half_Bath', 'Fireplaces', 'Garage_Area', 'Gr_Liv_Area', 'TotRms_AbvGrd']] 

Output the number of unique values in each variable. 

In [None]:
X_train.nunique()

We will use the Random Forest (Regressor) library in Sci-Kit Learn, and this package does not accept string variables. We need to encode the `Street` and `Central_Air` variables as a number for each unique value. Let's use the `category_encoders` package to do this. Output the head of the dataset again to make sure the encoding worked properly (i.e., all variables in the dataframe should be numeric).

In [None]:
encoder = ce.OrdinalEncoder(cols=['Street', 'Central_Air'])

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

X_train.head()

#### Model Building

First, instantiate the RandomForestRegressor model. 

In [None]:
fit_rf = RandomForestRegressor(random_state=42)

Now, we fit the model we just instantiated on our feature engineered training dataset. 

In [None]:
fit_rf.fit(X_train, y_train)

#### Predictions and Scoring

We can make predictions for the sale price of the homes in the test dataset using the model we just fit to our training dataset.

Output the MAE and MAPE for the model predictions on the test dataset.

In [None]:
preds = fit_rf.predict(X_test)
print("MAE: ", sklearn.metrics.mean_absolute_error(y_test, preds))
print("MAPE: ", sklearn.metrics.mean_absolute_percentage_error(y_test, preds))

We can try different numbers for the parameter `n_estimators`, a.k.a. "tuning" the model. Create an array of numbers to try from 10 to 200 in increments of 10.

In [None]:
estimators = np.arange(10, 200, 10)
estimators

We will fit the model again using each of the numbers in the `estimators` array. This code may take a minute to run.

In [None]:
scores = []
for n in estimators:
    fit_rf.set_params(n_estimators=n)
    fit_rf.fit(X_train, y_train)
    scores.append(fit_rf.score(X_test, y_test))

Plot the value of `n_estimator` against the score of the tuned model for each value of the parameter. We can see that the effect of `n_estimators` on the score of the model levels out around 125, so this might be a good setting for the parameter.

In [None]:
plt.title("Effect of n_estimators")
plt.xlabel("n_estimator")
plt.ylabel("score")
plt.plot(estimators, scores)

You can output the raw scores to make your decision for tuning the `n_estimator` parameter. 

In [None]:
scores

### XGBoost Model - Regression

Let's keep the same training and test datasets for Ames housing we used for the Random Forest model. We will run a grid search across 5 different parameters: `max_depth`, `learning_rate`, `gamma`, `reg_lambda`, and `scale_pos_weight`.

In [None]:
param_grid = {
    'max_depth' :[3,4,5],
    'learning_rate':[0.1, 0.01, 0.5],
    'gamma':[0,0.25,1],
    'reg_lambda':[0, 1.0, 10.0],
    'scale_pos_weight':[1,3,5]
}

optimal_params = GridSearchCV(estimator = xgb.XGBRegressor(subsample=0.9, colsample_bytree=0.5), param_grid = param_grid, verbose = 0,)

Now, we fit the model we instantiated using the Grid Search Cross-Validation method on our feature engineered training dataset.

In [None]:
gbm_model = optimal_params.fit(X_train, y_train, verbose = False)

#### Predictions and Scoring

We can make predictions for the sale price of the homes in the test dataset using the model we just fit to our training dataset.

Output the MAE and MAPE for the model predictions on the test dataset.

In [None]:
preds = gbm_model.predict(X_test)
print("MAE: ", sklearn.metrics.mean_absolute_error(y_test, preds))
print("MAPE: ", sklearn.metrics.mean_absolute_percentage_error(y_test, preds))

### Neural Network Model - Regression

#### Model Building

First, instantiate the MLPRegressor (Neural Network) model.

In [None]:
nnet = MLPRegressor(hidden_layer_sizes=(2000, ))

Now, we fit the model we just instantiated on our feature engineered training dataset.

In [None]:
nnet_model = nnet.fit(X_train, y_train)

#### Predictions and Scoring

We can make predictions for the sale price of the homes in the test dataset using the model we just fit to our training dataset.

Output the MAE and MAPE for the model predictions on the test dataset.

In [None]:
preds = nnet_model.predict(X_test)
print("MAE: ", sklearn.metrics.mean_absolute_error(y_test, preds))
print("MAPE: ", sklearn.metrics.mean_absolute_percentage_error(y_test, preds))

### Decision Tree Model - Classification

#### Setup Dataset

Read the `titanic.csv` CSV file into a Pandas Dataframe, and call it `titanic.`

In [None]:
titanic = pd.read_csv('titanic.csv')

Declare the target and predictor variables for the Titanic dataset. We want to classify observations on whether or not they survived the Titanic sinking.

In [None]:
X = titanic.drop(['Survived'], axis = 1)
y = titanic['Survived']

Split the data into training and test sets for both X and y (70%, 30%). Print out the number of rows in each set to make sure the data is split correctly.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
X_train.shape, X_test.shape

Print the head of the dataset to preview the columns and column types. 

In [None]:
X_train.head()

#### Variable Selection and Feature Engineering

We need to remove the `PassengerId`, `Name`, and `Ticket` variables because they are not useful to the model.

In [None]:
X_train = X_train.loc[:, ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked']] 
X_test = X_test.loc[:, ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked']] 

Output the number of unique values in each variable. 

In [None]:
X_train.nunique()

We can reduce the number of unique values in the `Cabin` variable by categorizing each passenger by the first letter of their Cabin (e.g., `B77` would become `B`).

In [None]:
X_train['Cabin'] = X_train['Cabin'].str[0]
X_test['Cabin'] = X_test['Cabin'].str[0]

We will use the decision tree library in Sci-Kit Learn, and this package does not accept string variables. We need to encode the `Sex`, `Cabin`, `Age`, and `Embarked` variables as a number for each unique value. Let's use the `category_encoders` package to do this. Output the head of the dataset again to make sure the encoding worked properly (i.e., all variables in the dataframe should be numeric).

In [None]:
encoder = ce.OrdinalEncoder(cols=['Sex', 'Cabin', 'Age', 'Embarked'])

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

X_train.head()

#### Model Building

First, instantiate the DecisionTreeClassifier model. Let's use the Gini index as the class target criterion. Feel free to come back and switch this criterion to another metric later and compare. 

In [None]:
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)

Now, we fit the model we just instantiated on our feature engineered training dataset. 

In [None]:
clf_gini.fit(X_train, y_train)

#### Predictions and Scoring

We can make predictions for survival of the passengers in the test dataset using the model we just fit to our training dataset. Set the array of predicted values (0s or 1s) to the variable `y_pred_gini.`

Get predictions for both the test data and training data.

In [None]:
y_pred_gini = clf_gini.predict(X_test)
# y_pred_gini

y_pred_train_gini = clf_gini.predict(X_train)
# y_pred_train_gini

Now, we can compare the accuracy of the training set predictions with the accuracy of the test set predictions to check for overfitting.

In [None]:
print('Training set score: {:.4f}'.format(sklearn.metrics.accuracy_score(y_train, y_pred_train_gini)))
print('Test set score: {:.4f}'.format(sklearn.metrics.accuracy_score(y_test, y_pred_gini)))

Output the confusion matrix for the model. Remember the confusion matrix should be made up of the following:

<img src="cm.png" width=500 height=300 />

In [None]:
cm = confusion_matrix(y_test, y_pred_gini)
print('Confusion matrix\n\n', cm)

#### Visualization and Variable Importance

Let's output a visual for the decision tree.

In [None]:
plt.figure(figsize=(12,8))
tree.plot_tree(clf_gini.fit(X_train, y_train)) 
plt.show()