# Modeling (Poly)

# Table of Contents

- ## [Data Cleaning](./Data_Cleaning.ipynb)
    - ### [Import Libraries](./Data_Cleaning.ipynb#Import-Libraries)
    - ### [Import Data](./Data_Cleaning.ipynb#Import-Data)
    - ### [Clean the "Average Salary" Column](./Data_Cleaning.ipynb#Clean-the-Average-Salary-Column)
    - ### [Create Stop Words](./Data_Cleaning.ipynb#Create-Custom-Stop-Words)
    - ### [Prepare words to be vectorized](./Data_Cleaning.ipynb#Tokenize%2C-Remove-Stop-Words%2C-Remove-Punctuation%2C-Lemmatize)
    - ### [Vectorize Word Data](#Vectorize-Word-Data)
- ## [Modeling](./Models.ipynb)
    - ### [Import Libraries](./Models.ipynb#Import-Libraries)
    - ### [Models](./Models.ipynb#Models)
      - #### [Linear Regression](./Models.ipynb#Linear-Regression)
      - #### [Lasso](./Models.ipynb#Lasso)
      - #### [Ridge](./Models.ipynb#Ridge)
      - #### [Random Forest Regressor](./Models.ipynb#Random-Forest-Regressor)
      - #### [Gradient Boost Regressor](./Models.ipynb#Gradient-Boost-Regressor)
      - #### [Neural Network](./Models.ipynb#Neural-Network))

# Import Libraries

In [12]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


In [13]:
import pickle
X = pickle.load( open( "../models/poly/X_95.pkl", "rb" ))
y = pickle.load( open( "../models/poly/y.pkl", "rb" ))              

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42, test_size = .2)

# Models

In [15]:
import models as m

## Linear Regression

In [16]:
lr = m.linear_regression(X_train,y_train)

pickle.dump( lr, open( "../models/poly/Linear_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {lr.score(X_test,y_test)}")

lr_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(lr.predict(X))))

pickle.dump(lr_margin, open( "../models/poly/margins/lr_margin.pkl", "wb" ) )

print(f"The RMSE is: {lr_margin}")

The R2 score for this model is: 0.2965678068818711
The RMSE is: 25389.984654733835


## Lasso

In [17]:
ls = m.lasso(X_train,y_train)

pickle.dump( ls, open( "../models/poly/Lasso_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {ls.score(X_test,y_test)}")

ls_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(ls.predict(X))))

pickle.dump(ls_margin, open( "../models/poly/margins/ls_margin.pkl", "wb" ) )

print(f"The RMSE is: {ls_margin}")

The R2 score for this model is: 0.489077559290809
The RMSE is: 27703.709795836134


## Ridge

In [18]:
rd = m.ridge(X_train,y_train)

pickle.dump( rd, open( "../models/poly/Ridge_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {rd.score(X_test,y_test)}")

rd_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(rd.predict(X))))

pickle.dump(rd_margin, open( "../models/poly/margins/rd_margin.pkl", "wb" ) )

print(f"The RMSE is: {rd_margin}")

The R2 score for this model is: 0.4959211623168871
The RMSE is: 24861.792029883523


## Random Forest Regressor

In [19]:
rf = m.random_forest(X_train,y_train)

pickle.dump( rf, open( "../models/poly/Random_Forest_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {rf.score(X_test,y_test)}")

rf_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(rf.predict(X))))

pickle.dump(rf_margin, open( "../models/poly/margins/rf_margin.pkl", "wb" ) )

print(f"The RMSE is: {rf_margin}")

The R2 score for this model is: 0.48080912468522974
The RMSE is: 18056.81980709426


## Gradient Boosting Regressor

In [20]:
gb = m.random_forest(X_train,y_train)

pickle.dump( gb, open( "../models/poly/Gradient_Boost_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {gb.score(X_test,y_test)}")

gb_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(gb.predict(X))))

pickle.dump(gb_margin, open( "../models/poly/margins/gb_margin.pkl", "wb" ) )

print(f"The RMSE is: {gb_margin}")

The R2 score for this model is: 0.48426991127168206
The RMSE is: 18032.72566107865


## Neural Network

In [21]:
best_epoch = m.get_best_epoch(X_train,y_train,X_test,y_test)

nn = m.neural_net(X_train,y_train,X_test,y_test,best_epoch)

nn.save("../models/poly/Neural_Net.h5")

nn_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(nn.predict(X))))

pickle.dump(nn_margin, open( "../models/poly/margins/nn_margin.pkl", "wb" ) )

print(f"The RMSE is: {nn_margin}")

The best epoch is 176 with a minimum loss of 0.1102323666608373
The RMSE is: 15193.511639815557


# Go To:
[Original Modeling Process](../original/Models.ipynb) 

[Original Data Cleaning](../original/Data_Cleaning.ipynb)

[Poly Modeling Process](../poly/Models.ipynb)

[Poly Data Cleaning](../poly/Data_Cleaning.ipynb)