# Modeling (Original)

# Table of Contents

- ## [Data Cleaning](./Data_Cleaning.ipynb)
    - ### [Import Libraries](./Data_Cleaning.ipynb#Import-Libraries)
    - ### [Import Data](./Data_Cleaning.ipynb#Import-Data)
    - ### [Clean the "Average Salary" Column](./Data_Cleaning.ipynb#Clean-the-Average-Salary-Column)
    - ### [Create Stop Words](./Data_Cleaning.ipynb#Create-Custom-Stop-Words)
    - ### [Prepare words to be vectorized](./Data_Cleaning.ipynb#Tokenize%2C-Remove-Stop-Words%2C-Remove-Punctuation%2C-Lemmatize)
    - ### [Vectorize Word Data](#Vectorize-Word-Data)
- ## [Modeling](./Models.ipynb)
    - ### [Import Libraries](./Models.ipynb#Import-Libraries)
    - ### [Models](./Models.ipynb#Models)
      - #### [Linear Regression](./Models.ipynb#Linear-Regression)
      - #### [Lasso](./Models.ipynb#Lasso)
      - #### [Ridge](./Models.ipynb#Ridge)
      - #### [Random Forest Regressor](./Models.ipynb#Random-Forest-Regressor)
      - #### [Gradient Boost Regressor](./Models.ipynb#Gradient-Boost-Regressor)
      - #### [Neural Network](./Models.ipynb#Neural-Network)

# Import Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


In [3]:
import pickle
X = pickle.load( open( "../models/original/X.pkl", "rb" ))
y = pickle.load( open( "../models/original/y.pkl", "rb" ))              

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42, test_size = .2)

# Models

In [6]:
import models as m

## Linear Regression

In [7]:
lr = m.linear_regression(X_train,y_train)

pickle.dump( lr, open( "../models/original/Linear_Model.pkl", "wb" ) )

print(f"The R2 score for this model is: {lr.score(X_test,y_test)}")

lr_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(lr.predict(X))))

pickle.dump(lr_margin, open( "../models/original/margins/lr_margin.pkl", "wb" ) )

print(f"The RMSE is: {lr_margin}")

The R2 score for this model is: 0.3723614458072618
The RMSE is: 29566.617367944724


## Lasso

In [8]:
ls = m.lasso(X_train,y_train)

pickle.dump( ls, open( "../models/original/Lasso_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {ls.score(X_test,y_test)}")

ls_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(ls.predict(X))))

pickle.dump(ls_margin, open( "../models/original/margins/ls_margin.pkl", "wb" ) )

print(f"The RMSE is: {ls_margin}")

The R2 score for this model is: 0.4242258287244388
The RMSE is: 30009.573804426396


## Ridge

In [9]:
rd = m.ridge(X_train,y_train)

pickle.dump( rd, open( "../models/original/Ridge_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {rd.score(X_test,y_test)}")

rd_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(rd.predict(X))))

pickle.dump(rd_margin, open( "../models/original/margins/rd_margin.pkl", "wb" ) )

print(f"The RMSE is: {rd_margin}")

The R2 score for this model is: 0.4203876940272564
The RMSE is: 29512.902460006022


## Random Forest Regressor

In [10]:
rf = m.random_forest(X_train,y_train)

pickle.dump( rf, open( "./models/original/Random_Forest_model.pkl", "wb" ) )
.
print(f"The R2 score for this model is: {rf.score(X_test,y_test)}")

rf_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(rf.predict(X))))

pickle.dump(rf_margin, open( "../models/original/margins/rf_margin.pkl", "wb" ) )

print(f"The RMSE is: {rf_margin}")

The R2 score for this model is: 0.4711669593689097
The RMSE is: 19087.208921791775


## Gradient Boosting Regressor

In [11]:
gb = m.random_forest(X_train,y_train)

pickle.dump( gb, open( "../models/original/Gradient_Boost_model.pkl", "wb" ) )

print(f"The R2 score for this model is: {gb.score(X_test,y_test)}")

gb_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(gb.predict(X))))

pickle.dump(gb_margin, open( "../models/original/margins/gb_margin.pkl", "wb" ) )

print(f"The RMSE is: {gb_margin}")

The R2 score for this model is: 0.48155460534615824
The RMSE is: 19043.784936360295


## Neural Network

In [12]:
best_epoch = m.get_best_epoch(X_train,y_train,X_test,y_test)

nn = m.neural_net(X_train,y_train,X_test,y_test,best_epoch)

nn.save("../models/original/Neural_Net.h5")

nn_margin = np.sqrt(mean_squared_error(np.exp(y),np.exp(nn.predict(X))))

pickle.dump(nn_margin, open( "../models/original/margins/nn_margin.pkl", "wb" ) )

print(f"The RMSE is: {nn_margin}")

The best epoch is 18 with a minimum loss of 0.10206479313645032
The RMSE is: 27505.906941230343


# Go To:
[Original Modeling Process](../original/Models.ipynb) 

[Original Data Cleaning](../original/Data_Cleaning.ipynb)

[Poly Modeling Process](../poly/Models.ipynb)

[Poly Data Cleaning](../poly/Data_Cleaning.ipynb)