# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(WHRDataSet_filename)# YOUR CODE HERE

print(df.columns)
df.head()

Index(['country', 'year', 'Life Ladder', 'Log GDP per capita',
       'Social support', 'Healthy life expectancy at birth',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 'Positive affect', 'Negative affect',
       'Confidence in national government', 'Democratic Quality',
       'Delivery Quality', 'Standard deviation of ladder by country-year',
       'Standard deviation/Mean of ladder by country-year',
       'GINI index (World Bank estimate)',
       'GINI index (World Bank estimate), average 2000-15',
       'gini of household income reported in Gallup, by wp5-year'],
      dtype='object')


Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year"
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,,,
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,,,0.441906
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,,,0.327318
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,,,0.336764
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,,,0.34454


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. The data set I have chosen is the World Happiness Record data set.
2. I will be predicting a country's life ladder score based on the score of the other features, such as log gdp per capita, social support, healthy life expectancy at birth, etc. The label is the column "Life Ladder"
3. This is a supervised learning problem because I have identified a label column and will be training my model based on that label. This is a regression problem because the label is a numerical value.
4. My initial set of features will be life ladder, log gdp per capita, social support, healthy life expectancy at birth, freedom to make life choices, perceptions of corruption, positive and negative affect, confidence in national government, democratic quality, delivery quality, GINI Index (World Bank Estimate), and GINI of household income reported in Gallup, by wp5-year.
5. This is an important problem because it could help countries identify what increases or worsens citizens' quality of life.|

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [23]:
df.describe()

Unnamed: 0,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year",label_life_ladder
count,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0,1562.0
mean,2011.820743,5.433676,9.220822,0.810669,62.249887,0.728975,7.9e-05,0.753622,0.708969,0.263171,0.480207,-0.126617,0.004947,2.003501,0.387271,0.372846,0.386948,0.445204,5.435275
std,3.419787,1.121017,1.17375,0.118872,7.937689,0.144051,0.159939,0.18011,0.107021,0.083682,0.180621,0.824041,0.925759,0.379684,0.119007,0.052884,0.078834,0.092575,1.11206
min,2005.0,2.661718,6.377396,0.290184,37.766476,0.257534,-0.322952,0.035198,0.362498,0.083426,0.068769,-2.448228,-2.144974,0.863034,0.133908,0.241,0.228833,0.22347,3.174264
25%,2009.0,4.606351,8.330659,0.749794,57.344959,0.635676,-0.108292,0.702761,0.622581,0.20468,0.348685,-0.713479,-0.671931,1.737934,0.309722,0.372846,0.32725,0.386856,4.606351
50%,2012.0,5.3326,9.361684,0.831776,63.763542,0.74432,-0.011797,0.798041,0.715595,0.252504,0.480207,-0.126617,-0.084389,1.960345,0.369751,0.372846,0.386948,0.445204,5.3326
75%,2015.0,6.271025,10.167549,0.904097,68.064693,0.841122,0.086098,0.874675,0.799524,0.310713,0.593869,0.50414,0.606049,2.21592,0.451833,0.372846,0.42925,0.480072,6.271025
max,2017.0,8.018934,11.770276,0.987343,76.536362,0.985178,0.677773,0.983276,0.943621,0.70459,0.993604,1.540097,2.184725,3.52782,1.022769,0.648,0.626,0.961435,7.614929


In [20]:
# YOUR CODE HERE
#Replacing missing values
nan_count = np.sum(df.isna(), axis = 0)
nan_cols = nan_count > 0
numerical_cols = (df.dtypes == 'int64') | (df.dtypes == 'float64')
to_replace = nan_cols & numerical_cols
for col in df.columns[to_replace]:
    df[col].fillna(value=df[col].mean(), inplace=True)
df

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year",label_life_ladder
0,Afghanistan,2008,3.723590,7.168690,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.929690,-1.655084,1.774662,0.476600,0.372846,0.386948,0.445204,3.723590
1,Afghanistan,2009,4.401778,7.333790,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,0.372846,0.386948,0.441906,4.401778
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.137630,0.706766,0.618265,0.275324,0.299357,-1.991810,-1.617176,1.878622,0.394803,0.372846,0.386948,0.327318,4.758381
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.785360,0.465942,0.372846,0.386948,0.336764,3.831719
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.775620,0.710385,0.267919,0.435440,-1.842996,-1.404078,1.798283,0.475367,0.372846,0.386948,0.344540,3.782938
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1557,Zimbabwe,2013,4.690188,7.565154,0.799274,48.949745,0.575884,-0.076716,0.830937,0.711885,0.182288,0.527755,-1.026085,-1.526321,1.964805,0.418918,0.372846,0.432000,0.555439,4.690188
1558,Zimbabwe,2014,4.184451,7.562753,0.765839,50.051235,0.642034,-0.045885,0.820217,0.725214,0.239111,0.566209,-0.985267,-1.484067,2.079248,0.496899,0.372846,0.432000,0.601080,4.184451
1559,Zimbabwe,2015,3.703191,7.556052,0.735800,50.925652,0.667193,-0.094585,0.810457,0.715079,0.178861,0.590012,-0.893078,-1.357514,2.198865,0.593776,0.372846,0.432000,0.655137,3.703191
1560,Zimbabwe,2016,3.735400,7.538829,0.768425,51.800068,0.732971,-0.065283,0.723612,0.737636,0.208555,0.699344,-0.863044,-1.371214,2.776363,0.743257,0.372846,0.432000,0.596690,3.735400


In [27]:
#Detect and remove outliers
import scipy.stats as stats
df['label_life_ladder'] = stats.mstats.winsorize(df['Life Ladder'], limits=[0.25, 0.25])
df.head(10)

Unnamed: 0,country,year,Life Ladder,Log GDP per capita,Social support,Healthy life expectancy at birth,Freedom to make life choices,Generosity,Perceptions of corruption,Positive affect,Negative affect,Confidence in national government,Democratic Quality,Delivery Quality,Standard deviation of ladder by country-year,Standard deviation/Mean of ladder by country-year,GINI index (World Bank estimate),"GINI index (World Bank estimate), average 2000-15","gini of household income reported in Gallup, by wp5-year",label_life_ladder
0,Afghanistan,2008,3.72359,7.16869,0.450662,49.209663,0.718114,0.181819,0.881686,0.517637,0.258195,0.612072,-1.92969,-1.655084,1.774662,0.4766,0.372846,0.386948,0.445204,4.606252
1,Afghanistan,2009,4.401778,7.33379,0.552308,49.624432,0.678896,0.203614,0.850035,0.583926,0.237092,0.611545,-2.044093,-1.635025,1.722688,0.391362,0.372846,0.386948,0.441906,4.606252
2,Afghanistan,2010,4.758381,7.386629,0.539075,50.008961,0.600127,0.13763,0.706766,0.618265,0.275324,0.299357,-1.99181,-1.617176,1.878622,0.394803,0.372846,0.386948,0.327318,4.758381
3,Afghanistan,2011,3.831719,7.415019,0.521104,50.367298,0.495901,0.175329,0.731109,0.611387,0.267175,0.307386,-1.919018,-1.616221,1.78536,0.465942,0.372846,0.386948,0.336764,4.606252
4,Afghanistan,2012,3.782938,7.517126,0.520637,50.709263,0.530935,0.247159,0.77562,0.710385,0.267919,0.43544,-1.842996,-1.404078,1.798283,0.475367,0.372846,0.386948,0.34454,4.606252
5,Afghanistan,2013,3.5721,7.503376,0.483552,51.04298,0.577955,0.074735,0.823204,0.620585,0.273328,0.482847,-1.879709,-1.403036,1.22369,0.342569,0.372846,0.386948,0.304368,4.606252
6,Afghanistan,2014,3.130896,7.484583,0.525568,51.370525,0.508514,0.118579,0.871242,0.531691,0.374861,0.409048,-1.773257,-1.312503,1.395396,0.445686,0.372846,0.386948,0.413974,4.606252
7,Afghanistan,2015,3.982855,7.466215,0.528597,51.693527,0.388928,0.094686,0.880638,0.553553,0.339276,0.260557,-1.844364,-1.291594,2.160618,0.54248,0.372846,0.386948,0.596918,4.606252
8,Afghanistan,2016,4.220169,7.461401,0.559072,52.016529,0.522566,0.057072,0.793246,0.564953,0.348332,0.32499,-1.917693,-1.432548,1.796219,0.425627,0.372846,0.386948,0.418629,4.606252
9,Afghanistan,2017,2.661718,7.460144,0.49088,52.339527,0.427011,-0.10634,0.954393,0.496349,0.371326,0.261179,-0.126617,0.004947,1.454051,0.546283,0.372846,0.386948,0.286599,4.606252


In [28]:
print("mean: {mean}".format(mean=df['label_life_ladder'].mean()))
print("median: {med}".format(med=df['label_life_ladder'].median()))

mean: 5.405399081148527
median: 5.33259964


  a.partition(kth, axis=axis, kind=kind, order=order)


In [29]:
high_corrs = round(df.corr(), 5)['label_life_ladder']
high_corrs.sort_values(ascending=False)

label_life_ladder                                           1.00000
Life Ladder                                                 0.94449
Log GDP per capita                                          0.75467
Healthy life expectancy at birth                            0.70915
Social support                                              0.66250
Delivery Quality                                            0.62369
Democratic Quality                                          0.54715
Positive affect                                             0.53377
Freedom to make life choices                                0.49078
Generosity                                                  0.14871
year                                                        0.02165
GINI index (World Bank estimate)                           -0.01376
GINI index (World Bank estimate), average 2000-15          -0.09612
Standard deviation of ladder by country-year               -0.10691
Confidence in national government               

**Analysis:** From the correlation list, we can see that relevant features might be features that has high correlation or high inverse correlation with the label, such as "Log GDP per capita", "Healthy life expectancy at birth", "Social support", "Delivery Quality", "Democratic Quality", "Positive affect", "Democratic Quality", and "Perceptions of corruption".

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

1. The feature list I chose to keep includes: "Log GDP per capita", "Healthy life expectancy at birth", "Social support", "Delivery Quality", "Democratic Quality", "Positive affect", "Democratic Quality", and "Perceptions of corruption".
2. The data preparation techniques that I will use to prepare my data for modeling include removing and replacing null values and outliers, feature selections based on high correlation or inverse correlation with the variable, and k-fold cross-validation to split data into training and testing sets.
3. I'm implementing a Random Forest model and a Neural Network model to predict the life ladder score of each country based on the feature list above.
4. I plan on using k-fold cross-validation to split and train my model on the given data set. While training, for the Random Forest Regressor, I would use Grid Search to find the best parameters/hyperparameters, train my model on the training data set, and pick the best parameter. Then I would compare the performance between the Logistic Regression and Neural Network model using evaluation metrics such as R2 and mean squared error, then take the best model. For the Neural Network model, because there are different ranges of data between the columns (shown below), I need to add a normalization layer to the neural network before the training and prediction process.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

#### Random Forest Regressor

In [30]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

#### Neural Network

In [31]:
import tensorflow.keras as keras
import time

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

#### Random Forest

In [32]:
#Prep data
ft_use = ["Log GDP per capita", "Healthy life expectancy at birth", "Social support", "Delivery Quality", "Democratic Quality", "Positive affect", "Democratic Quality", "Perceptions of corruption"]
df_use = df.drop(columns=['Life Ladder'])
y = df_use['label_life_ladder']
X = df_use[ft_use]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1234)

In [33]:
# YOUR CODE HERE
param_grid = {"max_depth": [2**i for i in range(6)], "min_samples_leaf": [25*2**n for n in range(0,3)]}
logistic_model = RandomForestRegressor()
grid = GridSearchCV(logistic_model, cv=5, param_grid=param_grid)
grid_search = grid.fit(X_train, y_train)

In [34]:
best_rf_params = grid_search.best_params_
best_rf_params

{'max_depth': 8, 'min_samples_leaf': 25}

In [35]:
best_rf = RandomForestRegressor(max_depth=best_rf_params["max_depth"], min_samples_leaf=best_rf_params["min_samples_leaf"])
best_rf.fit(X_train, y_train)

best_rf_labels = best_rf.predict(X_test)

In [36]:
mse_rf_score = np.sqrt(mean_squared_error(y_test, best_rf_labels))
r2_rf_score = r2_score(y_test, best_rf_labels)
print("mse score for random forest: {mse}".format(mse=mse_rf_score))
print("r2 score for random forest: {r2}".format(r2=r2_rf_score))

mse score for random forest: 0.30822720232219464
r2 score for random forest: 0.7935080317982888


#### Neural Network

In [37]:
# YOUR CODE HERE
for ft in ft_use:
    print("range of column {col}: [{min_val}, {max_val}]".format(col=ft, min_val=df[ft].min(), max_val=df[ft].max()))

range of column Log GDP per capita: [6.37739563, 11.77027607]
range of column Healthy life expectancy at birth: [37.76647568, 76.53636169]
range of column Social support: [0.29018417, 0.98734349]
range of column Delivery Quality: [-2.144973993, 2.184724569]
range of column Democratic Quality: [-2.448228121, 1.540097475]
range of column Positive affect: [0.362497687, 0.943620622]
range of column Democratic Quality: [-2.448228121, 1.540097475]
range of column Perceptions of corruption: [0.035197988, 0.98327601]


These columns' values need to be normalized

In [63]:
#normalize features
#create model
nn_model = keras.Sequential()

normalization_layer = keras.layers.BatchNormalization()
nn_model.add(normalization_layer)

input_layer = keras.layers.InputLayer(input_shape=(len(ft_use),), name='input')
nn_model.add(input_layer)

hidden_layer_1 = keras.layers.Dense(units=64, activation='relu')
nn_model.add(hidden_layer_1)


hidden_layer_2 = keras.layers.Dense(units=32, activation='relu')
nn_model.add(hidden_layer_2)


hidden_layer_3 = keras.layers.Dense(units=16, activation='relu')
nn_model.add(hidden_layer_3)

output_layer = keras.layers.Dense(units=1, name='output')
nn_model.add(output_layer)

sgd_optimizer = keras.optimizers.SGD(learning_rate=0.1)
loss_fn = keras.losses.MeanSquaredLogarithmicError(
    reduction="sum_over_batch_size", name="mean_squared_error"
)

nn_model.compile(optimizer=sgd_optimizer, loss=loss_fn, metrics=['mse'])

In [64]:
num_epochs = 100 # Number of epochs
t0 = time.time() # start time
history = nn_model.fit(X_train, y_train, epochs=num_epochs, verbose=1)# YOUR CODE HERE 
t1 = time.time() # stop time
print('Elapsed time: %.2fs' % (t1-t0))

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [65]:
nn_label_predictions = nn_model.predict(X_test)

In [66]:
mse_rf_score = np.sqrt(mean_squared_error(y_test, nn_label_predictions))
r2_rf_score = r2_score(y_test, nn_label_predictions)
print("mse score for random forest: {mse}".format(mse=mse_rf_score))
print("r2 score for random forest: {r2}".format(r2=r2_rf_score))

mse score for random forest: 0.3158557676196155
r2 score for random forest: 0.7831602697169349


**Conclusion:** From the 2 evaluation scores, we can see that the 2 models are similar and comparable in terms of the Mean Squared Error score and the R2 score. Therefore, we can conclude that the Random Forest Regressor is preferable over the Neural Network because it has a faster time to train and predict. The Neural Network model has also been tested with different epoch numbers, at 10, the scores are 0.46 and 0.54 for MSE and R2, and at 1, the scores are 10.3 and -229.53.