# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(adultDataSet_filename, header=0)
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex_selfID',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income_binary'],
      dtype='object')

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I have chosen the adult dataset.

2. I will be predicting which type of employment people are in based on education, age, and other factors.

3. This will be a supervised learning problem. It will be multiclass classification.

4. My features are age, hours per week, income-binary, occupation, education, fnlwgt, occupation, race, sex_selfID, capital gain, capital loss, marital-status, and native-country.

5. I think it is important because this could mean that some geographic locations, ages, and education levels are potentially more suited for a specific work class. If there is a clear trend for some workclasses, people could potentially formulate their education (or decide not to proceed with continued education) based off of their desired workclass.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
## dropping Nan values
df = df.dropna()
nan_count = np.sum(df.isnull(), axis = 0)
nan_count

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex_selfID        0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income_binary     0
dtype: int64

In [4]:
## Regrouping and one hot coding marital status
df['marital-status'].unique()
df_new = df.loc[df['marital-status'].isin(['Never-married', 'Divorced', 'Widowed', 'Separated']), 'marital-status'] = 'Single'
df_new = df.loc[df['marital-status'].isin(['Married-civ-spouse', 'Married-spouse-absent', 'Married-AF-spouse']), 'marital-status'] = 'Married'
df_married_onehot = pd.get_dummies(df['marital-status'], prefix='marital-status')
df = df.join(df_married_onehot)
df.drop(columns = 'marital-status', inplace=True)


In [5]:
## dropping relationship column because it is redundant
df.drop(columns='relationship', inplace=True)

In [6]:
## One hot encoding race column
df_race_onehot = pd.get_dummies(df['race'], prefix='race')
df = df.join(df_race_onehot)
df.drop(columns = 'race', inplace=True)

In [7]:
## Regrouping and one 
df.drop(columns = 'education', inplace=True)

In [8]:
## One hot encoding occupation
df_occupation_onehot = pd.get_dummies(df['occupation'], prefix='occupation')
df = df.join(df_occupation_onehot)
df.drop(columns = 'occupation', inplace=True)

In [9]:
## One hot encoding sex
df_sex_onehot = pd.get_dummies(df['sex_selfID'], prefix='sex')
df = df.join(df_sex_onehot)
df.drop(columns = 'sex_selfID', inplace=True)

In [10]:
## Dropping 'South' because we can't tell what country that is 
## Regrouping into continents and one hot encoding native-country
df = df[df['native-country'] != 'South']
df_nc_new = df.loc[df['native-country'].isin(['United-States', 'Cuba', 'Jamaica', 'Mexico', 'Puerto-Rico', 'Honduras', 'Canada', 'Dominican-Republic', 'El-Salvador', 'Guatemala', 'Haiti', 'Outlying-US(Guam-USVI-etc)', 'Trinadad&Tobago', 'Nicaragua']), 'native-country'] = 'North America'
df_nc_new = df.loc[df['native-country'].isin(['Columbia', 'Ecuador', 'Peru']), 'native-country'] = 'South America'
df_nc_new = df.loc[df['native-country'].isin(['Germany', 'England', 'Poland', 'Portugal', 'France', 'Italy', 'Scotland', 'Yugoslavia', 'Greece', 'Ireland', 'Hungary', 'Holand-Netherlands']), 'native-country'] = 'Europe'
df_nc_new = df.loc[df['native-country'].isin(['India', 'Iran', 'Philippines', 'Cambodia', 'Thailand', 'Laos', 'Taiwan', 'China', 'Japan', 'Vietnam', 'Hong']), 'native-country'] = 'Asia'
df_nc_onehot = pd.get_dummies(df['native-country'], prefix='native-country')
df = df.join(df_nc_onehot)
df.drop(columns = 'native-country', inplace=True)

In [11]:
## One hot encoding income_binary
df_ib_onehot = pd.get_dummies(df['income_binary'], prefix='income')
df = df.join(df_ib_onehot)
df.drop(columns = 'income_binary', inplace=True)

In [12]:
wc_map = {
    'Private': 0,
    'Self-emp-not-inc': 1,
    'Local-gov': 2,
    'State-gov': 3,
    'Self-emp-inc': 4,
    'Federal-gov': 5,
    'Without-pay': 6
}

df['workclass'] = df['workclass'].map(wc_map)

In [13]:
df['workclass'].unique()

array([3, 1, 0, 5, 2, 4, 6])

In [14]:
df.dtypes

age                             float64
workclass                         int64
fnlwgt                            int64
education-num                     int64
capital-gain                      int64
capital-loss                      int64
hours-per-week                  float64
marital-status_Married            uint8
marital-status_Single             uint8
race_Amer-Indian-Inuit            uint8
race_Asian-Pac-Islander           uint8
race_Black                        uint8
race_Other                        uint8
race_White                        uint8
occupation_Adm-clerical           uint8
occupation_Armed-Forces           uint8
occupation_Craft-repair           uint8
occupation_Exec-managerial        uint8
occupation_Farming-fishing        uint8
occupation_Handlers-cleaners      uint8
occupation_Machine-op-inspct      uint8
occupation_Other-service          uint8
occupation_Priv-house-serv        uint8
occupation_Prof-specialty         uint8
occupation_Protective-serv        uint8


In [15]:
df.head(10)

Unnamed: 0,age,workclass,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,marital-status_Married,marital-status_Single,race_Amer-Indian-Inuit,...,occupation_Tech-support,occupation_Transport-moving,sex_Female,sex_Non-Female,native-country_Asia,native-country_Europe,native-country_North America,native-country_South America,income_<=50K,income_>50K
0,39.0,3,77516,13,2174,0,40.0,0,1,0,...,0,0,0,1,0,0,1,0,1,0
1,50.0,1,83311,13,0,0,13.0,1,0,0,...,0,0,0,1,0,0,1,0,1,0
2,38.0,0,215646,9,0,0,40.0,0,1,0,...,0,0,0,1,0,0,1,0,1,0
3,53.0,0,234721,7,0,0,40.0,1,0,0,...,0,0,0,1,0,0,1,0,1,0
4,28.0,0,338409,13,0,0,40.0,1,0,0,...,0,0,1,0,0,0,1,0,1,0
5,37.0,0,284582,14,0,0,40.0,1,0,0,...,0,0,1,0,0,0,1,0,1,0
6,49.0,0,160187,5,0,0,16.0,1,0,0,...,0,0,1,0,0,0,1,0,1,0
7,52.0,1,209642,9,0,0,45.0,1,0,0,...,0,0,0,1,0,0,1,0,0,1
8,31.0,0,45781,14,14084,0,50.0,0,1,0,...,0,0,1,0,0,0,1,0,0,1
9,42.0,0,159449,13,5178,0,40.0,1,0,0,...,0,0,0,1,0,0,1,0,0,1


In [16]:
df['workclass'].value_counts()

0    21907
1     2450
2     2041
3     1259
4     1055
5      921
6       14
Name: workclass, dtype: int64

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

Yes, I am going to use a new feature list. 

I chose to keep most of the features.

I got rid of the relationship feature because the marital status feature was very similar. and I encoded the marital status feature.

I one hot encoded income_binary, sex_selfID, and race.

For education level, I did some groupings and then ordered importance and assigned numerical values from 1-15 to each level.

For native country, I grouped each country by continent and then one hot encoded that.

For marital status, I grouped by single and married and then one hot encoded that. 

I got rid of Nan values. For native-country, there was one entry 'South' that I didn't understand because that is not a country name so I deleted all examples with that. 

My plan is to test a couple types of models: KNN, Decision Tree, Logistic Regression, Random Forest, GBDT, and a neural network. My goal in testing different types of models is to have some variance in simple/complex model types. I think this is a fairly standard classification problem, but I am curious to see how the models will behave. I am using supervised data so I will first test base models, see which performs best, and then will start fine tuning the best-performing model. For fine-tuning, I will optimize various hyperparameters and test with other important hyperparameters to find the best combination.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [17]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2" # suppress info and warning messages
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import tensorflow.keras as keras
import time
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import tensorflow as tf
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.metrics import log_loss

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In total, I tested 7 different models. Here is how each of them went:

1. KNN

   Accuracy wise, the KNN model was fairly successful. I optimized n_neighbors and founr that a model with n_neighbors = 30 worked best.

2. Decision Tree

   This was the most successful model I created. I optimized max_depth and min_samples_leaf and found that max_depth = 8 and min_samples_leaf = 25 performed the best.

4. Logistic Regression

   I thought that this model would be good, but I thnk the data set is way too large for an LR model. I kept getting an error that the total number of iterations had succeeded even though I had max_iters = 10000. So if the data set was smaller I think this could have worked better.

6. GB Decision Tree

   Was not successful at all. I tried multiple values for max depth and every model I tried had very high MSE and low R2 score. So this was not great.

8. Random Forest

    Same as GBDT, not successful. I tried to optimize different combinations of max_depth and num_estimators values, and it was just bad every time. High MSE and low R2.
    
10. Neural Network

    This behaved very strangely. I tried to optimize a NN with 3 layers with differemt values for learning rates, and I got various results. Some accuracies/losses were really bad, and then with a slight change in learning rate accuracy would skyrocket and loss would equal 'Nan'. A learning rate of 0.25 or 0.5 had the highest accuracy, but overall I wouldn't use a Neural Network for this problem just based on how the model was behaving.

In [18]:
# creating my training and testing sets
y = df['workclass']
X = df.drop(columns='workclass', axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1234)

In [19]:
print(X_train.shape)
print(X_test.shape)

(22235, 35)
(7412, 35)


In [20]:
#TESTING AN KNN MODEL

# values we want to test
kvals = [2, 3, 5, 7, 9, 13, 20, 30, 40, 50]
param_grid = {'n_neighbors': kvals}
param_grid

print('Running Grid Search...')

model = KNeighborsClassifier()
grid = GridSearchCV(model, param_grid, cv=5)
grid_search = grid.fit(X_train, y_train)

print('Done')

## finding the best n neighbors value
best_k = grid_search.best_estimator_.n_neighbors
print('Best value for k: {0}'.format(best_k))
print('Accuracy score: {0}'.format(grid_search.best_score_))

Running Grid Search...
Done
Best value for k: 30
Accuracy score: 0.7392849111760738


In [20]:
## TESTING A DECISION TREE MODEL

md = [2**n for n in range(2,6)]
msl = [25*2**n for n in range(0,3)]
param_grid={'max_depth':md, 'min_samples_leaf':msl}

# testing the values
print('Running Grid Search...')

model = DecisionTreeClassifier()
grid = GridSearchCV(model, param_grid, cv=5)
grid_search = grid.fit(X_train, y_train)

print('Done')

# finding the best values for min samples lead and max depth
best_md = grid_search.best_estimator_.max_depth
best_msl = grid_search.best_estimator_.min_samples_leaf
print('Best value for max depth: {0}'.format(best_md))
print('Best value for the minimum number of samples required to be at a leaf node: {0}'
      .format(best_msl))
print('Accuracy score: {0}'.format(grid_search.best_score_))

Running Grid Search...
Done
Best value for max depth: 8
Best value for the minimum number of samples required to be at a leaf node: 25
Accuracy score: 0.7497638857656848


In [21]:
## TESTING A LOGISTIC REGRESSION MODEL

cvals = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

for c in cvals:
    model = LogisticRegression(penalty='l2', C=c, max_iter=10000)
    
    model.fit(X_train, y_train)
    
    probability_predictions = model.predict_proba(X_test)
    
    return_lst = []
    
    l_loss = log_loss(y_test, probability_predictions)
    
    class_label_predictions = model.predict(X_test)
    
    acc_score = accuracy_score(y_test, class_label_predictions)
    
    return_lst.append(l_loss)
    return_lst.append(acc_score)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


KeyboardInterrupt: 

In [21]:
## TESTING GRADIENT BOOSTED DECISION TREES MODEL
md = [2, 4, 8, 16, 32]
for item in md:
    print('Begin GBDT Implementation for Max Depth = ', str(item))
    
    gbdt_model = GradientBoostingRegressor(max_depth=item, n_estimators=300)
    gbdt_model.fit(X_train, y_train)
    
    y_gbdt_pred = gbdt_model.predict(X_test)
    
    gbdt_rmse = mean_squared_error(y_test, y_gbdt_pred, squared=False)
    
    gbdt_r2 = r2_score(y_test, y_gbdt_pred)
    
    print('[GBDT] Root Mean Squared Error: {0}'.format(gbdt_rmse))
    print('[GBDT] R2: {0}'.format(gbdt_r2)) 
    print('End')

Begin GBDT Implementation for Max Depth =  2




[GBDT] Root Mean Squared Error: 1.2152477819449725
[GBDT] R2: 0.12397359876649716
End
Begin GBDT Implementation for Max Depth =  4




[GBDT] Root Mean Squared Error: 1.2200581046557517
[GBDT] R2: 0.11702471172172857
End
Begin GBDT Implementation for Max Depth =  8




[GBDT] Root Mean Squared Error: 1.2602468777656142
[GBDT] R2: 0.05789614461990733
End
Begin GBDT Implementation for Max Depth =  16




[GBDT] Root Mean Squared Error: 1.3607608901972745
[GBDT] R2: -0.09837631180625284
End
Begin GBDT Implementation for Max Depth =  32
[GBDT] Root Mean Squared Error: 1.6062806917966501
[GBDT] R2: -0.5304896559980694
End




In [25]:
## TESTING RANDOM FOREST 
md = [2, 4, 8, 16, 32]
num_estimators = [100, 200, 300, 400, 500]
for item in md:
    for j in num_estimators:
        print('Begin RF Implementationf for Max Depth =', str(item), 'and num estimators = ', str(j))
        
        rf_model = RandomForestRegressor(max_depth=item, n_estimators=j)
        rf_model.fit(X_train, y_train)
        y_rf_pred 
        rf_rmse = mean_squared_error(y_test, y_rf_pred, squared=False)

        rf_r2 = r2_score(y_test, y_rf_pred)
        
        print('[RF] Root Mean Squared Error: {0}'.format(rf_rmse))
        print('[RF] R2: {0}'.format(rf_r2)) 
        print('End')

Begin RF Implementationf for Max Depth = 2 and num estimators =  100




[RF] Root Mean Squared Error: 1.2619168209343332
[RF] R2: 0.05539774167614486
End
Begin RF Implementationf for Max Depth = 2 and num estimators =  200




[RF] Root Mean Squared Error: 1.2620788513979133
[RF] R2: 0.05515515172824115
End
Begin RF Implementationf for Max Depth = 2 and num estimators =  300




[RF] Root Mean Squared Error: 1.2624499915622391
[RF] R2: 0.054599368018381056
End
Begin RF Implementationf for Max Depth = 2 and num estimators =  400




[RF] Root Mean Squared Error: 1.262289334537457
[RF] R2: 0.05483997254047657
End
Begin RF Implementationf for Max Depth = 2 and num estimators =  500




[RF] Root Mean Squared Error: 1.2621935066507848
[RF] R2: 0.054983472525091615
End
Begin RF Implementationf for Max Depth = 4 and num estimators =  100




[RF] Root Mean Squared Error: 1.23815759183017
[RF] R2: 0.09063262205332501
End
Begin RF Implementationf for Max Depth = 4 and num estimators =  200




[RF] Root Mean Squared Error: 1.2378221185702005
[RF] R2: 0.09112533334038264
End
Begin RF Implementationf for Max Depth = 4 and num estimators =  300




[RF] Root Mean Squared Error: 1.2382383603853713
[RF] R2: 0.09051397692225405
End
Begin RF Implementationf for Max Depth = 4 and num estimators =  400




[RF] Root Mean Squared Error: 1.2382788628112056
[RF] R2: 0.09045447788906369
End
Begin RF Implementationf for Max Depth = 4 and num estimators =  500




[RF] Root Mean Squared Error: 1.238045181574836
[RF] R2: 0.09079773444213535
End
Begin RF Implementationf for Max Depth = 8 and num estimators =  100




[RF] Root Mean Squared Error: 1.2205524760161741
[RF] R2: 0.1163089980487696
End
Begin RF Implementationf for Max Depth = 8 and num estimators =  200




[RF] Root Mean Squared Error: 1.2209937411029512
[RF] R2: 0.11566992274150112
End
Begin RF Implementationf for Max Depth = 8 and num estimators =  300




[RF] Root Mean Squared Error: 1.2204988768877074
[RF] R2: 0.11638660884958096
End
Begin RF Implementationf for Max Depth = 8 and num estimators =  400




[RF] Root Mean Squared Error: 1.2210708129461765
[RF] R2: 0.11555827745145819
End
Begin RF Implementationf for Max Depth = 8 and num estimators =  500




[RF] Root Mean Squared Error: 1.2212032133904174
[RF] R2: 0.11536646740355805
End
Begin RF Implementationf for Max Depth = 16 and num estimators =  100




[RF] Root Mean Squared Error: 1.2416753476896776
[RF] R2: 0.0854580356827136
End
Begin RF Implementationf for Max Depth = 16 and num estimators =  200




[RF] Root Mean Squared Error: 1.2404681293538184
[RF] R2: 0.08723549727656144
End
Begin RF Implementationf for Max Depth = 16 and num estimators =  300




[RF] Root Mean Squared Error: 1.2397903823563003
[RF] R2: 0.0882326279400294
End
Begin RF Implementationf for Max Depth = 16 and num estimators =  400




[RF] Root Mean Squared Error: 1.2386814400478596
[RF] R2: 0.08986297648259367
End
Begin RF Implementationf for Max Depth = 16 and num estimators =  500




[RF] Root Mean Squared Error: 1.2391196662655792
[RF] R2: 0.08921887793511829
End
Begin RF Implementationf for Max Depth = 32 and num estimators =  100




[RF] Root Mean Squared Error: 1.2714252218312632
[RF] R2: 0.041109169407761104
End
Begin RF Implementationf for Max Depth = 32 and num estimators =  200




[RF] Root Mean Squared Error: 1.2659066880808856
[RF] R2: 0.049415103965195994
End
Begin RF Implementationf for Max Depth = 32 and num estimators =  300




[RF] Root Mean Squared Error: 1.26709191970533
[RF] R2: 0.04763426068164589
End
Begin RF Implementationf for Max Depth = 32 and num estimators =  400


KeyboardInterrupt: 

In [None]:
## TESTING A NEURAL NETWORK
def createModel():
    nn_model = keras.Sequential()
    
    input_layer = keras.layers.InputLayer(input_shape=(X_train.shape[1],))
    
    nn_model.add(input_layer)
    
    hidden_layer_1 = keras.layers.Dense(units=64, activation='relu')
    nn_model.add(hidden_layer_1)
    
    hidden_layer_2 = keras.layers.Dense(units=32, activation='relu')
    nn_model.add(hidden_layer_2)
    
    hidden_layer_3 = keras.layers.Dense(units=16, activation='relu')
    nn_model.add(hidden_layer_3)
    
    output_layer = keras.layers.Dense(units=1, activation='sigmoid')
    nn_model.add(output_layer)
    return nn_model

lrs = [0.05, 0.1, 0.25, 0.5, 0.75]
loss_fn = keras.losses.BinaryCrossentropy(from_logits=False)
for item in lrs:
    nn_model2 = createModel()
    sgd_optimizer = keras.optimizers.SGD(learning_rate=item)
    nn_model2.compile(optimizer=sgd_optimizer, loss=loss_fn, metrics=['accuracy'])
    print("LEARNING RATE = ", str(item))
    t0 = time.time() # start time

    num_epochs = 100 # epochs

    history = nn_model2.fit(X_train, y_train, epochs=num_epochs, verbose=0, callbacks=[ProgBarLoggerNEpochs(num_epochs, every_n=5)], validation_split=0.2)

    
    t1 = time.time() # stop time
    print('Elapsed time: %.2fs' % (t1-t0))
    print("\n")
    tf.keras.backend.clear_session()

LEARNING RATE =  0.05
Epoch [5/ 100], Loss: 0.6471, Accuracy: 0.0828, Val_loss: 0.6609, Val_accuracy: 0.0852
Epoch [10/ 100], Loss: 0.6474, Accuracy: 0.0828, Val_loss: 0.6614, Val_accuracy: 0.0852
Epoch [15/ 100], Loss: 0.6478, Accuracy: 0.0828, Val_loss: 0.6598, Val_accuracy: 0.0852
Epoch [20/ 100], Loss: 0.6478, Accuracy: 0.0828, Val_loss: 0.6607, Val_accuracy: 0.0852
Epoch [25/ 100], Loss: 0.6476, Accuracy: 0.0828, Val_loss: 0.6609, Val_accuracy: 0.0852
Epoch [30/ 100], Loss: 0.6477, Accuracy: 0.0828, Val_loss: 0.6603, Val_accuracy: 0.0852
Epoch [35/ 100], Loss: 0.6472, Accuracy: 0.0828, Val_loss: 0.6594, Val_accuracy: 0.0852
Epoch [40/ 100], Loss: 0.6476, Accuracy: 0.0828, Val_loss: 0.6609, Val_accuracy: 0.0852
Epoch [45/ 100], Loss: 0.6473, Accuracy: 0.0828, Val_loss: 0.6630, Val_accuracy: 0.0852
Epoch [50/ 100], Loss: 0.6471, Accuracy: 0.0828, Val_loss: 0.6617, Val_accuracy: 0.0852
Epoch [55/ 100], Loss: 0.6476, Accuracy: 0.0828, Val_loss: 0.6612, Val_accuracy: 0.0852
Epoch [60/ 

In [22]:
class ProgBarLoggerNEpochs(keras.callbacks.Callback):
    
    def __init__(self, num_epochs: int, every_n: int = 50):
        self.num_epochs = num_epochs
        self.every_n = every_n
    
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.every_n == 0:
            s = 'Epoch [{}/ {}]'.format(epoch + 1, self.num_epochs)
            logs_s = ['{}: {:.4f}'.format(k.capitalize(), v)
                      for k, v in logs.items()]
            s_list = [s] + logs_s
            print(', '.join(s_list))
