<h1>Rocks or Mines</h1>

<h3>About the data</h3>
The data contains sonar signal data collected after they are bounced off two kinds of objects (underwater). The objects are either rocks or mines and the sonar signals are sent at 60 different frequencies. The value returned is then recorded. The goal of the exercise is to build a model that can figure out whether an object is a rock or a mine

<li>Dataset: 208 samples of sonar signals bounced off either a cylindrical metal cylinder (mine) or a cylinrical rock (rock)</li>
<li>Train a model to distinguish between a rock and a mine</li>
<li>We'll compare the performance of a random forest classifier, and a neural network classifier</li>

<h2>Prep the model results report</h2>


In [80]:
import pandas as pd
import numpy as np
results_df = pd.DataFrame(np.zeros(shape=(3,8)))
results_df.index=[1,2,3]
results_df.columns = ["description","training accuracy","testing accuracy","precision","recall","f1_score","AUC","AP"]
results_df.index.rename("Model",inplace=True)
results_df.description = ["Random Forest","Neural Net","Optimized Neural Net"]
results_df

Unnamed: 0_level_0,description,training accuracy,testing accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Random Forest,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Neural Net,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Optimized Neural Net,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h2>Report the results here</h2>
<li>Do this after you've run all the models</li>


In [112]:
#Run your models below and then return to this cell to report your results
results_df

Unnamed: 0_level_0,description,training accuracy,testing accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Random Forest,0.89759,0.761905,0.772727,0.772727,0.772727,0.886364,0.716155
2,Neural Net,1.0,0.761905,0.765368,0.761905,0.761905,0.820455,0.724675
3,Optimized Neural Net,1.0,0.785714,0.79307,0.785714,0.78535,0.856818,0.755297


<h2>Discuss your results</h2>
<li>Look at the various scores (e.g., AUC, AP, Accuracy, Precision) and try to explain the differences</li>
<li>Ask chatGPT to help. For example, you can ask it a question like "I'm analyzing the rocks vs. mines dataset. Why does my random forest model reports a higher AUC than my neural network model?"</li>
<li>Explore other questions. chatGPT may give you more detail than you need. Read the response, internalize it, and then, briefly summarize it (e.g., the random forest reports a higher AUC than the NN because ....)</li>

<h3>Summary of Results</h3>
<li>The random forest model performed fairly well across all scoring metrics. The testing accuracy was about 76%, and the model had an AUC score of .886. This AUC score was higher than the AUC score for either of the neural network models, so by that metric the random forest was the best model. </li>
<li>Model 2, the first neural network, performed the same or worse than the random forest across every metric except average precision. Overfitting appears to have been an issue, as the model reported a training accuracy of 100% but a testing accuracy of only 76%. Another issue could be that the rocks and mine dataset is quite small, as neural networks generally perform best on larger datasets</li>
<li>The optimized neural network saw better results than the first neural network across every metric. This makes sense as we would expect improved results given that we used a grid search to tune the hyperparameters. Overfitting still appears to be an issue, though, given that the model performed with 100% accuracy on the training data but only 78.6% accuracy on the testing data. However, model 3 nonetheless had better results than the random forest across all metrics except AUC score.</li>



<h1>Build the models below this cell</h1>

<h3>Get the data</h3>

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
df = fetch_openml('sonar')

<h2>Split the data into training and testing</h2>
<li>Set mines as 0 and rocks as 1</li>
<li>split the data into 20% testing and 80% training</li>
<li>x_train, y_train, x_test, y_test</li>

In [83]:
X, y = df.data, df.target
y = y.str.replace('Rock','1').str.replace('Mine', '0').astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)

<h2>Run cross validation grid search to find the best random forest model</h2>

In [84]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

parameters = {
     'n_estimators':(25, 50, 100), #the number of trees
     'min_samples_split': (50, 100, 200),
    'class_weight': [{1:1},'balanced'],
     'min_samples_leaf': (5,10,20) #
}
gs_clf = GridSearchCV(RandomForestClassifier(random_state=42),parameters,cv=5,n_jobs=-1,
                      scoring='f1')
gs_clf.fit(X_train, np.ravel(y_train))

<h2>Get the best estimator and run it</h2>

In [85]:
gs_clf.best_estimator_

In [86]:
gs_clf.best_params_

{'class_weight': 'balanced',
 'min_samples_leaf': 10,
 'min_samples_split': 50,
 'n_estimators': 100}

In [87]:
model_1 = RandomForestClassifier(class_weight='balanced', min_samples_leaf=10, min_samples_split=50,n_estimators=100)
model_1.fit(X_train,y_train)


<h3>Report the results and add them to results_df</h3>

In [88]:
from sklearn.metrics import confusion_matrix,f1_score,precision_score,recall_score
from sklearn.metrics import roc_auc_score,average_precision_score
test_pred = model_1.predict(X_test)
cfm = confusion_matrix(y_test,test_pred)
accuracy_training=model_1.score(X_train,y_train)
accuracy_testing=model_1.score(X_test,y_test)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
print("confusion matrix:")
print(cfm)
f1 = f1_score(y_test,test_pred)
precision = precision_score(y_test,test_pred)
recall = recall_score(y_test,test_pred)
auc = roc_auc_score(y_test,model_1.predict_proba(X_test)[:,1])

ap = average_precision_score(y_test,test_pred)
print("precision: ",precision)
print("recall: ",recall)
print("f1 score: ",f1)
print("auc",auc)
print("ap",ap)


Training accuracy:  0.8975903614457831
Testing  accuracy:  0.7619047619047619
confusion matrix:
[[15  5]
 [ 5 17]]
precision:  0.7727272727272727
recall:  0.7727272727272727
f1 score:  0.7727272727272727
auc 0.8863636363636362
ap 0.716155057064148


<h2>Update results table</h2>

In [89]:
results_df.loc[1,'training accuracy'] = accuracy_training
results_df.loc[1,'testing accuracy'] = accuracy_testing

results_df.loc[1,'precision'] = precision
results_df.loc[1,'recall'] = recall
results_df.loc[1,'f1_score'] = f1
results_df.loc[1,"AUC"] = auc
results_df.loc[1,"AP"] = ap
results_df

Unnamed: 0_level_0,description,training accuracy,testing accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Random Forest,0.89759,0.761905,0.772727,0.772727,0.772727,0.886364,0.716155
2,Neural Net,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Optimized Neural Net,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h2>Build the neural network</h2>

<li>the output layer in the NN can have multiple values</li>
<li>Use one hot encoding to build an output layer of two nodes (0 and 1)</li>

<h2>Converting the y values into numbers</h2>
<li>In our regression example, we used 0 for rocks and 1 for mines
<li>sklearn has a LabelEncoder that will replace text with numbered labels

In [90]:
#Use this cell to suppress the many warnings you'll get during grid search
import warnings
warnings.filterwarnings('ignore')

<h1>Building a basic neural net</h1>
<li>You need to decide:
<ol>
<li>Number of hidden layers
<li>Number of nodes in each hidden layer
<li>Number of nodes in the input layer
<li>Number of nodes in the output layer
<li>Number of training passes (epochs)
<li>Activation function to use



<h3>Brief explanation of Neural Net Parameters</h3>
<li><b>solver</b>: sgd (stochastic gradient descent), lbfgs (limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm), adam (stochastic gradient based optimizer)
<li><b>activation</b>: logistic (sigmoid), tanh (hyperbolic tan function), relu (linear unit function). relu returns max(0,x) and works better on two class dependent variables (we don't want both returned
<li><b>alpha</b>: L2 regularization term. Regularization is used to prevent overfitting by not using the exact loss (difference between predicted and actual) when adjusting the weights (in a neural network model). L2 adds the sum of the square of the weights modified by a lambda parameter to each delta
<li><b>batch size</b>: Number of cases to use in one epoch. 
<li><b>momentum</b>: A number between 0 and 1 that accelerates a gradient descent (e.g., sigmoid) algorithm if it is moving in the right (consistent) direction
<li><b>shuffle</b>: shuffle the samples in each iteration (the order in which they are presented will change
<li><b>tol</b>: if the improvement is less than this, the algorithm stops
<li><b>Learning rate</b> A hyper parameter that controls how much weights should be adjusted after each epoch</li>
<ul>
<li>Too low, the model will take a long time to converge (expensive GPU cost)
<li>Too high, the model may never converge
<li>Bit of guesswork goes into this (e.g., start low, slowly increase the rate, see how the loss changes (loss = prediction error), and adjust the rate accordingly
    <li>Setting it to constant is the default and the easiest place to start</li>
</ul>

<h3>Basic NN model</h3>
<li>See <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html">https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html</a></li>
<li>We'll start with one hidden layer of 60 nodes (1-1 correspondence with the input layer)
<li>Use lbfgs as the solver</li>
<li>An alpha of 0.00001</li>
<li>Epochs (the max_iter parameter) set to 500</li>


<h3>Training and testing</h3>

In [91]:
from sklearn.neural_network import MLPClassifier

model_2 = MLPClassifier(solver='lbfgs', alpha=.00001, hidden_layer_sizes=(60,), max_iter=500)
model_2.fit(X_train,y_train)

In [92]:
from sklearn.metrics import confusion_matrix,f1_score,precision_score,recall_score
from sklearn.metrics import roc_auc_score,average_precision_score
test_pred = model_2.predict(X_test)
accuracy_training=model_2.score(X_train,y_train)
accuracy_testing=model_2.score(X_test,y_test)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
f1 = f1_score(y_test,test_pred,average="weighted")
precision = precision_score(y_test,test_pred,average="weighted")
recall = recall_score(y_test,test_pred,average="weighted")
auc = roc_auc_score(y_test,model_2.predict_proba(X_test)[:,1])

ap = average_precision_score(y_test,test_pred)
print("precision: ",precision)
print("recall: ",recall)
print("f1 score: ",f1)
print("auc",auc)
print("ap",ap)


Training accuracy:  1.0
Testing  accuracy:  0.7619047619047619
precision:  0.7653679653679655
recall:  0.7619047619047619
f1 score:  0.7619047619047619
auc 0.8204545454545455
ap 0.7246753246753248


In [93]:
results_df.loc[2,'training accuracy'] = accuracy_training
results_df.loc[2,'testing accuracy'] = accuracy_testing

results_df.loc[2,'precision'] = precision
results_df.loc[2,'recall'] = recall
results_df.loc[2,'f1_score'] = f1
results_df.loc[2,"AUC"] = auc
results_df.loc[2,"AP"] = ap
results_df

Unnamed: 0_level_0,description,training accuracy,testing accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Random Forest,0.89759,0.761905,0.772727,0.772727,0.772727,0.886364,0.716155
2,Neural Net,1.0,0.761905,0.765368,0.761905,0.761905,0.820455,0.724675
3,Optimized Neural Net,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h2>Grid search</h2>
<li>Use grid search to find a better NN model</li>
<li>Note that this is open-ended. Use as many or as few values as you like</li>
<li>Refer to the documentation for help</li>

<h3>Brief explanation of Neural Net Parameters</h3>
<li><b>solver</b>: sgd (stochastic gradient descent), lbfgs (limited memory Broyden–Fletcher–Goldfarb–Shanno algorithm), adam (stochastic gradient based optimizer)
<li><b>activation</b>: logistic (sigmoid), tanh (hyperbolic tan function), relu (linear unit function). relu returns max(0,x) and works better on two class dependent variables (we don't want both returned
<li><b>alpha</b>: L2 regularization term. Regularization is used to prevent overfitting by not using the exact loss (difference between predicted and actual) when adjusting the weights (in a neural network model). L2 adds the sum of the square of the weights modified by a lambda parameter to each delta
<li><b>batch size</b>: Number of cases to use in one epoch. 
<li><b>momentum</b>: A number between 0 and 1 that accelerates a gradient descent (e.g., sigmoid) algorithm if it is moving in the right (consistent) direction
<li><b>shuffle</b>: shuffle the samples in each iteration (the order in which they are presented will change
<li><b>tol</b>: if the improvement is less than this, the algorithm stops
<li><b>Learning rate</b> A hyper parameter that controls how much weights should be adjusted after each epoch</li>
<ul>
<li>Too low, the model will take a long time to converge (expensive GPU cost)
<li>Too high, the model may never converge
<li>Bit of guesswork goes into this (e.g., start low, slowly increase the rate, see how the loss changes (loss = prediction error), and adjust the rate accordingly
    <li>Setting it to constant is the default and the easiest place to start</li>
</ul>

In [106]:
from sklearn.model_selection import GridSearchCV

parameters = {
     'solver':('lbfgs', 'sgd', 'adam'),
     'activation': ('logistic', 'tanh', 'relu'),
    'alpha': (.0001, .00001, .000005),
     'max_iter': (100, 300, 500),
     'hidden_layer_sizes': [(60,),(60,30),(100,)]
}
gs_clf = GridSearchCV(MLPClassifier(random_state=5),parameters,cv=3,n_jobs=-1,
                      scoring='f1')
gs_clf.fit(X_train, np.ravel(y_train))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

<h2>Get the best estimator and apply it</h2>
<li>Note that you may get different values of the best estimator at different times!</li>

In [107]:
gs_clf.best_estimator_

In [108]:
gs_clf.best_params_

{'activation': 'relu',
 'alpha': 1e-05,
 'hidden_layer_sizes': (60, 30),
 'max_iter': 300,
 'solver': 'adam'}

In [109]:
model_3 = MLPClassifier(activation='relu', solver='adam', alpha=.000005, hidden_layer_sizes=(60,30), max_iter=300)
model_3.fit(X_train,y_train)

<h2>Get the metrics and update results_df</h2>
<b>Then, go to the top of the notebook and answer the question!</b>

In [110]:
from sklearn.metrics import roc_auc_score,accuracy_score,precision_score,recall_score,f1_score,average_precision_score
test_pred = model_3.predict(X_test)
accuracy_training=model_3.score(X_train,y_train)
accuracy_testing=model_3.score(X_test,y_test)
print("Training accuracy: ",accuracy_training)
print("Testing  accuracy: ",accuracy_testing)
f1 = f1_score(y_test,test_pred,average="weighted")
precision = precision_score(y_test,test_pred,average="weighted")
recall = recall_score(y_test,test_pred,average="weighted")
auc = roc_auc_score(y_test,model_3.predict_proba(X_test)[:,1])

ap = average_precision_score(y_test,test_pred)
print("precision: ",precision)
print("recall: ",recall)
print("f1 score: ",f1)
print("auc",auc)
print("ap",ap)

Training accuracy:  1.0
Testing  accuracy:  0.7857142857142857
precision:  0.7930696305982347
recall:  0.7857142857142857
f1 score:  0.7853496475164088
auc 0.8568181818181818
ap 0.7552973342447027


In [111]:
results_df.loc[3,'training accuracy'] = accuracy_training
results_df.loc[3,'testing accuracy'] = accuracy_testing

results_df.loc[3,'precision'] = precision
results_df.loc[3,'recall'] = recall
results_df.loc[3,'f1_score'] = f1
results_df.loc[3,"AUC"] = auc
results_df.loc[3,"AP"] = ap
results_df

Unnamed: 0_level_0,description,training accuracy,testing accuracy,precision,recall,f1_score,AUC,AP
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,Random Forest,0.89759,0.761905,0.772727,0.772727,0.772727,0.886364,0.716155
2,Neural Net,1.0,0.761905,0.765368,0.761905,0.761905,0.820455,0.724675
3,Optimized Neural Net,1.0,0.785714,0.79307,0.785714,0.78535,0.856818,0.755297
