# Homework 2

Jiss Xavier, 916427256

For this assignment, you will be developing an artificial neural network to classify data given in the __[Dry Beans Data Set](https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset#)__. This data set was obtained as a part of a research study by Selcuk University, Turkey, in which a computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features. More details on the study can be found in the following __[research paper](https://www.sciencedirect.com/science/article/pii/S0168169919311573)__.

## About the Data Set
Seven different types of dry beans were used in a study in Selcuk University, Turkey, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features - 12 dimensions and 4 shape forms - were obtained from the grains.

Number of Instances (records in the data set): __13611__

Number of Attributes (fields within each record, including the class): __17__

### Data Set Attribute Information:

1. __Area (A)__ : The area of a bean zone and the number of pixels within its boundaries.
2. __Perimeter (P)__ : Bean circumference is defined as the length of its border.
3. __Major axis length (L)__ : The distance between the ends of the longest line that can be drawn from a bean.
4. __Minor axis length (l)__ : The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. __Aspect ratio (K)__ : Defines the relationship between L and l.
6. __Eccentricity (Ec)__ : Eccentricity of the ellipse having the same moments as the region.
7. __Convex area (C)__ : Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. __Equivalent diameter (Ed)__ : The diameter of a circle having the same area as a bean seed area.
9. __Extent (Ex)__ : The ratio of the pixels in the bounding box to the bean area.
10. __Solidity (S)__ : Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. __Roundness (R)__ : Calculated with the following formula: (4piA)/(P^2)
12. __Compactness (CO)__ : Measures the roundness of an object: Ed/L
13. __ShapeFactor1 (SF1)__
14. __ShapeFactor2 (SF2)__
15. __ShapeFactor3 (SF3)__
16. __ShapeFactor4 (SF4)__

17. __Classes : *Seker, Barbunya, Bombay, Cali, Dermosan, Horoz, Sira*__

### Libraries that can be used :
- NumPy, SciPy, Pandas, Sci-Kit Learn, TensorFlow, Keras
- Any other library used during the lectures and discussion sessions.

### Other Notes
- Don't worry about not being able to achieve high accuracy, it is neither the goal nor the grading standard of this assignment.
- Discussion materials should be helpful for doing the assignments.
- The homework submission should be a .ipynb file.


Worked on high level concepts with Tejes Srivastava


## Exercise 1 : Building a Feed-Forward Neural Network(50 points)

### Exercise 1.1 : Data Preprocessing (10 points)

- As the classes are categorical, use one-hot encoding to represent the set of classes. You will find this useful when developing the output layer of the neural network.
- Normalize each field of the input data using the min-max normalization technique.

### Exercise 1.2 : Training and Testing the Neural Network (40 points)

Design a 4-layer artificial neural network, specifically a feed-forward multi-layer perceptron (using the sigmoid activation function), to classify the type of 'Dry Bean' given the other attributes in the data set, similar to the one mentioned in the paper above. Please note that this is a multi-class classification problem so select the right number of nodes accordingly for the output layer.

For training and testing the model, split the data into training and testing set by __90:10__ and use the training set for training the model and the test set to evaluate the model performance.

Consider the following hyperparameters while developing your model :

- Number of nodes in each hidden layer should be (12, 3)
- Learning rate should be 0.3
- Number of epochs should be 500
- The sigmoid function should be used as the activation function in each layer
- Stochastic Gradient Descent should be used to minimize the error rate

__Requirements once the model has been trained :__

- A confusion matrix for all classes, specifying the true positive, true negative, false positive, and false negative cases for each category in the class
- The accuracy and mean squared error (MSE) of the model
- The precision and recall for each label in the class

__Notes :__

- Splitting of the dataset should be done __after__ the data preprocessing step.
- The mean squared error (MSE) values obtained __should be positive__.


In [2]:
#Based off One Hot Encoding example from 10/08 Discussion

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

#Read in the file
beans_df = pd.read_csv('./Dry_Beans_Dataset.csv')

#Label Encoding
myData_encoder = LabelEncoder()
myData_encoded = myData_encoder.fit_transform(beans_df['Class'])
print(myData_encoded)


#One Hot Encoding
onehot_encoder = OneHotEncoder(sparse=False)
# reshape the array
myData_encoded = myData_encoded.reshape(len(myData_encoded), 1) 
onehot_encoded = onehot_encoder.fit_transform(myData_encoded)

print(onehot_encoded)

#Normalizing each field of the input data using the min-max normalization technique
sc_X = MinMaxScaler()
X_scaled = sc_X.fit_transform(beans_df.drop(columns=['Class']))

[6 3 3 ... 6 4 6]
[[0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.metrics import multilabel_confusion_matrix


#Split the data to training and test
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, onehot_encoded, test_size=0.1)

model = Sequential()
#Number of nodes in the first hidden layer
model.add(Dense(12, input_dim=16, activation='sigmoid'))
#Number of nodes in the second hidden layer
model.add(Dense(3, activation='sigmoid'))
#Maps to the 7 different types of bean
model.add(Dense(7, activation='sigmoid'))

#From Discussion 10/15
model.compile(loss='mean_squared_error',
              optimizer=SGD(learning_rate=0.3),
              metrics=['accuracy', 'mse'])

#From Discussion 10/15
model.fit(X_train,
          Y_train, 
          epochs=500, 
          verbose=0)

#Find training/testing accuracy and MSE
_, training_accuracy, training_mse = model.evaluate(X_train, Y_train, verbose=0)
_, testing_accuracy, testing_mse = model.evaluate(X_test, Y_test, verbose=0)
Y_pred = model.predict(X_test)

#Print the MSE and Accuracy of the models
print('MSE of training model:', training_mse)
print('MSE of testing model:', testing_mse)
print('Accuracy of training model:', training_accuracy)
print('Accuracy of testing model:', testing_accuracy)

#https://scikit-learn.org/stable/modules/model_evaluation.html
for i in multilabel_confusion_matrix(Y_test.argmax(axis=1), Y_pred.argmax(axis=1)):
    print(i)
    print('Recall:', (i[0][0])/(i[0][0] + i[0][1]))
    print('Precision:',(i[0][0])/(i[0][0] + i[1][0]))
    
#Formula for recall and precision: https://www.kdnuggets.com/2020/01/guide-precision-recall-confusion-matrix.html

MSE of training model: 0.02512207254767418
MSE of testing model: 0.025625422596931458
Accuracy of training model: 0.8766430020332336
Accuracy of testing model: 0.8744493126869202
[[1166   50]
 [  15  131]]
Recall: 0.9588815789473685
Precision: 0.9872988992379339
[[1304    0]
 [  58    0]]
Recall: 1.0
Precision: 0.9574155653450808
[[1158   36]
 [  12  156]]
Recall: 0.9698492462311558
Precision: 0.9897435897435898
[[976  36]
 [ 25 325]]
Recall: 0.9644268774703557
Precision: 0.975024975024975
[[1155    6]
 [   9  192]]
Recall: 0.9948320413436692
Precision: 0.9922680412371134
[[1179    8]
 [  11  164]]
Recall: 0.9932603201347936
Precision: 0.9907563025210084
[[1063   35]
 [  41  223]]
Recall: 0.9681238615664846
Precision: 0.9628623188405797


## Exercise 2 : k-fold Cross Validation (20 points)

In order to avoid using biased models, use 10-fold cross validation to generalize the model based on the given data set.

__Requirements :__
- The accuracy and MSE values during each iteration of the cross validation
- The overall average accuracy and MSE value

__Note :__ The mean squared error (MSE) values obtained should be positive.

In [15]:
#Takes around 20 min to compile and execute

from sklearn.model_selection import KFold

#initalize values
overall_accuracy = 0
overall_mse = 0
iterator = 0

#Create the model

model = Sequential()
#Number of nodes in the first hidden layer
model.add(Dense(12, input_dim=16, activation='sigmoid'))
#Number of nodes in the second hidden layer
model.add(Dense(3, activation='sigmoid'))
#Maps to the 7 different types of bean
model.add(Dense(7, activation='sigmoid'))

#From Discussion 10/15
model.compile(loss='mean_squared_error',
              optimizer=SGD(learning_rate=0.3),
              metrics=['accuracy', 'mse'])

#10 fold cross validation
kf = KFold(n_splits=10, random_state=None, shuffle=False)

#Loop through each iteration of the cross validation
for trainingIndex, testingIndex in kf.split(X_scaled, onehot_encoded):
    X_train, X_test = X_scaled[trainingIndex], X_scaled[testingIndex]
    y_train, y_test = onehot_encoded[trainingIndex], onehot_encoded[testingIndex]
    
    #From Discussion 10/15
    model.fit(X_train, 
              y_train, 
              epochs=500, 
              verbose=0)
    
    #Find testing accuracy and MSE
    _, testing_accuracy, testing_mse = model.evaluate(X_test, y_test, verbose=0)
    
    print('Cross Validation Iteration #', iterator+1, ':')
    print('\t','Accuracy:', testing_accuracy)
    print('\t','MSE:', testing_mse)
    iterator+=1
    overall_accuracy += testing_accuracy
    overall_mse += testing_mse

print('Overall Average Accuracy :', overall_accuracy/10)
print('Overall Average MSE:', overall_mse/10)


Cross Validation Iteration # 1 :
	 Accuracy: 0.8869310021400452
	 MSE: 0.02343909442424774
Cross Validation Iteration # 2 :
	 Accuracy: 0.886113166809082
	 MSE: 0.02304946258664131
Cross Validation Iteration # 3 :
	 Accuracy: 0.9118295311927795
	 MSE: 0.02214258909225464
Cross Validation Iteration # 4 :
	 Accuracy: 0.9155033230781555
	 MSE: 0.02224731259047985
Cross Validation Iteration # 5 :
	 Accuracy: 0.9316678643226624
	 MSE: 0.020588815212249756
Cross Validation Iteration # 6 :
	 Accuracy: 0.920646607875824
	 MSE: 0.02184421569108963
Cross Validation Iteration # 7 :
	 Accuracy: 0.9294636249542236
	 MSE: 0.019694708287715912
Cross Validation Iteration # 8 :
	 Accuracy: 0.9338721632957458
	 MSE: 0.01894899643957615
Cross Validation Iteration # 9 :
	 Accuracy: 0.9140337705612183
	 MSE: 0.02210644632577896
Cross Validation Iteration # 10 :
	 Accuracy: 0.9191770553588867
	 MSE: 0.02302875928580761
Overall Average Accuracy : 0.9149238109588623
Overall Average MSE: 0.021709039993584155


## Exercise 3 : Hyperparameter Tuning (30 points)

Use either grid search or random search methodology to find the optimal number of nodes required in each hidden layer, as well as the optimal learning rate and the number of epochs, such that the accuracy of the model is maximum for the given data set.

__Requirements :__
- The set of optimal hyperparameters
- The maximum accuracy achieved using this set of optimal hyperparameters

__Note :__ Hyperparameter tuning takes a lot of time to execute. Make sure that you choose the appropriate number of each hyperparameter (preferably 3 of each), and that you allocate enough time to execute your code.

In [4]:
#Takes a long time to execute, might have to run again to view results on your local machine.
#Code is working, may just have to run it again to view results.

from sklearn.model_selection import RandomizedSearchCV
from keras.wrappers.scikit_learn import KerasClassifier

#Based of code RandomizedSearch from Discussion 10/20

#function from discussion to create the model
def create_model(layer1=12, layer2=3, learnRate=0.3):
    model = Sequential()
    #Number of nodes in the first hidden layer
    model.add(Dense(layer1, input_dim=16, activation='sigmoid'))
    #Number of nodes in the second hidden layer
    model.add(Dense(layer2, activation='sigmoid'))
    #Maps to the 7 different types of bean
    model.add(Dense(7, activation='sigmoid'))
    #From Discussion 10/15
    model.compile(loss='mean_squared_error',
                  optimizer=SGD(learning_rate=learnRate),
                  metrics=['accuracy', 'mse'])
    return model

#Use the function to create the model for randomSearch
model = KerasClassifier(build_fn=create_model, verbose=0)
print(model)

#sample model hyperparameters
layer1=[5, 10, 15]
layer2=[3, 6, 17]
learnRate=[0.1, .25, 0.5]
epochs=[300, 600, 900]

#Creating dictionary to be used for randomizedSearchCV
randomSearchDict = dict(layer1 = layer1,
                    layer2 = layer2,
                    epochs=epochs, 
                    learnRate = learnRate) 

#run RandomizedSearchCV
rs = RandomizedSearchCV(estimator=model, 
                          param_distributions=randomSearchDict, 
                          cv=3)
#get the results
randomSearch = rs.fit(X_scaled, onehot_encoded.argmax(axis=1))


#https://stackoverflow.com/questions/64209804/hyperparameter-tuning-with-gridsearch-with-various-parameters
meanScore = randomSearch.cv_results_['mean_test_score']
stdTestScore = randomSearch.cv_results_['std_test_score']
params = randomSearch.cv_results_['params']

for mean, stdev, param in zip(meanScore, stdTestScore, params):
    print("Mean test score of %f and standard deviation of (%f) are obtained with: %r" % (mean, stdev, param))

print("\n","The best score observed is %f and it is obtained using %s" % (randomSearch.best_score_, randomSearch.best_params_))


<keras.wrappers.scikit_learn.KerasClassifier object at 0x7fe602f20a60>
Mean test score of 0.101756 and standard deviation of (0.015757) are obtained with: {'learnRate': 0.5, 'layer2': 3, 'layer1': 15, 'epochs': 900}
Mean test score of 0.126809 and standard deviation of (0.091969) are obtained with: {'learnRate': 0.1, 'layer2': 3, 'layer1': 10, 'epochs': 300}
Mean test score of 0.130409 and standard deviation of (0.002926) are obtained with: {'learnRate': 0.25, 'layer2': 6, 'layer1': 10, 'epochs': 900}
Mean test score of 0.171112 and standard deviation of (0.022588) are obtained with: {'learnRate': 0.1, 'layer2': 6, 'layer1': 5, 'epochs': 300}
Mean test score of 0.111234 and standard deviation of (0.021690) are obtained with: {'learnRate': 0.1, 'layer2': 3, 'layer1': 15, 'epochs': 300}
Mean test score of 0.137389 and standard deviation of (0.016635) are obtained with: {'learnRate': 0.25, 'layer2': 3, 'layer1': 5, 'epochs': 300}
Mean test score of 0.090074 and standard deviation of (0.00