#Final project AI
#Iñigo Echeagaray Rodríguez
#May 9th 2022
Copyright TensorFlow 2019

Special thanks to Dr. Gerardo Ayala San Martin

In this project, a dataset will be analyzed by Iñigo Echeagaray using TensorFlow, the dataset contains data for 7 different kinds of dried beans.

The dataset citation is given as: KOKLU, M. and OZKAN, I.A., (2020), Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques. Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507.

In the given citation, one can visualize the original study, in which they also analize the dataset using matlab GUI with a multi layer perceptron, support vector machine, k nearest neighbors and decision trees, they also compare the different performances.

In this project, the dataset with the image features (not the images themselves) will be analyzed, features will be selected to achieve a good classification metric with a neural network.

The goal is to reach a model with accuracy that is at least as good as the ones given in the study.

#Set up and library importation

In [2]:
# Installing necessary complements to read excel files
!pip install openpyxl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
#Importing Tensorflow, to create the neural network
import tensorflow as TensorFlow
#Importing numpy, to perform algebraic operations on arrays and the dataset
import numpy as np
#Importing pandas, to be able to manipulate and read the excel dataset
import pandas as pd
#Importing model_selection, to randomly split our dataset into trainig and testing sets
from sklearn.model_selection import train_test_split
#Import a preprocessing part of sklearn that has an encoder to convert our string labels to integers (class 0,1,2,3,4,5 and 6)
from sklearn import preprocessing

In [4]:
#Import the dataset to the environment
Data=pd.read_excel("DriedBeansDataset.xlsx")
#Check if there are any null values on the dataset, and see which features it has
Data.isnull().sum(axis=0)

Area               0
Perimeter          0
MajorAxisLength    0
MinorAxisLength    0
AspectRation       0
Eccentricity       0
ConvexArea         0
EquivDiameter      0
Extent             0
Solidity           0
roundness          0
Compactness        0
ShapeFactor1       0
ShapeFactor2       0
ShapeFactor3       0
ShapeFactor4       0
Class              0
dtype: int64

Luckily, there are no null values in the dataset, now, we can split the dataset into two, one dataset containing the features and the other the classes.

In [5]:
#Dataset of features, dropping the class columns
DataFeat=Data.drop('Class', axis=1)
#Dataset(array) of the class
DataClass=Data[['Class']]

Since tensorflow needs numerical classes (that is, class 0,1,2,3,4,5,6), we need to turn the strings in DataClass to numbers, based on the name.

In [6]:
#Create the numerical transformer
ClassTransformerToInt = preprocessing.LabelEncoder()
#Fit the classes to the transformer (assign each class a number from 0 to 6)
ClassTransformerToInt.fit(DataClass.Class)
#Create array that has the numerical classes
DataClassNum=ClassTransformerToInt.transform(DataClass.Class)
#Visualize unique values (the classes) of the array
np.unique(DataClassNum)

array([0, 1, 2, 3, 4, 5, 6])

#Feature selection

We should scale the features so that they all have a range of 0 to 1, in order to simplify the model and also to make it easier to select features, one way to do this is by using the formula for each feature value X: (X-min)/(max-min), this is so that the min value of the set becomes 0, and the max value of the set becomes 1, with every other value in between.

We can make this operation for every value of the feature columns using the apply function.

In [7]:
#Apply this to every column of the dataset of features
for col in DataFeat:
  #Save the max value of the column
  max=DataFeat[col].max()
  #Save the min value of the column
  min=DataFeat[col].min()
  #Make sure that the column has different values, that is, that its minimum value isn't the same as its maximum value
  if max!=min:
    #Make the feature column equal to the scaled feature column, the apply function is a function that goes through a whole array (in this case, column) and
    #applies a mathematical formula to each value x
    DataFeat[col]=DataFeat[col].apply(
        lambda value: (value-min)/(max-min)
        )
  #If the column is a column of all equal values, drop it, since it doesn't provide any info
  else:
    #Here, the argument axis=1 implies that the dropping will be made for a column, not a row (which would be axis=0, the default)
    DataFeat.drop(col,axis=1)

In [8]:
#Visualize the scaled Dataset of features
DataFeat

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4
0,0.034053,0.058574,0.044262,0.152142,0.122612,0.477797,0.033107,0.070804,0.671024,0.922824,0.934823,0.786733,0.593432,0.833049,0.750996,0.980620
1,0.035500,0.077557,0.030479,0.178337,0.051577,0.278472,0.034991,0.073577,0.735504,0.871514,0.793138,0.903549,0.547447,0.967315,0.884987,0.974979
2,0.038259,0.068035,0.052633,0.158190,0.131521,0.496448,0.037126,0.078816,0.716671,0.932141,0.914511,0.773514,0.582016,0.800942,0.736200,0.987196
3,0.040940,0.082942,0.048548,0.177691,0.091623,0.403864,0.041389,0.083854,0.731365,0.761614,0.826871,0.829912,0.552408,0.854744,0.799846,0.893675
4,0.041504,0.065313,0.032862,0.200679,0.025565,0.165680,0.040123,0.084906,0.700538,0.949832,0.988408,0.951583,0.510741,1.000000,0.941770,0.989116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,0.092559,0.160862,0.189318,0.187843,0.375584,0.788553,0.089967,0.172180,0.512286,0.942381,0.852151,0.465175,0.531785,0.382135,0.412185,0.974113
13607,0.092576,0.159358,0.176450,0.201964,0.321303,0.746241,0.089910,0.172207,0.786890,0.947954,0.862952,0.523974,0.509582,0.426233,0.470848,0.970912
13608,0.092739,0.160605,0.176384,0.203370,0.318558,0.743877,0.090219,0.172463,0.561689,0.936648,0.855785,0.525351,0.508683,0.427019,0.472240,0.943025
13609,0.092773,0.163657,0.179703,0.200669,0.330472,0.753971,0.090623,0.172517,0.482741,0.908991,0.834795,0.510145,0.514216,0.415330,0.456919,0.913342


We can see that the dataset has a lot features, because a lot of features can be extracted from images.

When creating a model, having a lot of features can be a good thing, but it can also lead to overfitting or an increased difficulty (running time) when fitting the model, it can also sometimes add unnecessary noise that makes the model worse, so there are techniques to select the best features to use.

We can see that this can be related to searching algorithms, we could create an algorithm that implements the models with different features and chooses the best ones, however, there are a lot of features and therefore that would be a lot of models, 16 features can make 2^16 models (if we include the model that uses no features), testing all of them would not be ideal (breadth first is not a good idea) and we need a good solution (depth first is not a good idea), and even informed algorithms would not be a very good idea, since even testing just a few may take a long time (the algorithm would have to train and test the model each time, even testing a few, this would take a long time), so we need to find a way to select the features beforehand.

One way we can do this is selecting the features that have the highest scaled standard deviation (scaled so that the standard deviations are actually comparable)**, why? We can remember that the standard deviation is a measure of variability in a set, a feature with high standard deviation is a feature that contains several different values among the dataset, features with low standard deviation are features that have very similar values across the dataset, this implies that they "don't have much to say", meaning, if the feature has almost the same value along the dataset, we cannot use it to distinguish between classes, because the different values for different classes will be very close to eachother.

Thankfully, since we already scaled the features, we can simply check the standard deviation of each feature using the std() function.

**Note: If the features weren't scaled, the standard deviation wouldn't tell us much, for example, is a standard deviation of "3", a high standard deviation? It depends on the set range, right? Meaning, if the set has a range of 0 to 6, the standard deviation is extremely high, but if the set has a range of 0 to 1,000,000, the standard deviation is extremely low and 1,000,000 is almost certainly an isolated outlier, that is why when comparing the standard deviations between features, the feature values should be scaled so they all have the same range, in this case, from 0 to 1.

In [9]:
#Checking the standard deviation of the features
DataFeat.std()

Area               0.125212
Perimeter          0.146710
MajorAxisLength    0.154332
MinorAxisLength    0.133171
AspectRation       0.175517
Eccentricity       0.132860
ConvexArea         0.122744
EquivDiameter      0.144996
Extent             0.157895
Solidity           0.061783
roundness          0.118786
Compactness        0.177989
ShapeFactor1       0.147006
ShapeFactor2       0.192168
ShapeFactor3       0.175392
ShapeFactor4       0.083898
dtype: float64

We can clearly see that some features have lower standard deviations compared to the others, like ShapeFactor4 and Solidity, we can try to remove features that have a standard deviation of under 0.1, or under 10% of the maximum value, which is 1.

Note: This value is chosen because this is an already "clean" dataset, so we want to be conservative when removing features, in a "dirty" dataset, one should use a larger threshold.

In [10]:
#Create the array to save the names of columns that will be used
column_array=[]
#Apply this to every column of the dataset of scaled features
for col in DataFeat:
  #Obtain the Standard Deviation of the feature column
  FeatureStandardDeviation = DataFeat[col].std()
  #If the Standard Deviation of the feature column is over 0.1
  if FeatureStandardDeviation >= 0.1:
     #save the feature column name to the column array
    column_array.append(col)
#Display the columns to be used
column_array

['Area',
 'Perimeter',
 'MajorAxisLength',
 'MinorAxisLength',
 'AspectRation',
 'Eccentricity',
 'ConvexArea',
 'EquivDiameter',
 'Extent',
 'roundness',
 'Compactness',
 'ShapeFactor1',
 'ShapeFactor2',
 'ShapeFactor3']

In [None]:
#Create new feature dataframe that contains only the features we are going to use
DataFeatAnalysis=DataFeat[column_array]
#Display the new dataframe
DataFeatAnalysis

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3
0,0.034053,0.058574,0.044262,0.152142,0.122612,0.477797,0.033107,0.070804,0.671024,0.934823,0.786733,0.593432,0.833049,0.750996
1,0.035500,0.077557,0.030479,0.178337,0.051577,0.278472,0.034991,0.073577,0.735504,0.793138,0.903549,0.547447,0.967315,0.884987
2,0.038259,0.068035,0.052633,0.158190,0.131521,0.496448,0.037126,0.078816,0.716671,0.914511,0.773514,0.582016,0.800942,0.736200
3,0.040940,0.082942,0.048548,0.177691,0.091623,0.403864,0.041389,0.083854,0.731365,0.826871,0.829912,0.552408,0.854744,0.799846
4,0.041504,0.065313,0.032862,0.200679,0.025565,0.165680,0.040123,0.084906,0.700538,0.988408,0.951583,0.510741,1.000000,0.941770
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,0.092559,0.160862,0.189318,0.187843,0.375584,0.788553,0.089967,0.172180,0.512286,0.852151,0.465175,0.531785,0.382135,0.412185
13607,0.092576,0.159358,0.176450,0.201964,0.321303,0.746241,0.089910,0.172207,0.786890,0.862952,0.523974,0.509582,0.426233,0.470848
13608,0.092739,0.160605,0.176384,0.203370,0.318558,0.743877,0.090219,0.172463,0.561689,0.855785,0.525351,0.508683,0.427019,0.472240
13609,0.092773,0.163657,0.179703,0.200669,0.330472,0.753971,0.090623,0.172517,0.482741,0.834795,0.510145,0.514216,0.415330,0.456919


#Model creation

##Training and testing datasets

Now that we have chosen which features we are going to use, we can start creating the model, first off, we need a training and testing dataset

In [None]:
#Create the training and testing sets, in order, these are:
#The feature part of the training set, the feature part of the testing set, the class part of the training set and the class part of the testing set
#Test_size=0.1 implies that, out of the 100% that is the whole data set, 90% will be dedicated to training and 10% to testing
DataFeatAnalysis_train, DataFeatAnalysis_test, DataClass_train, DataClass_test = train_test_split(DataFeatAnalysis, DataClassNum, test_size=0.1)
#Important note: in smaller datasets, one should limit the test size to avoid overfitting, in our case, because our dataset is relatively large, 
#we can get away with bigger train sizes

Now, we need to convert the dataframes to numpy arrays.

In [None]:
#Convert the dataframes to numpy arrays
FeaturesArray_train=DataFeatAnalysis_train.to_numpy()
FeaturesArray_test=DataFeatAnalysis_test.to_numpy()
#Change the name of the Class arrays to fit the context
ClassArray_train=DataClass_train
ClassArray_test=DataClass_test

In [None]:
#Verify that the shape of each set is correct (14 for the features, none for the class):
print('Shape of training Feature array: ' + str(FeaturesArray_train.shape))
print('Shape of training Class array: ' + str(ClassArray_train.shape))
print('Shape of testing Feature array:  '  + str(FeaturesArray_test.shape))
print('Shape of testing Class array:  '  + str(ClassArray_test.shape))

Shape of training Feature array: (12249, 14)
Shape of training Class array: (12249,)
Shape of testing Feature array:  (1362, 14)
Shape of testing Class array:  (1362,)


##Loss and base model creation

Lets recall that to create the keras sequential model, we stack layers with one input tensor and one output tensor.

The Flatten layer flattens the input, in our case it is not necessary since our input is already a one dimensional array.

The Dense layer is the densely connected layer of the Neural Network.

The dimension of the output is set to 16.

The rectifier or ReLU (Rectified Linear Unit) activation function is an activation function defined as the positive part of:

f(x)= max(0,x)

where x is the input to a neuron. This is also known as a ramp function.

The Dropout layer randomly sets input units to 0 with a certain rate (0.2 in this case) at each step during training time, this makes it so the network isn't too dependent on just some neurons, which would cause overfitting.

In our case, we use 3 dense layers to appropriately train the network (1 input layer, 1 hidden layer and 1 output layer), no flatten layer (our input is already one dimensional), and 2 dropout layers to minimize overfitting because of our large train size.

In [None]:
#Create the neural model
NeuralModel= TensorFlow.keras.models.Sequential([
  #The output has a shape of 66, the activation function will be relu, and the input has shape 14,none
  TensorFlow.keras.layers.Dense(66,activation='relu',input_shape=(14,)),
  #Set the rate for the dropout to 0.2
  TensorFlow.keras.layers.Dropout(0.2),
  #Add another dense layer with output of shape 22
  TensorFlow.keras.layers.Dense(22),
  #Set the rate for the dropout to 0.2
  TensorFlow.keras.layers.Dropout(0.2),
  #Final dense layer, with the 7 classes as its output
  TensorFlow.keras.layers.Dense(7)
])

print(NeuralModel.summary())

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 66)                990       
                                                                 
 dropout_2 (Dropout)         (None, 66)                0         
                                                                 
 dense_4 (Dense)             (None, 22)                1474      
                                                                 
 dropout_3 (Dropout)         (None, 22)                0         
                                                                 
 dense_5 (Dense)             (None, 7)                 161       
                                                                 
Total params: 2,625
Trainable params: 2,625
Non-trainable params: 0
_________________________________________________________________
None


We can create the logit vector to start creating the loss function

In [None]:
#Create the vector of predictions for the neural model for just "one row" of the FeaturesArray
predictions = NeuralModel(FeaturesArray_train[:1])
#Display the vector
print(predictions)

tf.Tensor(
[[-0.29375508 -0.3100034  -0.2588174  -0.22384077 -0.05209648 -0.47950733
   0.08307511]], shape=(1, 7), dtype=float32)


We create the loss function using SparseCategoricalCrossentropy, since there are multiple classes, it takes a vector of logits with a True index and returns the loss for each example.

In [None]:
#Create the loss function
lossFunc = TensorFlow.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
#Compute the loss of the first example
print(lossFunc(ClassArray_train[:1], predictions).numpy())

2.0351434


We can see that our function is appropriate, since it is close the -ln(1/7)~=1.94, which is the expected loss for an untrained model of 7 classes (it just chooses 1 of the 7 classes pretty much randomly).

#Final model creation

To create the final model, we need to use a compile method, in this case, we can use ADAM, which is an optimized gradient descent method, among the different methods, ADAM is of the ones that converge faster (reaches high accuracy in few epochs), and it can also reach very high accuracy.

In [None]:
#Use the adam optimizer with our loss function and measure its accuracy
NeuralModel.compile(optimizer='adam',
              loss=lossFunc,
              metrics=['Accuracy'])

Note: 80 epochs were chosen after some trial and error, progress seems to stall at that number of epochs.

In [None]:
#Train the model using the feature array, for 80 epochs or repetitions
NeuralModel.fit(FeaturesArray_train, ClassArray_train, epochs=80)

Epoch 1/80
Epoch 2/80
Epoch 3/80
Epoch 4/80
Epoch 5/80
Epoch 6/80
Epoch 7/80
Epoch 8/80
Epoch 9/80
Epoch 10/80
Epoch 11/80
Epoch 12/80
Epoch 13/80
Epoch 14/80
Epoch 15/80
Epoch 16/80
Epoch 17/80
Epoch 18/80
Epoch 19/80
Epoch 20/80
Epoch 21/80
Epoch 22/80
Epoch 23/80
Epoch 24/80
Epoch 25/80
Epoch 26/80
Epoch 27/80
Epoch 28/80
Epoch 29/80
Epoch 30/80
Epoch 31/80
Epoch 32/80
Epoch 33/80
Epoch 34/80
Epoch 35/80
Epoch 36/80
Epoch 37/80
Epoch 38/80
Epoch 39/80
Epoch 40/80
Epoch 41/80
Epoch 42/80
Epoch 43/80
Epoch 44/80
Epoch 45/80
Epoch 46/80
Epoch 47/80
Epoch 48/80
Epoch 49/80
Epoch 50/80
Epoch 51/80
Epoch 52/80
Epoch 53/80
Epoch 54/80
Epoch 55/80
Epoch 56/80
Epoch 57/80
Epoch 58/80
Epoch 59/80
Epoch 60/80
Epoch 61/80
Epoch 62/80
Epoch 63/80
Epoch 64/80
Epoch 65/80
Epoch 66/80
Epoch 67/80
Epoch 68/80
Epoch 69/80
Epoch 70/80
Epoch 71/80
Epoch 72/80
Epoch 73/80
Epoch 74/80
Epoch 75/80
Epoch 76/80
Epoch 77/80
Epoch 78/80
Epoch 79/80
Epoch 80/80


<keras.callbacks.History at 0x7f9b591fa5d0>

We can see that our model reached an accuracy of around 91%, which isn't too good, but it isn't bad either, compared to the accuracies of the original models from the study, it is a little on the low side but nothing too bad, now, we need to check how accurate it is on unseen data, to make sure it wasn't overfit.

In [None]:
#Test the model to verify how good it is on previously unseen data
NeuralModel.evaluate(FeaturesArray_test,  ClassArray_test, verbose=2)

43/43 - 0s - loss: 0.2081 - Accuracy: 0.9258 - 217ms/epoch - 5ms/step


[0.20805798470973969, 0.9258443713188171]

We can see that the model is a little better on previously unseen data, with an accuracy of around 92.5%, which is pretty decent, all things considered, I wasn't able to get better results than the models described in the study, but it is still a pretty good result.

In [None]:
#Now that we have the model, save it:
NeuralModel.save("NeuralNetwork166831.h5")