## Dry Beans Classification

Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification

## Importing all libraries and data preprocessing

In [1]:

import numpy as np 
import pandas as pd
import seaborn as sns 
import matplotlib as plt 
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC ,SVR
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

In [2]:
#reading excel data file 
data = pd.read_excel('Dry_Bean_Dataset.xlsx')

In [3]:
data.head()    # getting some data 

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER


In [4]:
data.shape      # showing the total number of rows and column

(13611, 17)

In [5]:
data.info   #showing the values 

<bound method DataFrame.info of         Area  Perimeter  MajorAxisLength  MinorAxisLength  AspectRation  \
0      28395    610.291       208.178117       173.888747      1.197191   
1      28734    638.018       200.524796       182.734419      1.097356   
2      29380    624.110       212.826130       175.931143      1.209713   
3      30008    645.884       210.557999       182.516516      1.153638   
4      30140    620.134       201.847882       190.279279      1.060798   
...      ...        ...              ...              ...           ...   
13606  42097    759.696       288.721612       185.944705      1.552728   
13607  42101    757.499       281.576392       190.713136      1.476439   
13608  42139    759.321       281.539928       191.187979      1.472582   
13609  42147    763.779       283.382636       190.275731      1.489326   
13610  42159    772.237       295.142741       182.204716      1.619841   

       Eccentricity  ConvexArea  EquivDiameter    Extent  Solidity 

In [6]:
data.isna().sum()     # finding null values  , no null values

Area               0
Perimeter          0
MajorAxisLength    0
MinorAxisLength    0
AspectRation       0
Eccentricity       0
ConvexArea         0
EquivDiameter      0
Extent             0
Solidity           0
roundness          0
Compactness        0
ShapeFactor1       0
ShapeFactor2       0
ShapeFactor3       0
ShapeFactor4       0
Class              0
dtype: int64

In [7]:
data.describe().T      # getting statastical information

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Area,13611.0,53048.284549,29324.095717,20420.0,36328.0,44652.0,61332.0,254616.0
Perimeter,13611.0,855.283459,214.289696,524.736,703.5235,794.941,977.213,1985.37
MajorAxisLength,13611.0,320.141867,85.694186,183.601165,253.303633,296.883367,376.495012,738.860153
MinorAxisLength,13611.0,202.270714,44.970091,122.512653,175.84817,192.431733,217.031741,460.198497
AspectRation,13611.0,1.583242,0.246678,1.024868,1.432307,1.551124,1.707109,2.430306
Eccentricity,13611.0,0.750895,0.092002,0.218951,0.715928,0.764441,0.810466,0.911423
ConvexArea,13611.0,53768.200206,29774.915817,20684.0,36714.5,45178.0,62294.0,263261.0
EquivDiameter,13611.0,253.06422,59.17712,161.243764,215.068003,238.438026,279.446467,569.374358
Extent,13611.0,0.749733,0.049086,0.555315,0.718634,0.759859,0.786851,0.866195
Solidity,13611.0,0.987143,0.00466,0.919246,0.98567,0.988283,0.990013,0.994677


In [8]:
data.columns       # all columns 

Index(['Area', 'Perimeter', 'MajorAxisLength', 'MinorAxisLength',
       'AspectRation', 'Eccentricity', 'ConvexArea', 'EquivDiameter', 'Extent',
       'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2',
       'ShapeFactor3', 'ShapeFactor4', 'Class'],
      dtype='object')

In [9]:
data['Class'].unique()    #getting unique values from class column

array(['SEKER', 'BARBUNYA', 'BOMBAY', 'CALI', 'HOROZ', 'SIRA', 'DERMASON'],
      dtype=object)

At the end "Class" is the Multiple calssification column where there are more classification factors 
need to analyse and that is target column to know about, where we can apply machine learning modules and get the prediction value.
  

In [10]:
data["Class"].value_counts()     #calssifications 

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64

Summary -      

1 - Above are the bean classes that divided into  the 7 .

2 - Dermason is Most occuring class.

3 - Bombay  is Least occuring class.

## Feature Engineering 

Labeling the calssifications and try to get values in int. so modules can work fast and accurate target values from class column are str. format but after labelising it it cahnges  to the int.format.

In [11]:

from sklearn.preprocessing import LabelEncoder
L = LabelEncoder()
data['Class'] = L.fit_transform(data['Class'])
data


Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,5
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.272750,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.998430,5
2,29380,624.110,212.826130,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,5
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,5
4,30140,620.134,201.847882,190.279279,1.060798,0.333680,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.941900,0.999166,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,42097,759.696,288.721612,185.944705,1.552728,0.765002,42508,231.515799,0.714574,0.990331,0.916603,0.801865,0.006858,0.001749,0.642988,0.998385,3
13607,42101,757.499,281.576392,190.713136,1.476439,0.735702,42494,231.526798,0.799943,0.990752,0.922015,0.822252,0.006688,0.001886,0.676099,0.998219,3
13608,42139,759.321,281.539928,191.187979,1.472582,0.734065,42569,231.631261,0.729932,0.989899,0.918424,0.822730,0.006681,0.001888,0.676884,0.996767,3
13609,42147,763.779,283.382636,190.275731,1.489326,0.741055,42667,231.653248,0.705389,0.987813,0.907906,0.817457,0.006724,0.001852,0.668237,0.995222,3


In [12]:
data["Class"].value_counts()   # after labelling the class values are changed

3    3546
6    2636
5    2027
4    1928
2    1630
0    1322
1     522
Name: Class, dtype: int64

In [13]:
#splitting the dataset
X = data.drop(columns='Class', axis=1)  # storing training data in the X
Y = data['Class']                       # storing testing data in the Y ( target data )

In [14]:
print(X)

        Area  Perimeter  MajorAxisLength  MinorAxisLength  AspectRation  \
0      28395    610.291       208.178117       173.888747      1.197191   
1      28734    638.018       200.524796       182.734419      1.097356   
2      29380    624.110       212.826130       175.931143      1.209713   
3      30008    645.884       210.557999       182.516516      1.153638   
4      30140    620.134       201.847882       190.279279      1.060798   
...      ...        ...              ...              ...           ...   
13606  42097    759.696       288.721612       185.944705      1.552728   
13607  42101    757.499       281.576392       190.713136      1.476439   
13608  42139    759.321       281.539928       191.187979      1.472582   
13609  42147    763.779       283.382636       190.275731      1.489326   
13610  42159    772.237       295.142741       182.204716      1.619841   

       Eccentricity  ConvexArea  EquivDiameter    Extent  Solidity  roundness  \
0          0.54981

In [15]:
print(Y)

0        5
1        5
2        5
3        5
4        5
        ..
13606    3
13607    3
13608    3
13609    3
13610    3
Name: Class, Length: 13611, dtype: int32


1 - splitting data into train and testing dataset

2 - if random state changes , it changes in the prediction accuracy

In [16]:

X_test,X_train,Y_test,Y_train = train_test_split(X,Y, random_state=0)       

In [17]:
print(X.shape,X_test.shape, X_train.shape)

(13611, 16) (10208, 16) (3403, 16)


## Models Tuning

##  1  #  Logistic Regression Model 

In [18]:
#model training  #Logistic regression
model = LogisticRegression()
model.fit(X_train,Y_train)    # model fitting and training datasets
#model.fit(X_test,Y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [19]:
#accuracy train prediction
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)

#accuracy test prediction
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)

Acurracy of training dataset: 0.7355274757566853
Accuracy of testing : 0.7259012539184952


Logistic Regressio gives Accuracy above  70% in Both training and testing data

## 2 #Random Forest Model 


In [20]:
#using randomforest model for best prediction

model = RandomForestClassifier()
model.fit(X_train,Y_train)

In [21]:
#accuracy train prediction
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)

#accuracy test prediction
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)

Acurracy of training dataset: 1.0
Accuracy of testing : 0.9163401253918495


Random forest model gives abot 100% accuracy in trainig data 
and 91% accuracy in testing data


## 3 #  Decision Tree Classifier Model


In [22]:

model = DecisionTreeClassifier()
model.fit(X_train,Y_train)


X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)


X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)

Acurracy of training dataset: 1.0
Accuracy of testing : 0.8874412225705329


### 4 # kNeighbours Classification 

In [23]:
model = KNeighborsClassifier()
model.fit(X_train,Y_train)

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)

X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)

Acurracy of training dataset: 0.7813693799588598
Accuracy of testing : 0.6561520376175548


### 5 # Adaboost Classifier Model

In [24]:
model = AdaBoostClassifier()
model.fit(X_train,Y_train)

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)

X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)

Acurracy of training dataset: 0.56303261827799
Accuracy of testing : 0.5634796238244514


## 6 # Support Vector Model 


In [25]:
model = SVC()
model.fit(X_train,Y_train)

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)

X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)

Acurracy of training dataset: 0.6444313840728769
Accuracy of testing : 0.6320532915360502


## 7 # Naveis Bayes Model

In [26]:
model = GaussianNB()
model.fit(X_train,Y_train)

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Acurracy of training dataset:', training_data_accuracy)

X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing :', testing_data_accuracy)


Acurracy of training dataset: 0.7616808698207465
Accuracy of testing : 0.7623432601880877


### Conclusion

1 -  From above EDA part come know that there was Multiclass calssificaion, where seven types of dry bean classifications was  there. 

2 - No Null Values so it was easy to perform on data .

3 - Class columns is classification column also it was Target column.

4 - Class column was in str.format so it nedded to convert into int.format so can perform operations on target data it was done by LableEncoding Method.

5 - Then Train and Test data for getting best output,so data fit it into different Machine Learning Models  for training and testing , used Logistic , Randomforest Model, Support Vector , Adaboost, Decision tree, Kneighbours ,Navies Bayes etc.

6 - Got best output By RandomForest Model 91 % Accuracy score .