# DryBeanDataset

The **XGBoost** and **Random Forest** models are employed for the classification of the DryBeanDataset, a tabular dataset representing various bean types. 

1. Ikhsan Assidiqie
2. Zharfan Dawud Harwiraputera
3. Helmi Wira Tahta Haikal

## Method Summary

1. **XGBOOST** 
Known for its efficiency and accuracy, XGBoost is a gradient boosting algorithm that sequentially builds multiple decision trees and optimizes their collective predictive power. It's particularly effective for tabular data classification, making it well-suited for discerning different types of beans in the DryBeanDataset.

2. **Random Forest** 
This model involves constructing multiple decision trees using random subsets of the training data. Each tree contributes to the final prediction, and the Random Forest model excels in handling complex datasets. In the context of DryBeanDataset, Random Forest proves valuable for accurate classification, leveraging the diversity of decision trees to enhance overall performance.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder


In [2]:
df = pd.read_csv('../../data/raw/DryBeanDataset/Dry_Bean_Dataset.csv', delimiter=';')
df = df.replace(',', '.', regex=True)

df

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.1781167,173.888747,1.197191424,0.549812187,28715,190.1410973,0.763922518,0.988855999,0.958027126,0.913357755,0.007331506,0.003147289,0.834222388,0.998723889,SEKER
1,28734,638.018,200.5247957,182.7344194,1.097356461,0.411785251,29172,191.2727505,0.783968133,0.984985603,0.887033637,0.953860842,0.006978659,0.003563624,0.909850506,0.998430331,SEKER
2,29380,624.11,212.8261299,175.9311426,1.209712656,0.562727317,29690,193.4109041,0.778113248,0.989558774,0.947849473,0.908774239,0.007243912,0.003047733,0.825870617,0.999066137,SEKER
3,30008,645.884,210.557999,182.5165157,1.153638059,0.498615976,30724,195.4670618,0.782681273,0.976695743,0.903936374,0.928328835,0.007016729,0.003214562,0.861794425,0.994198849,SEKER
4,30140,620.134,201.8478822,190.2792788,1.06079802,0.333679658,30417,195.896503,0.773098035,0.99089325,0.984877069,0.970515523,0.00669701,0.003664972,0.941900381,0.999166059,SEKER
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13606,42097,759.696,288.721612,185.9447054,1.55272833,0.765002201,42508,231.5157988,0.71457428,0.990331232,0.916603122,0.80186515,0.006858484,0.001749094,0.642987719,0.998385248,DERMASON
13607,42101,757.499,281.5763923,190.7131365,1.476439419,0.735702218,42494,231.5267977,0.799942998,0.990751636,0.922015342,0.822252163,0.006688116,0.001885835,0.67609862,0.998218654,DERMASON
13608,42139,759.321,281.5399279,191.1879789,1.472581747,0.734064781,42569,231.6312612,0.729932444,0.989898753,0.918424091,0.822729703,0.00668122,0.001888271,0.676884164,0.996767264,DERMASON
13609,42147,763.779,283.3826364,190.2757308,1.489326228,0.741054787,42667,231.6532475,0.705389121,0.987812595,0.907906457,0.817457451,0.006723673,0.001852025,0.668236684,0.99522242,DERMASON


In [3]:
df.head(6)

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.1781167,173.888747,1.197191424,0.549812187,28715,190.1410973,0.763922518,0.988855999,0.958027126,0.913357755,0.007331506,0.003147289,0.834222388,0.998723889,SEKER
1,28734,638.018,200.5247957,182.7344194,1.097356461,0.411785251,29172,191.2727505,0.783968133,0.984985603,0.887033637,0.953860842,0.006978659,0.003563624,0.909850506,0.998430331,SEKER
2,29380,624.11,212.8261299,175.9311426,1.209712656,0.562727317,29690,193.4109041,0.778113248,0.989558774,0.947849473,0.908774239,0.007243912,0.003047733,0.825870617,0.999066137,SEKER
3,30008,645.884,210.557999,182.5165157,1.153638059,0.498615976,30724,195.4670618,0.782681273,0.976695743,0.903936374,0.928328835,0.007016729,0.003214562,0.861794425,0.994198849,SEKER
4,30140,620.134,201.8478822,190.2792788,1.06079802,0.333679658,30417,195.896503,0.773098035,0.99089325,0.984877069,0.970515523,0.00669701,0.003664972,0.941900381,0.999166059,SEKER
5,30279,634.927,212.5605564,181.5101816,1.171066849,0.52040066,30600,196.3477022,0.775688485,0.989509804,0.943851783,0.923725952,0.007020065,0.003152779,0.853269634,0.999235781,SEKER


In [4]:
df.tail(6)

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
13605,42070,760.701,276.6916506,193.9453665,1.426647389,0.713216263,42458,231.4415426,0.730813327,0.990861557,0.913596447,0.836460161,0.006576935,0.001986023,0.699665601,0.998176102,DERMASON
13606,42097,759.696,288.721612,185.9447054,1.55272833,0.765002201,42508,231.5157988,0.71457428,0.990331232,0.916603122,0.80186515,0.006858484,0.001749094,0.642987719,0.998385248,DERMASON
13607,42101,757.499,281.5763923,190.7131365,1.476439419,0.735702218,42494,231.5267977,0.799942998,0.990751636,0.922015342,0.822252163,0.006688116,0.001885835,0.67609862,0.998218654,DERMASON
13608,42139,759.321,281.5399279,191.1879789,1.472581747,0.734064781,42569,231.6312612,0.729932444,0.989898753,0.918424091,0.822729703,0.00668122,0.001888271,0.676884164,0.996767264,DERMASON
13609,42147,763.779,283.3826364,190.2757308,1.489326228,0.741054787,42667,231.6532475,0.705389121,0.987812595,0.907906457,0.817457451,0.006723673,0.001852025,0.668236684,0.99522242,DERMASON
13610,42159,772.237,295.142741,182.2047159,1.619841394,0.786693016,42600,231.6862231,0.788962497,0.989647887,0.888380369,0.784997193,0.007000705,0.001639812,0.616220592,0.998179623,DERMASON


In [5]:
df['Class'].value_counts()

Class
DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: count, dtype: int64

In [6]:
df.isnull().sum()

Area               0
Perimeter          0
MajorAxisLength    0
MinorAxisLength    0
AspectRation       0
Eccentricity       0
ConvexArea         0
EquivDiameter      0
Extent             0
Solidity           0
roundness          0
Compactness        0
ShapeFactor1       0
ShapeFactor2       0
ShapeFactor3       0
ShapeFactor4       0
Class              0
dtype: int64

In [7]:
df.dtypes

Area                int64
Perimeter          object
MajorAxisLength    object
MinorAxisLength    object
AspectRation       object
Eccentricity       object
ConvexArea          int64
EquivDiameter      object
Extent             object
Solidity           object
roundness          object
Compactness        object
ShapeFactor1       object
ShapeFactor2       object
ShapeFactor3       object
ShapeFactor4       object
Class              object
dtype: object

In [8]:
df.shape

(13611, 17)

## Preprocess Data

In [9]:
le = LabelEncoder()
df['Class'] = le.fit_transform(df['Class'])

In [10]:
X = df.drop('Class', axis=1)
y = df['Class']

In [11]:
X[['Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation', 'Eccentricity', 'EquivDiameter', 'Extent', 'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2', 'ShapeFactor3', 'ShapeFactor4']] = X[ ['Perimeter', 'MajorAxisLength', 'MinorAxisLength', 'AspectRation', 'Eccentricity', 'EquivDiameter', 'Extent', 'Solidity', 'roundness', 'Compactness', 'ShapeFactor1', 'ShapeFactor2', 'ShapeFactor3', 'ShapeFactor4']].astype(float)

In [12]:
X.dtypes

Area                 int64
Perimeter          float64
MajorAxisLength    float64
MinorAxisLength    float64
AspectRation       float64
Eccentricity       float64
ConvexArea           int64
EquivDiameter      float64
Extent             float64
Solidity           float64
roundness          float64
Compactness        float64
ShapeFactor1       float64
ShapeFactor2       float64
ShapeFactor3       float64
ShapeFactor4       float64
dtype: object

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.22)

In [14]:
model = XGBClassifier()
model.fit(X_train, y_train)

In [15]:
y_pred = model.predict(X_test)

In [16]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.93


In [17]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.93      0.92      0.93       295
           1       1.00      1.00      1.00       123
           2       0.94      0.94      0.94       348
           3       0.91      0.92      0.91       757
           4       0.97      0.96      0.97       434
           5       0.95      0.95      0.95       445
           6       0.87      0.88      0.88       593

    accuracy                           0.93      2995
   macro avg       0.94      0.94      0.94      2995
weighted avg       0.93      0.93      0.93      2995


In [18]:
from sklearn.metrics import confusion_matrix

In [19]:
cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:\n", cm)

Confusion matrix:
 [[271   0  14   0   0   3   7]
 [  0 123   0   0   0   0   0]
 [  9   0 328   0   7   2   2]
 [  0   0   0 697   1   8  51]
 [  2   0   4   3 418   0   7]
 [  7   0   0   9   0 421   8]
 [  1   0   3  58   4   7 520]]


In [20]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'learning_rate': [0.001, 0.01, 0.1, 1, 10, 100],
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
    'n_estimators': [50, 100, 200, 300]
}

grid_search = GridSearchCV(XGBClassifier(), param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 200}


In [21]:
best_model = grid_search.best_estimator_
y_pred_tuned = best_model.predict(X_test)

## Results and Analysis

Following the code execution above, the project outcomes will encompass model performance metrics, aided by visualizations and insights derived from the application of XGBoost on the DryBean Dataset.

## Conclusion

with the available results and our team’s analysis, we concluded that dry bean classification is an astounding success with all types of models having high percentages of precision and F1-score. Our greatest result could be seen in schema four using XGBoost with a precision score of 93%, followed closely by random forest schemas with 92%.