Source:
https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset

Predict the class of a dry bean based on its numeric features.

In [5]:
import numpy as np
import pandas as pd
import xgboost as xgb

Load dataset.

In [6]:
file_name = "Dry_Bean_Dataset.xlsx"
data = pd.read_excel(file_name)
data = pd.DataFrame(data)
data.head()

Unnamed: 0,Area,Perimeter,MajorAxisLength,MinorAxisLength,AspectRation,Eccentricity,ConvexArea,EquivDiameter,Extent,Solidity,roundness,Compactness,ShapeFactor1,ShapeFactor2,ShapeFactor3,ShapeFactor4,Class
0,28395,610.291,208.178117,173.888747,1.197191,0.549812,28715,190.141097,0.763923,0.988856,0.958027,0.913358,0.007332,0.003147,0.834222,0.998724,SEKER
1,28734,638.018,200.524796,182.734419,1.097356,0.411785,29172,191.27275,0.783968,0.984986,0.887034,0.953861,0.006979,0.003564,0.909851,0.99843,SEKER
2,29380,624.11,212.82613,175.931143,1.209713,0.562727,29690,193.410904,0.778113,0.989559,0.947849,0.908774,0.007244,0.003048,0.825871,0.999066,SEKER
3,30008,645.884,210.557999,182.516516,1.153638,0.498616,30724,195.467062,0.782681,0.976696,0.903936,0.928329,0.007017,0.003215,0.861794,0.994199,SEKER
4,30140,620.134,201.847882,190.279279,1.060798,0.33368,30417,195.896503,0.773098,0.990893,0.984877,0.970516,0.006697,0.003665,0.9419,0.999166,SEKER


Convert classes of response from string to categorical numbers.

In [16]:
data = data.replace(to_replace =df["Class"].unique(), value = list(range(len(df["Class"].unique()))))

Define train and test datasets.

In [17]:
X = data.drop('Class', axis=1)
y = data["Class"]

# Split into training and test set
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
train_data = pd.concat([X_train, y_train.reindex(X_train.index)], axis=1)
test_data = pd.concat([X_test, y_test.reindex(X_test.index)], axis=1)

Built XGBoost models for train and test sets.

In [18]:
from sklearn.metrics import accuracy_score
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))



Accuracy: 93.24%


We get more than 90 percent accuracy for the XGBoost test accuracy.

Now do it in LightGBM.

In [19]:
import lightgbm as lgb
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train)
# predict the results
y_pred=clf.predict(X_test)
# view accuracy
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(y_pred, y_test)
print('LightGBM Model accuracy score: {0:0.4f}'.format(accuracy_score(y_test, y_pred)))

LightGBM Model accuracy score: 0.9321


Seems like we're enjoying an accuracy rate of above 90 percent for the test set.

In [20]:
y_pred_train = clf.predict(X_train)
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

Training-set accuracy score: 0.9998


Now save the models.

In [25]:
import pickle
filename1 = 'finalized_model_xgb.sav'
filename2 = 'finalized_model_lgb.sav'
model = pickle.dump(model, open(filename1, 'wb'))
clf = pickle.dump(clf, open(filename2, 'wb'))

Load them back in.

In [27]:
#XGBoost
loaded_model_xgb = pickle.load(open(filename1, 'rb'))
result_xgb = loaded_model_xgb.score(X_test, y_test)
print(result_xgb)

0.932427469702534


In [28]:
#LightGBM
loaded_model_lgb = pickle.load(open(filename2, 'rb'))
result_lgb = loaded_model_lgb.score(X_test, y_test)
print(result_lgb)

0.9320602276900477


We get the same 93 percent accuracies for the models.