## Classification of mushroom

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

## Exploratory analysis and data pre-processing

Load the dataset and do some quick exploratory analysis. It seems that many of the entries are encoded in alphabets. To make the data useful for machine learning, we need to convert them into integers.

In [2]:
data = pd.read_csv('mushrooms.csv', index_col=False)
data.head(5)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
print(data.shape)

(8124, 23)


In [4]:
data.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [5]:
print("Columns with missing/NA values in data in ascending order")
print(data.isna().sum().sort_values(ascending = False))

Columns with missing/NA values in data in ascending order
habitat                     0
stalk-shape                 0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-root                  0
population                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
class                       0
dtype: int64


In [6]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

for col in data.columns:
    data[col] = encoder.fit_transform(data[col])
 
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


Let's take a look at the number of Poisonous (1) and Edible (0) cases from the dataset. From the output shown below, majority of the cases are Edible (0).

In [7]:
print(data.groupby('class').size())

class
0    4208
1    3916
dtype: int64


Finally, we'll split the data into predictor variables and target variable, following by breaking them into train and test sets. We will use 30% of the data as test set.

In [8]:
from sklearn.model_selection import train_test_split

Y = data['class'].values
X = data.drop('class', axis=1).values

X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.30, random_state=21)

## Essemble algorithm checking


I evaluate four different ensemble machine learning algorithms, two Boosting and two Bagging methods:
 **Boosting Methods**: AdaBoost (AB) and Gradient Boosting (GBM).
 **Bagging Methods**: Random Forests (RF) and Extra Trees (ET).

I did not standardize the training data (use the data as it is), and do a 10-fold cross-validation for each algorithm.

In [9]:
import lazypredict
from lazypredict.Supervised import LazyClassifier



In [10]:
clf = LazyClassifier(verbose=0,
                     ignore_warnings=True,
                     custom_metric=None,
                     predictions=False,
                     random_state=12,
                     classifiers='all')

models, predictions = clf.fit(X_train , X_test , Y_train , Y_test)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 29/29 [00:06<00:00,  4.74it/s]


In [11]:
models.head(15)

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AdaBoostClassifier,1.0,1.0,1.0,1.0,0.18
KNeighborsClassifier,1.0,1.0,1.0,1.0,0.27
XGBClassifier,1.0,1.0,1.0,1.0,0.17
SVC,1.0,1.0,1.0,1.0,0.14
RandomForestClassifier,1.0,1.0,1.0,1.0,0.21
BaggingClassifier,1.0,1.0,1.0,1.0,0.07
LabelSpreading,1.0,1.0,1.0,1.0,1.7
LabelPropagation,1.0,1.0,1.0,1.0,1.1
LGBMClassifier,1.0,1.0,1.0,1.0,0.08
ExtraTreeClassifier,1.0,1.0,1.0,1.0,0.01


In [12]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier

k = 10
acc_score = []
skfold = StratifiedKFold(n_splits = k, shuffle=True, random_state=21)
param_grid = {'n_estimators': [25, 50, 100, 200],
              'learning_rate': [0.05, 0.1, 1.0]}

model = AdaBoostClassifier(random_state=21)
grid = GridSearchCV(model, param_grid=param_grid, cv=skfold, scoring='accuracy')
start = time.time()
grid_result = grid.fit(X_train, Y_train)
end = time.time()

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

print("Best: %f using %s (run time :%f)" % (grid_result.best_score_, grid_result.best_params_, end-start))

0.959725 (0.006169) with: {'learning_rate': 0.05, 'n_estimators': 25}
0.967288 (0.006205) with: {'learning_rate': 0.05, 'n_estimators': 50}
0.979950 (0.005114) with: {'learning_rate': 0.05, 'n_estimators': 100}
0.985754 (0.005130) with: {'learning_rate': 0.05, 'n_estimators': 200}
0.960954 (0.007943) with: {'learning_rate': 0.1, 'n_estimators': 25}
0.979774 (0.005297) with: {'learning_rate': 0.1, 'n_estimators': 50}
0.985754 (0.005130) with: {'learning_rate': 0.1, 'n_estimators': 100}
0.992437 (0.004241) with: {'learning_rate': 0.1, 'n_estimators': 200}
1.000000 (0.000000) with: {'learning_rate': 1.0, 'n_estimators': 25}
1.000000 (0.000000) with: {'learning_rate': 1.0, 'n_estimators': 50}
1.000000 (0.000000) with: {'learning_rate': 1.0, 'n_estimators': 100}
1.000000 (0.000000) with: {'learning_rate': 1.0, 'n_estimators': 200}
Best: 1.000000 using {'learning_rate': 1.0, 'n_estimators': 25} (run time :35.114364)


In [13]:
model.set_params(**grid.best_params_)
model.fit(X_train, Y_train)

AdaBoostClassifier(n_estimators=25, random_state=21)

In [14]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

predictions = model.predict(X_test)
print("Accuracy score %f" % accuracy_score(Y_test, predictions))
print(classification_report(Y_test, predictions))

Accuracy score 1.000000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      1268
           1       1.00      1.00      1.00      1170

    accuracy                           1.00      2438
   macro avg       1.00      1.00      1.00      2438
weighted avg       1.00      1.00      1.00      2438



In [15]:
print(confusion_matrix(Y_test, predictions))

[[1268    0]
 [   0 1170]]


## Conclusion

It looks like the essemble method is able to achieve 100% accuracy for this data set. From the confusion matrix shown, there is no single misclassification observed. Personally, I am a little skeptical of this analysis. I will want to investigate this further. Meanwhile, let me know if you have any comments :)