## Classification of mushroom using the essemble method

I am using the essemble method to classify the mushrooms given the data set. I will be using a selection of Bagging and Boosting methods and each method will be tested using a 10 fold validation. 

Do vote for this if you like this analysis :)

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
import time

## Exploratory analysis and data pre-processing

Load the dataset and do some quick exploratory analysis. It seems that many of the entries are encoded in alphabets. To make the data useful for machine learning, we need to convert them into integers.

In [3]:
data = pd.read_csv('mushrooms.csv', index_col=False)
data.head(5)

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [4]:
print(data.shape)

(8124, 23)


In [5]:
data.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [8]:
encoder = LabelEncoder()

for col in data.columns:
    data[col] = encoder.fit_transform(data[col])
 
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,...,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,...,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,...,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,...,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,...,2,7,7,0,2,1,0,3,0,1


Let's take a look at the number of Poisonous (1) and Edible (0) cases from the dataset. From the output shown below, majority of the cases are Edible (0).

In [9]:
print(data.groupby('class').size())

class
0    4208
1    3916
dtype: int64


Finally, we'll split the data into predictor variables and target variable, following by breaking them into train and test sets. We will use 30% of the data as test set.

In [17]:
Y = data['class'].values
X = data.drop('class', axis=1).values

X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.30, random_state=21)

## Essemble algorithm checking


I evaluate four different ensemble machine learning algorithms, two Boosting and two Bagging methods:
 **Boosting Methods**: AdaBoost (AB) and Gradient Boosting (GBM).
 **Bagging Methods**: Random Forests (RF) and Extra Trees (ET).

I did not standardize the training data (use the data as it is), and do a 10-fold cross-validation for each algorithm.

In [18]:
# ensembles
ensembles = []
ensembles.append(('AB', AdaBoostClassifier()))
ensembles.append(('GBM', GradientBoostingClassifier()))
ensembles.append(('RF', RandomForestClassifier()))
ensembles.append(('ET', ExtraTreesClassifier()))

In [20]:
import warnings

results = []
names = []
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for name, model in ensembles:
        kfold = KFold(n_splits=10, random_state=21)
        cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
        results.append(cv_results)
        names.append(name)
        msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)

AB: 1.000000 (0.000000)
GBM: 1.000000 (0.000000)
RF: 1.000000 (0.000000)
ET: 1.000000 (0.000000)


Interestingly, all the algorithms hit 100% accuracy in the test! It would be interesting to observe the performance in detail. But for now, I will just pick RandomForestClassifier to validate with the test set.

In [49]:
# prepare the model
model = RandomForestClassifier(random_state=21, n_estimators=100) 
model.fit(X_train, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=21,
            verbose=0, warm_start=False)

In [50]:
predictions = model.predict(X_test)
print("Accuracy score %f" % accuracy_score(Y_test, predictions))
print(classification_report(Y_test, predictions))

Accuracy score 1.000000
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      1268
          1       1.00      1.00      1.00      1170

avg / total       1.00      1.00      1.00      2438



In [51]:
print(confusion_matrix(Y_test, predictions))

[[1268    0]
 [   0 1170]]


## Conclusion

It looks like the essemble method is able to achieve 100% accuracy for this data set. From the confusion matrix shown, there is no single misclassification observed. Personally, I am a little skeptical of this analysis. I will want to investigate this further. Meanwhile, let me know if you have any comments :)