# Mushroom Classification - Edible or Poisonous?
### by Renee Teate
### *Using Bernoulli Naive Bayes Classification from scikit-learn*
For Activity 5 of the Data Science Learning Club: http://www.becomingadatascientist.com/learningclub/forum-13.html

Dataset from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Mushroom



In [1]:
#import pandas and numpy libraries
import pandas as pd
import numpy as np
import sys #sys needed only for python version
#import gaussian naive bayes from scikit-learn
import sklearn as sk
#seaborn for pretty plots
import seaborn as sns

#display versions of python and packages
print('\npython version ' + sys.version)
print('pandas version ' + pd.__version__)
print('numpy version ' + np.__version__)
print('sk-learn version ' + sk.__version__)
print('seaborn version ' + sns.__version__)



python version 3.5.1 |Anaconda 2.4.1 (64-bit)| (default, Jan 29 2016, 15:01:46) [MSC v.1900 64 bit (AMD64)]
pandas version 0.17.1
numpy version 1.10.4
sk-learn version 0.17
seaborn version 0.7.0


### The dataset doesn't include column names, and the values are text characters

In [2]:
#read in data. it's comma-separated with no column names.
df = pd.read_csv('agaricus-lepiota.data', sep=',', header=None,
                 error_bad_lines=False, warn_bad_lines=True, low_memory=False)
# set pandas to output all of the columns in output
pd.options.display.max_columns = 25
#show the first 5 rows
print(df.sample(n=5))

     0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18 19 20 21 22
7913  e  b  s  n  f  n  a  c  b  n  e  ?  s  s  o  o  p  o  o  p  o  v  l
4161  e  f  f  g  t  n  f  c  b  u  t  b  s  s  g  w  p  w  o  p  n  v  d
2403  e  f  y  e  t  n  f  c  b  w  t  b  s  s  w  p  p  w  o  p  k  v  d
587   e  f  f  n  f  n  f  c  n  n  e  e  s  s  w  w  p  w  o  p  k  y  u
79    e  f  y  n  t  a  f  c  b  n  e  r  s  y  w  w  p  w  o  p  n  y  g


### Added column names from the UCI documentation

In [3]:
#manually add column names from documentation (1st col is class: e=edible,p=poisonous; rest are attributes)
df.columns = ['class','cap-shape','cap-surface','cap-color','bruises','odor','gill-attachment',
             'gill-spacing','gill-size','gill-color','stalk-shape','stalk-root',
             'stalk-surf-above-ring','stalk-surf-below-ring','stalk-color-above-ring','stalk-color-below-ring',
             'veil-type','veil-color','ring-number','ring-type','spore-color','population','habitat']

print("Example values:\n")
print(df.iloc[3984]) #this one has a ? value - how are those treated by classifier?

Example values:

class                     e
cap-shape                 x
cap-surface               y
cap-color                 b
bruises                   t
odor                      n
gill-attachment           f
gill-spacing              c
gill-size                 b
gill-color                e
stalk-shape               e
stalk-root                ?
stalk-surf-above-ring     s
stalk-surf-below-ring     s
stalk-color-above-ring    e
stalk-color-below-ring    w
veil-type                 p
veil-color                w
ring-number               t
ring-type                 e
spore-color               w
population                c
habitat                   w
Name: 3984, dtype: object


## The dataset is split fairly evenly between the edible and poison classes

In [4]:
#show plots in notebook
%matplotlib inline

#bar chart of classes using pandas plotting
print(df['class'].value_counts())
#df['class'].value_counts().plot(kind='bar')


e    4208
p    3916
Name: class, dtype: int64


## Let's see  how well our classifier can identify poisonous mushrooms by combinations of features

In [31]:
#put the features into X (everything except the 0th column)
X = pd.DataFrame(df, columns=df.columns[1:len(df.columns)], index=df.index)
#put the class values (0th column) into Y 
Y = df['class']

#encode the class labels as numeric
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(Y)
#print(le.classes_)
#print(np.array(Y))
#Y values now boolean values; poison = 1
y = le.transform(Y)
#print(y_train)

#have to initialize or get error below
x = pd.DataFrame(X,columns=[X.columns[0]])

#encode each feature column and add it to x_train (one hot encoder requires numeric input?)
for colname in X.columns:
    le.fit(X[colname])
    #print(colname, le.classes_)
    x[colname] = le.transform(X[colname])

#encode the feature labels using one-hot encoding
from sklearn import preprocessing
oh = preprocessing.OneHotEncoder(categorical_features='all')
oh.fit(x)
xo = oh.transform(x).toarray()
#print(xo)

print('\nEncoder Value Counts Per Column:')
print(oh.n_values_)   
print('\nExample Feature Values - row 1 in X:')
print(X.iloc[1])
print('\nExample Encoded Feature Values - row 1 in xo:')
print(xo[1])
print('\nClass Values (Y):')
print(np.array(Y))
print('\nEncoded Class Values (y):')
print(y)



Example Feature Values - row 1 in X:
cap-shape                 x
cap-surface               s
cap-color                 y
bruises                   t
odor                      a
gill-attachment           f
gill-spacing              c
gill-size                 b
gill-color                k
stalk-shape               e
stalk-root                c
stalk-surf-above-ring     s
stalk-surf-below-ring     s
stalk-color-above-ring    w
stalk-color-below-ring    w
veil-type                 p
veil-color                w
ring-number               o
ring-type                 p
spore-color               n
population                n
habitat                   g
Name: 1, dtype: object

Encoder Value Counts Per Column:
[ 6  4 10  2  9  2  2  2 12  2  5  4  4  9  9  1  4  3  5  9  6  7]

Example Encoded Feature Values - row 1 in xo:
[ 0.  0.  0.  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  1.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  0.  1.
  0.  0.  0.  0.  0.  1.  0.

In [32]:
#split the dataset into training and test sets
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(xo, y, test_size=0.33)

#initialize and fit the naive bayes classifier
from sklearn.naive_bayes import BernoulliNB
skbnb = BernoulliNB()
skbnb.fit(x_train,y_train)
train_predict = skbnb.predict(x_train)
#print(train_predict)

#see how accurate the training data was fit
from sklearn import metrics
print("Training accuracy:",metrics.accuracy_score(y_train, train_predict))

#use the trained model to predict the test values
test_predict = skbnb.predict(x_test)
print("Testing accuracy:",metrics.accuracy_score(y_test, test_predict))


Training accuracy: 0.938453058975
Testing accuracy: 0.931368892204


In [33]:
print("\nClassification Report:")
print(metrics.classification_report(y_test, test_predict, target_names=['edible','poisonous']))
print("\nConfusion Matrix:")
skcm = metrics.confusion_matrix(y_test,test_predict)
#putting it into a dataframe so it prints the labels
skcm = pd.DataFrame(skcm, columns=['predicted-edible','predicted-poisonous'])
skcm['actual'] = ['edible','poisonous']
skcm = skcm.set_index('actual')

#NOTE: NEED TO MAKE SURE I'M INTERPRETING THE ROWS & COLS RIGHT TO ASSIGN THESE LABELS!
print(skcm)

print("\nScore (same thing as test accuracy?): ", skbnb.score(x_test,y_test))




Classification Report:
             precision    recall  f1-score   support

     edible       0.89      0.99      0.94      1398
  poisonous       0.98      0.87      0.92      1283

avg / total       0.94      0.93      0.93      2681


Confusion Matrix:
           predicted-edible  predicted-poisonous
actual                                          
edible                 1378                   20
poisonous               164                 1119

Score (same thing as test accuracy?):  0.931368892204


### Add interpretation of numbers above (after verifying I entered the parameters correctly and the metrics are labeled right)