# DECISION TREE TO PREDICT WHETHER MUSHROOMS ARE POISONOUS OR EDIBLE

1. Split your data into train and test sets. 
2. Get basic descriptive statistics for the training data and check for missing and incorrect or extreme values. Get scatterplots or heatmaps showing the relationship between the variables. 
3. What are the factors that predict whether a mushroom is poisonous? 
4. Report the accuracy of your model on the training set and on the test set. How successful is the model - what is its precision and recall? 
5. What is the prevalence of poisonous mushrooms in the dataset? How might prevalence affect the positive and negative predictive values of a test/model?

In [1]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report 
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from sklearn.svm import SVC
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder

Importing the CSV file in order to read and analyse the data.

In [2]:
mushroom_data = pd.read_csv('agaricus-lepiota.data')

We going to see how the data looks and rename columns to make the data more understandable.

In [3]:
mushroom_data.head()

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,...,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u
0,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


In [4]:
mushroom_data = mushroom_data.rename(columns={
    "p": "classes",
    "x":"cap-shape",
    "s":"cap-surface",
    "n":"cap-color",
    "t":"bruises",
    "p.1":"odor",
    "f":"gill-attachment",
    "c":"gill-spacing",
    "n.1":"gill-size",
    "k":"gill-color",
    "e":"stalk-shape",
    "e.1":"stalk-root",
    "s.1":"stalk-surface-above-ring",
    "s.2":"stalk-surface-below-ring",
    "w":"stalk-color-above-ring",
    "w.1":"stalk-color-above-ring",
    "p.2":"veil-type",
    "w.2":"veil-color",
    "o":"ring-number",
    "p.3":"ring-type",
    "k.1":"spore-print-color",
    "s.3":"population",
    "u":"habitat"
})
mushroom_data.head()

Unnamed: 0,classes,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-above-ring.1,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
1,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
2,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
3,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
4,e,x,y,y,t,a,f,c,b,n,...,s,w,w,p,w,o,p,k,n,g


Now we need to chack for any missing values and also know the type of date we are working with.

In [5]:
mushroom_data = mushroom_data.replace({'?':np.NaN})

In [6]:
mushroom_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 23 columns):
classes                     8123 non-null object
cap-shape                   8123 non-null object
cap-surface                 8123 non-null object
cap-color                   8123 non-null object
bruises                     8123 non-null object
odor                        8123 non-null object
gill-attachment             8123 non-null object
gill-spacing                8123 non-null object
gill-size                   8123 non-null object
gill-color                  8123 non-null object
stalk-shape                 8123 non-null object
stalk-root                  5643 non-null object
stalk-surface-above-ring    8123 non-null object
stalk-surface-below-ring    8123 non-null object
stalk-color-above-ring      8123 non-null object
stalk-color-above-ring      8123 non-null object
veil-type                   8123 non-null object
veil-color                  8123 non-null object
ring-number

In [7]:
mushroom_data.isnull().values.any()

True

In [8]:
mushroom_data.isnull().values.sum()

2480

Above we check for any missing values but first we replace any incorrect values with Nan, this will allow us to check for null values as we have done above. As we look at the info of the dataset, we can identify that stalk-root column has missing vlues, we calculate the exact amount missing which is 2480.

In [9]:
mushroom_data["classes"].value_counts()

e    4208
p    3915
Name: classes, dtype: int64

From the above results, the are no missing values and the datatype is object(23). What needs to be done is to change the datatype into an integer

Now we will split the data into the predictor variable and the independent. The dependent variable is the classes column because we need the rest of the data in other columns to determine what features result in mushrooms breing poisisous or edible.  

The data in the dataframe is categorical so we will use dummy representation which will split each column according to the number of values it has, we also change the datatype into int64.

In [10]:
frame2 = mushroom_data.drop('classes',axis=1)

In [11]:
frame2 = pd.get_dummies(frame2,drop_first = True)

In [12]:
Y = mushroom_data['classes']
X = frame2[frame2.columns[1:]]

In [13]:
X_train, X_test, y_train, y_test = train_test_split(  
    X, Y, test_size = 0.3, random_state = 42) 

In [14]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(5686, 109)
(5686,)
(2437, 109)
(2437,)


In [15]:
svc=SVC() # The default kernel used by SVC is the gaussian kernel
svc.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [16]:
prediction = svc.predict(X_test)

In [17]:
cm = confusion_matrix(y_test, prediction)
sum = 0
for i in range(cm.shape[0]):
    sum += cm[i][i]
    
accuracy = sum/X_test.shape[0]
print(accuracy)

0.9963069347558473


We are going to split the data into training and test data. Then we will find the correlation matrix which will allow us to plot a heat map. The heatmap will give a clearer view of the relationship between the columns and also showing the strength of the relationship.

In [18]:
mushroom_data = mushroom_data.iloc[:,1:-1]

In [19]:
label_encoder = LabelEncoder()
mushroom_data.iloc[:,0] = label_encoder.fit_transform(mushroom_data.iloc[:,0]).astype('float64')

In [20]:
prediction = svc.predict(X_test)

In [21]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [22]:
from sklearn import tree
from IPython.display import Image  
import pydotplus

dot_data = export_graphviz(dt, out_file=None,
                                class_names=['p','e'],
                                filled=True, rounded=True)
graphviz.Source(dot_data)
print(graph)

NameError: name 'graphviz' is not defined