<font size=5>**Mushroom Classification**</font>



**About this Dataset**


This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one


**Attribute Information**


1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
4. bruises?: bruises=t,no=f
5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
6. gill-attachment: attached=a,descending=d,free=f,notched=n
7. gill-spacing: close=c,crowded=w,distant=d
8. gill-size: broad=b,narrow=n
9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
10. stalk-shape: enlarging=e,tapering=t
11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
16. veil-type: partial=p,universal=u
17. veil-color: brown=n,orange=o,white=w,yellow=y
18. ring-number: none=n,one=o,two=t
19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d

**Import all the necessary Data processing and Machine Learning libraries**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

import os
print(os.listdir("../input"))


Read the csv file into a Panda Dataframe

In [None]:
df=pd.read_csv('../input/mushrooms.csv')

Check the sample data and do some simple analysis on the dataset

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

Check the distribution of the target class.
Here the data is almost evenly distributed

In [None]:
sns.countplot(x='class',data=df)

Since sckikit-learn accepts only numeric values.Add a new column Class_code
and assign a numeric value to both the class

In [None]:
lb_class = LabelEncoder()
df["class_code"] = lb_class.fit_transform(df["class"])
df[["class", "class_code"]].head(5)

<font size=5>**Exploratory Data Analysis**</font>

Let us do some Exploratory Data Analysis on this dataset

The below graph represents the percentage of  distribution of target class (poisonous or edible) by its individual attributes.

Ex: Consider the bruises (bruises=t,no=f) variable,

If a sample contains bruises (bruises=t), the proabablity that it is edible is 82%

In [None]:
sns.set(style="darkgrid")
fig,axs=plt.subplots(nrows=8,ncols=3,figsize=(30, 75))

i=0
j=0
k=0

for col in df.columns:
    
    i=int(k/3)
    j=k%3
    
    axe=sns.countplot(x=col, hue="class", data=df,ax=axs[i][j]) # for Seaborn version 0.7 and more
    
    bars = axe.patches
    half = int(len(bars)/2)
    left_bars = bars[:half]
    right_bars = bars[half:]

    for left, right in zip(left_bars, right_bars):
        height_l = np.nan_to_num(left.get_height())
        height_r = np.nan_to_num(right.get_height())
        total = height_l + height_r

        axe.text(left.get_x() + left.get_width()/2., height_l + 40, '{0:.0%}'.format(height_l/total), ha="center")
        axe.text(right.get_x() + right.get_width()/2., height_r + 40, '{0:.0%}'.format(height_r/total), ha="center")
    
    k=k+1

By above Graph we have noticed some intresting observations.

* If a sample contains bruises (bruises=t), the proabablity that it is edible is 82%.
* If a sample does not have an odor(odor=n), the proabablity that it is edible is 97%.
* If a sample has a crowded gill-spacing(gill-spacing=w),the proabablity that it is edible is 91%.
* If a sample has a narrow gill-size(gill-size=n), ,the proabablity that it is poisonous is 89%.
  

Various other attributes can be used to determine wheather a sample is edible or poisonous


Check-out the all the columns present in the dataframe

In [None]:
df.columns

Since these are categorical variables without any definite order(nominal variables),dummy variables has to be created for all the  categorical variables.

This can be accomplished using get_dummies function from pandas

In [None]:
df=pd.get_dummies(data=df,columns=[ 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],drop_first=False)

Check out the head of the dataframe, this has created 119 columns!

In [None]:
print(len(df.columns))
df.head()

In [None]:
X=df.drop(['class','class_code'],axis=1)
y=df['class_code']

Since this has created 119 attributes, it is highly possible that some of the attributes are co-related.One of the way to eliminate this problem and reduce the dimensionality is to use **Principle Component Analysis (PCA)**.

**PCA** is a technique for feature extraction — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables!

Also please note that PCA does make independent variables less interpretable.

Let's try to figure out how many Principal components we need so that it captures a good variance of the dataset.

In [None]:
n_components=[1,10,20,30,40,50,75,100]

for comp in n_components:
    pca_comp=PCA(n_components=comp)
    pca_comp.fit_transform(X)
    print(comp,sum(pca_comp.explained_variance_ratio_)*100)

Above output shows that the 1st component captures around 16% of variance in the dataset and with only 10 components we capture around 65% of variance.

With 119 columns in the original dataset, we are able to explain 65% of variance with only 10 pricipal components !.

We observe that with 40 components we capture 95% of the variance and with 50 and 75 components we capture aroud 98% and 99% of the variance.

**Note:** One more important aspects to consider while using PCA is that all the variables need to be scaled.However since we are dealing with only categorical variables, i am choosing to skip this particular step as it does not provide a considerable benifit.

Let's go ahead and choose the number of components as 40.

In [None]:
pca = PCA(n_components=40)
pca_x=pd.DataFrame(pca.fit_transform(X))

sum(pca.explained_variance_ratio_)

Check out the head of the newly created dataframe after applying PCA.You can observe that a dataframe is created with 40 principal componets.

In [None]:
pca_x.head()

Do a train-test split on the newly created dataframe with test_size as 0.3

In [None]:
X_train, X_test, y_train, y_test = train_test_split(pca_x,y,test_size=0.3)

<font size=4>**Machine Learning**</font>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

LR_model= LogisticRegression()

LR_model.fit(X_train,y_train)

LR_y_pred = LR_model.predict(X_test) # This will give you positive class prediction probabilities  

accuracy=accuracy_score(y_test, LR_y_pred)*100
print("Accuracy Score: ","{0:.2f}".format(accuracy))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, LR_y_pred)),annot=True,fmt="g", cmap='viridis')

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF_model=RandomForestClassifier()

RF_model.fit(X_train,y_train)
RF_y_pred = RF_model.predict(X_test) # This will give you positive class prediction probabilities  

accuracy=accuracy_score(y_test, RF_y_pred)*100
print("Accuracy Score: ","{0:.2f}".format(accuracy))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, RF_y_pred)),annot=True,fmt="g", cmap='viridis')

In [None]:
from sklearn import svm

SVM_model=svm.LinearSVC()

SVM_model.fit(X_train,y_train)
SVM_y_pred = SVM_model.predict(X_test) # This will give you positive class prediction probabilities  

accuracy=accuracy_score(y_test, SVM_y_pred)*100

print("Accuracy Score: ","{0:.2f}".format(accuracy))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, SVM_y_pred)),annot=True,fmt="g", cmap='viridis')

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model=KNeighborsClassifier()
knn_model.fit(X_train,y_train)

knn_y_pred = knn_model.predict(X_test)  

accuracy=accuracy_score(y_test, knn_y_pred)*100

print("Accuracy Score: ","{0:.2f}".format(accuracy))
sns.heatmap(pd.DataFrame(confusion_matrix(y_test, knn_y_pred)),annot=True,fmt="g", cmap='viridis')