## Mushroom Prediction

### The Dataset

This dataset includes 61069 hypothetical mushrooms with caps based on 173 species (353 mushrooms per species). Each mushroom is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended (the latter class was combined with the poisonous class). Of the 20 variables, 17 are nominal and 3 are metrical.



#### Attribute Information:

One binary class divided in edible=e and poisonous=p (with the latter one also containing mushrooms of unknown edibility).
Twenty remaining variables (n: nominal, m: metrical)

cap-diameter (m): float number in cm
cap-shape (n): bell=b, conical=c, convex=x, flat=f,
sunken=s, spherical=p, others=o
cap-surface (n): fibrous=i, grooves=g, scaly=y, smooth=s,
shiny=h, leathery=l, silky=k, sticky=t,
wrinkled=w, fleshy=e
cap-color (n): brown=n, buff=b, gray=g, green=r, pink=p,
purple=u, red=e, white=w, yellow=y, blue=l,
orange=o, black=k
does-bruise-bleed (n): bruises-or-bleeding=t,no=f
gill-attachment (n): adnate=a, adnexed=x, decurrent=d, free=e,
sinuate=s, pores=p, none=f, unknown=?
gill-spacing (n): close=c, distant=d, none=f
gill-color (n): see cap-color + none=f
stem-height (m): float number in cm

### Project Aims


To predict if a mushroom is either poisonous or not with only the above mentioned variables

### EDA

In [2]:
# import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

from category_encoders import OrdinalEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import pickle


In [3]:
#import dataset
dataset = pd.read_csv('../raw_data/secondary_data.csv', sep=";", low_memory=False)

In [4]:
df = dataset.copy()

In [5]:
df.head()

Unnamed: 0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,p,15.26,x,g,o,f,e,,w,16.95,...,s,y,w,u,w,t,g,,d,w
1,p,16.6,x,g,o,f,e,,w,17.99,...,s,y,w,u,w,t,g,,d,u
2,p,14.07,x,g,o,f,e,,w,17.8,...,s,y,w,u,w,t,g,,d,w
3,p,14.17,f,h,e,f,e,,w,15.77,...,s,y,w,u,w,t,p,,d,w
4,p,14.64,x,h,o,f,e,,w,16.53,...,s,y,w,u,w,t,p,,d,w


In [6]:
#profile = ProfileReport(df, title="Pandas Profilling Report", html={"style":{"full_width": True}})
#profile

### Insights:
- There are only 3 columns with numeric values, 2 with boolean and 16 with categorical.
- There are a LOT of missing data, some columns have more than 80% of missing values.
- All of the numeric columns have strong correlation with each other.
- There are no missing data in the boolean and numerical columns.

I need to drop some columns. I have chosen to drop columns that have more than 40% of missing data. 

In [7]:
drop_columns = ['gill-spacing', 'stem-root', 
                'stem-surface', 'veil-type', 
                'veil-color', 'spore-print-color']
df.drop(columns=drop_columns, inplace=True)

I chose to drop 6 columns

In [8]:
df.shape

(61069, 15)

#### Splitting the data

In [9]:
X = df.drop(columns='class')
y = df['class']

In [10]:
X = OrdinalEncoder().fit_transform(X)
X.head()

Unnamed: 0,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-color,stem-height,stem-width,stem-color,has-ring,ring-type,habitat,season
0,15.26,1,1,1,1,1,1,16.95,17.09,1,1,1,1,1
1,16.6,1,1,1,1,1,1,17.99,18.19,1,1,1,1,2
2,14.07,1,1,1,1,1,1,17.8,17.74,1,1,1,1,1
3,14.17,2,2,2,1,1,1,15.77,15.98,1,1,2,1,1
4,14.64,1,2,1,1,1,1,16.53,17.2,1,1,2,1,1


In [11]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

#### Baseline Prediction 

In [12]:
acc_baseline = y_train.value_counts(normalize=True).max()
print("Baseline Accuracy:", round(acc_baseline, 4))

Baseline Accuracy: 0.5536


#### Random Forest Model

In [13]:
params = {
    "n_estimators": range(100,1001,150),
    "max_depth": range(5,41,5),
    "criterion": ["gini", "entropy"]
}

In [14]:
model_random_forest = GridSearchCV(
    RandomForestClassifier(random_state=12),
    param_grid=params,
    cv=5,
    n_jobs=-1,
    verbose=2
)

In [15]:
#model_random_forest.fit(X_train, y_train)

In [16]:
#with open ("model_random_forest_mushrooms.pkl", "wb") as m:
    #pickle.dump(model_random_forest, m)

In [20]:
with open("model_random_forest_mushrooms.pkl", "rb") as input_file: 
    rf_model = pickle.load(input_file)

In [21]:
cv_results = pd.DataFrame(rf_model.cv_results_)
cv_results.sort_values("rank_test_score").head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
55,80.231532,7.966767,2.509274,0.212306,gini,40,1000,"{'criterion': 'gini', 'max_depth': 40, 'n_esti...",0.999795,0.999693,0.999693,0.999693,0.999693,0.999713,4.1e-05,1
41,61.039107,0.678129,2.183204,0.103324,gini,30,1000,"{'criterion': 'gini', 'max_depth': 30, 'n_esti...",0.999795,0.999693,0.999693,0.999693,0.999693,0.999713,4.1e-05,1
48,94.348839,6.314678,3.284325,0.889481,gini,35,1000,"{'criterion': 'gini', 'max_depth': 35, 'n_esti...",0.999795,0.999693,0.999693,0.999693,0.999693,0.999713,4.1e-05,1
34,58.420459,0.693013,2.169327,0.156153,gini,25,1000,"{'criterion': 'gini', 'max_depth': 25, 'n_esti...",0.999795,0.999591,0.999693,0.999693,0.999693,0.999693,6.5e-05,4
40,52.282253,0.428641,1.827644,0.037089,gini,30,850,"{'criterion': 'gini', 'max_depth': 30, 'n_esti...",0.999795,0.999591,0.999693,0.999693,0.999693,0.999693,6.5e-05,4


In [22]:
# Get feature names from training data
features = X_train.columns
# Extract importances from model
importances = rf_model.best_estimator_.feature_importances_
# Create a series with feature names and importances
feat_imp = pd.Series(importances, index=features)
# Plot 10 most important features
feat_imp.sort_values(ascending=False).head(10)

stem-width         0.131240
cap-surface        0.129728
gill-attachment    0.111682
stem-height        0.091989
gill-color         0.086448
stem-color         0.086102
cap-diameter       0.082104
cap-color          0.057896
cap-shape          0.055819
ring-type          0.049047
dtype: float64

In [23]:
acc_train =rf_model.score(X_train, y_train)
acc_test = rf_model.score(X_test, y_test)

In [31]:
print('best accuracy score is:', round(acc_test*100,2),'%')

best accuracy score is: 99.98 %
