# Mushroom classification
This dataset is downloaded from kaggle: https://www.kaggle.com/datasets/uciml/mushroom-classification

&nbsp;

-------------------------------------------------------------------------SUMMARY-------------------------------------------------------------------------

It turned out that this dataset was so simple, that first classifier that I trained gained f1 score equal to 1.0. <b>It was totally unchallenging dataset.</b> 

If I did something wrong what implied so great score, please let me know.

## Setting up

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import zipfile

ZIP_path = "datasets\\mushrooms\\archive.zip"
CSV_directory = "datasets\mushrooms"
CSV_filename = "mushrooms.csv"
CSV_path = os.path.join(CSV_directory, CSV_filename)

If the .csv file is not extracted from the .zip file it has to be done at the beggining.

In [8]:
if not os.path.exists(CSV_path):
    with zipfile.ZipFile(ZIP_path, 'r') as zip_f:
        zip_f.extractall(CSV_directory)

## Data insight

In [10]:
from sklearn.model_selection import train_test_split

mushrooms_df = pd.read_csv(CSV_path)
mushrooms_df.columns

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],
      dtype='object')

In [15]:
X_train, X_test, y_train, y_test = train_test_split(mushrooms_df.drop("class", axis=1), mushrooms_df["class"])

In [16]:
X_train.head(3)

Unnamed: 0,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
791,f,y,n,t,l,f,c,b,p,e,...,y,w,w,p,w,o,p,k,y,p
6744,x,f,g,f,n,f,w,b,g,e,...,k,w,w,p,w,t,p,w,s,g
2113,x,f,e,t,n,f,c,b,n,t,...,s,p,g,p,w,o,p,n,v,d


Attribute Information: (classes: edible=e, poisonous=p)

* cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
* cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s
* cap-color: brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
* bruises: bruises=t,no=f
* odor: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
* gill-attachment: attached=a, descending=d, free=f, notched=n
* gill-spacing: close=c, crowded=w, distant=d
* gill-size: broad=b, narrow=n
* gill-color: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
* stalk-shape: enlarging=e, tapering=t
* stalk-root: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
* stalk-surface-above-ring: fibrous=f, scaly=y, silky=k, smooth=s
* stalk-surface-below-ring: fibrous=f, scaly=y, silky=k, smooth=s
* stalk-color-above-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
* stalk-color-below-ring: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
* veil-type: partial=p, universal=u
* veil-color: brown=n, orange=o, white=w, yellow=y
* ring-number: none=n, one=o, two=t
* ring-type: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
* spore-print-color: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
* population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
* habitat: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d

In [18]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6093 entries, 791 to 547
Data columns (total 22 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   cap-shape                 6093 non-null   object
 1   cap-surface               6093 non-null   object
 2   cap-color                 6093 non-null   object
 3   bruises                   6093 non-null   object
 4   odor                      6093 non-null   object
 5   gill-attachment           6093 non-null   object
 6   gill-spacing              6093 non-null   object
 7   gill-size                 6093 non-null   object
 8   gill-color                6093 non-null   object
 9   stalk-shape               6093 non-null   object
 10  stalk-root                6093 non-null   object
 11  stalk-surface-above-ring  6093 non-null   object
 12  stalk-surface-below-ring  6093 non-null   object
 13  stalk-color-above-ring    6093 non-null   object
 14  stalk-color-below-ring 

There is no null values in whole dataset.

We have to transform the labels from 'e' and 'p' to 0 and 1, becouse we are going to use precision and recall scoring.

In [39]:
def transform_labels_to_01(labels, pos_lab):
    return [1 if y == pos_lab else 0 for y in labels]

y_train_01 = transform_labels_to_01(y_train, 'e')
y_test_01 = transform_labels_to_01(y_test, 'e')

## Model selection

In [26]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder()
X_train_tr = one_hot.fit_transform(X_train)

In [41]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rnd_clf = RandomForestClassifier()
params = {
    "n_estimators": [100, 250],
    "max_leaf_nodes": [20, 30]
}

grid_cv = GridSearchCV(rnd_clf, params, verbose=3, cv=3, scoring="f1")
grid_cv.fit(X_train_tr, y_train_01)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV 1/3] END max_leaf_nodes=20, n_estimators=100;, score=0.999 total time=   0.2s
[CV 2/3] END max_leaf_nodes=20, n_estimators=100;, score=1.000 total time=   0.2s
[CV 3/3] END max_leaf_nodes=20, n_estimators=100;, score=0.999 total time=   0.1s
[CV 1/3] END max_leaf_nodes=20, n_estimators=250;, score=0.999 total time=   0.4s
[CV 2/3] END max_leaf_nodes=20, n_estimators=250;, score=1.000 total time=   0.4s
[CV 3/3] END max_leaf_nodes=20, n_estimators=250;, score=0.999 total time=   0.5s
[CV 1/3] END max_leaf_nodes=30, n_estimators=100;, score=1.000 total time=   0.1s
[CV 2/3] END max_leaf_nodes=30, n_estimators=100;, score=1.000 total time=   0.1s
[CV 3/3] END max_leaf_nodes=30, n_estimators=100;, score=1.000 total time=   0.1s
[CV 1/3] END max_leaf_nodes=30, n_estimators=250;, score=1.000 total time=   0.5s
[CV 2/3] END max_leaf_nodes=30, n_estimators=250;, score=1.000 total time=   0.5s
[CV 3/3] END max_leaf_nodes=30, n_esti

In [42]:
best_clf = grid_cv.best_estimator_

In [43]:
from sklearn.pipeline import Pipeline

full_pipeline = Pipeline([
    ("one_hot", one_hot),
    ("clf", best_clf)
])

## Validation using test dataset

In [45]:
from sklearn.metrics import f1_score

y_pred = full_pipeline.predict(X_test)
f1_score(y_test_01, y_pred)

1.0