# Random search

Zbiór danych do analizy: https://www.kaggle.com/datasets/prishasawhney/mushroom-dataset

Dokumentacja: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Mamy dane dotyczące grzybów. Model ma za zadanie ocenić, czy grzyb jest jadalny.
Cel biznesowy: Stworzenie aplikacji, która pomoże użytkownikowi w ocenie czy grzyb jest jadalny, poprawiając bezpieczeństwo.

Zmienne:
- Cap Diameter
- Cap Shape
- Gill Attachment
- Gill Color
- Stem Height
- Stem Width
- Stem Color
- Season
- Target Class - Is it edible or not?

In [1]:
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV , cross_val_score 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [2]:
# puść ten kod, 
# jeżeli wywołujesz plik  w folderze rozwiąznaia, 
# a ramka danych znajduje się w folderze data
import os 
os.chdir('../')

In [3]:
# Załadowanie danych
df = pd.read_csv('data/mushroom.csv')

In [4]:
# head
df.head()

Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
0,1372,2,2,10,3.807467,1545,11,1.804273,1
1,1461,2,2,10,3.807467,1557,11,1.804273,1
2,1371,2,2,10,3.612496,1566,11,1.804273,1
3,1261,6,2,10,3.787572,1566,11,1.804273,1
4,1305,6,2,10,3.711971,1464,11,0.943195,1


In [5]:
# Liczebności klas
df['class'].value_counts()

class
1    29675
0    24360
Name: count, dtype: int64

In [6]:
# Korelacja
df.corr(method='spearman')

Unnamed: 0,cap-diameter,cap-shape,gill-attachment,gill-color,stem-height,stem-width,stem-color,season,class
cap-diameter,1.0,0.185547,0.305216,0.219728,0.041821,0.872176,0.070386,0.098406,-0.187262
cap-shape,0.185547,1.0,0.071889,0.142247,0.034181,0.211521,0.038383,0.050108,-0.127997
gill-attachment,0.305216,0.071889,1.0,0.114296,-0.075199,0.318893,0.023405,-0.012126,-0.061537
gill-color,0.219728,0.142247,0.114296,1.0,-0.011244,0.170294,0.18598,0.037277,-0.057532
stem-height,0.041821,0.034181,-0.075199,-0.011244,1.0,0.015236,-0.013012,1.3e-05,0.198088
stem-width,0.872176,0.211521,0.318893,0.170294,0.015236,1.0,0.14663,0.08138,-0.228617
stem-color,0.070386,0.038383,0.023405,0.18598,-0.013012,0.14663,1.0,0.019379,-0.104076
season,0.098406,0.050108,-0.012126,0.037277,1.3e-05,0.08138,0.019379,1.0,-0.067507
class,-0.187262,-0.127997,-0.061537,-0.057532,0.198088,-0.228617,-0.104076,-0.067507,1.0


In [7]:
# Braki danych
df.isna().max()

cap-diameter       False
cap-shape          False
gill-attachment    False
gill-color         False
stem-height        False
stem-width         False
stem-color         False
season             False
class              False
dtype: bool

In [8]:
# Podział na zbiór treningowy i testowy
train_x, test_x, train_y,test_y = train_test_split(df.drop('class',axis=1),df['class'], test_size=0.2, random_state=1000)


In [9]:
# grid
params = {'n_estimators': [50,100,200],
          'min_samples_split': [5,20,50],
           'min_samples_leaf': [5,20,50],
            'criterion': ['gini','entropy'],
             'min_impurity_decrease': [0,0.001, 0.01] }


In [10]:
# Optymalizowany model
rf = RandomForestClassifier()

In [11]:
# Obiekt random search
rs = RandomizedSearchCV(estimator=rf,param_distributions=params, n_iter = 5)

In [12]:
# Optymalizacja
res = rs.fit(train_x,train_y)

In [13]:
# Wybrane parametry
res.best_params_

{'n_estimators': 50,
 'min_samples_split': 20,
 'min_samples_leaf': 20,
 'min_impurity_decrease': 0,
 'criterion': 'entropy'}

In [14]:
# Wybrany model
model = res.best_estimator_

In [15]:
res.best_score_

0.9781853582335703

In [20]:
# Predykcje
train_pred = model.predict_proba(train_x)[:,1]
test_pred  = model.predict_proba(test_x)[:,1]

In [21]:
# train
roc_auc_score(train_y,train_pred)

0.999030201170738

In [22]:
# test 
roc_auc_score(test_y, test_pred)

0.998656969787435

In [23]:
# wynik cross walidacji
cross_val_score(model, train_x,train_y, cv=5, scoring = 'roc_auc')

array([0.99777965, 0.99805377, 0.99811559, 0.99752707, 0.99803005])