# Binary Prediction of Poisonous Mushrooms

Playground Series - Season 4, Episode 8

(https://www.kaggle.com/competitions/playground-series-s4e8)

![image](./data/header.png)

__Overview__
Welcome to the 2024 Kaggle Playground Series! We plan to continue in the spirit of previous playgrounds, providing interesting an approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.

__Your Goal:__ The goal of this competition is to predict whether a mushroom is edible or poisonous based on its physical characteristics.

__Dataset Description__
The dataset for this competition (both train and test) was generated from a deep learning model trained on the UCI Mushroom dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

__Note:__ Unlike many previous Tabular Playground datasets, data artifacts have not been cleaned up. There are categorical values in the dataset that are not found in the original. It is up to the competitors how to handle this.

__Files (297.82 MB)__

- train.csv - the training dataset; class is the binary target (either e or p)
- test.csv - the test dataset; your objective is to predict target class for each row
- sample_submission.csv - a sample submission file in the correct format

__Models__
- K-Nearest Neighboor Model 
- Gaussian Naive Bayes Model 
- Logistic Regressor 
- Support Vector Classification Model 
- Decision Tree Model 
- Random Forest Model 
- Linear Discriminant Analysis Model 
- Gradient Boosting Classifier Model 
- Neural Network CLassifier Model 
- X Gradient Boost Classifier
- Cat Boosting Classifier

In [1]:
%pip install -r requirements.txt --break-system-packages

Defaulting to user installation because normal site-packages is not writeable
    sys-platform (=="darwin") ; extra == 'objc'
                 ~^[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import all necessary libraries

import os
import time
import numpy as np
import pandas as pd
import pickle as pkl
import seaborn as sns
import datetime as dt
import warnings as wn
import matplotlib.pyplot as plt

from catboost import Pool
from sklearn.pipeline import Pipeline
from imblearn.combine import SMOTEENN
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer

In [3]:
# Ignore all warnings
wn.filterwarnings('ignore')

In [4]:
# Set all variables paths 

_plots = './plots/'
_tested = './tested/'
_test = './data/test.csv'
_train = './data/train.csv'
_info = './model/model.docx'
_model = './model/model.pkl'
_submission = './data/sample_submission.csv'

In [5]:
# Simplify the process of integrating large CSV

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

def import_data(file, **kwargs):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True, **kwargs)
    df = reduce_mem_usage(df)
    return df

In [6]:
# Read the training and testing datasets and preprocess
train = import_data(_train, index_col = "id", engine="pyarrow")

Memory usage of dataframe is 523.17 MB
Memory usage after optimization is: 95.15 MB
Decreased by 81.8%


In [11]:
# Set the datatype of each column for memory eficiency

for col in train.columns:
    print(train[col].unique())

['e', 'p']
Categories (2, object): ['e', 'p']
[ 8.8   4.51  6.94 ... 29.8  32.22 38.12]
['f', 'x', 'p', 'b', 'o', ..., '17.44', '4.33', '2.82', '6.53', '19.06']
Length: 75
Categories (75, object): ['', '0.82', '1.66', '10.13', ..., 'w', 'x', 'y', 'z']
['s', 'h', 'y', 'l', 't', ..., '10.34', '10.1', '1.08', 'is k', '0.87']
Length: 84
Categories (84, object): ['', '0.85', '0.87', '0.88', ..., 'w', 'x', 'y', 'z']
['u', 'o', 'b', 'g', 'w', ..., '20.02', '20', '25.98', '8.67', '9.02']
Length: 79
Categories (79, object): ['', '1.51', '10.1', '10.56', ..., 'w', 'x', 'y', 'z']
['f', 't', 'd', 'has-ring', 'w', ..., '3.43', 'e', '4.42', '2.9', 'u']
Length: 27
Categories (27, object): ['', '2.9', '3.43', '4.42', ..., 'w', 'x', 'y', 'z']
['a', 'x', 's', 'd', 'e', ..., '16.27', '11.26', '2.79', 'is f', '13.94']
Length: 79
Categories (79, object): ['', '0.92', '1', '1.32', ..., 'w', 'x', 'y', 'z']
['c', '', 'd', 'f', 'x', ..., '3.62', 'does f', '6.4', '1.88', '55.13']
Length: 49
Categories (49, obje

In [8]:
# Display the first n rows of training
train.head(n=10)

Unnamed: 0_level_0,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,e,8.796875,f,s,u,f,a,c,w,4.511719,...,,,w,,,f,f,,d,a
1,p,4.511719,x,h,o,f,a,c,n,4.789062,...,,y,o,,,t,z,,d,w
2,e,6.941406,f,s,b,f,x,c,w,6.851562,...,,s,n,,,f,f,,l,w
3,e,3.880859,f,y,g,f,s,,g,4.160156,...,,,w,,,f,f,,d,u
4,e,5.851562,x,l,w,f,d,,w,3.369141,...,,,w,,,f,f,,g,a
5,p,4.300781,x,t,n,f,s,c,n,5.910156,...,,,w,,n,t,z,,d,a
6,e,9.648438,p,y,w,f,e,c,k,19.0625,...,,s,w,,,t,e,,g,w
7,p,4.550781,x,e,e,f,a,,y,8.3125,...,,,y,,w,t,z,,d,a
8,p,7.359375,f,h,e,f,x,d,w,5.769531,...,b,,w,,,f,f,,d,a
9,e,6.449219,x,t,n,f,a,d,w,7.128906,...,,,e,,,f,f,,d,a


In [None]:
# Describe the data
train.describe()