# Binary Prediction of Poisonous Mushrooms - Modeling

[Competition Link](https://www.kaggle.com/competitions/playground-series-s4e8/data)

Goal of the competition is to predict if a mushroom is poisonous or not based on various mushroom parameters.

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 29/08/2024   | Martin | Create   | Notebook created. Feature engineering and XGBoost | 
| 17/09/2024   | Martin | Update   | Feature engineering exploration | 


# Content

* [Feature Engineering](#feature-engineering)
* [Baseline - XGBoost](#baseline---xgboost)

# Feature Engineering

In [1]:
import os
os.chdir("/tmp/poison_mushrooms")

In [31]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import useful_functions as uf

import string

In [None]:
df = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")

## General cleaning

In [33]:
df.head()

Unnamed: 0,id,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,0,e,8.8,f,s,u,f,a,c,w,...,,,w,,,f,f,,d,a
1,1,p,4.51,x,h,o,f,a,c,n,...,,y,o,,,t,z,,d,w
2,2,e,6.94,f,s,b,f,x,c,w,...,,s,n,,,f,f,,l,w
3,3,e,3.88,f,y,g,f,s,,g,...,,,w,,,f,f,,d,u
4,4,e,5.85,x,l,w,f,d,,w,...,,,w,,,f,f,,g,a


In [34]:
df_test.head()

Unnamed: 0,id,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,3116945,8.64,x,,n,t,,,w,11.13,...,b,,w,u,w,t,g,,d,a
1,3116946,6.9,o,t,o,f,,c,y,1.27,...,,,n,,,f,f,,d,a
2,3116947,2.0,b,g,n,f,,c,n,6.18,...,,,n,,,f,f,,d,s
3,3116948,3.47,x,t,n,f,s,c,n,4.98,...,,,w,,n,t,z,,d,u
4,3116949,6.17,x,h,y,f,p,,y,6.73,...,,,y,,y,t,,,d,u


In [35]:
# Remove columns with too many Null
columns_to_remove = [
  "id",
  "stem-root",
  "veil-type",
  "veil-color",
  "spore-print-color"
]
df = df.drop(columns_to_remove, axis=1)

df_test_id = df_test['id']
df_test = df_test.drop(columns_to_remove, axis=1)

In [36]:
# Check which columns contain Nan values and how many
df.isna().sum()

class                         0
cap-diameter                  4
cap-shape                    40
cap-surface              671023
cap-color                    12
does-bruise-or-bleed          8
gill-attachment          523936
gill-spacing            1258435
gill-color                   57
stem-height                   0
stem-width                    0
stem-surface            1980861
stem-color                   38
has-ring                     24
ring-type                128880
habitat                      45
season                        0
dtype: int64

In [65]:
# Set invalid categorical values to NA for each column
valid_values = {
  'cap-shape': list(string.ascii_lowercase),
  'cap-surface': list(string.ascii_lowercase), 
  'cap-color': list(string.ascii_lowercase), 
  'does-bruise-or-bleed': ["f", "t"],
  'gill-attachment': list(string.ascii_lowercase),
  'gill-spacing': ["c", "d", "e", "f"],
  'gill-color': list(string.ascii_lowercase),
  'stem-surface': list(string.ascii_lowercase),
  'stem-color': list(string.ascii_lowercase),
  'has-ring': ["f", "t"],
  'ring-type': list(string.ascii_lowercase),
  'habitat': list(string.ascii_lowercase),
}

for col, l in valid_values.items():
  df[col] = df[col].apply(lambda x: np.nan if x not in l else x)

array(['f', 'x', 'p', 'b', 'o', 'c', 's', 'd', 'e', 'n', nan, 'w', 'k',
       'l', 't', 'g', 'z', 'a', 'r', 'u', 'y', 'i', 'm', 'h'],
      dtype=object)

* https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0
* https://contrib.scikit-learn.org/category_encoders/catboost.html
* https://xgboost.readthedocs.io/en/stable/get_started.html

# Baseline XGBoost

In [20]:
import xgboost as xgb
from xgboost import XGBClassifier

In [5]:
df.dtypes

class                    object
cap-diameter            float64
cap-shape                object
cap-surface              object
cap-color                object
does-bruise-or-bleed     object
gill-attachment          object
gill-spacing             object
gill-color               object
stem-height             float64
stem-width              float64
stem-surface             object
stem-color               object
has-ring                 object
ring-type                object
habitat                  object
season                   object
dtype: object

In [9]:
# Split variables
y = df['class']
X = df.drop('class', axis=1)

mapper = {
  'e': 0,
  'p': 1
}
y = [mapper[i] for i in y]


In [13]:
# Setting categorical variables
for t, col in zip(X.dtypes, X.columns):
  if t == 'object':
    X[col] = X[col].astype("category")

In [17]:
# Define XGBoost model
clf = XGBClassifier(
  tree_method='hist',
  enable_categorical=True,
  device='cuda'
)
clf.fit(X, y)
clf.save_model("models/baseline_xgb.json")

In [25]:
# Predictions
ids = df_test['id']
df_test = df_test.drop('id', axis=1)

# Setting columns
for t, col in zip(df_test.dtypes, df_test.columns):
  if t == 'object':
    df_test[col] = df_test[col].astype("category")

preds = clf.predict(df_test, device='cuda')

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




In [32]:
# Creating output
reverse_mapper = {v: k for k, v in mapper.items()}
result = [reverse_mapper[i] for i in preds]

final = pd.DataFrame({
  'id': ids,
  'class': result
})

final.to_csv('results/baseline_xgb.csv', index=False)

Score on Kaggle: 0.17899