# Binary Prediction of Poisonous Mushrooms - Modeling

[Competition Link](https://www.kaggle.com/competitions/playground-series-s4e8/data)

Goal of the competition is to predict if a mushroom is poisonous or not based on various mushroom parameters.

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 29/08/2024   | Martin | Create   | Notebook created. Feature engineering and XGBoost | 
| 17/09/2024   | Martin | Update   | Feature engineering exploration | 


# Content

* [Feature Engineering](#feature-engineering)
* [Baseline - XGBoost](#baseline---xgboost)

# Feature Engineering

In [1]:
import os
os.chdir("/tmp/poison_mushrooms")

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import useful_functions as uf

import string

In [3]:
df = pd.read_csv("./data/train.csv")
df_test = pd.read_csv("./data/test.csv")

## General cleaning

In [16]:
df.head()

Unnamed: 0,id,class,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,0,e,8.8,f,s,u,f,a,c,w,...,,,w,,,f,f,,d,a
1,1,p,4.51,x,h,o,f,a,c,n,...,,y,o,,,t,z,,d,w
2,2,e,6.94,f,s,b,f,x,c,w,...,,s,n,,,f,f,,l,w
3,3,e,3.88,f,y,g,f,s,,g,...,,,w,,,f,f,,d,u
4,4,e,5.85,x,l,w,f,d,,w,...,,,w,,,f,f,,g,a


In [17]:
df_test.head()

Unnamed: 0,id,cap-diameter,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-height,...,stem-root,stem-surface,stem-color,veil-type,veil-color,has-ring,ring-type,spore-print-color,habitat,season
0,3116945,8.64,x,,n,t,,,w,11.13,...,b,,w,u,w,t,g,,d,a
1,3116946,6.9,o,t,o,f,,c,y,1.27,...,,,n,,,f,f,,d,a
2,3116947,2.0,b,g,n,f,,c,n,6.18,...,,,n,,,f,f,,d,s
3,3116948,3.47,x,t,n,f,s,c,n,4.98,...,,,w,,n,t,z,,d,u
4,3116949,6.17,x,h,y,f,p,,y,6.73,...,,,y,,y,t,,,d,u


In [18]:
# Remove columns with too many Null
columns_to_remove = [
  "id",
  "stem-root",
  "veil-type",
  "veil-color",
  "spore-print-color"
]
df = df.drop(columns_to_remove, axis=1)

df_test_id = df_test['id']
df_test = df_test.drop(columns_to_remove, axis=1)

In [7]:
# Check which columns contain Nan values and how many
df.isna().sum()

class                         0
cap-diameter                  4
cap-shape                    40
cap-surface              671023
cap-color                    12
does-bruise-or-bleed          8
gill-attachment          523936
gill-spacing            1258435
gill-color                   57
stem-height                   0
stem-width                    0
stem-surface            1980861
stem-color                   38
has-ring                     24
ring-type                128880
habitat                      45
season                        0
dtype: int64

In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer, SimpleImputer

import keras
from keras import Sequential
from keras.layers import Embedding, Dense, Flatten

2024-10-09 14:46:37.452825: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-09 14:46:37.585131: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-09 14:46:37.634693: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-09 14:46:37.648643: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-09 14:46:37.740145: I tensorflow/core/platform/cpu_feature_guar

Use the most frequent categorical variable to fill

In [None]:
le = LabelEncoder()
mapper = {}

# Set invalid categorical values to NA for each column
valid_values = {
  'cap-shape': list(string.ascii_lowercase),
  'cap-surface': list(string.ascii_lowercase), 
  'cap-color': list(string.ascii_lowercase), 
  'does-bruise-or-bleed': ["f", "t"],
  'gill-attachment': list(string.ascii_lowercase),
  'gill-spacing': ["c", "d", "e", "f"],
  'gill-color': list(string.ascii_lowercase),
  'stem-surface': list(string.ascii_lowercase),
  'stem-color': list(string.ascii_lowercase),
  'has-ring': ["f", "t"],
  'ring-type': list(string.ascii_lowercase),
  'habitat': list(string.ascii_lowercase),
  'season': ['a', 'w', 'u', 's']
}

for col, l in valid_values.items():
  # Replace all invalid characters with NA
  df[col] = df[col].apply(lambda x: np.nan if x not in l else x)

  # Add column and entry to mapper, map non-NA values
  col_subset = df.loc[df[col].notna(), col]
  unique_values = col_subset.unique()
  mapper[col] = {unique_values[i]: i for i in range(len(unique_values))}
  col_subset = col_subset.apply(lambda x: mapper[col][x])
  df.loc[df[col].notna(), col] = col_subset

# Convert remaining class into label
df['class'] = le.fit_transform(df['class'])

# Use most-frequent to fill missing values
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
cat_filled = imputer.fit_transform(df[valid_values.keys()])
cat_filled = pd.DataFrame(cat_filled, columns=valid_values.keys())

# Join back to main dataframe
df = pd.concat([df[['class', 'cap-diameter', 'stem-width', 'stem-height']], cat_filled], axis=1)

Converting the categorical variables to embeddings

In [20]:
df

Unnamed: 0,class,cap-diameter,stem-width,stem-height,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-surface,stem-color,has-ring,ring-type,habitat,season
0,0,8.80,15.39,4.51,0,0,0,0,0,0,0,1,0,0,0,0,0
1,1,4.51,6.48,4.79,1,1,1,0,0,0,1,0,1,1,1,0,1
2,0,6.94,9.93,6.85,0,0,2,0,1,0,0,1,2,0,0,1,1
3,0,3.88,6.53,4.16,0,2,3,0,2,0,2,1,0,0,0,0,2
4,0,5.85,8.36,3.37,1,3,4,0,3,0,0,1,0,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3116940,0,9.29,18.81,12.14,0,4,5,1,0,0,0,1,0,1,5,0,2
3116941,0,10.88,26.97,6.65,6,4,4,1,3,0,6,1,0,0,0,0,2
3116942,1,7.82,11.06,9.51,1,5,6,0,0,0,0,1,3,1,1,0,0
3116943,0,9.45,17.77,9.13,2,8,5,1,4,0,6,0,0,1,3,0,2


In [11]:
df1 = df.copy()

In [12]:
df1.head()

Unnamed: 0,class,cap-diameter,stem-width,stem-height,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-surface,stem-color,has-ring,ring-type,habitat,season
0,0,8.8,15.39,4.51,0,0,0,0,0,0,0,1,0,0,0,0,0
1,1,4.51,6.48,4.79,1,1,1,0,0,0,1,0,1,1,1,0,1
2,0,6.94,9.93,6.85,0,0,2,0,1,0,0,1,2,0,0,1,1
3,0,3.88,6.53,4.16,0,2,3,0,2,0,2,1,0,0,0,0,2
4,0,5.85,8.36,3.37,1,3,4,0,3,0,0,1,0,0,0,2,0


In [13]:
EPOCHS = 10
BATCH_SIZE = 10000

for col in valid_values.keys():
  # Dynamically set the input dimensions and embedding size
  input_dim = len(df1[col].unique())
  embedding_size = min(50, round((input_dim + 1) / 2))

  # Define the model
  model = Sequential()
  model.add(Embedding(input_dim=input_dim, output_dim=embedding_size, input_length=1, name='embedding'))
  model.add(Flatten())
  model.add(Dense(50, activation='relu'))
  model.add(Dense(15, activation='relu'))
  model.add(Dense(1))
  model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
  model.fit(
    x=df1[col].astype(np.float32),
    y=df1['class'],
    epochs=EPOCHS,
    batch_size=BATCH_SIZE
  )

  # Map the embedding columns back to the data
  mapper_df = pd.DataFrame(model.layers[0].get_weights()[0])
  mapper_df.columns = [f"{col}-{i}" for i in range(len(mapper_df.columns))]
  mapper_df['mapper'] = range(len(df1[col].unique()))

  df1 = df1.merge(
    mapper_df,
    how='left',
    left_on=col,
    right_on='mapper'
  )
  df1 = df1.drop(['mapper', col], axis=1)

  # Save mapper for embeddings in test set
  mapper_df.to_csv(f'embedding_maps/{col}-map.csv', index=False)

I0000 00:00:1727708446.373654    7526 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1727708446.414786    7526 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1727708446.414827    7526 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1727708446.418093    7526 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1727708446.418131    7526 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:0

Epoch 1/10


070, pci bus id: 0000:01:00.0, compute capability: 8.9
I0000 00:00:1727708447.560247    7626 service.cc:146] XLA service 0x7f61cc00a320 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727708447.560292    7626 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 4070, Compute Capability 8.9
2024-09-30 15:00:47.589274: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-09-30 15:00:47.695847: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8906



[1m169/312[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m0s[0m 902us/step - accuracy: 0.5391 - loss: 0.3031

I0000 00:00:1727708449.331943    7626 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m282/312[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 899us/step - accuracy: 0.5531 - loss: 0.2839




[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 5ms/step - accuracy: 0.5554 - loss: 0.2808
Epoch 2/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 869us/step - accuracy: 0.5844 - loss: 0.2423
Epoch 3/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 896us/step - accuracy: 0.5847 - loss: 0.2423
Epoch 4/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 960us/step - accuracy: 0.5849 - loss: 0.2423
Epoch 5/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 970us/step - accuracy: 0.5846 - loss: 0.2423
Epoch 6/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 982us/step - accuracy: 0.5849 - loss: 0.2423
Epoch 7/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5846 - loss: 0.2423
Epoch 8/10
[1m312/312[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.5851 - loss: 0.2422  
Epoch 9/10
[1m312/312[0m [32m━━━━━━━

In [15]:
# Tested with KNNImputer but too long
# imputer = KNNImputer()
# # perform imputation on categorical variables
# imputer.fit_transform(df1[['cap-shape', 'cap-color', 'cap-surface']])


In [None]:
# For continuous variables we use Mean Inputation
# cap-diameter, stem-height and stem-width are numerical values
continuous_cols = ['cap-diameter', 'stem-height', 'stem-width']
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
continuous_vals = pd.DataFrame(imputer.fit_transform(df[continous_cols]), columns=continuous_cols)
df = df.drop(continuous_cols, axis=1)
df = pd.concat([df, continuous_vals], axis=1)
df.head()

# Full Data Transformation Pipeline

In [5]:
# Categorical variable map
categorical_map = {
  'cap-shape': list(string.ascii_lowercase),
  'cap-surface': list(string.ascii_lowercase), 
  'cap-color': list(string.ascii_lowercase), 
  'does-bruise-or-bleed': ["f", "t"],
  'gill-attachment': list(string.ascii_lowercase),
  'gill-spacing': ["c", "d", "e", "f"],
  'gill-color': list(string.ascii_lowercase),
  'stem-surface': list(string.ascii_lowercase),
  'stem-color': list(string.ascii_lowercase),
  'has-ring': ["f", "t"],
  'ring-type': list(string.ascii_lowercase),
  'habitat': list(string.ascii_lowercase),
  'season': ['a', 'w', 'u', 's']
}

# Continuous variable columns
continuous_cols = ['cap-diameter', 'stem-height', 'stem-width']

def transformation_pipeline(df, categorical_map, continuous_cols, is_train=True):
  le = LabelEncoder()
  mapper = {}

  # Remove columns with too many Null
  columns_to_remove = [
    "id",
    "stem-root",
    "veil-type",
    "veil-color",
    "spore-print-color"
  ]
  df = df.drop(columns_to_remove, axis=1)

  for col, l in categorical_map.items():
    # Replace all invalid characters with NA
    df[col] = df[col].apply(lambda x: np.nan if x not in l else x)

    # Add column and entry to mapper, map non-NA values
    col_subset = df.loc[df[col].notna(), col]
    unique_values = col_subset.unique()
    mapper[col] = {unique_values[i]: i for i in range(len(unique_values))}
    col_subset = col_subset.apply(lambda x: mapper[col][x])
    df.loc[df[col].notna(), col] = col_subset

  # Use most-frequent to fill missing values
  imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
  cat_filled = imputer.fit_transform(df[categorical_map.keys()])
  cat_filled = pd.DataFrame(cat_filled, columns=categorical_map.keys())

  # Mean Imputer for continuous variables
  imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
  continuous_vals = pd.DataFrame(imputer.fit_transform(df[continuous_cols]), columns=continuous_cols)
  df = df.drop(continuous_cols, axis=1)
  df = pd.concat([df, continuous_vals], axis=1)

  # Convert to trained embeddings
  dataframes = []
  for col, _ in categorical_map.items():
    embedding = pd.read_csv(f"./embedding_maps/{col}-map.csv")
    to_merge = df[[col]]
    to_merge = to_merge.merge(
      embedding,
      how='left',
      left_on=col,
      right_on='mapper'
    )
    to_merge = to_merge.drop([col, 'mapper'], axis=1)
    dataframes.append(to_merge)

  if is_train:
    # Convert remaining class into label
    df['class'] = le.fit_transform(df['class'])

    # Join back to main dataframe
    df = pd.concat([df[['class', 'cap-diameter', 'stem-width', 'stem-height']], cat_filled], axis=1)
  else:
    df = pd.concat([df[['cap-diameter', 'stem-width', 'stem-height']], cat_filled], axis=1)

  return df, dataframes

In [6]:
# # Removing IDs for test set later
# df_test_id = df_test['id']
# df_test = df_test.drop(columns_to_remove, axis=1)

df_train, dataframes = transformation_pipeline(df, categorical_map, continuous_cols, is_train=True)
# df_t, dataframes = transformation_pipeline(df_test, categorical_map, continuous_cols, is_train=False)

In [7]:
df_train = df_train.drop(categorical_map.keys(), axis=1)

In [8]:
categorical_data = pd.concat(dataframes, axis=1)

: 

In [9]:
df_train = pd.concat([df_train, categorical_data], axis=1)
df_train.to_csv('./data/processed_train.csv', index=False)

: 

: 

In [29]:
df

Unnamed: 0,cap-shape,cap-surface,cap-color,does-bruise-or-bleed,gill-attachment,gill-spacing,gill-color,stem-surface,stem-color,has-ring,ring-type,habitat,season
0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,1,1,1,0,0,0,1,0,1,1,1,0,1
2,0,0,2,0,1,0,0,1,2,0,0,1,1
3,0,2,3,0,2,0,2,1,0,0,0,0,2
4,1,3,4,0,3,0,0,1,0,0,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3116940,0,4,5,1,0,0,0,1,0,1,5,0,2
3116941,6,4,4,1,3,0,6,1,0,0,0,0,2
3116942,1,5,6,0,0,0,0,1,3,1,1,0,0
3116943,2,8,5,1,4,0,6,0,0,1,3,0,2


* https://towardsdatascience.com/deep-embeddings-for-categorical-variables-cat2vec-b05c8ab63ac0
* https://contrib.scikit-learn.org/category_encoders/catboost.html
* https://xgboost.readthedocs.io/en/stable/get_started.html

# Baseline XGBoost

In [20]:
import xgboost as xgb
from xgboost import XGBClassifier

In [5]:
df.dtypes

class                    object
cap-diameter            float64
cap-shape                object
cap-surface              object
cap-color                object
does-bruise-or-bleed     object
gill-attachment          object
gill-spacing             object
gill-color               object
stem-height             float64
stem-width              float64
stem-surface             object
stem-color               object
has-ring                 object
ring-type                object
habitat                  object
season                   object
dtype: object

In [9]:
# Split variables
y = df['class']
X = df.drop('class', axis=1)

mapper = {
  'e': 0,
  'p': 1
}
y = [mapper[i] for i in y]


In [13]:
# Setting categorical variables
for t, col in zip(X.dtypes, X.columns):
  if t == 'object':
    X[col] = X[col].astype("category")

In [17]:
# Define XGBoost model
clf = XGBClassifier(
  tree_method='hist',
  enable_categorical=True,
  device='cuda'
)
clf.fit(X, y)
clf.save_model("models/baseline_xgb.json")

In [25]:
# Predictions
ids = df_test['id']
df_test = df_test.drop('id', axis=1)

# Setting columns
for t, col in zip(df_test.dtypes, df_test.columns):
  if t == 'object':
    df_test[col] = df_test[col].astype("category")

preds = clf.predict(df_test, device='cuda')

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




In [32]:
# Creating output
reverse_mapper = {v: k for k, v in mapper.items()}
result = [reverse_mapper[i] for i in preds]

final = pd.DataFrame({
  'id': ids,
  'class': result
})

final.to_csv('results/baseline_xgb.csv', index=False)

Score on Kaggle: 0.17899