## Greetings Everyone,


##### I found a very useful notebook [[Mushrooms] Single LightGBM model [~20 minutes]](https://www.kaggle.com/code/carlmcbrideellis/mushrooms-single-lightgbm-model-20-minutes) 

#### Published by [@Carl McBride Ellis](https://www.kaggle.com/carlmcbrideellis)

> Everything in the notebook was clear, so i thought to use it as my second baseline,
I kept the HPs (hyperparameters) same as before but my apporoach was to ensemble LightGBM models with different Random States (Made a few other changes to the code too)



> Also experimented with Xgboost and Catboost.

> **I did not tunned the HPs of XGBoost, Catboost, LightGBM**



In [1]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import matthews_corrcoef

import sklearn
sklearn.set_config(transform_output="pandas")


In [2]:
train_competition = pd.read_csv("/kaggle/input/playground-series-s4e8/train.csv", index_col="id")
train_original    = pd.read_csv("/kaggle/input/secondary-mushroom-dataset-data-set/MushroomDataset/secondary_data.csv", sep=";")
train = pd.concat([train_competition, train_original], ignore_index=True)
cols = train.columns.to_list()
cols.remove("class")
train = train.drop_duplicates(subset=cols, keep='first')
X_test = pd.read_csv("/kaggle/input/playground-series-s4e8/test.csv", index_col="id")

In [3]:
def cleaning(df):
    
    threshold = 100
    
    cat_feats = ["cap-shape","cap-surface","cap-color","does-bruise-or-bleed","gill-attachment",
                "gill-spacing","gill-color","stem-root","stem-surface","stem-color","veil-type",
                "veil-color","has-ring","ring-type","spore-print-color","spore-print-color",
                "habitat","season"]
    
    for feat in cat_feats:
        df[feat] = df[feat].fillna('missing')
        df.loc[df[feat].value_counts(dropna=False)[df[feat]].values < threshold, feat] = "noise"
        df[feat] = df[feat].astype('category')
    
    return df

train  = cleaning(train)
X_test = cleaning(X_test)

In [4]:
X = train
X = X.drop(["class"], axis="columns")
y = train["class"].map({'e': 0, 'p': 1})
# .map({0: 'e', 1: 'p'})

In [5]:
y.head()

0    0
1    1
2    0
3    0
4    0
Name: class, dtype: int64

In [6]:
# # One-hot encode categorical features
# X = pd.get_dummies(X)
# X_test = pd.get_dummies(X_test)

# # Align the columns of the test set with the train set
# X_test = X_test.reindex(columns=X.columns, fill_value=0)

In [7]:
# Define model parameters
lgb_params = {
    'n_estimators': 2500,
    'max_bin': 256,
    'colsample_bytree': 0.6,
    'reg_lambda': 80,
    'verbose': -1,
    'device': 'gpu',
    'n_jobs': -1
    
}

xgb_params = {
    'n_estimators': 2500,
    'max_bin': 1024,
    'colsample_bytree': 0.6,
    'reg_lambda': 80,
    'verbosity': 0,
    'use_label_encoder': False,
    'n_jobs': -1
}

catb_params = {
    'iterations': 2500,
    'depth': 10,
    'l2_leaf_reg': 80,
    'verbose': 0
}

In [8]:
# Create models with different random states
xgb_models = [(f"xgb_{i}", XGBClassifier(**xgb_params, random_state=i)) for i in range(9)]
lgb_models = [(f"lgb_{i}", LGBMClassifier(**lgb_params, random_state=i)) for i in range(12)]
catb_models = [(f"cat_{i}", CatBoostClassifier(**catb_params, random_state=i)) for i in range(5)]


In [9]:
%%time
# List of all models
all_models = lgb_models

# Train each model and predict probabilities
test_pred_probas = []

for name, model in all_models:
    model.fit(X, y)
    train_preds = model.predict(X)
    mcc = matthews_corrcoef(y, train_preds)
    print(f'Training Data -> Model: {name}, MCC: {round(mcc, 5)}')
    print()
    test_pred_probas.append(model.predict_proba(X_test)[:, 1])
    
print()    
print('Finally Done')



Training Data -> Model: lgb_0, MCC: 0.98589

Training Data -> Model: lgb_1, MCC: 0.98583

Training Data -> Model: lgb_2, MCC: 0.98591

Training Data -> Model: lgb_3, MCC: 0.98589

Training Data -> Model: lgb_4, MCC: 0.98589

Training Data -> Model: lgb_5, MCC: 0.98593

Training Data -> Model: lgb_6, MCC: 0.98592

Training Data -> Model: lgb_7, MCC: 0.9859

Training Data -> Model: lgb_8, MCC: 0.98593

Training Data -> Model: lgb_9, MCC: 0.98592

Training Data -> Model: lgb_10, MCC: 0.9859

Training Data -> Model: lgb_11, MCC: 0.98589


Finally Done
CPU times: user 8h 43min 44s, sys: 1min 23s, total: 8h 45min 7s
Wall time: 2h 14min 28s


In [10]:
# Calculate the mean of predictions
mean_test_pred_probas = np.mean(test_pred_probas, axis=0)

# Apply threshold
threshold = 0.5
test_predictions = mean_test_pred_probas > threshold

# Prepare the submission
submission = pd.read_csv("/kaggle/input/playground-series-s4e8/sample_submission.csv")
submission["class"] = test_predictions.astype(int)
submission['class'] = submission['class'].map({0: 'e', 1: 'p'})
submission.to_csv('submission.csv', index=False)

In [11]:
submission.head()

Unnamed: 0,id,class
0,3116945,e
1,3116946,p
2,3116947,p
3,3116948,p
4,3116949,e


## If you have any other suggestions, please feel free to share them with me in the comment section, I would love to improve  