<a href="https://www.kaggle.com/code/jonathankao/dec-2023-tabular-ensemble-rf-xgboost-lgbm?scriptVersionId=157157836" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Task Overview 
Task: In this synthetic dataset based off of a real dataset funded by the Mayo Clinic, each example represents both general and survival information about a patient that has liver cirrhosis, a condition involving prolonged liver damage. The goal is to train a machine learning model that can predict the patient's current survival status based on the data features. 

Approach v1.0: Our approach will be to train a Logistic Regression model using sklearn to be used as a baseilne model and to then train a performance-focused model using XGBoost as a learning exercise.

Approach v2.0: After a few days of initial work, we realized that it was feasible within the competition timeframe to improve on Approach 1 by training an Ensemble of models so we changed our approach to training multiple Logistic Regression, Random Forest, and XGBoost models and then combining them into an ensemble.

Approach v3.0: After manual hyperparameter tuning all the models, we realized that the current Logistic Regression and Random Forest models were too weak to be implemented in the final Ensemble model. Instead we decided to add a LGBM classifier and create an Ensemble model with a LGBM classifier + XGBoost classifier. 

Next Steps: Try to implement more robust automated hyperparameter tuning so that we can add additional storng-performing models to the Ensemble. 

Version History
1. v1.0-1.3 - Implemented dataset loading and dataset pre-processing. Added explanations and comments for each step. 
2. v1.4 - Added Logistic Regression model training and Kaggle submission formatting
3. v2.0 - Added Random Forest and XGBoost model training and prediction.
4. v2.1 - Added log-loss calculations for each model on the validation set. 
5. v2.2 - Added initial hyperparameter tuning for Logistic Regression, Random Forest, and XGBoost models. 
6. v3.0 - Added LGBM Classifier and did hyperparameter tuning for the LGBM Classifier. Added explanations for model training / tuning
7. v3.1 - Combined the tuned LGBM Classifier and XGBoost classifer into an Ensemble model. 


# Dataset Loading

In [1]:
import numpy as np
import pandas as pd

# Read in the initial competition training and test data as Pandas DataFrames 
train_file_path = "/kaggle/input/playground-series-s3e26/train.csv"
train_df = pd.read_csv(train_file_path) 
test_file_path = "/kaggle/input/playground-series-s3e26/test.csv"
test_df = pd.read_csv(test_file_path)

# Remove the id column since it is not useful for prediction and might confuse the model in training
test_id_df = test_df['id'].astype(int)
train_df = train_df.drop('id', axis=1)
test_df = test_df.drop('id', axis=1)

# Use the head method to visually see that the dataset has been loaded
train_df.head(5)

Unnamed: 0,N_Days,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage,Status
0,999,D-penicillamine,21532,M,N,N,N,N,2.3,316.0,3.35,172.0,1601.0,179.8,63.0,394.0,9.7,3.0,D
1,2574,Placebo,19237,F,N,N,N,N,0.9,364.0,3.54,63.0,1440.0,134.85,88.0,361.0,11.0,3.0,C
2,3428,Placebo,13727,F,N,Y,Y,Y,3.3,299.0,3.55,131.0,1029.0,119.35,50.0,199.0,11.7,4.0,D
3,2576,Placebo,18460,F,N,N,N,N,0.6,256.0,3.5,58.0,1653.0,71.3,96.0,269.0,10.7,3.0,C
4,788,Placebo,16658,F,N,Y,N,N,1.1,346.0,3.65,63.0,1181.0,125.55,96.0,298.0,10.6,4.0,C


In [2]:
# Use info to see the number of categorical/text columns (any column with Dtype=object is text and will need to be converted to a number since ML models work only with numbers) 
# Here you can also see if there are any missing values by looking at the number of non-null values in each column
train_df.info()

# Make a list of categorical columns for future use (in the feature scaling section)
categorical_cols = ['Drug', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema', 'Status']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7905 entries, 0 to 7904
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   N_Days         7905 non-null   int64  
 1   Drug           7905 non-null   object 
 2   Age            7905 non-null   int64  
 3   Sex            7905 non-null   object 
 4   Ascites        7905 non-null   object 
 5   Hepatomegaly   7905 non-null   object 
 6   Spiders        7905 non-null   object 
 7   Edema          7905 non-null   object 
 8   Bilirubin      7905 non-null   float64
 9   Cholesterol    7905 non-null   float64
 10  Albumin        7905 non-null   float64
 11  Copper         7905 non-null   float64
 12  Alk_Phos       7905 non-null   float64
 13  SGOT           7905 non-null   float64
 14  Tryglicerides  7905 non-null   float64
 15  Platelets      7905 non-null   float64
 16  Prothrombin    7905 non-null   float64
 17  Stage          7905 non-null   float64
 18  Status  

In [3]:
# Use the describe method to generate statistical information such as standard deviation, mean, and min/max. 
train_df.describe()

Unnamed: 0,N_Days,Age,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
count,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0,7905.0
mean,2030.173308,18373.14649,2.594485,350.561923,3.548323,83.902846,1816.74525,114.604602,115.340164,265.228969,10.629462,3.032511
std,1094.233744,3679.958739,3.81296,195.379344,0.346171,75.899266,1903.750657,48.790945,52.530402,87.465579,0.781735,0.866511
min,41.0,9598.0,0.3,120.0,1.96,4.0,289.0,26.35,33.0,62.0,9.0,1.0
25%,1230.0,15574.0,0.7,248.0,3.35,39.0,834.0,75.95,84.0,211.0,10.0,2.0
50%,1831.0,18713.0,1.1,298.0,3.58,63.0,1181.0,108.5,104.0,265.0,10.6,3.0
75%,2689.0,20684.0,3.0,390.0,3.77,102.0,1857.0,137.95,139.0,316.0,11.0,4.0
max,4795.0,28650.0,28.0,1775.0,4.64,588.0,13862.4,457.25,598.0,563.0,18.0,4.0


# Dataset Pre-Processing Task Overview

1. Target Variable Label Encoding - Make sure the training labels are numerical, otherwise we cannot train the model. 
2. Missing Feature Value Handling - Make sure the dataset is not missing any values, if so we cannot train the model.
3. Feature Scaling - Make sure the features are similar in numerical value, otherwise some ML models will struggle to weigh them appropriately during training.  
4. Categorical Attribute Handling (Should be done at the end since it may change the number of columns) - Make sure text values have been converted to numbers, otherwise ML algorithms cannot learn from text data.
5. Train/Test Split - Need to create a validation set for hyperparameter tuning and evaluation, otherwise the model will overfit the hyperparameters to the training dataset and your model will not generalize well to the real-world / Kaggle private test set. 

## Target Variable Label Encoding

Problem: Since the Label column that we are trying to predict, 'Status', is a text column, we need to convert the column's values into integers so the machine learning model can process it (ML models do not support text input so we must convert the text into a numerical representation). We have three possible categories: Status_C, Status_CL, and Status_D. 

Possible Approaches:
1. Label Encoding - Convert each text category into an integer label. (Ex: 0, 1, 2)
2. Ordinal Encoding - Convert each text category into an integer label but with a particular order. Used when the categories have some quantitative order that can be taken advantage of like low, medium, high.   (Ex: low -> 0, medium-> 1, high -> 2)
3. One-Hot Encoding - Convert each text category into a separate column. For example, this is done in the softmax layer of a neural network. 

Chosen Approach: Label Encoding - There is no obvious ordering in the 'Status' column and ML libraries typically expect a single Label column which rules out one-hot encoding. Therefore we will use label encoding.

In [4]:
# Check to see what the initial text categories are
train_df['Status'].value_counts()

Status
C     4965
D     2665
CL     275
Name: count, dtype: int64

In [5]:
from sklearn.preprocessing import LabelEncoder

# Use sklearn's LabelEncoder class to transform the Status column's values from strings to integers.
label_encoder = LabelEncoder() 
train_df['Status'] = label_encoder.fit_transform(train_df['Status'])

# Check that the label encoder transformation was applied correctly and the categories are now numbers
train_df['Status'].value_counts()

Status
0    4965
2    2665
1     275
Name: count, dtype: int64

## Missing Feature Value Handling

Problem: We need to check if there are any missing feature values since most machine learning algorithms cannot handle an empty cell with no value. For example, a Logistic Regression model would throw an error in training although some ML algorithms like XGBoost are implemented to train normally by substituting in a value as needed. 

Solution: We check for missing feature values by using the isna DataFrame method which returns a boolean DataFrame where each cell is True if the value is missing and False otherwise. We then apply the sum method to find the number of missing values in each column. We find there are no missing values in this dataset so we can skip this step. 

In [6]:
# Check for missing feature values
print("Number of missing feature values by column: ")
print(train_df.isna().sum())

Number of missing feature values by column: 
N_Days           0
Drug             0
Age              0
Sex              0
Ascites          0
Hepatomegaly     0
Spiders          0
Edema            0
Bilirubin        0
Cholesterol      0
Albumin          0
Copper           0
Alk_Phos         0
SGOT             0
Tryglicerides    0
Platelets        0
Prothrombin      0
Stage            0
Status           0
dtype: int64


In [7]:
# Alternatively we can check the total number of missing feature values in the dataset this way. 
print("Number of missing feature values by column in test data:", test_df.isna().sum().sum())

Number of missing feature values by column in test data: 0


## Feature Scaling

Problem: Typically since feature column values are combined (often by adding them together) to create the final classification, the ML model will perform better if the features are on the same scale. (Ex: The ML model would struggle to scale values correctly if one column's values was in the billions and another column had values from 1-10)

Possible Feature Scaling Approaches: 
1. Min-Max Scaling (Default)- Scales the data to a fixed range between two values (typically 0 and 1). Most useful for neural networks.
2. Standardization (Default) - Scales the data so that the mean is 0 and the standard deviation is 1. Most useful for algorithms that assume a normal distribution of data, such as SVMs and logistic regression.
3. Robust Scaling (Advanced) - Scaling based on median and IQR. Most useful for handling significant outliers.
4. MaxAbsScaler (Advanced) - Scales each feature based on its maximum absolute value. Useful for sparse data. 

Chosen Approach: Standardization - Since XGBoost's decision tree classification uses splitting which occurs within a column, different column values do not interact with each other and therefore scaling the features is not necessary. However, since we are using Logistic Regression as our baseline model and features do interact in training we will need to do feature scaling. Since we are doing feature scaling specifically for our logistic regression model we will use the Standardization approach which is a commonly used approach that is effective for many datasets. 

In [8]:
from sklearn.preprocessing import StandardScaler 

# Here we find the numerical columns that need to be scaled by removing the text columns from the list of total columns in the df.
numerical_cols = train_df.columns.difference(categorical_cols)

# Now we use the StandardScaler class from sklearn to transform our numerical columns to the a Standardized scale
std_scaler = StandardScaler() 
std_scaler.fit(train_df[numerical_cols])

train_df[numerical_cols] = std_scaler.transform(train_df[numerical_cols])
test_df[numerical_cols] = std_scaler.transform(test_df[numerical_cols])
                      
# Confirm the transformation was successful by seeing if the mean = 0 and std = 1 for numerical columns
test_df.describe()

Unnamed: 0,N_Days,Age,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
count,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0,5271.0
mean,0.00779,0.033864,0.001549,0.009851,-0.029617,0.010526,-0.002895,-0.020847,-0.001029,-0.013781,0.004353,0.005175
std,0.993309,0.973958,1.010406,1.025961,1.02524,1.021709,1.016664,1.003627,1.001441,1.001418,1.014105,0.987967
min,-1.817984,-2.384728,-0.601797,-1.180148,-4.588553,-1.052815,-0.802543,-1.808946,-1.567576,-2.323678,-2.08455,-2.345776
25%,-0.727654,-0.7183,-0.496885,-0.524971,-0.57294,-0.591648,-0.522026,-0.811772,-0.596648,-0.64291,-0.805263,-1.191649
50%,-0.135421,0.117632,-0.391973,-0.263923,0.062625,-0.249068,-0.354452,-0.156896,-0.215892,-0.071221,-0.037691,-0.037522
75%,0.604869,0.627996,0.106359,0.201867,0.640411,0.238452,0.011428,0.478508,0.431393,0.591939,0.474024,1.116605
max,2.526884,2.792831,6.663359,7.291089,3.15378,6.642081,6.327728,7.023169,9.188781,3.404652,5.84703,1.116605


## Handling Categorical Attributes/Columns 
Problem: ML models cannot handle text data naturally, they can only handle numbers so we need to convert text data into some numerical representation that still contains the relevant information. 

Possible Approaches: The main approaches for categorical attribute handling are 
1. Ordinal Encoding - Useful when the categories correspond to an ascending or descending order. 
2. One-Hot Encoding (Default Choice) - For each categorical column, convert it into multiple columns, one for each possible category. This is used when the categories do not have an obvious logical order. 
3. Numerical Feature Replacement (Advanced) - In cases where the number of categories is cery large (hundreds or thousands) one should consider replacing the categorical columns with a numerical column that converts each category into some number. For example, one could convert a country code into the country's population. 
4. Embedding Replacement (Advanced) - Alternatively, one can replace categories with embeddings, which are low dimensional vectors that represent the category. 

Chosen Approach : One-Hot Encoding - In this case, we use a one-hot encoding since none of the categories seem to have an order to the classes (ruling out ordinal encoding) and the number of categories for each column is low (under 10 for all categorical columns) which rules out needing advanced methods. 

In [9]:
# Confirm that the number of categories in the categorical columns is manageable (< 100) since we will be adding a column to the df for each category
unique_values_per_column = train_df.nunique()
print(unique_values_per_column)

N_Days           461
Drug               2
Age              391
Sex                2
Ascites            2
Hepatomegaly       2
Spiders            2
Edema              3
Bilirubin        111
Cholesterol      226
Albumin          160
Copper           171
Alk_Phos         364
SGOT             206
Tryglicerides    154
Platelets        227
Prothrombin       49
Stage              4
Status             3
dtype: int64


In [10]:
# Convert the categorical columns into one-hot encodings using the get_dummies function
status = train_df['Status']
train_df_dummies = pd.get_dummies(train_df.drop('Status', axis=1))
train_df = pd.concat([train_df_dummies, status], axis=1)
test_df = pd.get_dummies(test_df)

# Confirm the transformation was successful
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5271 entries, 0 to 5270
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   N_Days                5271 non-null   float64
 1   Age                   5271 non-null   float64
 2   Bilirubin             5271 non-null   float64
 3   Cholesterol           5271 non-null   float64
 4   Albumin               5271 non-null   float64
 5   Copper                5271 non-null   float64
 6   Alk_Phos              5271 non-null   float64
 7   SGOT                  5271 non-null   float64
 8   Tryglicerides         5271 non-null   float64
 9   Platelets             5271 non-null   float64
 10  Prothrombin           5271 non-null   float64
 11  Stage                 5271 non-null   float64
 12  Drug_D-penicillamine  5271 non-null   bool   
 13  Drug_Placebo          5271 non-null   bool   
 14  Sex_F                 5271 non-null   bool   
 15  Sex_M                

## Train/Test Split

Problem: We need to split the dataset into a training dataset and a testing/tuning dataset. If we were not to do this and did both training / tuning / evaluation with the same dataset, the final model would likely be overfitted and would not generalize well to the real-world / the Kaggle competition private test set. 

Possible Approaches
1. K-Fold Cross-Validation - In this approach the dataset is divided into k equal-sized subsets. Then we train the model k times, each time using a different subset as the validation set and the other k-1 subsets as the training set. Finally we use the average score among all k models as the final score. This approach is commonly used since it essentially utilizes  more of the training data for validation (90% vs. 80%). Typical values range are k=5 or k=10. 
2. Train/Validation/Test Split - In this approach we split the dataset into three distinct sets: a training set for training the model, a validation set for tuning the hyperparameters and a test set for evaluating the final model. By separating the test set and validation set we reduce / avoid overfitting to the test set. A typical split varies but could be 80/10/10 or 70/15/15.

Chosen Approach : Initially we used the train/validation/test approach since it is simpler to implement for manual hyperparameter tuning. However later in some cases we used k-fold cross-validation since that is the default for automated hyperparameter tuning methods such as GridSearchCV.  

In [11]:
from sklearn.model_selection import StratifiedKFold

# Set parameters for k-folds cross-validation
kfold = StratifiedKFold(n_splits=10, shuffle=True)

from sklearn.model_selection import train_test_split

# Split the training set into a training and validation set 
X = train_df.drop("Status", axis=1)
y = train_df["Status"]

X_train, X_val, y_train, y_val = train_test_split(X, y,  test_size = 0.1)


# Logistic Regression Baseline Model
Default Model Performance: 0.506

Tuned Model Performance: 0.506

Hyperparameter Tuning: 
Here we manually tuned the most important parameters for Logistic Regression (C, max_iter, class_weight, tol, penalty, solver) but were unable to significantly increase performance (still around 0.506)

Conclusion: We set the baseline of model performance at 0.506 log-loss since Logistic Regression is the simplest model we used on the dataset. Since the performance was so weak, we decided to not include Logistic Regression in the final ensemble of models for competition submission.  

In [12]:
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import log_loss

# Train the model using sci-kit learn's Logistic Regression model 
model = LogisticRegression(C=1, max_iter = 1000)
# The fit method learns the parameters (weights) for the model
model.fit(X_train, y_train)
# Predict the probability of each class for the validation set and the test set 
y_val_pred_lr = model.predict_proba(X_val)
y_pred_lr = model.predict_proba(test_df)

# Evaluate the model predictions by calculating log loss for the validation set
loss = log_loss(y_val, y_val_pred_lr)
print(f'Validation Set Log Loss: {loss}')

Validation Set Log Loss: 0.5015551624001687


# Random Forest Model 
Default Model Performance: 0.438

Tuned Model Performance: 0.434 

Hyperparameter Tuning:
We manually tuned hyperparameters starting from the most important hyperparameters (n_estimators) and working through the rest (max_depth, max_features, min_samples_split, min_impurity_decrease, min_samples_leaf, min_weight_fraction_leaf). We only found significant improvements in performance by tuning the n_estimator hyperparameter which improved performance from 0.438 -> 0.434.

Conclusion: The overall conclusion is that the Random Forest model performed significantly better than the baseline Logistic Regression model with a log-loss of 0.434 compared to 0.506. However, this model still performed significantly worse than the best XGBoost and LGBM models so we decided to exclude the Random Forest from the final ensemble for competition submission. 


In [13]:
from sklearn.ensemble import RandomForestClassifier

# Train the model using sci-kit learn's Random Forest model 
rf_classifier = RandomForestClassifier(n_estimators=300)
# The fit method learns the parameters (weights) for the model
rf_classifier.fit(X_train, y_train) 
# Predict the probability of each class for the validation set and the test set (Kaggle's unlabeled dataset)
y_val_pred_rf = rf_classifier.predict_proba(X_val)
y_pred_rf = rf_classifier.predict_proba(test_df)
print(y_val_pred_rf, "\n")

# Evaluate the model predictions by calculating log loss for the validation set
loss = log_loss(y_val, y_val_pred_rf)
print(f'Validation Set Log Loss: {loss}')

[[0.88333333 0.04333333 0.07333333]
 [0.72333333 0.06       0.21666667]
 [0.17333333 0.07       0.75666667]
 ...
 [0.7        0.10333333 0.19666667]
 [0.78666667 0.03333333 0.18      ]
 [0.30333333 0.02       0.67666667]] 

Validation Set Log Loss: 0.42991751918593946


# XGBoost Model
Default Model Performance: 0.457

Tuned Model Performance: 0.38

Approach: Initially we tuned the model hyperparameters manually as before which is an approach I've seen suggested in some ML books such as Corey Wade's XGBoost book. However, we were not able to tune the model beyond ~0.42 score. 

Conclusion: After comparing to other XGboost notebooks it was clear in order to tune the model hyperparameters manually one needs a great deal of experience. For example, in the method I have been using where you tune hyperparameters one at a time by simply changing the numbers in the XGBClassifier(n_estimators=700) constructor call, n_estimators=700 performs very poorly and therefore is discarded early in the process. However, it turns out n_estimators=700 performs very well *if* the learning_rate is also tuned at the same time. This suggests I need to improve my understanding of how hyperparameters interact and automate some of my process using GridSearchCV for future hyperparameter tuning. 


In [14]:
from xgboost import XGBClassifier 

# Train the model using xgboost's XGBClassifier model 
xgb_model = XGBClassifier(n_estimators=700, max_depth=6, learning_rate=0.04,colsample_bytree=0.168, min_child_weight=17, subsample=0.7) 
# The fit method learns the parameters (weights) for the model
xgb_model.fit(X_train, y_train)
# Predict the probability of each class for the validation set and the test set (Kaggle's unlabeled dataset)
y_val_pred_xgb = xgb_model.predict_proba(X_val)
y_pred_xgb = xgb_model.predict_proba(test_df)
print(y_val_pred_xgb, "\n")

# Evaluate the model predictions by calculating log loss for the validation set
loss = log_loss(y_val, y_val_pred_xgb)
print(f'Validation Set Log Loss: {loss}')

[[0.8166731  0.0058847  0.1774422 ]
 [0.7644712  0.09509657 0.14043225]
 [0.06047722 0.00893804 0.9305847 ]
 ...
 [0.924562   0.01742059 0.05801747]
 [0.9405738  0.00984116 0.04958501]
 [0.12290483 0.00321576 0.87387943]] 

Validation Set Log Loss: 0.3882653256144413


# LGBM Classifier
Default Model Performance: 0.455

Tuned Model Performance: 0.396

Approach  / Conclusion: The approach and conclusions are similar to those found in the XGBoost section. 

In [15]:
import lightgbm as lgb

# Define your parameters
# Define your parameters
lgb_params = {
    'max_depth': 15,
    'min_child_samples': 13,
    'learning_rate': 0.05285597081335651,
    'n_estimators': 294,
    'min_child_weight': 5,
    'colsample_bytree': 0.10012816493265511,
    'reg_alpha': 0.8767668608061822,
    'reg_lambda': 0.8705834466355764
}

# Create the LGBMClassifier with the specified parameters
lgbm_model = lgb.LGBMClassifier(**lgb_params)

# Now you can fit this classifier to your data
lgbm_model.fit(X_train, y_train)

# And make predictions
y_val_pred_lgbm = lgbm_model.predict_proba(X_val)
y_pred_lgbm = lgbm_model.predict_proba(test_df)

# Evaluate the model predictions by calculating log loss for the validation set
loss = log_loss(y_val, y_val_pred_lgbm)
print(f'Validation Set Log Loss: {loss}')

Validation Set Log Loss: 0.38630631371856494


# Ensemble Model

Approach: Since our current Logistic Regression and Random Forest models are significantly weaker than the XGBoost and LGBM models, we will only use the tuned XGBoost / LGBM models in the Ensemble. Further work will involve trying to tune the other models further so they can be included in the Ensemble. 

Tuned Ensemble Model Score: 0.398

In [16]:
from sklearn.ensemble import VotingClassifier

# Train an Ensemble model using a combination of the XGBoost and LGBM Classifiers, voting='soft' is used since we are predicting probabilities, not the actual classes
ensemble_model = VotingClassifier(
    estimators=[
        ('lgb', lgbm_model),
        ('xgb', xgb_model),
    ],
    voting='soft'
)

ensemble_model.fit(X_train, y_train)

# Make predictions using the ensemble
y_val_pred_ensemble = ensemble_model.predict_proba(X_val)
y_pred_ensemble = ensemble_model.predict_proba(test_df)

# Evaluate the model predictions by calculating log loss for the validation set
loss = log_loss(y_val, y_val_pred_ensemble)
print(f'Validation Set Log Loss: {loss}')

Validation Set Log Loss: 0.38521794412139393


In [17]:
# Train the final Ensemble model for competition submission using all the data, including data from the validation set 
ensemble_model_final = VotingClassifier(
    estimators=[
        ('lgb', lgbm_model),
        ('xgb', xgb_model),
    ],
    voting='soft'
)

ensemble_model_final.fit(X, y)

# Make predictions using the ensemble
y_pred_ensemble_final = ensemble_model.predict_proba(test_df)

# Kaggle Submission Processing
1. Create a submission dataframe from the model's predictions 
2. Concatenate the data id column values to adhere to submission formatting requirements
3. Convert the submission dataframe into a csv file for submission 
4. Now in order to submit to Kaggle, save the notebook and navigate to the Submissions page for this competition and click 'Submit Prediction' in the top-right corner -> Notebook -> Submit. 

In [18]:
# Modify the probability predictions into the submission format 
submission_df = pd.DataFrame(y_pred_ensemble_final, columns=['Status_C', 'Status_CL', 'Status_D'])
final_submission_df = pd.concat([test_id_df, submission_df], axis=1)
final_submission_df.head(10)

# Create a submission.csv file that Kaggle will automatically evaluate for submission
final_submission_df.to_csv('submission.csv', index = False)