# Multi-Class Prediction of Obesity Risk
Easy solution💸 91% Accuracy 🧿 XGBoost+Optuna
 Playground Series - Season 4, Episode 2
 
 Check out all my notebooks on : [www.kaggle.com/divyam6969](http://www.kaggle.com/divyam6969)
 
 # Multi-Class Prediction of Obesity Risk - XGBoost+Optuna

## Overview

This project addresses the challenge of predicting the risk of obesity or cardiovascular disease (CVD) using machine learning techniques. The goal is to perform a multi-class prediction of obesity risk by employing the XGBoost algorithm along with the Optuna hyperparameter optimization framework.

## Key Components and Techniques

1. **XGBoost Algorithm:** Utilizing the XGBoost classifier, a popular and powerful gradient boosting algorithm, to develop a predictive model for multi-class classification.

2. **Optuna Hyperparameter Optimization:** Employing the Optuna library to efficiently search and optimize hyperparameters for the XGBoost model, enhancing its performance.

3. **Data and Features:** Leveraging input data to train the model, with features related to obesity or cardiovascular risk. The dataset is used to build a robust and accurate predictive model.

4. **GPU Acceleration:** Utilizing GPU resources (specifically, the P100 GPU) to expedite the training process, significantly reducing computation time.

## Project Outcome

- **Competition Performance:** The project has achieved notable success, earning a bronze medal in the competition.

- **Public Score:** The model demonstrates strong predictive capabilities, as indicated by a public score of 0.91473.

- **Version History:** The project has undergone iterative improvements, with version 3 showcasing the final model.

- **Execution Time:** The model has been efficiently trained within a total execution time of 155.2 seconds using GPU acceleration.

This project serves as a demonstration of employing advanced machine learning techniques, specifically XGBoost with hyperparameter optimization, to tackle the challenging task of multi-class prediction in the context of obesity or cardiovascular risk. The utilization of GPU resources further enhances the speed and efficiency of the training process, contributing to the project's success in the competition.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/playground-series-s4e2/sample_submission.csv
/kaggle/input/playground-series-s4e2/train.csv
/kaggle/input/playground-series-s4e2/test.csv
/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv


## First we will load the necessary libraries

In [2]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import cross_val_score

from xgboost import XGBClassifier

import optuna

## Now go to Data -> Add Data -> then search obesity-or-cvd-risk-classififyregressorcluster Data and add it before running the following code

In [3]:
train = pd.read_csv("/kaggle/input/playground-series-s4e2/train.csv", index_col="id")
test = pd.read_csv("/kaggle/input/playground-series-s4e2/test.csv", index_col="id")
obesity = pd.read_csv("/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv")

train = pd.concat([train, obesity], axis=0)
train = train.drop_duplicates()

display(train.shape, train.head(), train.describe(include=[np.number]).T, train.describe(include=[object]).T, train.isna().sum())

(22845, 17)

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,22845.0,23.888513,5.755338,14.0,20.0,22.815416,26.0,61.0
Height,22845.0,1.700467,0.087865,1.45,1.631856,1.7,1.763029,1.98
Weight,22845.0,87.793761,26.363367,39.0,66.0,84.0,111.531208,173.0
FCVC,22845.0,2.443675,0.533392,1.0,2.0,2.393837,3.0,3.0
NCP,22845.0,2.755837,0.711185,1.0,3.0,3.0,3.0,4.0
CH2O,22845.0,2.027165,0.608479,1.0,1.755907,2.0,2.531984,3.0
FAF,22845.0,0.984585,0.839728,0.0,0.01586,1.0,1.600431,3.0
TUE,22845.0,0.620984,0.602802,0.0,0.0,0.58284,1.0,2.0


Unnamed: 0,count,unique,top,freq
Gender,22845,2,Female,11457
family_history_with_overweight,22845,2,yes,18736
FAVC,22845,2,yes,20826
CAEC,22845,4,Sometimes,19290
SMOKE,22845,2,no,22556
SCC,22845,2,no,22062
CALC,22845,4,Sometimes,16446
MTRANS,22845,5,Public_Transportation,18245
NObeyesdad,22845,7,Obesity_Type_III,4370


Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

### Now we will preprocess our data as we have completed our first step which is Data loading ;)

In [4]:
preprocess = ColumnTransformer([
    ("onehot", OneHotEncoder(handle_unknown="ignore"), make_column_selector(dtype_include=object)),
    ("scale", StandardScaler(), make_column_selector(dtype_include=np.number)),
])

X_train, y_train = train.drop("NObeyesdad", axis=1), train["NObeyesdad"]

preprocess.fit(pd.concat([X_train, test]))
X_train = preprocess.transform(X_train)
X_test = preprocess.transform(test)

label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)


### Now after preprocessing our data we will move forward to train the model using XGBoost+Optuna 

#### We will use tqdm library to see the progress bar of how much model is trained

In [5]:
import numpy as np
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import optuna
from tqdm import tqdm

In [6]:
params = {
    'n_estimators': 1312,
    'learning_rate': 0.018279520260162645,
    'gamma': 0.0024196354156454324,
    'reg_alpha': 0.9025931173755949,
    'reg_lambda': 0.06835667255875388,
    'max_depth': 5,
    'min_child_weight': 5,
    'subsample': 0.883274050086088,
    'colsample_bytree': 0.6579828557036317
}

xgb = XGBClassifier(random_state=42, **params)

# Wrap the training loop with tqdm for progress bar
for i in tqdm(range(5), desc="Training XGBoost"):
    score = cross_val_score(xgb, np.array(X_train), y_train, scoring='accuracy', cv=5, n_jobs=-1).mean()
    print("Accuracy: ", score)

    # Optionally fit the model at each iteration
    xgb.fit(np.array(X_train), y_train)


Training XGBoost:   0%|          | 0/5 [00:00<?, ?it/s]

Accuracy:  0.9151236594440796


Training XGBoost:  20%|██        | 1/5 [02:24<09:38, 144.72s/it]

Accuracy:  0.9151236594440796


Training XGBoost:  40%|████      | 2/5 [04:48<07:11, 143.89s/it]

Accuracy:  0.9151236594440796


Training XGBoost:  60%|██████    | 3/5 [07:10<04:46, 143.16s/it]

Accuracy:  0.9151236594440796


Training XGBoost:  80%|████████  | 4/5 [09:34<02:23, 143.40s/it]

Accuracy:  0.9151236594440796


Training XGBoost: 100%|██████████| 5/5 [11:56<00:00, 143.28s/it]


### Now we have trained the model now We would try to predict on test.csv

In [7]:
y_pred = xgb.predict(np.array(X_test))
y_pred = label_encoder.inverse_transform(y_pred)

submission = pd.DataFrame({"id": test.index, "NObeyesdad": y_pred})
submission.to_csv("submission.csv", index=False)