# Multi-Class Prediction of Obesity Risk

(Playground Series - Season 4, Episode 2)

https://www.kaggle.com/competitions/playground-series-s4e2/overview

![Image](./data/picture.png)

__About the Tabular Playground Series__
The goal of the Tabular Playground Series is to provide the Kaggle community with a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The duration of each competition will generally only last a few weeks, and may have longer or shorter durations depending on the challenge. The challenges will generally use fairly light-weight datasets that are synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various model and feature engineering ideas, create visualizations, etc.

__Synthetically-Generated Datasets__
Using synthetic data for Playground competitions allows us to strike a balance between having real-world data (with named features) and ensuring test labels are not publicly available. This allows us to host competitions with more interesting datasets than in the past. While there are still challenges with synthetic data generation, the state-of-the-art is much better now than when we started the Tabular Playground Series two years ago, and that goal is to produce datasets that have far fewer artifacts. Please feel free to give us feedback on the datasets for the different competitions so that we can continue to improve!

__Dataset Description__
The dataset for this competition (both train and test) was generated from a deep learning model trained on the Obesity or CVD risk dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Note: This dataset is particularly well suited for visualizations, clustering, and general EDA. Show off your skills!

__Files__
* train.csv - the training dataset; NObeyesdad is the categorical target
* test.csv - the test dataset; your objective is to predict the class of NObeyesdad for each row
* sample_submission.csv - a sample submission file in the correct format

__Model Features__

- Frequent consumption of high caloric food (FAVC)
- Frequency of consumption of vegetables(FCVC)
- Number of main meals (NCP)
- Consumption of food between meals (CAEC)
- Consumption of water daily (CH2O)
- Calories consumption monitoring (SCC)
- Physical activity frequency (FAF)
- Time using technology devices (TUE)
- Consumption of alcohol (CALC)
- Transportation used (MTRANS)

In [3]:
# Import all the librarys

import time
import numpy as np
import pandas as pd
import pickle as pkl
import datetime as dt
import warnings as wn
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer

In [2]:
# Ignore all warnings
wn.filterwarnings('ignore')

In [4]:
# Set all variable paths 

_plots = './plots/'
_test = './data/test.csv'
_train = './data/train.csv'
_info = './model/model.docx'
_model = './model/model.pkl'
_submission = './data/submission.csv'

In [5]:
# Read the data from file

test_data = pd.read_csv(_test)
train_data = pd.read_csv(_train)

In [6]:
# Display the first n rows from dataset
train_data.head(n=10)

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II
5,5,Male,18.128249,1.748524,51.552595,yes,yes,2.919751,3.0,Sometimes,no,2.13755,no,1.930033,1.0,Sometimes,Public_Transportation,Insufficient_Weight
6,6,Male,29.883021,1.754711,112.725005,yes,yes,1.99124,3.0,Sometimes,no,2.0,no,0.0,0.696948,Sometimes,Automobile,Obesity_Type_II
7,7,Male,29.891473,1.75015,118.206565,yes,yes,1.397468,3.0,Sometimes,no,2.0,no,0.598655,0.0,Sometimes,Automobile,Obesity_Type_II
8,8,Male,17.0,1.7,70.0,no,yes,2.0,3.0,Sometimes,no,3.0,yes,1.0,1.0,no,Public_Transportation,Overweight_Level_I
9,9,Female,26.0,1.638836,111.275646,yes,yes,3.0,3.0,Sometimes,no,2.632253,no,0.0,0.218645,Sometimes,Public_Transportation,Obesity_Type_III


In [7]:
# Describe the dataset
train_data.describe()

Unnamed: 0,id,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0
mean,10378.5,23.841804,1.700245,87.887768,2.445908,2.761332,2.029418,0.981747,0.616756
std,5992.46278,5.688072,0.087312,26.379443,0.533218,0.705375,0.608467,0.838302,0.602113
min,0.0,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,5189.25,20.0,1.631856,66.0,2.0,3.0,1.792022,0.008013,0.0
50%,10378.5,22.815416,1.7,84.064875,2.393837,3.0,2.0,1.0,0.573887
75%,15567.75,26.0,1.762887,111.600553,3.0,3.0,2.549617,1.587406,1.0
max,20757.0,61.0,1.975663,165.057269,3.0,4.0,3.0,3.0,2.0


In [8]:
# Display more info
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   Gender                          20758 non-null  object 
 2   Age                             20758 non-null  float64
 3   Height                          20758 non-null  float64
 4   Weight                          20758 non-null  float64
 5   family_history_with_overweight  20758 non-null  object 
 6   FAVC                            20758 non-null  object 
 7   FCVC                            20758 non-null  float64
 8   NCP                             20758 non-null  float64
 9   CAEC                            20758 non-null  object 
 10  SMOKE                           20758 non-null  object 
 11  CH2O                            20758 non-null  float64
 12  SCC                             

In [9]:
# Set the array for columns and target values
names = train_data.columns
target = train_data['NObeyesdad']
target

0        Overweight_Level_II
1              Normal_Weight
2        Insufficient_Weight
3           Obesity_Type_III
4        Overweight_Level_II
                ...         
20753        Obesity_Type_II
20754    Insufficient_Weight
20755        Obesity_Type_II
20756    Overweight_Level_II
20757        Obesity_Type_II
Name: NObeyesdad, Length: 20758, dtype: object

In [10]:
# Remove unnecessary data

train_data = train_data.drop(['id'], axis=1)
test_data = test_data.drop(['id'], axis=1)

In [11]:
# Check for empty values
train_data.notna().any()
test_data.notna().any()

Gender                            True
Age                               True
Height                            True
Weight                            True
family_history_with_overweight    True
FAVC                              True
FCVC                              True
NCP                               True
CAEC                              True
SMOKE                             True
CH2O                              True
SCC                               True
FAF                               True
TUE                               True
CALC                              True
MTRANS                            True
dtype: bool

In [12]:
# Adding the BMI (Body Mass Index) Column

def BodyMassIndex():
    train_data['BMI'] = train_data['Weight'] / (train_data['Height'] ** 2)
    bmi_column = train_data.pop('BMI') 
    train_data.insert(1, 'BMI', bmi_column)

# BodyMassIndex() 

In [13]:
# Encode the dataframe to numerical 

for column in train_data.columns[:]:
    if train_data[column].dtype == 'O':
        encoder = LabelEncoder()
        train_data[column] = encoder.fit_transform(train_data[column]) + 1
        mapping_dict = dict(zip(encoder.classes_, encoder.transform(encoder.classes_) + 1))
        print(f"Mapping for {column}: {mapping_dict}")

for column in test_data.columns[:]:
    if test_data[column].dtype == 'O':
        encoder = LabelEncoder()
        test_data[column] = encoder.fit_transform(test_data[column]) + 1
        mapping_dict = dict(zip(encoder.classes_, encoder.transform(encoder.classes_) + 1))


Mapping for Gender: {'Female': 1, 'Male': 2}
Mapping for family_history_with_overweight: {'no': 1, 'yes': 2}
Mapping for FAVC: {'no': 1, 'yes': 2}
Mapping for CAEC: {'Always': 1, 'Frequently': 2, 'Sometimes': 3, 'no': 4}
Mapping for SMOKE: {'no': 1, 'yes': 2}
Mapping for SCC: {'no': 1, 'yes': 2}
Mapping for CALC: {'Frequently': 1, 'Sometimes': 2, 'no': 3}
Mapping for MTRANS: {'Automobile': 1, 'Bike': 2, 'Motorbike': 3, 'Public_Transportation': 4, 'Walking': 5}
Mapping for NObeyesdad: {'Insufficient_Weight': 1, 'Normal_Weight': 2, 'Obesity_Type_I': 3, 'Obesity_Type_II': 4, 'Obesity_Type_III': 5, 'Overweight_Level_I': 6, 'Overweight_Level_II': 7}


In [14]:
# Split the dataframe to feature and target

rows = train_data.shape[0]
cols = train_data.shape[1]

y_data = pd.DataFrame(train_data['NObeyesdad'])
X_data = pd.DataFrame(train_data.iloc[:,:-1])

X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

In [15]:
# Set categorical null data to zero
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

In [16]:
target = ['Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I',
       'Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I',
       'Overweight_Level_II']

In [17]:
# The newer dataframes
print(y_train['NObeyesdad'].unique())

[3 1 4 7 2 5 6]


In [18]:
# Plot the training dataframe

def plot():
    numerical_columns = train_data.select_dtypes(include=['float64', 'int64']).columns
    for column in numerical_columns[:-1]:
        plt.hist(train_data[train_data['NObeyesdad'] == 1][column], label=target[0], color='red', alpha=0.7, density=False)
        plt.hist(train_data[train_data['NObeyesdad'] == 2][column], label=target[1], color='blue', alpha=0.7, density=False)
        plt.hist(train_data[train_data['NObeyesdad'] == 3][column], label=target[2], color='purple', alpha=0.7, density=False)
        plt.hist(train_data[train_data['NObeyesdad'] == 4][column], label=target[3], color='green', alpha=0.7, density=False)
        plt.hist(train_data[train_data['NObeyesdad'] == 5][column], label=target[4], color='orange', alpha=0.7, density=False)
        plt.hist(train_data[train_data['NObeyesdad'] == 6][column], label=target[5], color='olive', alpha=0.7, density=False)
        plt.hist(train_data[train_data['NObeyesdad'] == 7][column], label=target[6], color='cyan', alpha=0.7, density=False)
        plt.legend()
        plt.title(column)
        plt.ylabel(column)
        plt.xlabel('NObeyesdad')
        plt.savefig(f'{_plots}{column}.png')
        plt.show()

# plot()

# K Nearest Neighbors Model

In [107]:
from sklearn.neighbors import KNeighborsClassifier

In [108]:
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print(classification_report(y_test, knn_pred))

              precision    recall  f1-score   support

           1       0.91      0.94      0.92       524
           2       0.86      0.82      0.84       626
           3       0.84      0.85      0.85       543
           4       0.96      0.95      0.96       657
           5       0.99      0.99      0.99       804
           6       0.73      0.78      0.75       484
           7       0.77      0.75      0.76       514

    accuracy                           0.88      4152
   macro avg       0.87      0.87      0.87      4152
weighted avg       0.88      0.88      0.88      4152



# Gaussian Naive Bayes Model

In [109]:
from sklearn.naive_bayes import GaussianNB

In [110]:
nbc_model = GaussianNB()
nbc_model.fit(X_train, y_train)
nbc_pred = nbc_model.predict(X_test)
print(classification_report(y_test, nbc_pred))

              precision    recall  f1-score   support

           1       0.81      0.94      0.87       524
           2       0.82      0.66      0.73       626
           3       0.62      0.64      0.63       543
           4       0.82      0.94      0.87       657
           5       0.97      1.00      0.98       804
           6       0.67      0.54      0.59       484
           7       0.59      0.57      0.58       514

    accuracy                           0.78      4152
   macro avg       0.76      0.76      0.75      4152
weighted avg       0.77      0.78      0.77      4152



# Logistic Regression Model

In [111]:
from sklearn.linear_model import LogisticRegression

In [112]:
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
log_pred = log_model.predict(X_test)
print(classification_report(y_test, log_pred))

              precision    recall  f1-score   support

           1       0.79      0.74      0.76       524
           2       0.63      0.67      0.65       626
           3       0.57      0.60      0.58       543
           4       0.82      0.88      0.85       657
           5       0.89      0.92      0.90       804
           6       0.56      0.56      0.56       484
           7       0.57      0.45      0.50       514

    accuracy                           0.71      4152
   macro avg       0.69      0.69      0.69      4152
weighted avg       0.71      0.71      0.71      4152



# Support Vector Machine Model

In [113]:
from sklearn.svm import SVC

In [114]:
svc_model = SVC()
svc_model.fit(X_train, y_train)
svc_pred = svc_model.predict(X_test)
print(classification_report(y_test, svc_pred))

              precision    recall  f1-score   support

           1       0.84      0.95      0.89       524
           2       0.86      0.72      0.78       626
           3       0.80      0.80      0.80       543
           4       0.93      0.89      0.91       657
           5       0.94      0.96      0.95       804
           6       0.66      0.66      0.66       484
           7       0.67      0.73      0.70       514

    accuracy                           0.83      4152
   macro avg       0.81      0.82      0.81      4152
weighted avg       0.83      0.83      0.83      4152



# Decision Tree Classifier Model

In [115]:
from sklearn.tree import DecisionTreeClassifier

In [116]:
dtc_model = DecisionTreeClassifier()
dtc_model.fit(X_train, y_train)
dtc_pred = dtc_model.predict(X_test)
print(classification_report(y_test, dtc_pred))

              precision    recall  f1-score   support

           1       0.89      0.89      0.89       524
           2       0.79      0.78      0.79       626
           3       0.81      0.81      0.81       543
           4       0.97      0.94      0.95       657
           5       0.99      1.00      0.99       804
           6       0.64      0.66      0.65       484
           7       0.70      0.70      0.70       514

    accuracy                           0.84      4152
   macro avg       0.82      0.82      0.82      4152
weighted avg       0.84      0.84      0.84      4152



# Random Forest Classifier Model

In [117]:
from sklearn.ensemble import RandomForestClassifier

In [118]:
rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)
rfc_pred = rfc_model.predict(X_test)
print(classification_report(y_test, rfc_pred))

              precision    recall  f1-score   support

           1       0.93      0.93      0.93       524
           2       0.86      0.88      0.87       626
           3       0.88      0.89      0.88       543
           4       0.97      0.97      0.97       657
           5       1.00      1.00      1.00       804
           6       0.76      0.77      0.77       484
           7       0.81      0.79      0.80       514

    accuracy                           0.90      4152
   macro avg       0.89      0.89      0.89      4152
weighted avg       0.90      0.90      0.90      4152



# Linear Discriminant Analysis Model

In [119]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [120]:
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)
lda_pred = lda_model.predict(X_test)
print(classification_report(y_test, lda_pred))

              precision    recall  f1-score   support

           1       0.84      0.94      0.88       524
           2       0.82      0.71      0.76       626
           3       0.81      0.76      0.79       543
           4       0.92      0.96      0.94       657
           5       0.99      0.99      0.99       804
           6       0.62      0.67      0.65       484
           7       0.68      0.64      0.66       514

    accuracy                           0.83      4152
   macro avg       0.81      0.81      0.81      4152
weighted avg       0.83      0.83      0.83      4152



# Gradient Boosting Classifier Model

In [20]:
from sklearn.ensemble import GradientBoostingClassifier

In [122]:
gbc_model = GradientBoostingClassifier()
gbc_model.fit(X_train, y_train)
gbc_pred = gbc_model.predict(X_test)
print(classification_report(y_test, gbc_pred))

              precision    recall  f1-score   support

           1       0.94      0.94      0.94       524
           2       0.87      0.88      0.88       626
           3       0.90      0.87      0.89       543
           4       0.97      0.97      0.97       657
           5       1.00      1.00      1.00       804
           6       0.77      0.77      0.77       484
           7       0.79      0.80      0.80       514

    accuracy                           0.90      4152
   macro avg       0.89      0.89      0.89      4152
weighted avg       0.90      0.90      0.90      4152



# Neural Network Classifier Model

In [123]:
from sklearn.neural_network import MLPClassifier

In [124]:
nnc_model = MLPClassifier()
nnc_model.fit(X_train, y_train)
nnc_pred = nnc_model.predict(X_test)
print(classification_report(y_test, nnc_pred))

              precision    recall  f1-score   support

           1       0.90      0.94      0.92       524
           2       0.89      0.71      0.79       626
           3       0.84      0.85      0.85       543
           4       0.96      0.96      0.96       657
           5       1.00      1.00      1.00       804
           6       0.63      0.78      0.70       484
           7       0.74      0.71      0.72       514

    accuracy                           0.86      4152
   macro avg       0.85      0.85      0.85      4152
weighted avg       0.87      0.86      0.86      4152



In [153]:
def Statistics():
    print("1. KNeighborsClassifier Score: \t\t\t", accuracy_score(y_test, knn_pred))
    print("2. Gaussian Naive Bayes Score: \t\t\t", accuracy_score(y_test, nbc_pred))
    print("3. Logistic Regressor Score: \t\t\t", accuracy_score(y_test, log_pred))
    print("4. Support Vector Classification Score: \t", accuracy_score(y_test, svc_pred))
    print("5. Decision Tree Score: \t\t\t", accuracy_score(y_test, dtc_pred))
    print("6. Random Forest Score: \t\t\t", accuracy_score(y_test, rfc_pred))
    print("7. Linear Discriminant Analysis Score: \t\t", accuracy_score(y_test, lda_pred))
    print("8. Gradient Boost Classifier Score: \t\t", accuracy_score(y_test, gbc_pred))
    print("9. Neural Network Score: \t\t\t", accuracy_score(y_test, nnc_pred))

Statistics()

1. KNeighborsClassifier Score: 			 0.8774084778420038
2. Gaussian Naive Bayes Score: 			 0.7776974951830443
3. Logistic Regressor Score: 			 0.7105009633911368
4. Support Vector Classification Score: 	 0.8273121387283237
5. Decision Tree Score: 			 0.8403179190751445
6. Random Forest Score: 			 0.899325626204239
7. Linear Discriminant Analysis Score: 		 0.8282755298651252
8. Gradient Boost Classifier Score: 		 0.9019749518304432
9. Neural Network Score: 			 0.8619942196531792


In [154]:
# Print the confusion matrix for the most accurate model
print(confusion_matrix(y_true=y_test, y_pred=gbc_pred))

[[491  30   0   0   0   2   1]
 [ 29 554   0   0   0  38   5]
 [  3   1 473  12   1  12  41]
 [  0   0  13 640   2   0   2]
 [  0   0   1   1 802   0   0]
 [  1  43   7   0   0 375  58]
 [  0  10  31   5   1  57 410]]


# Pipeline & Scaler Model

In [21]:
# Create the pipeline

model = GradientBoostingClassifier(learning_rate=0.1, max_depth=4, min_samples_leaf=2, min_samples_split=2, n_estimators=200, subsample=0.9)
pipeline = Pipeline([
    # ('scaler', MinMaxScaler()),
    ('model', model)
])

pipeline.fit(X_train, y_train)
y_pipe = pipeline.predict(X_test)
print(classification_report(y_test, y_pipe))
print("Final Score: ", accuracy_score(y_test, y_pipe))

# Load the info about model in the file

def update_info():
    info = open(_info, "w")
    pipeline.fit(X_data, y_data)
    y_pipe = pipeline.predict(X_test)
    model_info = [f"Accuracy Score: {accuracy_score(y_test, y_pipe)}\n", 
                  f"Model Name: {pipeline.named_steps['model']}\n", 
                  f"Time:   {dt.datetime.now()}\n\n",
                  f"Report: {classification_report(y_test, y_pipe)}"]
    info.writelines(model_info)
    print(''.join(model_info))
    pipeline.fit(X_train, y_train)


# Load Model In File Project

val = input('Are you sure you want to save the last model: ')
if(val == 'y'):
    print('Saving . . .')
    update_info()
    model = pipeline
    pkl.dump(model, open(_model, 'wb'))

              precision    recall  f1-score   support

           1       0.95      0.93      0.94       524
           2       0.88      0.90      0.89       626
           3       0.88      0.88      0.88       543
           4       0.97      0.97      0.97       657
           5       1.00      1.00      1.00       804
           6       0.79      0.79      0.79       484
           7       0.80      0.79      0.80       514

    accuracy                           0.90      4152
   macro avg       0.89      0.89      0.89      4152
weighted avg       0.90      0.90      0.90      4152

Final Score:  0.9046242774566474
Saving . . .
Accuracy Score: 0.9588150289017341
Model Name: GradientBoostingClassifier(max_depth=4, min_samples_leaf=2, n_estimators=200,
                           subsample=0.9)
Time:   2024-02-08 00:52:24.944035

Report:               precision    recall  f1-score   support

           1       0.98      0.98      0.98       524
           2       0.93      0.94    

In [36]:
# Save the output dataframe to file

predictions = pipeline.predict(test_data)
output_data = pd.DataFrame({'NObeyesdad': [target[pred - 1] for pred in predictions]}, columns=['id', 'NObeyesdad'])
output_data['id'] = range(20758, 20758 + len(output_data))
output_data.to_csv('./data/output.csv', index=False)

In [38]:
output_data

Unnamed: 0,id,NObeyesdad
0,20758,Obesity_Type_II
1,20759,Overweight_Level_I
2,20760,Obesity_Type_III
3,20761,Obesity_Type_I
4,20762,Obesity_Type_III
...,...,...
13835,34593,Overweight_Level_II
13836,34594,Overweight_Level_I
13837,34595,Insufficient_Weight
13838,34596,Normal_Weight
