We're going to do some dimension reduction, so let's pip install UMAP.

In [1]:
!pip install --quiet umap-learn
print('umap install complete.')

umap install complete.


Let's load up the data and do a little feature engineering.

In [2]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

XLSX = '/kaggle/input/obesity-dataset/Obesity_Dataset.xlsx'

df = pd.read_excel(io=XLSX, sheet_name='Obesity_Dataset ', )
df['Class'] = df['Class'].map({1: 'A', 2: 'B', 3: 'C', 4: 'D'})
preprocessor = ColumnTransformer(
    [
        ('age', StandardScaler(), ['Age']),
        ('height', StandardScaler(), ['Height']),
    ],
    verbose_feature_names_out=False,
)
df[['Age', 'Height', ]] = preprocessor.fit_transform(X=df.drop(columns=['Class']), y=df['Class'])
df = pd.get_dummies(data=df, columns=['Sex'])


df.head()

Unnamed: 0,Age,Height,Overweight_Obese_Family,Consumption_of_Fast_Food,Frequency_of_Consuming_Vegetables,Number_of_Main_Meals_Daily,Food_Intake_Between_Meals,Smoking,Liquid_Intake_Daily,Calculation_of_Calorie_Intake,Physical_Excercise,Schedule_Dedicated_to_Technology,Type_of_Transportation_Used,Class,Sex_1,Sex_2
0,-1.537377,-1.597215,2,2,3,1,3,2,1,2,3,3,4,B,False,True
1,-1.537377,-1.221152,2,2,3,1,1,2,1,2,1,3,3,B,False,True
2,-1.537377,-1.095798,2,2,2,1,3,2,3,2,2,3,4,B,False,True
3,-1.537377,-0.719736,2,2,2,2,2,2,2,2,1,3,4,B,False,True
4,-1.537377,-0.343673,2,1,2,1,3,2,1,2,3,3,2,B,False,True


What kind of data do we have?

In [3]:
df.dtypes

Age                                  float64
Height                               float64
Overweight_Obese_Family                int64
Consumption_of_Fast_Food               int64
Frequency_of_Consuming_Vegetables      int64
Number_of_Main_Meals_Daily             int64
Food_Intake_Between_Meals              int64
Smoking                                int64
Liquid_Intake_Daily                    int64
Calculation_of_Calorie_Intake          int64
Physical_Excercise                     int64
Schedule_Dedicated_to_Technology       int64
Type_of_Transportation_Used            int64
Class                                 object
Sex_1                                   bool
Sex_2                                   bool
dtype: object

All of our data except the target variable is numerical.

Are the classes in our target variable balanced?

In [4]:
df['Class'].value_counts().to_dict()

{'B': 658, 'C': 592, 'D': 287, 'A': 73}

If we use a dummy model that always guesses the largest class how accurate will it be?

In [5]:
print('{:5.4f}'.format(len(df[df['Class'] == 'B'])/len(df)))

0.4087


A dummy model that always guesses B will by accurate about 41% of the time.

Our classes are substantially unbalanced, which suggests this may be a difficult problem. Let's use dimension reduction to add x/y coordinates so we can visualize our data.

In [6]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df.drop(columns=['Class']))
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:16.444025


In [7]:
from plotly import express

express.scatter(data_frame=df, x='x', y='y', color='Class', hover_data=['Sex_1', 'Sex_2'], height=800)

What do we see? We see that UMAP easily distinguishes men from women. Beyond that, this suggests that a model will easily distinguish class B from class C instances, but broadly distinguishing all four classes may be challenging. Let's build a model.

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['x', 'y', 'Class']), df['Class'], test_size=0.2, random_state=2024, stratify=df['Class'])

svc = SVC(kernel='linear', C=0.025, random_state=2024).fit(X_train, y_train)
print('accuracy: {:5.4f} f1: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=svc.predict(X=X_test)), f1_score(average='weighted', y_true=y_test, y_pred=svc.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=svc.predict(X=X_test), zero_division=0))

accuracy: 0.7578 f1: 0.7337
              precision    recall  f1-score   support

           A       0.00      0.00      0.00        15
           B       0.79      0.91      0.85       132
           C       0.72      0.80      0.76       118
           D       0.77      0.53      0.62        57

    accuracy                           0.76       322
   macro avg       0.57      0.56      0.56       322
weighted avg       0.72      0.76      0.73       322



As expected our model does better with the large classes (B and C) than it does with the small classes (A and D).

We can do a little better with logistic regression.

In [9]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(max_iter=500, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0))

model fit in 369 iterations
accuracy: 0.7453
f1: 0.7385
              precision    recall  f1-score   support

           A       0.45      0.33      0.38        15
           B       0.78      0.87      0.82       132
           C       0.74      0.75      0.74       118
           D       0.72      0.54      0.62        57

    accuracy                           0.75       322
   macro avg       0.67      0.63      0.64       322
weighted avg       0.74      0.75      0.74       322

