# 1. Business Understanding

From Kaggle: (https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster)
## Dataset Description

The data consist of the estimation of obesity levels in people from the countries of Mexico, Peru and Colombia, with ages between 14 and 61 and diverse eating habits and physical condition , data was collected using a web platform with a survey where anonymous users answered each question, then the information was processed obtaining 17 attributes and 2111 records.
The attributes related with eating habits are: 

Frequent consumption of high caloric food (FAVC), 

Frequency of consumption of vegetables (FCVC), 

Number of main meals (NCP), 

Consumption of food between meals (CAEC), 

Consumption of water daily (CH20), and 

Consumption of alcohol (CALC). 

The attributes related with the physical condition are: 

Calories consumption monitoring (SCC), 

Physical activity frequency (FAF), 

Time using technology devices (TUE), 

Transportation used (MTRANS)

variables obtained :

Gender, Age, Height and Weight.

NObesity values are:

•Underweight Less than 18.5
•Normal 18.5 to 24.9
•Overweight 25.0 to 29.9
•Obesity I 30.0 to 34.9
•Obesity II 35.0 to 39.9
•Obesity III Higher than 40

The data contains numerical data and continous data, so it can be used for analysis based on algorithms of classification, prediction, segmentation and association. Data is available in CSV format.

Files
train.csv - the training dataset; NObeyesdad is the categorical target
test.csv - the test dataset; your objective is to predict the class of NObeyesdad for each row

# 2. Data Understanding

In [13]:
# Importing Libraries

import pandas as pd

# Scoring
from sklearn.metrics import accuracy_score

# Basics for pipeline class
from sklearn.base import BaseEstimator, TransformerMixin

In [4]:
# Importing Datasets

df_train = pd.read_csv('../data/train.csv')

Let's explore the dataset:

In [6]:
df_train

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,Male,24.443011,1.699998,81.669950,yes,yes,2.000000,2.983297,Sometimes,no,2.763573,no,0.000000,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.000000,1.560000,57.000000,yes,yes,2.000000,3.000000,Frequently,no,2.000000,no,1.000000,1.000000,no,Automobile,Normal_Weight
2,2,Female,18.000000,1.711460,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.710730,131.274851,yes,yes,3.000000,3.000000,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,20753,Male,25.137087,1.766626,114.187096,yes,yes,2.919584,3.000000,Sometimes,no,2.151809,no,1.330519,0.196680,Sometimes,Public_Transportation,Obesity_Type_II
20754,20754,Male,18.000000,1.710000,50.000000,no,yes,3.000000,4.000000,Frequently,no,1.000000,no,2.000000,1.000000,Sometimes,Public_Transportation,Insufficient_Weight
20755,20755,Male,20.101026,1.819557,105.580491,yes,yes,2.407817,3.000000,Sometimes,no,2.000000,no,1.158040,1.198439,no,Public_Transportation,Obesity_Type_II
20756,20756,Male,33.852953,1.700000,83.520113,yes,yes,2.671238,1.971472,Sometimes,no,2.144838,no,0.000000,0.973834,no,Automobile,Overweight_Level_II


In [7]:
df_train.describe()

Unnamed: 0,id,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0,20758.0
mean,10378.5,23.841804,1.700245,87.887768,2.445908,2.761332,2.029418,0.981747,0.616756
std,5992.46278,5.688072,0.087312,26.379443,0.533218,0.705375,0.608467,0.838302,0.602113
min,0.0,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,5189.25,20.0,1.631856,66.0,2.0,3.0,1.792022,0.008013,0.0
50%,10378.5,22.815416,1.7,84.064875,2.393837,3.0,2.0,1.0,0.573887
75%,15567.75,26.0,1.762887,111.600553,3.0,3.0,2.549617,1.587406,1.0
max,20757.0,61.0,1.975663,165.057269,3.0,4.0,3.0,3.0,2.0


In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20758 non-null  int64  
 1   Gender                          20758 non-null  object 
 2   Age                             20758 non-null  float64
 3   Height                          20758 non-null  float64
 4   Weight                          20758 non-null  float64
 5   family_history_with_overweight  20758 non-null  object 
 6   FAVC                            20758 non-null  object 
 7   FCVC                            20758 non-null  float64
 8   NCP                             20758 non-null  float64
 9   CAEC                            20758 non-null  object 
 10  SMOKE                           20758 non-null  object 
 11  CH2O                            20758 non-null  float64
 12  SCC                             

In [9]:
df_train['NObeyesdad'].value_counts()

Obesity_Type_III       4046
Obesity_Type_II        3248
Normal_Weight          3082
Obesity_Type_I         2910
Insufficient_Weight    2523
Overweight_Level_II    2522
Overweight_Level_I     2427
Name: NObeyesdad, dtype: int64

In [10]:
df_train.groupby(['NObeyesdad'])['Weight'].agg('mean').reset_index()

Unnamed: 0,NObeyesdad,Weight
0,Insufficient_Weight,49.860773
1,Normal_Weight,61.533289
2,Obesity_Type_I,92.371026
3,Obesity_Type_II,115.995914
4,Obesity_Type_III,117.697452
5,Overweight_Level_I,74.228266
6,Overweight_Level_II,82.085513


All the values and columns are pretty much clean and neat. Let's move to the next step of CRISP-DM

# Data Preparation

Next, let's evaluate the BMI value.



## Calculating the BMI

BMI stands for *Body Mass Index*. It is a measurement that uses an individual's height and weight to assess whether they fall within healthy weight ranges.

The BMI formula is as follows:

$$ BMI = \frac{{\text{weight in kilograms}}}{{(\text{height in meters})^2}} $$

The BMI result is often categorized into ranges indicating whether a person is underweight, normal weight, overweight, or obese. However, it's important to note that BMI has limitations as it does not consider factors such as body fat distribution, body composition, or other health-relevant aspects. While BMI is a useful tool for population assessments, it is not a perfect measure for individual health. Clinical context and guidance from healthcare professionals are essential for a comprehensive evaluation.

In [11]:
# Calculating BMI

df_BMI = df_train.copy()

df_BMI['BMI'] = df_BMI['Weight'] / (df_BMI['Height'] * df_BMI['Height'])

df_BMI

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI
0,0,Male,24.443011,1.699998,81.669950,yes,yes,2.000000,2.983297,Sometimes,no,2.763573,no,0.000000,0.976473,Sometimes,Public_Transportation,Overweight_Level_II,28.259565
1,1,Female,18.000000,1.560000,57.000000,yes,yes,2.000000,3.000000,Frequently,no,2.000000,no,1.000000,1.000000,no,Automobile,Normal_Weight,23.422091
2,2,Female,18.000000,1.711460,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight,17.126706
3,3,Female,20.952737,1.710730,131.274851,yes,yes,3.000000,3.000000,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III,44.855798
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II,25.599151
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,20753,Male,25.137087,1.766626,114.187096,yes,yes,2.919584,3.000000,Sometimes,no,2.151809,no,1.330519,0.196680,Sometimes,Public_Transportation,Obesity_Type_II,36.587084
20754,20754,Male,18.000000,1.710000,50.000000,no,yes,3.000000,4.000000,Frequently,no,1.000000,no,2.000000,1.000000,Sometimes,Public_Transportation,Insufficient_Weight,17.099278
20755,20755,Male,20.101026,1.819557,105.580491,yes,yes,2.407817,3.000000,Sometimes,no,2.000000,no,1.158040,1.198439,no,Public_Transportation,Obesity_Type_II,31.889841
20756,20756,Male,33.852953,1.700000,83.520113,yes,yes,2.671238,1.971472,Sometimes,no,2.144838,no,0.000000,0.973834,no,Automobile,Overweight_Level_II,28.899693


We can evaluate the obesity level by the value of the BMI:

- Insufficient_Weight Less than 18.5
- Normal_Weight from 18.5 to 24.9
- Overweight_Level_I from 25.0 to 27.4
- Overweight_Level_II from 27.5 to 29.9
- Obesity_Type_I from 30.0 to 34.9
- Obesity_Type_II from 35.0 to 39.0
- Obesity_Type_III Higher than 40

In [29]:
df_BMI['NObeyesdad_2'] = ""

for i in range(df_BMI.shape[0]):
    if df_BMI['BMI'][i] < 18.5:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Insufficient_Weight'
    elif df_BMI['BMI'][i] < 25.0:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Normal_Weight'
    elif df_BMI['BMI'][i] < 27.4:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Overweight_Level_I'
    elif df_BMI['BMI'][i] < 30.0:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Overweight_Level_II'
    elif df_BMI['BMI'][i] < 35.0:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Obesity_Type_I'
    elif df_BMI['BMI'][i] < 40.0:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Obesity_Type_II'
    elif df_BMI['BMI'][i] > 40.0:
        df_BMI.loc[i, "NObeyesdad_2"] = 'Obesity_Type_III'
df_BMI

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI,NObeyesdad_2
0,0,Male,24.443011,1.699998,81.669950,yes,yes,2.000000,2.983297,Sometimes,no,2.763573,no,0.000000,0.976473,Sometimes,Public_Transportation,Overweight_Level_II,28.259565,Overweight_Level_II
1,1,Female,18.000000,1.560000,57.000000,yes,yes,2.000000,3.000000,Frequently,no,2.000000,no,1.000000,1.000000,no,Automobile,Normal_Weight,23.422091,Normal_Weight
2,2,Female,18.000000,1.711460,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight,17.126706,Insufficient_Weight
3,3,Female,20.952737,1.710730,131.274851,yes,yes,3.000000,3.000000,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III,44.855798,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II,25.599151,Overweight_Level_I
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,20753,Male,25.137087,1.766626,114.187096,yes,yes,2.919584,3.000000,Sometimes,no,2.151809,no,1.330519,0.196680,Sometimes,Public_Transportation,Obesity_Type_II,36.587084,Obesity_Type_II
20754,20754,Male,18.000000,1.710000,50.000000,no,yes,3.000000,4.000000,Frequently,no,1.000000,no,2.000000,1.000000,Sometimes,Public_Transportation,Insufficient_Weight,17.099278,Insufficient_Weight
20755,20755,Male,20.101026,1.819557,105.580491,yes,yes,2.407817,3.000000,Sometimes,no,2.000000,no,1.158040,1.198439,no,Public_Transportation,Obesity_Type_II,31.889841,Obesity_Type_I
20756,20756,Male,33.852953,1.700000,83.520113,yes,yes,2.671238,1.971472,Sometimes,no,2.144838,no,0.000000,0.973834,no,Automobile,Overweight_Level_II,28.899693,Overweight_Level_II


Now let's compare the BMI classification with the original classification from the Dataset

In [30]:
pred_BMI = df_BMI['NObeyesdad_2']
y_test = df_BMI['NObeyesdad']

print('BMI accuracy: {}'.format(accuracy_score(y_test, pred_BMI)))

BMI accuracy: 0.7764717217458329


Based only on BMI values we can get an accuracy comparing to the original Dataset. To optimize the values we can remap the BMI to get a better result

In [105]:
df_BMI['NObeyesdad_2'] = ""

for i in range(df_BMI.shape[0]):
    if df_BMI['BMI'][i] < 18.5: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Insufficient_Weight'
    elif df_BMI['BMI'][i] < 25.0: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Normal_Weight'
    elif df_BMI['BMI'][i] < 26.9: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Overweight_Level_I'
    elif df_BMI['BMI'][i] < 30.0: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Overweight_Level_II'
    elif df_BMI['BMI'][i] < 34.3: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Obesity_Type_I'
    elif df_BMI['BMI'][i] < 39.2: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Obesity_Type_II'
    elif df_BMI['BMI'][i] > 39.2: # Optimal
        df_BMI.loc[i, "NObeyesdad_2"] = 'Obesity_Type_III'

In [107]:
pred_BMI = df_BMI['NObeyesdad_2']
y_test = df_BMI['NObeyesdad']

print('BMI accuracy: {}'.format(accuracy_score(y_test, pred_BMI)))

BMI accuracy: 0.8065324212351864


With optimized mapping we get 80,6% accuracy based only on BMI value.

In [None]:
# Object to calculate the BMI
class CalculateBMI(BaseEstimator, TransformerMixin):
  '''get height and weight and calculate BMI'''
  def __init__(self, df=None):
    self.df = df
    if df is not None:
        self.BMI = pd.DataFrame(df['Weight'] / (df['Height'] * df['Height']), columns = ['BMI'])
        self.BMI.name = 'BMI'

  def fit(self, X=None, y=None):
    return self

  def transform(self, X, y=None):
    if self.df is None:
        raise ValueError('The attribute df must be set before transforming.')
    new_X = self.BMI
    X['BMI'] = new_X['BMI']
    return X