# **Dataset:** [Multi-Class Prediction of Obesity Risk Dataset](https://www.kaggle.com/competitions/playground-series-s4e2)

## **Questions to answer:**
- (Classification) How important is each variable in predicting an individual’s risk of having cardiovascular disease?
- (Classification) Is there a way to categorise individuals' level of obesity based on information about health and daily habits?

## **IMPORT NECESSARY LIBRARIES**

In [1]:
import numpy as np
import pandas as pd
import seaborn as sb

from pathlib import Path

## **IMPORT DATA**

In [2]:
# Load the dataset
train_set = "train.csv"
data = pd.read_csv(train_set)
data

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,Male,24.443011,1.699998,81.669950,yes,yes,2.000000,2.983297,Sometimes,no,2.763573,no,0.000000,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.000000,1.560000,57.000000,yes,yes,2.000000,3.000000,Frequently,no,2.000000,no,1.000000,1.000000,no,Automobile,Normal_Weight
2,2,Female,18.000000,1.711460,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.710730,131.274851,yes,yes,3.000000,3.000000,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,20753,Male,25.137087,1.766626,114.187096,yes,yes,2.919584,3.000000,Sometimes,no,2.151809,no,1.330519,0.196680,Sometimes,Public_Transportation,Obesity_Type_II
20754,20754,Male,18.000000,1.710000,50.000000,no,yes,3.000000,4.000000,Frequently,no,1.000000,no,2.000000,1.000000,Sometimes,Public_Transportation,Insufficient_Weight
20755,20755,Male,20.101026,1.819557,105.580491,yes,yes,2.407817,3.000000,Sometimes,no,2.000000,no,1.158040,1.198439,no,Public_Transportation,Obesity_Type_II
20756,20756,Male,33.852953,1.700000,83.520113,yes,yes,2.671238,1.971472,Sometimes,no,2.144838,no,0.000000,0.973834,no,Automobile,Overweight_Level_II


*Quick overview*

In [3]:
display('Data set:', data.head())

'Data set:'

Unnamed: 0,id,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,Male,24.443011,1.699998,81.66995,yes,yes,2.0,2.983297,Sometimes,no,2.763573,no,0.0,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,1,Female,18.0,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.0,no,1.0,1.0,no,Automobile,Normal_Weight
2,2,Female,18.0,1.71146,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,3,Female,20.952737,1.71073,131.274851,yes,yes,3.0,3.0,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II


*Data description:*

| Abbreviation   | Full Form                                 |
|:---------------|:------------------------------------------|
| FAVC           | Frequent consumption of high caloric food |
| FCVC           | Frequency of consumption of vegetables    |
| NCP            | Number of main meals                      |
| CAEC           | Consumption of food between meals         |
| CH20           | Consumption of water daily                |
| CALC           | Consumption of alcohol                    |
| SCC            | Calories consumption monitoring           |
| FAF            | Physical activity frequency               |
| TUE            | Time using technology devices             |
| MTRANS         | Transportation used                       |
| NObeyesdad     | Body mass index                           |

## **DATA PREPARATION**

### **DATA CLEANING**

#### > CHECK FOR NULL VALUES

In [4]:
if data.isnull().any().any():
    print("There are null values in the DataFrame.")
else:
    print("There are no null values in the DataFrame.")

There are no null values in the DataFrame.


#### > DROP REDUNDANT AND DUPLICATE FEATURES

In [5]:
data.drop(columns=['id'], axis=1,  inplace=True)

data.drop_duplicates()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Male,24.443011,1.699998,81.669950,yes,yes,2.000000,2.983297,Sometimes,no,2.763573,no,0.000000,0.976473,Sometimes,Public_Transportation,Overweight_Level_II
1,Female,18.000000,1.560000,57.000000,yes,yes,2.000000,3.000000,Frequently,no,2.000000,no,1.000000,1.000000,no,Automobile,Normal_Weight
2,Female,18.000000,1.711460,50.165754,yes,yes,1.880534,1.411685,Sometimes,no,1.910378,no,0.866045,1.673584,no,Public_Transportation,Insufficient_Weight
3,Female,20.952737,1.710730,131.274851,yes,yes,3.000000,3.000000,Sometimes,no,1.674061,no,1.467863,0.780199,Sometimes,Public_Transportation,Obesity_Type_III
4,Male,31.641081,1.914186,93.798055,yes,yes,2.679664,1.971472,Sometimes,no,1.979848,no,1.967973,0.931721,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,Male,25.137087,1.766626,114.187096,yes,yes,2.919584,3.000000,Sometimes,no,2.151809,no,1.330519,0.196680,Sometimes,Public_Transportation,Obesity_Type_II
20754,Male,18.000000,1.710000,50.000000,no,yes,3.000000,4.000000,Frequently,no,1.000000,no,2.000000,1.000000,Sometimes,Public_Transportation,Insufficient_Weight
20755,Male,20.101026,1.819557,105.580491,yes,yes,2.407817,3.000000,Sometimes,no,2.000000,no,1.158040,1.198439,no,Public_Transportation,Obesity_Type_II
20756,Male,33.852953,1.700000,83.520113,yes,yes,2.671238,1.971472,Sometimes,no,2.144838,no,0.000000,0.973834,no,Automobile,Overweight_Level_II


#### > CREATE A CSV FILE FOR CLEANED DATASET

In [6]:
filepath = Path('data/cleaned.csv')
filepath.parent.mkdir(parents=True, exist_ok=True)  
data.to_csv(filepath) 

### **FEATURES EDITION**

#### > CLASSIFY GROUP AGE

In [7]:
# Classify Group Age
age = data['Age']
def age2group(age):
    if age < 20:
        age = 'Under 20'
    elif age >= 50:
        age = 'Above 50'
    else:
        lowerbound = int(age / 10) * 10
        upperbound = (int(age / 10) + 1) * 10 - 1
        age = str(lowerbound) + '-' + str(upperbound)
    return age
data['Age'] = age.apply(age2group)

#### > ROUND VALUES

In [8]:
#Round values of each column to the appropriate decimal places
data['FCVC'] = data['FCVC'].round(1)
data['NCP'] = data['NCP'].round(1)
data['FAF'] = data['FAF'].round(2)
data['Height'] = data['Height'].round(2)
data['Weight'] = data["Weight"].round(1)
data['TUE'] = data['TUE'].round(2)
data['CH2O'] = data['CH2O'].round(2)

#Abbreviate features
data.rename(columns={'family_history_with_overweight': 'FHWO'}, inplace=True)

#Print dataset
data

Unnamed: 0,Gender,Age,Height,Weight,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Male,20-29,1.70,81.7,yes,yes,2.0,3.0,Sometimes,no,2.76,no,0.00,0.98,Sometimes,Public_Transportation,Overweight_Level_II
1,Female,Under 20,1.56,57.0,yes,yes,2.0,3.0,Frequently,no,2.00,no,1.00,1.00,no,Automobile,Normal_Weight
2,Female,Under 20,1.71,50.2,yes,yes,1.9,1.4,Sometimes,no,1.91,no,0.87,1.67,no,Public_Transportation,Insufficient_Weight
3,Female,20-29,1.71,131.3,yes,yes,3.0,3.0,Sometimes,no,1.67,no,1.47,0.78,Sometimes,Public_Transportation,Obesity_Type_III
4,Male,30-39,1.91,93.8,yes,yes,2.7,2.0,Sometimes,no,1.98,no,1.97,0.93,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,Male,20-29,1.77,114.2,yes,yes,2.9,3.0,Sometimes,no,2.15,no,1.33,0.20,Sometimes,Public_Transportation,Obesity_Type_II
20754,Male,Under 20,1.71,50.0,no,yes,3.0,4.0,Frequently,no,1.00,no,2.00,1.00,Sometimes,Public_Transportation,Insufficient_Weight
20755,Male,20-29,1.82,105.6,yes,yes,2.4,3.0,Sometimes,no,2.00,no,1.16,1.20,no,Public_Transportation,Obesity_Type_II
20756,Male,30-39,1.70,83.5,yes,yes,2.7,2.0,Sometimes,no,2.14,no,0.00,0.97,no,Automobile,Overweight_Level_II


#### > ADDING AND CALCULATING BMI
Here we add a new feature 'BMI', which is calculated as 'Weight' / ('Height' ^ 2).

In [9]:
# Calculating Body Mass Index (BMI) + round to 2 decimal numbers
data['BMI'] = (data['Weight']/((data['Height'])**2)).round(2)

#### > REORDERING COLUMNS

In [10]:
data.columns[:]
data = data[['Gender', 'Age', 'Height', 'Weight', 'BMI', 'FHWO', 'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC', 'MTRANS', 'NObeyesdad']]
data

Unnamed: 0,Gender,Age,Height,Weight,BMI,FHWO,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Male,20-29,1.70,81.7,28.27,yes,yes,2.0,3.0,Sometimes,no,2.76,no,0.00,0.98,Sometimes,Public_Transportation,Overweight_Level_II
1,Female,Under 20,1.56,57.0,23.42,yes,yes,2.0,3.0,Frequently,no,2.00,no,1.00,1.00,no,Automobile,Normal_Weight
2,Female,Under 20,1.71,50.2,17.17,yes,yes,1.9,1.4,Sometimes,no,1.91,no,0.87,1.67,no,Public_Transportation,Insufficient_Weight
3,Female,20-29,1.71,131.3,44.90,yes,yes,3.0,3.0,Sometimes,no,1.67,no,1.47,0.78,Sometimes,Public_Transportation,Obesity_Type_III
4,Male,30-39,1.91,93.8,25.71,yes,yes,2.7,2.0,Sometimes,no,1.98,no,1.97,0.93,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20753,Male,20-29,1.77,114.2,36.45,yes,yes,2.9,3.0,Sometimes,no,2.15,no,1.33,0.20,Sometimes,Public_Transportation,Obesity_Type_II
20754,Male,Under 20,1.71,50.0,17.10,no,yes,3.0,4.0,Frequently,no,1.00,no,2.00,1.00,Sometimes,Public_Transportation,Insufficient_Weight
20755,Male,20-29,1.82,105.6,31.88,yes,yes,2.4,3.0,Sometimes,no,2.00,no,1.16,1.20,no,Public_Transportation,Obesity_Type_II
20756,Male,30-39,1.70,83.5,28.89,yes,yes,2.7,2.0,Sometimes,no,2.14,no,0.00,0.97,no,Automobile,Overweight_Level_II


#### > CREATE A CSV FILE FOR EDITED DATASET

In [11]:
newfile = Path('data/edited-train-set.csv')
newfile.parent.mkdir(parents=True, exist_ok=True)  
data.to_csv(newfile)