<a href="https://colab.research.google.com/github/joliebao/TCS-Data-Science-Obesity/blob/main/TCS_DataScience_Obesity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



---


# **Obesity Data Set**


---




Obesity is a growing global health issue that affects people of all ages and backgrounds. Understanding the relationship between daily habits—such as diet and physical activity—and obesity levels can help identify patterns that contribute to weight gain. This project uses a publicly available dataset on obesity levels based on eating habits and physical condition to explore these patterns through data analysis.

By applying data science techniques such as data cleaning, visualization, and classification, this project aims to uncover how specific behaviors influence body weight categories. The goal is to develop insights that may inform healthier lifestyle choices for teens and young adults. Ultimately, this research demonstrates how data science can be used to address real-world health challenges.



**Key Columns:**

  + Gender: Biological sex of the individual

  + Age: Age in years

  + Height: Height in meters

  + Weight: Weight in kilograms

  + Family_history_with_overweight: Overweight in immediate family

  + FAVC: Eats high-calorie foods frequently

  + FCVC: Frequency of vegetable consumption

  + NCP: Number of daily main meals

  + CAEC: Eats between meals

  + SMOKE: Smokes regularly

  + CH2O: Daily water intake

  + SCC: Monitors daily calorie intake

  + FAF: Weekly physical activity frequency

  + TUE: Daily screen time usage

  + CALC: Alcohol consumption frequency

  + MTRANS: Main transportation method

  + NObeyesdad: Obesity classification label


** Note: The Dataset has already been cleaned and filtered

***Importing***

In [None]:
#importing libraries and cloning repo

!git clone https://github.com/joliebao/TCS-Data-Science-Obesity

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, accuracy_score

fatal: destination path 'TCS-Data-Science-Obesity' already exists and is not an empty directory.


In [None]:
# convert downloaded dataset into accessible dataset
url = 'https://raw.githubusercontent.com/joliebao/TCS-Data-Science-Obesity/refs/heads/main/ObesityDataSet.csv'
df = pd.read_csv(url)

***Basic Dataset Exploration***

In [None]:
df.shape            # 17 categories, 2111 entries

(2111, 17)

In [None]:
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [None]:
df.info()         # all filled out! (all non-null)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

In [None]:
df.isna()             # again all filled out! (all non-null)

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2107,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2108,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2109,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [None]:
df.describe()

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0
mean,24.3126,1.701677,86.586058,2.419043,2.685628,2.008011,1.010298,0.657866
std,6.345968,0.093305,26.191172,0.533927,0.778039,0.612953,0.850592,0.608927
min,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,19.947192,1.63,65.473343,2.0,2.658738,1.584812,0.124505,0.0
50%,22.77789,1.700499,83.0,2.385502,3.0,2.0,1.0,0.62535
75%,26.0,1.768464,107.430682,3.0,3.0,2.47742,1.666678,1.0
max,61.0,1.98,173.0,3.0,4.0,3.0,3.0,2.0




---


***Research/Literature Review***

*Pre-Dataset Exploration*:
+


***Research Question***

*How do behavorial factors affect obesity?*

*How do environments affect obesity?*

***Modeling/Visualizations***

*Possible Relations*

In [None]:
# Suggestions of number of models to test -----
# 1 supervised
# 1 unsupervised

# Recommended Model -----
# SVM Models
# Nearest Neighbors