# <center>**Obesity Prediction Analysis**</center>

## **Business Understanding**

### Introduction

Obesity is a significant global health concern, and its rising rates are particularly notable in countries such as Mexico, Peru, and Colombia. This dataset includes information about individuals from these countries, focusing on their eating habits, physical activity, and other lifestyle factors to predict obesity levels. By leveraging this data, we can help public health initiatives target the root causes of obesity and develop personalized interventions to combat this issue. 

### Problem Statement

The goal of this project is to explore and predict obesity levels based on demographic and lifestyle attributes. Specifically, we aim to answer the following key questions:

- **What factors contribute to obesity?**  
  The dataset contains information on eating habits, physical activity, and lifestyle choices that can help us identify the key factors leading to obesity in this population.
  
- **Can we predict obesity levels based on demographic and lifestyle attributes?**  
  The target variable, `Obesity_level`, allows us to classify individuals into different obesity categories. Our goal is to build a machine learning model capable of accurately predicting these categories.

- **How can these predictions help in public health planning and personalized interventions?**  
  By predicting obesity levels, we can provide valuable information that helps health professionals and policymakers design more effective public health strategies and personalized intervention programs.

### Objectives

#### Main Objective
The primary objective of this project is to develop a predictive model that can accurately classify obesity levels in individuals based on their demographic and lifestyle attributes.

#### Specific Objectives
1. **Data Exploration and Preprocessing**: Explore the dataset, handle missing values, remove outliers, and transform features as necessary for modeling.
2. **Feature Analysis and Selection**: Identify and analyze key features contributing to obesity using exploratory data analysis (EDA).
3. **Model Building and Evaluation**: Build and evaluate multiple machine learning models (e.g., logistic regression, random forests, SVM) to predict obesity levels.
4. **Model Optimization**: Optimize the model for maximum accuracy and generalization.
5. **Actionable Insights**: Provide valuable insights that can be used by health professionals and policymakers for obesity prevention strategies.

### Metric of Success

The success of this project will be measured using the following metrics:
1. **Accuracy**: The percentage of correctly predicted obesity levels compared to the actual labels.
2. **Precision, Recall, and F1-Score**: These metrics are crucial due to the multi-class nature of the problem, ensuring a balanced performance across all obesity categories.
3. **AUC-ROC Curve**: Used to evaluate the performance of classification models, especially when dealing with imbalanced classes.
4. **Model Interpretability**: The ability to interpret the model's predictions and provide actionable insights to stakeholders.








## **Data Understanding**
About Dataset

Overview:

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.

Data Details:

- Gender: Gender

- Age: Age

- Height : in metres

- Weight : in kgs

- family_history : Has a family member suffered or suffers from overweight?

- FAVC : Do you eat high caloric food frequently?

- FCVC : Do you usually eat vegetables in your meals?

- NCP : How many main meals do you have daily?

- CAEC : Do you eat any food between meals?

- SMOKE : Do you smoke?

- CH2O : How much water do you drink daily?

- SCC : Do you monitor the calories you eat daily?

- FAF: How often do you have physical activity?

- TUE : How much time do you use technological devices such as cell phone, videogames, television, computer and others?

- CALC : How often do you drink alcohol?

- MTRANS : Which transportation do you usually use?

- Obesity_level (Target Column) : Obesity level

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Load the dataset into a DataFrame
df = pd.read_csv('Obesity prediction.csv', index_col=None)


In [13]:
# Inspect the first few rows of the dataset
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,Obesity
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [4]:
# Inspect the columns and rows
df.shape

(2111, 17)

In [None]:
# inspect dataset information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          2111 non-null   object 
 1   Age             2111 non-null   float64
 2   Height          2111 non-null   float64
 3   Weight          2111 non-null   float64
 4   family_history  2111 non-null   object 
 5   FAVC            2111 non-null   object 
 6   FCVC            2111 non-null   float64
 7   NCP             2111 non-null   float64
 8   CAEC            2111 non-null   object 
 9   SMOKE           2111 non-null   object 
 10  CH2O            2111 non-null   float64
 11  SCC             2111 non-null   object 
 12  FAF             2111 non-null   float64
 13  TUE             2111 non-null   float64
 14  CALC            2111 non-null   object 
 15  MTRANS          2111 non-null   object 
 16  Obesity         2111 non-null   object 
dtypes: float64(8), object(9)
memory u

In [14]:
# inspect dataset stats
df.describe()

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0
mean,24.3126,1.701677,86.586058,2.419043,2.685628,2.008011,1.010298,0.657866
std,6.345968,0.093305,26.191172,0.533927,0.778039,0.612953,0.850592,0.608927
min,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,19.947192,1.63,65.473343,2.0,2.658738,1.584812,0.124505,0.0
50%,22.77789,1.700499,83.0,2.385502,3.0,2.0,1.0,0.62535
75%,26.0,1.768464,107.430682,3.0,3.0,2.47742,1.666678,1.0
max,61.0,1.98,173.0,3.0,4.0,3.0,3.0,2.0


In [15]:
# check if the dataset have any missing values
df.isna().any()

Gender            False
Age               False
Height            False
Weight            False
family_history    False
FAVC              False
FCVC              False
NCP               False
CAEC              False
SMOKE             False
CH2O              False
SCC               False
FAF               False
TUE               False
CALC              False
MTRANS            False
Obesity           False
dtype: bool

In [20]:
# Inspect for any duplicates
df.duplicated().any()
duplicate_count= df.duplicated().sum()

In [27]:
if duplicate_count > 0:
    duplicate_rows = df[df.duplicated(keep=False)]  # keep=False marks all occurrences of duplicates
    print("Duplicate rows found:")
    print(duplicate_rows)


Duplicate rows found:
     Gender   Age  Height  Weight family_history FAVC  FCVC  NCP        CAEC  \
97   Female  21.0    1.52    42.0             no   no   3.0  1.0  Frequently   
98   Female  21.0    1.52    42.0             no   no   3.0  1.0  Frequently   
105  Female  25.0    1.57    55.0             no  yes   2.0  1.0   Sometimes   
106  Female  25.0    1.57    55.0             no  yes   2.0  1.0   Sometimes   
145    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
174    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
179    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
184    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
208  Female  22.0    1.69    65.0            yes  yes   2.0  3.0   Sometimes   
209  Female  22.0    1.69    65.0            yes  yes   2.0  3.0   Sometimes   
282  Female  18.0    1.62    55.0            yes  yes   2.0  3.0  Frequently   
295  Female  16.0 

In [26]:
# View the duplicate rows, if any
if duplicate_count > 0:
    duplicate_rows = df[df.duplicated()]
    print("Duplicate rows:")
    print(duplicate_rows)

Duplicate rows:
     Gender   Age  Height  Weight family_history FAVC  FCVC  NCP        CAEC  \
98   Female  21.0    1.52    42.0             no   no   3.0  1.0  Frequently   
106  Female  25.0    1.57    55.0             no  yes   2.0  1.0   Sometimes   
174    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
179    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
184    Male  21.0    1.62    70.0             no  yes   2.0  1.0          no   
209  Female  22.0    1.69    65.0            yes  yes   2.0  3.0   Sometimes   
309  Female  16.0    1.66    58.0             no   no   2.0  1.0   Sometimes   
460  Female  18.0    1.62    55.0            yes  yes   2.0  3.0  Frequently   
467    Male  22.0    1.74    75.0            yes  yes   3.0  3.0  Frequently   
496    Male  18.0    1.72    53.0            yes  yes   2.0  3.0   Sometimes   
527  Female  21.0    1.52    42.0             no  yes   3.0  1.0  Frequently   
659  Female  21.0    1.5