### Initial Dataset Evaluation

In [2]:
import pandas as pd

df=pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')

We check the dimensions of the dataset, the data types assigned to each column, and obtain summary statistics for the numerical variables. This provides a first understanding of the dataset's structure and highlights any immediate issues such as incorrect data types or unexpected missing values.

In [3]:
df.info()
df.describe()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

## Data Cleaning and Preparation

### Outliers Identification
Let's check for outliers, first we separate numerical features and categorical since outliers will only reside in the numerical ones.

In [4]:
import numpy as np

num_cols = df.select_dtypes(include=[np.number]).columns

#### IQR Outlier Detection Function

We implement a reusable function that computes the first quartile (Q1), third quartile (Q3), and the interquartile range.  
Values lying more than 1.5 Ã— IQR below Q1 or above Q3 are considered potential outliers.

In [6]:
def detect_outliers_iqr(series):
    """Return indexes of outliers using IQR."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series[(series < lower) | (series > upper)].index

outlier_summary = {}

for col in num_cols:
    # Skip discrete small-range variables
    if df[col].nunique() <= 5:
        outlier_summary[col] = "Skipped (discrete variable)"
        continue
    
    outlier_idx = detect_outliers_iqr(df[col])
    outlier_summary[col] = len(outlier_idx)

print("\nOutlier Report")
for k, v in outlier_summary.items():
    print(f"{k}: {v}")


Outlier Report
Age: 168
Height: 1
Weight: 1
FCVC: 0
NCP: 579
CH2O: 0
FAF: 0
TUE: 0


### Interpretation of Outliers

The outlier analysis reveals the presence of extreme values in variables such as height, weight, and age. These values are not removed, as they can represent valid physiological variability. Human biometric data naturally contains extreme cases, and the synthetic component of the dataset may also generate rare but plausible values.  

To preserve the natural variability of the dataset and avoid discarding meaningful information, all outliers are retained for the subsequent analysis and modelling stages.