### Initial Dataset Evaluation

In [1]:
import pandas as pd

df=pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')

We check the dimensions of the dataset, the data types assigned to each column, and obtain summary statistics for the numerical variables. This provides a first understanding of the dataset's structure and highlights any immediate issues such as incorrect data types or unexpected missing values.

In [2]:
df.info()
df.describe()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

## Data Cleaning and Preparation

### Outliers Identification
Let's check for outliers, first we separate numerical features and categorical since outliers will only reside in the numerical ones.

In [3]:
import numpy as np

num_cols = df.select_dtypes(include=[np.number]).columns

#### IQR Outlier Detection Function

We implement a reusable function that computes the first quartile (Q1), third quartile (Q3), and the interquartile range.  
Values lying more than 1.5 Ã— IQR below Q1 or above Q3 are considered potential outliers.

In [4]:
def detect_outliers_iqr(series):
    """Return indexes of outliers using IQR."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series[(series < lower) | (series > upper)].index

outlier_summary = {}

for col in num_cols:
    # Skip discrete small-range variables
    if df[col].nunique() <= 5:
        outlier_summary[col] = "Skipped (discrete variable)"
        continue
    
    outlier_idx = detect_outliers_iqr(df[col])
    outlier_summary[col] = len(outlier_idx)

print("\nOutlier Report")
for k, v in outlier_summary.items():
    print(f"{k}: {v}")


Outlier Report
Age: 168
Height: 1
Weight: 1
FCVC: 0
NCP: 579
CH2O: 0
FAF: 0
TUE: 0


### Interpretation of Outliers

The outlier analysis reveals the presence of extreme values in variables such as height, weight, and age. These values are not removed, as they can represent valid physiological variability. Human biometric data naturally contains extreme cases, and the synthetic component of the dataset may also generate rare but plausible values.  

To preserve the natural variability of the dataset and avoid discarding meaningful information, all outliers are retained for the subsequent analysis and modelling stages.

### Min Max Normalization

Left to do: explain why Min max, problems with dataset as strings, explain use of encoder.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

df=pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')
df_encoded = pd.get_dummies(df, drop_first=True)

target= "NObeyesdad"
X = df_encoded.drop([col for col in df.columns if col == "target"], axis=1)
y = df[target]
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = MinMaxScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)

print("Normalization complete.")


Normalization complete.


## Exploratory Data Analysis (EDA)
### Min and Max values, Mean, Median, Standard Deviation, Variance, Correlation matrix, Correlation
 

In [6]:
numeric_cols = df.select_dtypes(include=['int64', 'float64'])

min_value= numeric_cols.min
print("\nMinimum values\n", min_value())

max_value= numeric_cols.max
print("\nMaximum values\n", max_value())

print("\nMean")
print(numeric_cols.mean())

print("\nMedian")
print(numeric_cols.median())

print("\nStandard Deviation")
print(numeric_cols.std())

print("\nVariance")
print(numeric_cols.var())

print("\nCorrelation Matrix")
print(numeric_cols.corr())



Minimum values
 Age       14.00
Height     1.45
Weight    39.00
FCVC       1.00
NCP        1.00
CH2O       1.00
FAF        0.00
TUE        0.00
dtype: float64

Maximum values
 Age        61.00
Height      1.98
Weight    173.00
FCVC        3.00
NCP         4.00
CH2O        3.00
FAF         3.00
TUE         2.00
dtype: float64

Mean
Age       24.312600
Height     1.701677
Weight    86.586058
FCVC       2.419043
NCP        2.685628
CH2O       2.008011
FAF        1.010298
TUE        0.657866
dtype: float64

Median
Age       22.777890
Height     1.700499
Weight    83.000000
FCVC       2.385502
NCP        3.000000
CH2O       2.000000
FAF        1.000000
TUE        0.625350
dtype: float64

Standard Deviation
Age        6.345968
Height     0.093305
Weight    26.191172
FCVC       0.533927
NCP        0.778039
CH2O       0.612953
FAF        0.850592
TUE        0.608927
dtype: float64

Variance
Age        40.271313
Height      0.008706
Weight    685.977477
FCVC        0.285078
NCP         0.60534