<div style="text-align: left; font-size: 14px; line-height: 1.4">
<strong>Pontificia Universidad Católica de Chile</strong><br>
Facultad de Matemática<br>
Magíster en Inteligencia Artificial - MIA
</div>

<hr style="border: 1px solid #999;">

<div style="text-align: center">
<h2 style="margin-bottom: 0.3em">Proyecto Aplicado</h2>
<h4 style="margin-top: 0.2em; margin-bottom: 0.2em">EPG4001 – Aprendizaje Supervisado</h4>
<h5 style="margin-top: 0.2em; font-weight: normal">Segundo Bimestre</h5>
</div>

<hr style="border: 1px solid #999;">

<div style="text-align: center; font-size: 15px; margin-bottom: 1em">
<strong>Profesor:</strong> Jonathan Acosta
</div>

<div style="font-size: 14px">
Julio 2025<br>
Glen Restrepo A.<br>
Javiera Vukasovic F.<br>
Marco Gutierrez C.<br>
Maximiliano Zapater C. <br>
Sebastián Silva E.
</div>


## Library Imports

In [66]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats.contingency import association

## Data Imports

In [None]:
df_raw = pd.read_csv('../ObesityDataSet_raw_and_data_sinthetic.csv')
df_raw.shape

## Preliminar analisis

- El dataset cuenta con 2.100 filas, y 17 variables y no tiene valores nulos. 
- No hay errores evidentes como edades o pesos negativos.
- Po demos notar que en el dataset hay más hombres, tienden a comer comida calórica (`FAVC=yes`) y a usar transporte público.

In [None]:
df = df_raw.copy()
df. head(3)

In [None]:
df.info()

Podemos notar que los valores FCVC and NCP son numéricos, podrían categorizarse.

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.NObeyesdad.value_counts()

## Visual inspection

#### Numeric
Podemos encontrar relacion entre NObeysdad y:
- Age
- Weight
- FCVC (vegetable consumption)
- NCP (cantidad de comidas) solo para *insufficient_weight*
- FAF (actividad fisica) ligero
- TUE (tiempo tecnologia) medias parecen ligeramente superiores

No encontramos relacion con altura

In [None]:
numerical_features = ['Age', 'Height', 'Weight']
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for i, feature in enumerate(numerical_features):
    sns.boxplot(ax=axes[i], x='NObeyesdad', y=feature, data=df)
    axes[i].set_title(f'{feature} vs. NObeyesdad')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout() 
plt.show()

In [None]:
numerical_categorical_features = ['FCVC','NCP','CH2O','FAF','TUE']

fig, axes = plt.subplots(2, 3, figsize=(18,9))
axes = axes.flatten()
for i, feature in enumerate(numerical_categorical_features):
    sns.boxplot(ax=axes[i], x='NObeyesdad', y=feature, data=df)
    axes[i].set_title(f'{feature} vs. NObeyesdad')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout() 
plt.show()

#### Categoric

In [None]:
categorical_features_batch1 = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS']

fig, axes = plt.subplots(2, 4, figsize=(18, 9))
axes = axes.flatten()

for i, feature in enumerate(categorical_features_batch1):
    # Calculate proportions
    prop_df = df.groupby('NObeyesdad')[feature].value_counts(normalize=True).rename('proportion').reset_index()
    
    # Plotting with swapped axes
    sns.barplot(ax=axes[i], x='NObeyesdad', y='proportion', hue=feature, data=prop_df)
    axes[i].set_title(f'Proportion of {feature} by NObeyesdad')
    axes[i].tick_params(axis='x', rotation=45)
    axes[i].set_ylabel('Proportion')

plt.tight_layout()
plt.show()

#### Statistic relations

In [None]:
numerical_df = df.select_dtypes(include=['float64', 'int64'])

# Correlation matrix
corr_matrix = numerical_df.corr()

# Plot  heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

In [None]:
# Categorical columns
for var in ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS']:
    tabla = pd.crosstab(df[var], df['NObeyesdad'])
    v = association(tabla, method='cramer')
    p = association(tabla, method='pearson')
    t = association(tabla, method='tschuprow')
    print(f'{var}')
    print(f'  - Cramér’s V = {v:.4f}')
    # print(f'    - Pearson’s Chi-squared = {p:.4f}') # Poco interpretable, no se sabe el maximo
    # print(f'    - Tschuprow’s T = {t:.4f}') # Baja el valor si el numero de categorias es muy distinto


**Rangos para Cramers V**

| Valor V   | Asociación   |
| --------- | ------------ |
| 0.00–0.10 | Débil o nula |
| 0.10–0.30 | Moderada     |
| 0.30–0.50 | Fuerte       |
| > 0.50    | Muy fuerte   |


In [None]:
# Numerical columns
# Test para ver si las medias son distintas entre grupos (asume normalidad y heterocedasticidad)
target_categories = df['NObeyesdad'].unique()

numerical_features = df.select_dtypes(include=['float64', 'int64']).columns

print("ANOVA F-test results (Feature vs. NObeyesdad):")
for feature in numerical_features:
    groups = [df[feature][df['NObeyesdad'] == category] for category in target_categories]

    # One-way ANOVA
    f_stat, p_value = stats.f_oneway(*groups)
    
    print(f" - {feature}: F-statistic = {f_stat:.2f}, p-value = {p_value:.4e}")

In [None]:
# PROBANDO NORMALIDAD ENTRE GRUPOS
n_features = len(numerical_features)
n_categories = df['NObeyesdad'].nunique()
total_plots = n_features * n_categories

# Subplot layout
n_cols = 7
n_rows = (total_plots + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, n_rows * 3))
axes = axes.flatten()

plot_idx = 0
for feature in numerical_features:
    for category in df['NObeyesdad'].unique():
        subset = df[df['NObeyesdad'] == category][feature].dropna()
        stats.probplot(subset, dist="norm", plot=axes[plot_idx])
        axes[plot_idx].set_title(f'{feature} | {category}', fontsize=9)
        plot_idx += 1

# Eliminar subplots vacíos si sobran
for i in range(plot_idx, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()


In [81]:
# SUPUESTO HOMOCEDASTICIDAD (VARIANZA IGUAL ENTRE GRUPOS)
from scipy.stats import levene
for feature in numerical_features:
    for category in df['NObeyesdad'].unique():
        groups = [df[df['NObeyesdad'] == category][feature] for category in df['NObeyesdad'].unique()]
        stat, p = levene(*groups)
        if p > 0.05:
            print(f"Feature '{feature}' meets homogeneity of variance assumption (p > 0.05)")
        else:
            pass

