# Obesity Risk Prediction Model
This notebook develops a classification model to predict individuals at high risk for obesity based on demographic and lifestyle features. It includes data loading, preprocessing, exploratory data analysis, model training, and evaluation.

### Dataset Information

The *Obesity Levels*$\text{}^{1}$ dataset observed includes estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

This dataset contains the following columns:

- **Gender:** Feature, Categorical, "Gender"
- **Age:** Feature, Continuous, "Age"
- **Height:** Feature, Continuous
- **Weight:** Feature Continuous
- **family_history_with_overweight:** Feature, Binary, "Has a family member suffered or suffers from overweight?"
- **FAVC:** Feature, Binary, "Do you eat high caloric food frequently?"
- **FCVC:** Feature, Integer, "Do you usually eat vegetables in your meals?"
- **NCP:** Feature, Continuous, "How many main meals do you have daily?"
- **CAEC:** Feature, Categorical, "Do you eat any food between meals?"
- **SMOKE:** Feature, Binary, "Do you smoke?"
- **CH2O:** Feature, Continuous, "How much water do you drink daily?"
- **SCC:** Feature, Binary, "Do you monitor the calories you eat daily?"
- **FAF:** Feature, Continuous, "How often do you have physical activity?"
- **TUE:** Feature, Integer, "How much time do you use technological devices such as cell phone, videogames, television, computer and others?"
- **CALC:** Feature, Categorical, "How often do you drink alcohol?"
- **MTRANS:** Feature, Categorical, "Which transportation do you usually use?"
- **NObeyesdad:** Target, Categorical, "Obesity level"

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.exceptions import ConvergenceWarning

# Suppress only ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

## 2. Load Data
Load the dataset and examine the first few rows.

In [None]:
data_raw = pd.read_csv('ObesityDataSet_raw.csv')
data_raw.head()

## 3. Data Preprocessing
Convert categorical features to numeric, handle missing values, and scale numerical features.

In [None]:
# TODO: Add more cleaning steps if needed. 
# Check for missing values
data_raw.isnull().sum()

In [None]:
# Drop missing values
data = data_raw.dropna()

In [None]:
# Encode categorical features
label_encoders = {}
# Dictionary to store the relationship between original and encoded values
value_mapping = {}
for column in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    # Store the mapping of original values to encoded values
    value_mapping[column] = dict(zip(le.classes_, range(len(le.classes_))))
    #Then, proceed with the replacement.
    label_encoders[column] = le

data.head()

In [None]:
def printMappedValues(MyDictionary):
    for column, mapping in MyDictionary.items():
        # Create an HTML table for each column's mapping
        resultText = ''
        
        # Add rows for each mapping in the dictionary
        for original, encoded in mapping.items():
            resultText += f'- {original}: {encoded}\n'
        
        resultText += '-------------------------'
        
        # Print the HTML table for the current column
        print(f'Values for column: {column}:')
        print(resultText)
        print('\n')  # Add space between tables for readability
printMappedValues(value_mapping)

## 4. Exploratory Data Analysis (EDA)
Explore the distribution of obesity levels and visualize relationships between features.

In [None]:
# Plot with labeled x-axis
sns.countplot(x='NObeyesdad', data=data_raw)
plt.title('Distribution of Obesity Levels')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(12,6))
sns.heatmap(data.corr(), annot=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
# create function to iterate through our plot creation 
def plt_dist(data, column, title="", x="Value", y="Frequency"):

    # remap male: 1, female: 0, because facetgrid reads as numeric values 
    data['Gender_Label'] = data['Gender'].replace({0: 'Female', 1: 'Male'})
    fg = sns.FacetGrid(data, col='Gender_Label', height=5, aspect=0.8) # set grid size
    fg.map(sns.histplot, column, kde=True) # set our histogram
    fg.set_axis_labels(x, y) # set our x and y labels for our plot 
    fg.fig.suptitle(title, y=1.05) # add title for our plot
    plt.show()

# plot Age by gender
plt_dist(data, column='Age', title='Age Distribution by Gender', x='Age', y='Frequency')
# plot hieght by gender 
plt_dist(data, column='Height', title='Height Distribution by Gender', x='Height', y='Frequency')
# plot weight by gender 
plt_dist(data, column='Weight', title='Weight Distribution by Gender', x='Weight', y='Frequency')

    

In [None]:
#TODO: Add more EDA steps

## 5. Train-Test Split
Split the data into training and testing sets.

In [None]:
# TODO: Select refine features selections. Currently all features are being considered. 
X = data.drop('NObeyesdad', axis=1)
y = data['NObeyesdad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Model Training
Train multiple models and compare performance.

In [None]:
# TODO: We can add a few more models...
# NOTE: We can take this approach or run each model in separate cells. 

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
}

# Train and evaluate models
for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy:.2f}')

## 7. Model Evaluation
Evaluate the best model with detailed metrics.

In [None]:
# TODO: Compare models performance 

## 9. Conclusion
Summarize model performance, key findings from feature importance analysis, and potential applications for public health resource allocation.

In [None]:
# TODO: Gather models results, pick the model with best accuracy and identify features to be used. 


----------------
$^{1}$ Mehrparvar, F. (2021). Obesity Levels. Kaggle. Retrieved November 9, 2024, from https://www.kaggle.com/datasets/fatemehmehrparvar/obesity-levels/data