# Obesity Risk Prediction Model
This notebook develops a classification model to predict individuals at high risk for obesity based on demographic and lifestyle features. It includes data loading, preprocessing, exploratory data analysis, model training, and evaluation.

### Dataset Information

The *Obesity Levels*$\text{}^{1}$ dataset observed includes estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.

This dataset contains the following columns:

| Column                        | Type        | Data Type   | Description                                                                                   |
|-------------------------------|-------------|-------------|-----------------------------------------------------------------------------------------------|
| **Gender**                    | Feature     | Categorical | "Gender"                                                                                      |
| **Age**                       | Feature     | Continuous  | "Age"                                                                                         |
| **Height**                    | Feature     | Continuous  | Height                                                                                        |
| **Weight**                    | Feature     | Continuous  | Weight                                                                                        |
| **family_history_with_overweight** | Feature | Binary      | "Has a family member suffered or suffers from overweight?"                                    |
| **FAVC**                      | Feature     | Binary      | "Do you eat high caloric food frequently?"                                                    |
| **FCVC**                      | Feature     | Integer     | "Do you usually eat vegetables in your meals?"                                                |
| **NCP**                       | Feature     | Continuous  | "How many main meals do you have daily?"                                                      |
| **CAEC**                      | Feature     | Categorical | "Do you eat any food between meals?"                                                          |
| **SMOKE**                     | Feature     | Binary      | "Do you smoke?"                                                                               |
| **CH2O**                      | Feature     | Continuous  | "How much water do you drink daily?"                                                          |
| **SCC**                       | Feature     | Binary      | "Do you monitor the calories you eat daily?"                                                  |
| **FAF**                       | Feature     | Continuous  | "How often do you have physical activity?"                                                    |
| **TUE**                       | Feature     | Integer     | "How much time do you use technological devices such as cell phone, videogames, television, computer and others?" |
| **CALC**                      | Feature     | Categorical | "How often do you drink alcohol?"                                                             |
| **MTRANS**                    | Feature     | Categorical | "Which transportation do you usually use?"                                                    |
| **NObeyesdad**                | Target      | Categorical | "Obesity level"                                                                               |


## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.exceptions import ConvergenceWarning
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, classification_report

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, accuracy_score
from statsmodels.stats import anova
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor, RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.svm import SVR
from IPython.display import display, HTML


# Suppress warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)

## 2. Load Data
Load the dataset and examine the first few rows.

In [None]:
data_raw = pd.read_csv('ObesityDataSet_raw.csv')
data_raw.head()

In [None]:
# Print shape of the dataset
data_raw.shape

In [None]:
# Print information of the dataset.
data_raw.info()

## 3. Data Preprocessing
Convert categorical features to numeric, handle missing values, and scale numerical features.

In [None]:
# Drop missing values
data = data_raw.dropna()

In [None]:
# Encode categorical features
label_encoders = {}
# Dictionary to store the relationship between original and encoded values
value_mapping = {}
for column in data.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    # Store the mapping of original values to encoded values
    value_mapping[column] = dict(zip(le.classes_, range(len(le.classes_))))
    #Then, proceed with the replacement.
    label_encoders[column] = le

data.head()

In [None]:
def printMappedValues(MyDictionary):
    for column, mapping in MyDictionary.items():
        # Create an HTML table for each column's mapping
        resultText = ''
        
        # Add rows for each mapping in the dictionary
        for original, encoded in mapping.items():
            resultText += f'- {original}: {encoded}\n'
        
        resultText += '-------------------------'
        
        # Print the HTML table for the current column
        print(f'Values for column: {column}:')
        print(resultText)
        print('\n')  # Add space between tables for readability
printMappedValues(value_mapping)

### Data Encoding

| Column                         | Encodings                                                      |
|--------------------------------|----------------------------------------------------------------|
| **Gender**                     | Female: 0; Male: 1                                             |
| **family_history_with_overweight** | no: 0; yes: 1                                         |
| **FAVC**                       | no: 0; yes: 1                                                  |
| **CAEC**                       | Always: 0; Frequently: 1; Sometimes: 2; no: 3                  |
| **SMOKE**                      | no: 0; yes: 1                                                  |
| **SCC**                        | no: 0; yes: 1                                                  |
| **CALC**                       | Always: 0; Frequently: 1; Sometimes: 2; no: 3                  |
| **MTRANS**                     | Automobile: 0; Bike: 1; Motorbike: 2; Public_Transportation: 3; Walking: 4 |
| **NObeyesdad**                 | Insufficient_Weight: 0; Normal_Weight: 1; Obesity_Type_I: 2; Obesity_Type_II: 3; Obesity_Type_III: 4; Overweight_Level_I: 5; Overweight_Level_II: 6 |


## 4. Exploratory Data Analysis (EDA)
Explore the distribution of obesity levels and visualize relationships between features.

In [None]:
# Set up the number of columns for the grid
num_cols = 3
num_vars = len(data_raw.columns)
num_rows = (num_vars + num_cols - 1) // num_cols  # Calculate required number of rows

# Create a grid of subplots for all variables
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, num_rows * 4))
axes = axes.flatten()  # Flatten the axes array for easy iteration

for i, column in enumerate(data_raw.columns):
    ax = axes[i]
    if data_raw[column].dtype == 'object':  # Plot count plot for categorical variables
        sns.countplot(x=column, data=data_raw, ax=ax)
        ax.set_title(f'Distribution of {column}')
        ax.set_xlabel(column)
        ax.set_ylabel('Count')
        ax.tick_params(axis='x', rotation=45)  # Rotate labels for count plot
    else:
        # Plot histogram for numerical variables
        sns.histplot(data_raw[column], kde=True, ax=ax)
        ax.set_title(f'Distribution of {column}')
        ax.set_xlabel(column)
        ax.set_ylabel('Density')
        ax.tick_params(axis='x', rotation=45)  # Rotate labels for histogram

# Remove any empty subplots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(12,6))
sns.heatmap(data.corr(), annot=True, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

## 5. Train-Test Split
Split the data into training and testing sets.

In [None]:
# TODO: Select refine features selections. Currently all features are being considered. 
RANDOM_SEED = 42 # Define our random seed

X = data.drop('NObeyesdad', axis=1)
y = data['NObeyesdad']

# Splits the data into training/test sets
X_train, X_test, y_train, y_test = train_test_split(X,                          # Features variables
                                                    y,                          # Target variable
                                                    test_size=0.25,             # 25% of the data for test 
                                                    random_state=RANDOM_SEED)   # Set random seed

In [None]:
# Standardize training/test data
# Standarizing data after splitting to avoid information leakage
scaler = StandardScaler() 
X_train_scale = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_scale = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)

## 6. Model Training
Train multiple models and compare performance.

In [None]:
# create a function for our confusion matrix 
def plot_confusion_matrix(_true, _pred, classes, cmap='Blues', title=''):
    cm = confusion_matrix(_true, _pred) # set our confusion matrix true and predicted values

    # display our confusion matrix
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=classes)
    disp.plot(cmap=cmap)
    plt.title(title)
    plt.show()

### Random Forest Classifier

In [None]:
# Instantiate and train the Random Forest model
rf_model = RandomForestClassifier(random_state=RANDOM_SEED)
rf_model.fit(X_train_scale, y_train)

# Make predictions on the test set
rf_prediction = rf_model.predict(X_test_scale)

# Calculate accuracy
rf_accuracy = accuracy_score(y_test, rf_prediction)
print(f'Accuracy: {rf_accuracy}')

# Calculate F1 score
rf_weighted_f1 = f1_score(y_test, rf_prediction, average='weighted')

# call confusion matrix function 
plot_confusion_matrix(y_test, rf_prediction, rf_model.classes_, title='Random Forest CF Matrix')

### Support Vector Machine (SVM)

In [None]:
# Instantiate and train the SVM
svm_model = SVC(random_state=RANDOM_SEED)
svm_model.fit(X_train_scale, y_train)

# Make predictions on the test set
svm_prediction = svm_model.predict(X_test_scale)

# Calculate accuracy
svm_accuracy = accuracy_score(y_test, svm_prediction)
print(f'Accuracy: {svm_accuracy}')

# Calculate F1 score
svm_weighted_f1 = f1_score(y_test, svm_prediction, average='weighted')

# call confusion matrix function 
plot_confusion_matrix(y_test, svm_prediction, rf_model.classes_, title='SVM CF Matrix')

### Logistic Regression

In [None]:
# Instantiate and train the Logistic Regression
lg_model = LogisticRegression(random_state=RANDOM_SEED)
lg_model.fit(X_train_scale, y_train)

# Make predictions on the test set
lg_prediction = lg_model.predict(X_test_scale)

# Calculate accuracy
lg_accuracy = accuracy_score(y_test, lg_prediction)
print(f'Accuracy: {lg_accuracy}')

# Calculate F1 score
lg_weighted_f1 = f1_score(y_test, lg_prediction, average='weighted')

# call confusion matrix function 
plot_confusion_matrix(y_test, lg_prediction, rf_model.classes_, title='Logistic Regression')

### Decision Tree

In [None]:
# Instantiate and train the Decision Tree
dt_model = DecisionTreeClassifier(random_state=RANDOM_SEED)
dt_model.fit(X_train_scale, y_train)

# Make predictions on the test set
dt_prediction = dt_model.predict(X_test_scale)

# Calculate accuracy
dt_accuracy = accuracy_score(y_test, dt_prediction)
print(f'Accuracy: {dt_accuracy}')

# Calculate F1 score
dt_weighted_f1 = f1_score(y_test, dt_prediction, average='weighted')

# call confusion matrix function 
plot_confusion_matrix(y_test, dt_prediction, rf_model.classes_, title='Decision Tree CF Matrix')

In [None]:
####################################################################################
#
#   Base Class for the ModelSelector
#
####################################################################################
class ModelSelector:
    def __init__(
        self,
        X,
        y,
        testingSize=0.2,
        randomSeed=42,
        validationSize=0.25,
        selectedModel="LinearRegression"
):
        '''
        Definition of parameters:
        X: Features
        y: Target column
        testingSize: Float number from 0 to 1 representing the percentage that will be assigned to the testing set
        randomSeed: Value for reproducibility in randomization (Random state)
        validationSize: Float number from 0 tp 1 representing the percentage that will be assigned to the validation set
        selectedModel: Model to be tested by the class.
        '''
        self.model_name = selectedModel
        self.selectedModels = {
            "LinearRegression": {
                "model": lambda X, y: LinearRegression(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "RandomForestRegressor": {
                "model": lambda X, y: RandomForestRegressor(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "DecisionTreeRegressor": {
                "model": lambda X, y: DecisionTreeRegressor(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "GradientBoostingRegressor": {
                "model": lambda X, y: GradientBoostingRegressor(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "AdaBoostRegressor": {
                "model": lambda X, y: AdaBoostRegressor(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "BaggingRegressor": {
                "model": lambda X, y: BaggingRegressor(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "OLS": {
                "model":lambda X, y: sm.OLS(y, X),
                "fit": lambda X, y: self.selectedModel.fit()
            },
            "SVR": {
                "model": lambda X, y: SVR(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "MLPRegressor": {
                "model": lambda X, y: MLPRegressor(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            },
            "LogisticRegression":{
                "model": lambda X, y: LogisticRegression(),
                "fit": lambda X, y: self.selectedModel.fit(X, y)
            }
        }
        self.X = X
        self.y = y
        self.testingSize = testingSize
        self.randomSeed = randomSeed
        self.validationSize = validationSize
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(
            self.X,
            self.y,
            test_size=self.testingSize,
            random_state=self.randomSeed
)
        self.X_train, self.X_val, self.y_train, self.y_val = train_test_split(
            self.X_train,
            self.y_train,
            test_size=self.validationSize,
            random_state=self.randomSeed
)
        self.selectedModel = self.selectedModels[self.model_name]['model'](self.X_train, self.y_train)

    @staticmethod
    def getAvailableModels():
        return [
            "LinearRegression",
            "RandomForestRegressor",
            "DecisionTreeRegressor",
            "GradientBoostingRegressor",
            "AdaBoostRegressor",
            "BaggingRegressor",
            "OLS",
            "SVR",
            "MLPRegressor",
            "LogisticRegression"
]

    def train(self):
        self.selectedModel=self.selectedModels[self.model_name]['fit'](self.X_train, self.y_train)
        return self.evaluate(self.X_train, self.y_train)

    def predict(self, X):
        return self.selectedModel.predict(X)
    
    def validate(self):
        return self.evaluate(self.X_val, self.y_val)
    
    def test(self):
        return self.evaluate(self.X_test, self.y_test)

    def evaluate(self, X, y):
        y_pred = self.predict(X)
        mse = mean_squared_error(y, y_pred)
        mae = mean_absolute_error(y, y_pred)
        r2 = r2_score(y, y_pred)
        return np.round((mse, mae, r2), decimals=2)
    
    def summary(self):
        return self.selectedModel.summary()
    
    def plotResiduals(self, X, y):
        residuals = y - self.predict(X)
        sns.scatterplot(x=self.predict(X), y=residuals)
        plt.xlabel('Predicted Values')
        plt.ylabel('Residuals')
        plt.axhline(y=0, color='r', linestyle='--')
        plt.title('Predicted Values vs. Residuals')
        plt.show()

    def plot_residuals(self):
        self.plotResiduals(self.X_val, self.y_val)
    
    def plot_residuals_test(self):
        self.plotResiduals(self.X_test, self.y_test)

    def plot_residuals_val(self):
        self.plotResiduals(self.X_val, self.y_val)

In [None]:
# Collect the data into a list of lists
resultsData = [['Model',
         'Train MSE',
         'Train MAE',
         'Train R²',
         'Validation MSE',
         'Validation MAE',
         'Validation R²',
         'Test MSE',
         'Test MAE',
         'Test R²'
]]

for modelName in ModelSelector.getAvailableModels():
    currentModel = ModelSelector(X, y, selectedModel=modelName)
    trainingResults = currentModel.train()
    validationResults = currentModel.validate()
    testingResults = currentModel.test()
    
    # Append each row of model results to the data list
    resultsData.append([modelName,
                 f'{trainingResults[0]:.4f}', f'{trainingResults[1]:.4f}', f'{trainingResults[2]:.4f}',
                 f'{validationResults[0]:.4f}', f'{validationResults[1]:.4f}', f'{validationResults[2]:.4f}',
                 f'{testingResults[0]:.4f}', f'{testingResults[1]:.4f}', f'{testingResults[2]:.4f}'])

#Printing the data
dfResultsData = pd.DataFrame(resultsData)

# Print DataFrame as an HTML table
html_ResultsData = dfResultsData.to_html()
display(HTML(html_ResultsData))




## 7. Model Evaluation
Evaluate the best model with detailed metrics.

### Models Explored:

| Model                  | Description                                                             | Limitations                                               | Reason for Inclusion                                      |
|------------------------|-------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------------------|
| **Random Forest**      | An ensemble model that builds multiple decision trees and combines them for more stable and accurate predictions. | Computationally expensive, especially with large datasets. | Robust to overfitting, captures non-linear relationships, and provides feature importance insights. |
| **Support Vector Machine (SVM)** | Finds the hyperplane that best separates classes in the feature space, maximizing the margin between classes. | Sensitive to noise, can be slow on large datasets.         | Effective for high-dimensional data and when clear class boundaries exist. |
| **Logistic Regression** | A linear model that predicts class probabilities based on feature values. | Assumes a linear relationship between features and target, may underperform on complex data. | Simple, interpretable, and computationally efficient for baseline performance comparison. |
| **Decision Tree**      | A model that splits data into decision nodes based on feature values, creating an interpretable path to classification. | Prone to overfitting without regularization or pruning.   | Easily interpretable, provides a foundation for ensemble methods like Random Forest. |


*update markdown here: https://stackedit.io/app#

### Random Forest Classification Report

In [None]:
print(classification_report(y_test, rf_prediction))
print("Confusion Matrix:")
print(confusion_matrix(y_test, rf_prediction))

### SVM Classification Report

In [None]:
print(classification_report(y_test, svm_prediction))
print("Confusion Matrix:")
print(confusion_matrix(y_test, svm_prediction))

### Logistic Regression Classification Report

In [None]:
print(classification_report(y_test, lg_prediction))
print("Confusion Matrix:")
print(confusion_matrix(y_test, lg_prediction))

### Decision Tree Classification Report

In [None]:
print(classification_report(y_test, dt_prediction))
print("Confusion Matrix:")
print(confusion_matrix(y_test, dt_prediction))

In [None]:
# NOTE: This takes a long time to run. 

i = 30 # Iterations

# Lists to store accuracy scores for each model
rf_scores = []
svm_scores = []
lg_scores = []
dt_scores = []

for i in range(i):
    # Splits the data into training/test sets
    _x_train, _x_test, _y_train, _y_test = train_test_split(X, y, test_size=0.25)

    # Standardizes training/test data
    _x_train_scaled = pd.DataFrame(scaler.fit_transform(_x_train), columns=_x_train.columns, index=_x_train.index)
    _x_test_scaled = pd.DataFrame(scaler.transform(_x_test), columns=_x_test.columns, index=_x_test.index)

    # Trains and evaluates Random Forest
    rf_model.fit(_x_train_scaled, _y_train)
    _rf_predictions = rf_model.predict(_x_test_scaled)
    _rf_score = accuracy_score(_y_test, _rf_predictions)

    # Trains and evaluates SVM 
    svm_model.fit(_x_train_scaled, _y_train)
    _svm_predictions = svm_model.predict(_x_test_scaled)
    _svm_score = accuracy_score(_y_test, _svm_predictions)

    # Trains and evaluates Logistic Regression 
    lg_model.fit(_x_train_scaled, _y_train)
    _lg_predictions = lg_model.predict(_x_test_scaled)
    _lg_score = accuracy_score(_y_test, _lg_predictions)

    # Trains and evaluates Decision Tree 
    dt_model.fit(_x_train_scaled, _y_train)
    _dt_predictions = dt_model.predict(_x_test_scaled)
    _dt_score = accuracy_score(_y_test, _dt_predictions)

    # Collects the scores 
    rf_scores.append(_rf_score)
    svm_scores.append(_svm_score)
    lg_scores.append(_lg_score)
    dt_scores.append(_dt_score)

# Plot histograms of the scores
plt.hist(rf_scores, bins=8, alpha=0.6, label='SVM')
plt.hist(svm_scores, bins=8, alpha=0.6, label='Decision Tree')
plt.hist(lg_scores, bins=8, alpha=0.6, label='SVM')
plt.hist(dt_scores, bins=8, alpha=0.6, label='Decision Tree')

#  Title, Labels, and Legend
plt.legend()
plt.xlabel("Accuracy Score")
plt.ylabel("Frequency")
plt.title("Accuracy Score Distribution of Classifiers")
plt.show()

## 9. Conclusion
Summarize model performance, key findings from feature importance analysis, and potential applications for public health resource allocation.

In [None]:
# TODO: Gather models results, pick the model with best accuracy and identify features to be used. 


----------------
$^{1}$ Mehrparvar, F. (2021). Obesity Levels. Kaggle. Retrieved November 9, 2024, from https://www.kaggle.com/datasets/fatemehmehrparvar/obesity-levels/data