<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
   
                                          
                                          
                                          
                                          
                                          
#                               Multinomial Logistic Regression in Predicting Obesity Risk




##                                                        Gladys Murage

##                              College of Business, Engineering, and  Technology, National University

##                                         DDS8555 v1: PREDICTIVE ANALYSIS(3602869492)

##                                                        Dr MOHAMED NABEEL

##                                                            April 06, 2025


<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>
<BR>

# Key Components of the code:
#### 1. Data Loading: reads training, test, and submission files
#### 2. Pre-processing: handles label encoding for categorical targets and standardizes features logistic regression
#### 3. Models: multinomial Logistic Regression: For multi-class classification
#### 4. Output: generates a submission files
#### 5 Includes evaluation if test labels are available; test labels not available for Kaggle competition
#### 6. Train and evaluate Logistic Regression model
#### 7. Create validation set for local evaluation and prints validation metrics
#### 8. Model interpretation in coefficients



In [14]:
# Import and load libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load data
train = pd.read_csv('Otrain.csv')
test = pd.read_csv('Otest.csv')
sample_sub = pd.read_csv('Osample_submission.csv')

# Prepare features and target
X_train = train.drop('NObeyesdad', axis=1)
y_train = train['NObeyesdad']
X_test = test.drop('NObeyesdad', axis=1) if 'NObeyesdad' in test.columns else test.copy()

# Identify categorical and numerical columns
categorical_cols = X_train.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_cols}")
print(f"Numerical columns: {numerical_cols}")

# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Encode target variable
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)

# Preprocess the data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Train and evaluate Logistic Regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train_processed, y_train_encoded)

# Create validation set for local evaluation
X_train_split, X_val, y_train_split, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42)

# Preprocess validation data
X_val_processed = preprocessor.transform(X_val)
y_val_encoded = le.transform(y_val)

# Make predictions on validation set
val_preds = le.inverse_transform(logreg.predict(X_val_processed))

# Print validation metrics
print("\nValidation Metrics:")
print(f"Accuracy: {accuracy_score(y_val, val_preds):.4f}")
print(classification_report(y_val, val_preds))

# Generate submission file
test_preds = le.inverse_transform(logreg.predict(X_test_processed))
submission = sample_sub.copy()
submission['NObeyesdad'] = test_preds
submission.to_csv("multinomialLogR.csv", index=False)

# Model interpretation (coefficients)
feature_names = (numerical_cols + 
                list(preprocessor.named_transformers_['cat']
                    .get_feature_names_out(categorical_cols)))

print("\nModel Coefficients:")
for i, class_name in enumerate(le.classes_):
    print(f"\nClass {class_name}:")
    coef_df = pd.DataFrame({'Feature': feature_names, 
                           'Coefficient': logreg.coef_[i]})
    print(coef_df.sort_values('Coefficient', ascending=False).head(10))

Categorical columns: ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS']
Numerical columns: ['id', 'Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']

Validation Metrics:
Accuracy: 0.8702
                     precision    recall  f1-score   support

Insufficient_Weight       0.88      0.94      0.91       524
      Normal_Weight       0.86      0.81      0.83       626
     Obesity_Type_I       0.82      0.84      0.83       543
    Obesity_Type_II       0.96      0.97      0.97       657
   Obesity_Type_III       1.00      1.00      1.00       804
 Overweight_Level_I       0.74      0.73      0.73       484
Overweight_Level_II       0.73      0.70      0.72       514

           accuracy                           0.87      4152
          macro avg       0.85      0.86      0.86      4152
       weighted avg       0.87      0.87      0.87      4152


Model Coefficients:

Class Insufficient_Weight:
                              Fea

# Interpretation of the Multinomial Logistic Regression Model
## 1. Overall Model Performance (Validation Metrics)
#### Accuracy: 87.02% which is strong overall performance

#### Class-wise Performance:

##### Perfect Classification: Obesity_Type_III (100% precision/recall)

##### Best Classes: Obesity_Type_II (F1=0.97), Insufficient_Weight (F1=0.91)

##### Most Challenging: Overweight_Level_I/II (F1~0.72-0.73)

#### Key Insight: the model excels at identifying extreme weight categories but struggles more with intermediate overweight classes.

# 2. Feature Importance by Class
## Insufficient_Weight
### Top Predictors:
Height (3.78): taller individuals more likely underweight.
Automobile Transport (1.25): car users more likely underweight.
Frequent Eating (CAEC_Frequently, 1.18): counter-intuitive, this  needs domain validation.

## Normal_Weight
### Key Drivers:
Height (2.03): height remains important as a predictor.
Motorbike Users (0.89): active transport association.
Male Gender (0.62): males are more likely to have normal weight.

## Obesity Type I
### Dominant Factors:
Weight (7.98): Absolute strongest predictor of Obesity type 1.
Female Gender (0.67): higher risk for females than males.
Bike Transport (0.36): Possibly reverses causality of Obesity.

## Obesity Type II
### Extreme Predictors:
Weight (14.77): has a  massive coefficient and is the strongest predictor for Obesity type II.
Male Gender (2.01): there is a strong male association.
Age (1.22): older age increases risk.

## Obesity Type III
### Critical Factors:
Weight (12.61) + Female (3.31): weight is the strongest predictor of a severe obesity profile.
FCVC (Fruit/Veg Consumption, 3.21): this is paradoxical and  possibly reporting bias.
Public Transport (2.17): Mobility limitation indicator is an indicator.

## Overweight Classes
### Different Patterns:
Level I: CAEC_no (1.18 - no eating concerns) + Height (0.66)
Level II: Weight (1.30) + CALC_Frequently (0.68 - frequent alcohol)

#  2. Feature Importance by Class
## Insufficient_Weight
### Top Predictors:
#### Height (3.78): taller individuals are more likely underweight.
#### Automobile Transport (1.25): car users are more likely underweight.
#### Frequent Eating (CAEC_Frequently, 1.18): counter-intuitive, this  needs domain validation.

## Normal_Weight
### Key Drivers:
#### Height (2.03): height remains important as a predictor.
#### Motorbike Users (0.89): active transport association.
#### Male Gender (0.62): males are more likely to have normal weight.

## Obesity Type I
### Dominant Factors:
#### Weight (7.98): Absolute strongest predictor of Obesity type 1.
#### Female Gender (0.67): higher risk for females than males.
#### Bike Transport (0.36): Possibly reverses causality of Obesity.

## Obesity Type II
### Extreme Predictors:
#### Weight (14.77): has a  massive coefficient and is the strongest predictor for Obesity type II.
#### Male Gender (2.01): there is a strong male association.
#### Age (1.22): older age increases risk.

## Obesity Type III
### Critical Factors:
#### Weight (12.61) + Female (3.31): weight is the strongest predictor of a severe obesity profile.
#### FCVC (Fruit/Veg Consumption, 3.21): this is paradoxical and  possibly reporting bias.
#### Public Transport (2.17): Mobility limitation indicator is an indicator.

## Overweight Classes
### Different Patterns:
#### Level I: CAEC_no (1.18 - no eating concerns) + Height (0.66)
#### Level II: Weight (1.30) + CALC_Frequently (0.68 - frequent alcohol)

# 3. Key Biological and Behavioral Insights
## Weight/Height Duality:
1. Weight drives obesity classes

2. Height protects against underweight/normal weight

## Gender Differences:
1. Females dominate severe obesity (Types I/III)

2. Males dominate  more in Type II obesity/normal weight

## Transport Matters:
1. Active transport (bike/motorbike) results in healthier weights

2. Automobile/public transport results on higher obesity risk

## Data Quality Flags:

Counter-intuitive findings e.g., fruit/veg with obesity, suggest possible measurement issues or complex interactions

# 4. Recommendations for Improvement
1. Address Class Imbalance: overweight classes show lower recall

2. Feature Engineering:

   a. Create BMI (weight/height²) as composite feature

   b. Investigate interaction terms (e.g., Gender × Transport)

3. Data Collection:

   a. Validate self-reported measures (FCVC, CAEC)

   b. Consider adding socioeconomic features

4. Model Tuning:

   a. Class weights for overweight classes

   b. Try non-linear models for complex interactions such as XGBOOST and Random Trees.

# 5. Deployment Considerations
Strongest Predictors: Weight, height, transport mode

Ethical Notes: Gender/transport features may require careful handling to avoid bias

# 6. Conclusion:
This model provides clinically meaningful insights about risk for Obesity, while achieving strong accuracy. The coefficients align with known epidemiological patterns about Obesity, while revealing some surprising behavioral associations worth further investigation.