# LifeSync Project - Machine Learning Model Training

This notebook implements the model training step for the LifeSync project. We'll build and evaluate multiple models to predict Happiness Score and Stress Level based on lifestyle factors.

## Steps:
1. Load and explore the dataset
2. Prepare features and target variables
3. Train and tune multiple models for Happiness Score prediction
4. Evaluate models and select the best one
5. Generate SHAP explainability visualizations
6. Train models for Stress Level prediction
7. Save trained models and evaluation metrics

In [3]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import shap
import os

# Set up paths
import os
os.makedirs('../outputs', exist_ok=True)

# Set random seed for reproducibility
np.random.seed(42)

# Configure plot styles (using compatible style names)
plt.style.use('seaborn-v0_8') 
sns.set_theme(style="whitegrid")  # Modern seaborn syntax
sns.set_palette('viridis')

## Step 1: Load and Explore the Dataset

In [4]:
# Load the dataset
dataset_path = '../Mental_Health_Lifestyle_Dataset.csv'
df = pd.read_csv(dataset_path)

# Display basic information
print(f"Dataset shape: {df.shape}")
print("\nFirst 5 rows:")
df.head()

Dataset shape: (3000, 12)

First 5 rows:


Unnamed: 0,Country,Age,Gender,Exercise Level,Diet Type,Sleep Hours,Stress Level,Mental Health Condition,Work Hours per Week,Screen Time per Day (Hours),Social Interaction Score,Happiness Score
0,Brazil,48,Male,Low,Vegetarian,6.3,Low,,21,4.0,7.8,6.5
1,Australia,31,Male,Moderate,Vegan,4.9,Low,PTSD,48,5.2,8.2,6.8
2,Japan,37,Female,Low,Vegetarian,7.2,High,,43,4.7,9.6,9.7
3,Brazil,35,Male,Low,Vegan,7.2,Low,Depression,43,2.2,8.2,6.6
4,Germany,46,Male,Low,Balanced,7.3,Low,Anxiety,35,3.6,4.7,4.4


In [5]:
# Check data types and missing values
print("Data types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

Data types:
Country                         object
Age                              int64
Gender                          object
Exercise Level                  object
Diet Type                       object
Sleep Hours                    float64
Stress Level                    object
Mental Health Condition         object
Work Hours per Week              int64
Screen Time per Day (Hours)    float64
Social Interaction Score       float64
Happiness Score                float64
dtype: object

Missing values:
Country                          0
Age                              0
Gender                           0
Exercise Level                   0
Diet Type                        0
Sleep Hours                      0
Stress Level                     0
Mental Health Condition        595
Work Hours per Week              0
Screen Time per Day (Hours)      0
Social Interaction Score         0
Happiness Score                  0
dtype: int64


In [6]:
# Statistical summary
df.describe()

Unnamed: 0,Age,Sleep Hours,Work Hours per Week,Screen Time per Day (Hours),Social Interaction Score,Happiness Score
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,41.229667,6.475933,39.466333,5.089833,5.4702,5.395067
std,13.428416,1.499866,11.451459,1.747231,2.563532,2.557601
min,18.0,1.4,20.0,2.0,1.0,1.0
25%,30.0,5.5,30.0,3.6,3.3,3.2
50%,41.0,6.5,39.0,5.1,5.5,5.4
75%,53.0,7.5,50.0,6.6,7.6,7.5
max,64.0,11.3,59.0,8.0,10.0,10.0


In [7]:
# Check unique values for categorical columns
categorical_cols = ['Country', 'Gender', 'Exercise Level', 'Diet Type', 'Stress Level', 'Mental Health Condition']
for col in categorical_cols:
    print(f"\n{col} unique values:")
    print(df[col].value_counts())


Country unique values:
Country
USA          446
Japan        439
Australia    434
India        434
Canada       428
Brazil       415
Germany      404
Name: count, dtype: int64

Gender unique values:
Gender
Female    1024
Other      996
Male       980
Name: count, dtype: int64

Exercise Level unique values:
Exercise Level
Low         1033
Moderate     998
High         969
Name: count, dtype: int64

Diet Type unique values:
Diet Type
Junk Food     637
Balanced      625
Vegetarian    592
Vegan         573
Keto          573
Name: count, dtype: int64

Stress Level unique values:
Stress Level
Low         1008
High        1002
Moderate     990
Name: count, dtype: int64

Mental Health Condition unique values:
Mental Health Condition
Anxiety       628
PTSD          624
Depression    580
Bipolar       573
Name: count, dtype: int64


## Step 2: Prepare Features and Target Variables

Let's prepare our data for modeling by:
1. Converting categorical variables to numeric
2. Separating features from target variables
3. Splitting into training and testing sets

In [8]:
# Create a copy of the dataframe to avoid modifying the original
df_processed = df.copy()

# Encode categorical variables
# Exercise Level: Low (1), Moderate (2), High (3)
exercise_mapping = {'Low': 1, 'Moderate': 2, 'High': 3}
df_processed['Exercise Level'] = df_processed['Exercise Level'].map(exercise_mapping)

# Diet Type: Using one-hot encoding
diet_dummies = pd.get_dummies(df_processed['Diet Type'], prefix='Diet')
df_processed = pd.concat([df_processed, diet_dummies], axis=1)
df_processed.drop('Diet Type', axis=1, inplace=True)

# Stress Level: Low (1), Moderate (2), High (3)
stress_mapping = {'Low': 1, 'Moderate': 2, 'High': 3}
df_processed['Stress Level'] = df_processed['Stress Level'].map(stress_mapping)

# Gender: Using one-hot encoding
gender_dummies = pd.get_dummies(df_processed['Gender'], prefix='Gender')
df_processed = pd.concat([df_processed, gender_dummies], axis=1)
df_processed.drop('Gender', axis=1, inplace=True)

# Mental Health Condition: Using one-hot encoding
mh_dummies = pd.get_dummies(df_processed['Mental Health Condition'], prefix='MH')
df_processed = pd.concat([df_processed, mh_dummies], axis=1)
df_processed.drop('Mental Health Condition', axis=1, inplace=True)

# Country: Using one-hot encoding
country_dummies = pd.get_dummies(df_processed['Country'], prefix='Country')
df_processed = pd.concat([df_processed, country_dummies], axis=1)
df_processed.drop('Country', axis=1, inplace=True)

# Display the processed dataframe
print(f"Processed dataframe shape: {df_processed.shape}")
df_processed.head()

Processed dataframe shape: (3000, 27)


Unnamed: 0,Age,Exercise Level,Sleep Hours,Stress Level,Work Hours per Week,Screen Time per Day (Hours),Social Interaction Score,Happiness Score,Diet_Balanced,Diet_Junk Food,...,MH_Bipolar,MH_Depression,MH_PTSD,Country_Australia,Country_Brazil,Country_Canada,Country_Germany,Country_India,Country_Japan,Country_USA
0,48,1,6.3,1,21,4.0,7.8,6.5,False,False,...,False,False,False,False,True,False,False,False,False,False
1,31,2,4.9,1,48,5.2,8.2,6.8,False,False,...,False,False,True,True,False,False,False,False,False,False
2,37,1,7.2,3,43,4.7,9.6,9.7,False,False,...,False,False,False,False,False,False,False,False,True,False
3,35,1,7.2,1,43,2.2,8.2,6.6,False,False,...,False,True,False,False,True,False,False,False,False,False
4,46,1,7.3,1,35,3.6,4.7,4.4,True,False,...,False,False,False,False,False,False,True,False,False,False


In [9]:
# Define feature sets and target variables

# Target variables
happiness_target = df_processed['Happiness Score']
stress_target = df_processed['Stress Level']

# Features (excluding the target variables)
feature_columns = [col for col in df_processed.columns if col not in ['Happiness Score', 'Stress Level']]
features = df_processed[feature_columns]

print(f"Number of features: {len(feature_columns)}")
print(f"Feature columns: {feature_columns}")

# Split the data into training and testing sets (for Happiness Score)
X_train_happiness, X_test_happiness, y_train_happiness, y_test_happiness = train_test_split(
    features, happiness_target, test_size=0.2, random_state=42)

# Split the data into training and testing sets (for Stress Level)
X_train_stress, X_test_stress, y_train_stress, y_test_stress = train_test_split(
    features, stress_target, test_size=0.2, random_state=42)

print(f"\nTraining set size (Happiness Score): {X_train_happiness.shape}")
print(f"Testing set size (Happiness Score): {X_test_happiness.shape}")
print(f"Training set size (Stress Level): {X_train_stress.shape}")
print(f"Testing set size (Stress Level): {X_test_stress.shape}")

Number of features: 25
Feature columns: ['Age', 'Exercise Level', 'Sleep Hours', 'Work Hours per Week', 'Screen Time per Day (Hours)', 'Social Interaction Score', 'Diet_Balanced', 'Diet_Junk Food', 'Diet_Keto', 'Diet_Vegan', 'Diet_Vegetarian', 'Gender_Female', 'Gender_Male', 'Gender_Other', 'MH_Anxiety', 'MH_Bipolar', 'MH_Depression', 'MH_PTSD', 'Country_Australia', 'Country_Brazil', 'Country_Canada', 'Country_Germany', 'Country_India', 'Country_Japan', 'Country_USA']

Training set size (Happiness Score): (2400, 25)
Testing set size (Happiness Score): (600, 25)
Training set size (Stress Level): (2400, 25)
Testing set size (Stress Level): (600, 25)


## Step 3: Train and Tune Models for Happiness Score Prediction

We'll train four different regression models:
1. Decision Tree Regressor
2. Random Forest Regressor
3. Gradient Boosting Regressor
4. XGBoost Regressor

For each model, we'll use GridSearchCV to find the best hyperparameters.

In [10]:
# Define function to evaluate model performance
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name):
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Calculate metrics
    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)
    mae_train = mean_absolute_error(y_train, y_pred_train)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
    
    # Print metrics
    print(f"\n{model_name} Performance:")
    print(f"R² (Train): {r2_train:.4f}, R² (Test): {r2_test:.4f}")
    print(f"MAE (Train): {mae_train:.4f}, MAE (Test): {mae_test:.4f}")
    print(f"RMSE (Train): {rmse_train:.4f}, RMSE (Test): {rmse_test:.4f}")
    
    # Cross-validation scores
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
    print(f"5-Fold CV R² Scores: {cv_scores}")
    print(f"Mean CV R²: {cv_scores.mean():.4f}, Std: {cv_scores.std():.4f}")
    
    # Return metrics dictionary
    metrics = {
        'model_name': model_name,
        'r2_train': r2_train,
        'r2_test': r2_test,
        'mae_train': mae_train,
        'mae_test': mae_test,
        'rmse_train': rmse_train,
        'rmse_test': rmse_test,
        'cv_r2_mean': cv_scores.mean(),
        'cv_r2_std': cv_scores.std()
    }
    
    return metrics

In [11]:
# Model 1: Decision Tree Regressor with GridSearchCV
print("Training Decision Tree Regressor for Happiness Score...")

dt_param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10]
}

dt_grid = GridSearchCV(
    DecisionTreeRegressor(random_state=42),
    dt_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

dt_grid.fit(X_train_happiness, y_train_happiness)
dt_best = dt_grid.best_estimator_

print(f"Best parameters: {dt_grid.best_params_}")
print(f"Best CV score: {dt_grid.best_score_:.4f}")

# Evaluate Decision Tree
dt_metrics = evaluate_model(dt_best, X_train_happiness, y_train_happiness, 
                           X_test_happiness, y_test_happiness, "Decision Tree")

Training Decision Tree Regressor for Happiness Score...
Best parameters: {'max_depth': 3, 'min_samples_split': 5}
Best CV score: -0.0195

Decision Tree Performance:
R² (Train): 0.0236, R² (Test): -0.0043
MAE (Train): 2.1656, MAE (Test): 2.2532
RMSE (Train): 2.5157, RMSE (Test): 2.6074
5-Fold CV R² Scores: [ 0.00525498 -0.00262762 -0.05505982 -0.03456341 -0.01051347]
Mean CV R²: -0.0195, Std: 0.0222
Best parameters: {'max_depth': 3, 'min_samples_split': 5}
Best CV score: -0.0195

Decision Tree Performance:
R² (Train): 0.0236, R² (Test): -0.0043
MAE (Train): 2.1656, MAE (Test): 2.2532
RMSE (Train): 2.5157, RMSE (Test): 2.6074
5-Fold CV R² Scores: [ 0.00525498 -0.00262762 -0.05505982 -0.03456341 -0.01051347]
Mean CV R²: -0.0195, Std: 0.0222


In [12]:
# Model 2: Random Forest Regressor with GridSearchCV
print("\nTraining Random Forest Regressor for Happiness Score...")

rf_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10]
}

rf_grid = GridSearchCV(
    RandomForestRegressor(random_state=42),
    rf_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

rf_grid.fit(X_train_happiness, y_train_happiness)
rf_best = rf_grid.best_estimator_

print(f"Best parameters: {rf_grid.best_params_}")
print(f"Best CV score: {rf_grid.best_score_:.4f}")

# Evaluate Random Forest
rf_metrics = evaluate_model(rf_best, X_train_happiness, y_train_happiness, 
                           X_test_happiness, y_test_happiness, "Random Forest")


Training Random Forest Regressor for Happiness Score...
Best parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 200}
Best CV score: -0.0105

Random Forest Performance:
R² (Train): 0.1164, R² (Test): 0.0036
MAE (Train): 2.0662, MAE (Test): 2.2538
RMSE (Train): 2.3932, RMSE (Test): 2.5971
Best parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 200}
Best CV score: -0.0105

Random Forest Performance:
R² (Train): 0.1164, R² (Test): 0.0036
MAE (Train): 2.0662, MAE (Test): 2.2538
RMSE (Train): 2.3932, RMSE (Test): 2.5971
5-Fold CV R² Scores: [-5.64693629e-03 -4.77533464e-03 -1.33948876e-02 -2.87042518e-02
 -7.31139638e-05]
Mean CV R²: -0.0105, Std: 0.0100
5-Fold CV R² Scores: [-5.64693629e-03 -4.77533464e-03 -1.33948876e-02 -2.87042518e-02
 -7.31139638e-05]
Mean CV R²: -0.0105, Std: 0.0100


In [13]:
# Model 3: Gradient Boosting Regressor
print("\nTraining Gradient Boosting Regressor for Happiness Score...")

gb_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1]
}

gb_grid = GridSearchCV(
    GradientBoostingRegressor(random_state=42),
    gb_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

gb_grid.fit(X_train_happiness, y_train_happiness)
gb_best = gb_grid.best_estimator_

print(f"Best parameters: {gb_grid.best_params_}")
print(f"Best CV score: {gb_grid.best_score_:.4f}")

# Evaluate Gradient Boosting
gb_metrics = evaluate_model(gb_best, X_train_happiness, y_train_happiness, 
                           X_test_happiness, y_test_happiness, "Gradient Boosting")


Training Gradient Boosting Regressor for Happiness Score...
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Best CV score: -0.0014

Gradient Boosting Performance:
R² (Train): 0.0179, R² (Test): 0.0023
MAE (Train): 2.1814, MAE (Test): 2.2536
RMSE (Train): 2.5230, RMSE (Test): 2.5987
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Best CV score: -0.0014

Gradient Boosting Performance:
R² (Train): 0.0179, R² (Test): 0.0023
MAE (Train): 2.1814, MAE (Test): 2.2536
RMSE (Train): 2.5230, RMSE (Test): 2.5987
5-Fold CV R² Scores: [-1.82458914e-03  1.47640281e-05 -1.33233903e-03 -7.66737483e-03
  3.57137643e-03]
Mean CV R²: -0.0014, Std: 0.0036
5-Fold CV R² Scores: [-1.82458914e-03  1.47640281e-05 -1.33233903e-03 -7.66737483e-03
  3.57137643e-03]
Mean CV R²: -0.0014, Std: 0.0036


In [14]:
# Model 4: XGBoost Regressor
print("\nTraining XGBoost Regressor for Happiness Score...")

xgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 0.9, 1.0]
}

xgb_grid = GridSearchCV(
    xgb.XGBRegressor(random_state=42),
    xgb_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

xgb_grid.fit(X_train_happiness, y_train_happiness)
xgb_best = xgb_grid.best_estimator_

print(f"Best parameters: {xgb_grid.best_params_}")
print(f"Best CV score: {xgb_grid.best_score_:.4f}")

# Evaluate XGBoost
xgb_metrics = evaluate_model(xgb_best, X_train_happiness, y_train_happiness, 
                           X_test_happiness, y_test_happiness, "XGBoost")


Training XGBoost Regressor for Happiness Score...
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
Best CV score: -0.0009

XGBoost Performance:
R² (Train): 0.0302, R² (Test): 0.0030
MAE (Train): 2.1674, MAE (Test): 2.2531
RMSE (Train): 2.5072, RMSE (Test): 2.5978
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
Best CV score: -0.0009

XGBoost Performance:
R² (Train): 0.0302, R² (Test): 0.0030
MAE (Train): 2.1674, MAE (Test): 2.2531
RMSE (Train): 2.5072, RMSE (Test): 2.5978
5-Fold CV R² Scores: [-0.00236995  0.00379966 -0.0019257  -0.01139132  0.00717501]
Mean CV R²: -0.0009, Std: 0.0063
5-Fold CV R² Scores: [-0.00236995  0.00379966 -0.0019257  -0.01139132  0.00717501]
Mean CV R²: -0.0009, Std: 0.0063


## Step 4: Compare Models and Select the Best One for Happiness Score Prediction

In [15]:
# Compare model performances
happiness_models = [dt_metrics, rf_metrics, gb_metrics, xgb_metrics]
happiness_comparison_df = pd.DataFrame(happiness_models)

# Sort by test R² score in descending order
happiness_comparison_df = happiness_comparison_df.sort_values('r2_test', ascending=False)

print("Model Comparison for Happiness Score Prediction:")
happiness_comparison_df[['model_name', 'r2_test', 'mae_test', 'rmse_test', 'cv_r2_mean']]

# Save comparison metrics to CSV
happiness_comparison_df.to_csv('../outputs/happiness_model_comparison_metrics.csv', index=False)

# Select the best model based on test R² score
best_happiness_model_name = happiness_comparison_df.iloc[0]['model_name']
print(f"\nBest model for Happiness Score prediction: {best_happiness_model_name}")

# Map model name to actual model object
model_name_to_object = {
    'Decision Tree': dt_best,
    'Random Forest': rf_best,
    'Gradient Boosting': gb_best,
    'XGBoost': xgb_best
}

best_happiness_model = model_name_to_object[best_happiness_model_name]

# Save the best model
happiness_model_path = '../outputs/lifesync_happiness_model.pkl'
joblib.dump(best_happiness_model, happiness_model_path)
print(f"Best Happiness Score model saved to: {happiness_model_path}")

Model Comparison for Happiness Score Prediction:

Best model for Happiness Score prediction: Random Forest
Best Happiness Score model saved to: ../outputs/lifesync_happiness_model.pkl
Best Happiness Score model saved to: ../outputs/lifesync_happiness_model.pkl


## Step 5: Generate SHAP Explainability for Happiness Model

In [16]:
# Create SHAP explainer for the best happiness model
print(f"Generating SHAP values for {best_happiness_model_name}...")

if best_happiness_model_name == 'XGBoost':
    explainer = shap.TreeExplainer(best_happiness_model)
    shap_values = explainer.shap_values(X_test_happiness)
else:
    explainer = shap.TreeExplainer(best_happiness_model)
    shap_values = explainer.shap_values(X_test_happiness)

# Create and save SHAP summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(shap_values, X_test_happiness, plot_type='bar', show=False)
plt.tight_layout()
plt.savefig('../outputs/shap_summary_happiness.png', dpi=300, bbox_inches='tight')
plt.close()

# Create and save SHAP summary dot plot
plt.figure(figsize=(12, 10))
shap.summary_plot(shap_values, X_test_happiness, show=False)
plt.tight_layout()
plt.savefig('../outputs/shap_dot_happiness.png', dpi=300, bbox_inches='tight')
plt.close()

print("SHAP plots for Happiness model saved.")

# Get feature importance from the best model
if hasattr(best_happiness_model, 'feature_importances_'):
    feature_importances = best_happiness_model.feature_importances_
    feature_importance_df = pd.DataFrame({
        'Feature': X_train_happiness.columns,
        'Importance': feature_importances
    }).sort_values(by='Importance', ascending=False)
    
    feature_importance_df.to_csv('../outputs/happiness_feature_importance.csv', index=False)
    print("Feature importance for Happiness model saved.")
    
    # Plot feature importance
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Importance', y='Feature', data=feature_importance_df.head(15))
    plt.title(f'Top 15 Feature Importance for {best_happiness_model_name} (Happiness Score)')
    plt.tight_layout()
    plt.savefig('../outputs/happiness_feature_importance.png', dpi=300)
    plt.close()

Generating SHAP values for Random Forest...


  shap.summary_plot(shap_values, X_test_happiness, plot_type='bar', show=False)
  shap.summary_plot(shap_values, X_test_happiness, show=False)
  shap.summary_plot(shap_values, X_test_happiness, show=False)


SHAP plots for Happiness model saved.
Feature importance for Happiness model saved.


## Step 6: Train and Tune Models for Stress Level Prediction

Now, we'll repeat the same process for Stress Level prediction.

In [17]:
# Model 1: Decision Tree Regressor for Stress Level
print("Training Decision Tree Regressor for Stress Level...")

dt_stress_grid = GridSearchCV(
    DecisionTreeRegressor(random_state=42),
    dt_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

dt_stress_grid.fit(X_train_stress, y_train_stress)
dt_stress_best = dt_stress_grid.best_estimator_

print(f"Best parameters: {dt_stress_grid.best_params_}")
print(f"Best CV score: {dt_stress_grid.best_score_:.4f}")

# Evaluate Decision Tree for Stress Level
dt_stress_metrics = evaluate_model(dt_stress_best, X_train_stress, y_train_stress, 
                                 X_test_stress, y_test_stress, "Decision Tree")

Training Decision Tree Regressor for Stress Level...
Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best CV score: -0.0274

Decision Tree Performance:
R² (Train): 0.0198, R² (Test): -0.0171
MAE (Train): 0.6882, MAE (Test): 0.6935
RMSE (Train): 0.8116, RMSE (Test): 0.8201
5-Fold CV R² Scores: [-0.02026765 -0.04667296 -0.02427944 -0.0213455  -0.0245212 ]
Mean CV R²: -0.0274, Std: 0.0098
Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Best CV score: -0.0274

Decision Tree Performance:
R² (Train): 0.0198, R² (Test): -0.0171
MAE (Train): 0.6882, MAE (Test): 0.6935
RMSE (Train): 0.8116, RMSE (Test): 0.8201
5-Fold CV R² Scores: [-0.02026765 -0.04667296 -0.02427944 -0.0213455  -0.0245212 ]
Mean CV R²: -0.0274, Std: 0.0098


In [18]:
# Model 2: Random Forest Regressor for Stress Level
print("\nTraining Random Forest Regressor for Stress Level...")

rf_stress_grid = GridSearchCV(
    RandomForestRegressor(random_state=42),
    rf_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

rf_stress_grid.fit(X_train_stress, y_train_stress)
rf_stress_best = rf_stress_grid.best_estimator_

print(f"Best parameters: {rf_stress_grid.best_params_}")
print(f"Best CV score: {rf_stress_grid.best_score_:.4f}")

# Evaluate Random Forest for Stress Level
rf_stress_metrics = evaluate_model(rf_stress_best, X_train_stress, y_train_stress, 
                                 X_test_stress, y_test_stress, "Random Forest")


Training Random Forest Regressor for Stress Level...
Best parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 200}
Best CV score: -0.0057

Random Forest Performance:
R² (Train): 0.1087, R² (Test): -0.0067
MAE (Train): 0.6486, MAE (Test): 0.6799
RMSE (Train): 0.7739, RMSE (Test): 0.8159
Best parameters: {'max_depth': 5, 'min_samples_split': 2, 'n_estimators': 200}
Best CV score: -0.0057

Random Forest Performance:
R² (Train): 0.1087, R² (Test): -0.0067
MAE (Train): 0.6486, MAE (Test): 0.6799
RMSE (Train): 0.7739, RMSE (Test): 0.8159
5-Fold CV R² Scores: [-0.00821471 -0.01282651 -0.00645407  0.0005401  -0.00152551]
Mean CV R²: -0.0057, Std: 0.0048
5-Fold CV R² Scores: [-0.00821471 -0.01282651 -0.00645407  0.0005401  -0.00152551]
Mean CV R²: -0.0057, Std: 0.0048


In [19]:
# Model 3: Gradient Boosting Regressor for Stress Level
print("\nTraining Gradient Boosting Regressor for Stress Level...")

gb_stress_grid = GridSearchCV(
    GradientBoostingRegressor(random_state=42),
    gb_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

gb_stress_grid.fit(X_train_stress, y_train_stress)
gb_stress_best = gb_stress_grid.best_estimator_

print(f"Best parameters: {gb_stress_grid.best_params_}")
print(f"Best CV score: {gb_stress_grid.best_score_:.4f}")

# Evaluate Gradient Boosting for Stress Level
gb_stress_metrics = evaluate_model(gb_stress_best, X_train_stress, y_train_stress, 
                                 X_test_stress, y_test_stress, "Gradient Boosting")


Training Gradient Boosting Regressor for Stress Level...
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Best CV score: -0.0057

Gradient Boosting Performance:
R² (Train): 0.0140, R² (Test): 0.0004
MAE (Train): 0.6738, MAE (Test): 0.6686
RMSE (Train): 0.8140, RMSE (Test): 0.8130
Best parameters: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
Best CV score: -0.0057

Gradient Boosting Performance:
R² (Train): 0.0140, R² (Test): 0.0004
MAE (Train): 0.6738, MAE (Test): 0.6686
RMSE (Train): 0.8140, RMSE (Test): 0.8130
5-Fold CV R² Scores: [-0.00721545 -0.01105423 -0.0023893  -0.00394895 -0.00396564]
Mean CV R²: -0.0057, Std: 0.0031
5-Fold CV R² Scores: [-0.00721545 -0.01105423 -0.0023893  -0.00394895 -0.00396564]
Mean CV R²: -0.0057, Std: 0.0031


In [20]:
# Model 4: XGBoost Regressor for Stress Level
print("\nTraining XGBoost Regressor for Stress Level...")

xgb_stress_grid = GridSearchCV(
    xgb.XGBRegressor(random_state=42),
    xgb_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1
)

xgb_stress_grid.fit(X_train_stress, y_train_stress)
xgb_stress_best = xgb_stress_grid.best_estimator_

print(f"Best parameters: {xgb_stress_grid.best_params_}")
print(f"Best CV score: {xgb_stress_grid.best_score_:.4f}")

# Evaluate XGBoost for Stress Level
xgb_stress_metrics = evaluate_model(xgb_stress_best, X_train_stress, y_train_stress, 
                                   X_test_stress, y_test_stress, "XGBoost")


Training XGBoost Regressor for Stress Level...
Best parameters: {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 50, 'subsample': 0.8}
Best CV score: -0.0003

XGBoost Performance:
R² (Train): 0.1288, R² (Test): 0.0014
MAE (Train): 0.6368, MAE (Test): 0.6737
RMSE (Train): 0.7652, RMSE (Test): 0.8126
Best parameters: {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 50, 'subsample': 0.8}
Best CV score: -0.0003

XGBoost Performance:
R² (Train): 0.1288, R² (Test): 0.0014
MAE (Train): 0.6368, MAE (Test): 0.6737
RMSE (Train): 0.7652, RMSE (Test): 0.8126
5-Fold CV R² Scores: [-0.00433099 -0.00524187 -0.00199151  0.00645107 -0.00038803]
Mean CV R²: -0.0011, Std: 0.0041
5-Fold CV R² Scores: [-0.00433099 -0.00524187 -0.00199151  0.00645107 -0.00038803]
Mean CV R²: -0.0011, Std: 0.0041


## Step 7: Compare Models and Select the Best One for Stress Level Prediction

In [21]:
# Compare model performances for Stress Level
stress_models = [dt_stress_metrics, rf_stress_metrics, gb_stress_metrics, xgb_stress_metrics]
stress_comparison_df = pd.DataFrame(stress_models)

# Sort by test R² score in descending order
stress_comparison_df = stress_comparison_df.sort_values('r2_test', ascending=False)

print("Model Comparison for Stress Level Prediction:")
stress_comparison_df[['model_name', 'r2_test', 'mae_test', 'rmse_test', 'cv_r2_mean']]

# Save comparison metrics to CSV
stress_comparison_df.to_csv('../outputs/stress_model_comparison_metrics.csv', index=False)

# Combine both model comparisons for an overall metrics file
all_models_df = pd.concat([
    happiness_comparison_df.assign(Target='Happiness Score'),
    stress_comparison_df.assign(Target='Stress Level')
])
all_models_df.to_csv('../outputs/model_comparison_metrics.csv', index=False)

# Select the best model based on test R² score
best_stress_model_name = stress_comparison_df.iloc[0]['model_name']
print(f"\nBest model for Stress Level prediction: {best_stress_model_name}")

# Map model name to actual model object
stress_model_name_to_object = {
    'Decision Tree': dt_stress_best,
    'Random Forest': rf_stress_best,
    'Gradient Boosting': gb_stress_best,
    'XGBoost': xgb_stress_best
}

best_stress_model = stress_model_name_to_object[best_stress_model_name]

# Save the best model
stress_model_path = '../outputs/lifesync_stress_model.pkl'
joblib.dump(best_stress_model, stress_model_path)
print(f"Best Stress Level model saved to: {stress_model_path}")

Model Comparison for Stress Level Prediction:

Best model for Stress Level prediction: XGBoost
Best Stress Level model saved to: ../outputs/lifesync_stress_model.pkl


## Step 8: Generate SHAP Explainability for Stress Model

In [22]:
# Create SHAP explainer for the best stress model
print(f"Generating SHAP values for {best_stress_model_name}...")

if best_stress_model_name == 'XGBoost':
    stress_explainer = shap.TreeExplainer(best_stress_model)
    stress_shap_values = stress_explainer.shap_values(X_test_stress)
else:
    stress_explainer = shap.TreeExplainer(best_stress_model)
    stress_shap_values = stress_explainer.shap_values(X_test_stress)

# Create and save SHAP summary plot
plt.figure(figsize=(10, 8))
shap.summary_plot(stress_shap_values, X_test_stress, plot_type='bar', show=False)
plt.tight_layout()
plt.savefig('../outputs/shap_summary_stress.png', dpi=300, bbox_inches='tight')
plt.close()

# Create and save SHAP summary dot plot
plt.figure(figsize=(12, 10))
shap.summary_plot(stress_shap_values, X_test_stress, show=False)
plt.tight_layout()
plt.savefig('../outputs/shap_dot_stress.png', dpi=300, bbox_inches='tight')
plt.close()

print("SHAP plots for Stress model saved.")

# Get feature importance from the best model
if hasattr(best_stress_model, 'feature_importances_'):
    stress_feature_importances = best_stress_model.feature_importances_
    stress_feature_importance_df = pd.DataFrame({
        'Feature': X_train_stress.columns,
        'Importance': stress_feature_importances
    }).sort_values(by='Importance', ascending=False)
    
    stress_feature_importance_df.to_csv('../outputs/stress_feature_importance.csv', index=False)
    print("Feature importance for Stress model saved.")
    
    # Combine both feature importances
    feature_importance_combined = pd.merge(
        feature_importance_df.rename(columns={'Importance': 'Happiness_Importance'}),
        stress_feature_importance_df.rename(columns={'Importance': 'Stress_Importance'}),
        on='Feature', how='outer'
    ).fillna(0)
    
    feature_importance_combined.to_csv('../outputs/feature_importance.csv', index=False)
    print("Combined feature importance saved.")
    
    # Plot feature importance
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Importance', y='Feature', data=stress_feature_importance_df.head(15))
    plt.title(f'Top 15 Feature Importance for {best_stress_model_name} (Stress Level)')
    plt.tight_layout()
    plt.savefig('../outputs/stress_feature_importance.png', dpi=300)
    plt.close()

Generating SHAP values for XGBoost...


  shap.summary_plot(stress_shap_values, X_test_stress, plot_type='bar', show=False)
  shap.summary_plot(stress_shap_values, X_test_stress, show=False)
  shap.summary_plot(stress_shap_values, X_test_stress, show=False)


SHAP plots for Stress model saved.
Feature importance for Stress model saved.
Combined feature importance saved.


## Step 9: Generate Final Model Summary

Let's create a summary of our best models and their key features.

In [23]:
# Create a summary of the best models
summary_data = {
    'Happiness Score': {
        'Model': best_happiness_model_name,
        'R²': happiness_comparison_df.iloc[0]['r2_test'],
        'MAE': happiness_comparison_df.iloc[0]['mae_test'],
        'RMSE': happiness_comparison_df.iloc[0]['rmse_test'],
        'Top Features': feature_importance_df.head(5)['Feature'].tolist()
    },
    'Stress Level': {
        'Model': best_stress_model_name,
        'R²': stress_comparison_df.iloc[0]['r2_test'],
        'MAE': stress_comparison_df.iloc[0]['mae_test'],
        'RMSE': stress_comparison_df.iloc[0]['rmse_test'],
        'Top Features': stress_feature_importance_df.head(5)['Feature'].tolist()
    }
}

print("\n--- FINAL MODEL SUMMARY ---")
for target, details in summary_data.items():
    print(f"\n{target}:")
    print(f"Best Model: {details['Model']}")
    print(f"R² Score: {details['R²']:.4f}")
    print(f"Mean Absolute Error: {details['MAE']:.4f}")
    print(f"Root Mean Squared Error: {details['RMSE']:.4f}")
    print(f"Top 5 Features: {', '.join(details['Top Features'])}")

# Save the model summary to a text file
with open('../outputs/model_summary.txt', 'w') as f:
    f.write("LifeSync Model Training Summary\n")
    f.write("==============================\n\n")
    
    for target, details in summary_data.items():
        f.write(f"{target} Prediction:\n")
        f.write(f"- Best Model: {details['Model']}\n")
        f.write(f"- R² Score: {details['R²']:.4f}\n")
        f.write(f"- Mean Absolute Error: {details['MAE']:.4f}\n")
        f.write(f"- Root Mean Squared Error: {details['RMSE']:.4f}\n")
        f.write(f"- Top 5 Features: {', '.join(details['Top Features'])}\n\n")

print("\nModel summary saved to '../outputs/model_summary.txt'")
print("\nModel training completed successfully!")


--- FINAL MODEL SUMMARY ---

Happiness Score:
Best Model: Random Forest
R² Score: 0.0036
Mean Absolute Error: 2.2538
Root Mean Squared Error: 2.5971
Top 5 Features: Sleep Hours, Age, Social Interaction Score, Work Hours per Week, Screen Time per Day (Hours)

Stress Level:
Best Model: XGBoost
R² Score: 0.0014
Mean Absolute Error: 0.6737
Root Mean Squared Error: 0.8126
Top 5 Features: Social Interaction Score, Country_Canada, Gender_Male, Country_India, Screen Time per Day (Hours)

Model summary saved to '../outputs/model_summary.txt'

Model training completed successfully!
