# Capstone2 Modeling Mental Health Dataset

## Table of Contents
- [1. Overview](#1.-Overview)
- [2. Import Libraries](#2.-Import-Libraries)
- [3. Load Data](#3.-Load-Data)
- [4. Analyze Data](#4.-Analyze-Data)
- [5. Modeling](#5.-Modeling)
    - [5.1 Global Model](#5.1-Global-Model)
    - [5.2 Segmented Country Models](#5.2-Segmented-Country-Models)
        - [5.2.1 Load Segments](#5.2.1-Load-Segments)
        - [5.2.2 Modeling High Performing Countries](#5.2.2-Modeling-High-Performing-Countries)
        - [5.2.3 Modeling Low Performing Countries](#5.2.3-Modeling-Low-Performing-Countries) 
    - [5.3 Combined Global and Segmented Models](#5.3-Combined-Global-and-Segmented-Models)
    - [5.4 Cross Validation Score](#5.4-Cross-Validation-Score)
- [6. Summary](#6.-Summary)

## 1. Overview

The focus of the mental health modeling project is to develop a predictive model that identifies individuals likely to need mental health support. This analysis is based on a comprehensive dataset of categorical features, encompassing over 290,000 observations. Key factors considered in the model include demographics, mental health conditions, sentiment analysis scores, and psychological indicators.

The project builds upon prior feature engineering using logistic regression for both the global and segmented models. The results include an accuracy of 0.69 on the global model with a 0.76 AUC, and for the high-performance segment, accuracy of 0.69 and 0.73 AUC. The lower-performing segment showed 0.54 Accuracy and 0.55 AUC.

In the upcoming modeling phase, we will implement XGBoost, LightGBM, and Random Forest algorithms, focusing on fine-tuning hyperparameter settings to enhance performance. We will also explore a stacking approach to optimize how these models are combined for better results. Additionally, we will create separate models for the high and low segments to address their specific needs and assess how these can be integrated with the global model for improved performance.

To ensure the reliability of our models, we will expand our evaluation metrics beyond accuracy and AUC by employing k-fold cross-validation.

## 2. Import Libraries

In [1]:
import pickle

import numpy as np
import pandas as pd

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from bayes_opt import BayesianOptimization
from xgboost import XGBClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression


## 3. Load Data

In [2]:
df = pd.read_csv('MentalHealthCleaned.csv')

## 4. Analyze Data

In [3]:
df.head()

Unnamed: 0,family_history,treatment,Coping_Struggles,Gender_Male,Country_Belgium,Country_Bosnia and Herzegovina,Country_Brazil,Country_Canada,Country_Colombia,Country_Costa Rica,...,Mood_Swings_Low,Mood_Swings_Medium,Work_Interest_No,Work_Interest_Yes,Social_Weakness_No,Social_Weakness_Yes,mental_health_interview_No,mental_health_interview_Yes,care_options_Not sure,care_options_Yes
0,0,1,0,0,0,0,0,0,0,0,...,0,1,1,0,0,1,1,0,1,0
1,1,1,0,0,0,0,0,0,0,0,...,0,1,1,0,0,1,1,0,0,0
2,1,1,0,0,0,0,0,0,0,0,...,0,1,1,0,0,1,1,0,0,1
3,1,1,0,0,0,0,0,0,0,0,...,0,1,1,0,0,1,0,0,0,1
4,1,1,0,0,0,0,0,0,0,0,...,0,1,1,0,0,1,1,0,0,1


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
family_history,292364.0,0.395165,0.488887,0.0,0.0,0.0,1.0,1.0
treatment,292364.0,0.504871,0.499977,0.0,0.0,1.0,1.0,1.0
Coping_Struggles,292364.0,0.472137,0.499224,0.0,0.0,0.0,1.0,1.0
Gender_Male,292364.0,0.820381,0.383870,0.0,1.0,1.0,1.0,1.0
Country_Belgium,292364.0,0.002818,0.053014,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...
Social_Weakness_Yes,292364.0,0.313332,0.463849,0.0,0.0,0.0,1.0,1.0
mental_health_interview_No,292364.0,0.794099,0.404359,0.0,1.0,1.0,1.0,1.0
mental_health_interview_Yes,292364.0,0.029497,0.169197,0.0,0.0,0.0,0.0,1.0
care_options_Not sure,292364.0,0.265990,0.441860,0.0,0.0,0.0,1.0,1.0


## 5. Modeling

### 5.1 Global Model

In [5]:
import libs.data_utils as utils
# Common runtime settings
utils.random_state=42
utils.verbose=1

In [6]:
global_model = utils.train_stacked_models(df, 'treatment')
global_model

1. Hyperparameter optimization
LGBM:
Accuracy: 0.8370843989769821
XGB:
|   iter    |  target   |   gamma   | learni... | max_depth | n_esti... |
-------------------------------------------------------------------------
| [35m14       [39m | [35m0.8493   [39m | [35m3.018    [39m | [35m0.3      [39m | [35m11.46    [39m | [35m138.0    [39m |
Accuracy:  0.8492966751918158
Random Forest:
|   iter    |  target   | criterion | max_depth | min_sa... | n_esti... |
-------------------------------------------------------------------------
| [35m7        [39m | [35m0.8339   [39m | [35m0.0      [39m | [35m150.0    [39m | [35m50.0     [39m | [35m245.1    [39m |
| [35m10       [39m | [35m0.8345   [39m | [35m0.7433   [39m | [35m91.42    [39m | [35m50.0     [39m | [35m287.5    [39m |
| [35m15       [39m | [35m0.8346   [39m | [35m1.0      [39m | [35m96.97    [39m | [35m50.0     [39m | [35m152.3    [39m |
Accuracy:  0.8345907928388747

2. Model Performance


### 5.2. Segmented Country Models

#### 5.2.1 Load Segments

In [7]:
# Load datasets into pd DF
hperf_df = pd.read_csv('hperf_countries.csv')
lperf_df = pd.read_csv('lperf_countries.csv')

print(hperf_df.shape)
print(lperf_df.shape)

(18160, 64)
(241438, 64)


In [8]:
hperf_df.head()

Unnamed: 0,family_history,treatment,Coping_Struggles,Gender_Male,Country_Belgium,Country_Bosnia and Herzegovina,Country_Brazil,Country_Canada,Country_Colombia,Country_Costa Rica,...,Mood_Swings_Low,Mood_Swings_Medium,Work_Interest_No,Work_Interest_Yes,Social_Weakness_No,Social_Weakness_Yes,mental_health_interview_No,mental_health_interview_Yes,care_options_Not sure,care_options_Yes
0,1,1,0,1,0,0,1,0,0,0,...,0,0,1,0,0,1,1,0,0,0
1,0,0,0,1,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0
2,0,0,0,1,0,0,1,0,0,0,...,0,0,1,0,0,1,1,0,0,1
3,0,0,0,1,0,0,1,0,0,0,...,0,0,1,0,0,1,0,0,0,0
4,1,0,0,1,0,0,1,0,0,0,...,0,0,1,0,0,1,1,0,0,0


#### 5.2.2 Modeling High Performing Countries

In [9]:
# Train and get best models for high performing countries
hperf_model = utils.train_stacked_models(hperf_df, 'treatment')

1. Hyperparameter optimization
LGBM:
Accuracy: 0.8591304347826086
XGB:
|   iter    |  target   |   gamma   | learni... | max_depth | n_esti... |
-------------------------------------------------------------------------
| [35m2        [39m | [35m0.8577   [39m | [35m0.7801   [39m | [35m0.05524  [39m | [35m3.523    [39m | [35m179.9    [39m |
Accuracy:  0.8577391304347826
Random Forest:
|   iter    |  target   | criterion | max_depth | min_sa... | n_esti... |
-------------------------------------------------------------------------
| [35m6        [39m | [35m0.8059   [39m | [35m0.5436   [39m | [35m143.1    [39m | [35m36.43    [39m | [35m339.0    [39m |
| [35m9        [39m | [35m0.8129   [39m | [35m0.1427   [39m | [35m145.4    [39m | [35m43.68    [39m | [35m321.3    [39m |
Accuracy:  0.8128695652173913

2. Model Performance
LGB:
Confusion Matrix:
[[1486  295]
 [ 110  984]]

Classification Report:
              precision    recall  f1-score   support

     

#### 5.2.3 Modeling Low Performing Countries

In [10]:
# Train and get best models for low performing countries
lperf_model = utils.train_stacked_models(lperf_df, 'treatment')

1. Hyperparameter optimization
LGBM:
Accuracy: 0.5630578750359919
XGB:
|   iter    |  target   |   gamma   | learni... | max_depth | n_esti... |
-------------------------------------------------------------------------
| [35m5        [39m | [35m0.7373   [39m | [35m1.521    [39m | [35m0.1622   [39m | [35m6.888    [39m | [35m93.68    [39m |
| [35m9        [39m | [35m0.7374   [39m | [35m4.052    [39m | [35m0.07196  [39m | [35m11.91    [39m | [35m98.57    [39m |
Accuracy:  0.7374028217679239
Random Forest:
|   iter    |  target   | criterion | max_depth | min_sa... | n_esti... |
-------------------------------------------------------------------------
Accuracy:  0.7104808522890872

2. Model Performance
LGB:
Confusion Matrix:
[[ 917 2018]
 [1017 2994]]

Classification Report:
              precision    recall  f1-score   support

           0       0.47      0.31      0.38      2935
           1       0.60      0.75      0.66      4011

    accuracy                  

### 5.3 Combined Global and Segmented Models

In [11]:
# Split global data
X_train, x_test, y_train, y_test = utils.train_test_split_with_duplicates(df, 'treatment')

stacked_models = [
    ('global', global_model),
    ('hperf_countries', hperf_model),
    ('lperf_countries', lperf_model) # We'll give this 20% weight over the others.
]

combined_model = utils.WeightedStackingClassifier(
    estimators=stacked_models,
    final_estimator=LogisticRegression(random_state=utils.random_state),
    #weights=[0.33, 0.33, 0.33] # Weights - top to bottom
)

combined_model.fit(X_train, y_train)
# Evaluate the best model
y_pred = combined_model.predict(x_test)
combined_accuracy = accuracy_score(y_test, y_pred)


In [12]:
print(f'Accuracy: {combined_accuracy}')
_ = utils.print_model_metrics(y_test, y_pred)


Accuracy: 0.8642583120204603
Confusion Matrix:
[[6813 1537]
 [ 586 6704]]

Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.82      0.87      8350
           1       0.81      0.92      0.86      7290

    accuracy                           0.86     15640
   macro avg       0.87      0.87      0.86     15640
weighted avg       0.87      0.86      0.86     15640



### 5.4 Cross Validation Score

In [13]:
from sklearn.model_selection import cross_val_score

# Evaluate individual global model
global_cv_scores = cross_val_score(global_model, X_train, y_train, cv=5)

print("\nGlobal Model:")
print("============")
print(f'CV Scores: {global_cv_scores}')
print(f'Mean CV Score: {global_cv_scores.mean()}')

# Evaluate individual segmented models
hperf_cv_scores = cross_val_score(hperf_model, X_train, y_train, cv=5)

print("\nHigh Performing Countries Segment Model:")
print("========================================")
print(f'CV Scores: {hperf_cv_scores}')
print(f'Mean CV Score: {hperf_cv_scores.mean()}')

lperf_cv_scores = cross_val_score(lperf_model, X_train, y_train, cv=5)

print("\nLow Performing Countries Segment Model:")
print("========================================")
print(f'CV Scores: {lperf_cv_scores}')
print(f'Mean CV Score: {lperf_cv_scores.mean()}')

# Evaluate combined model
combined_cv_scores = cross_val_score(combined_model, X_train, y_train, cv=5)

print("\n Combined Model:")
print("===============")
print(f'CV Scores: {combined_cv_scores}')
print(f'Mean CV Score: {combined_cv_scores.mean()}')



Global Model:
CV Scores: [0.65772879 0.76688048 0.78184118 0.76599512 0.74535632]
Mean CV Score: 0.7435603771119952

High Performing Countries Segment Model:
CV Scores: [0.62323606 0.76729605 0.74205439 0.73330924 0.71568734]
Mean CV Score: 0.7163166164507629

Low Performing Countries Segment Model:
CV Scores: [0.64358117 0.81644232 0.80301744 0.77439696 0.74167028]
Mean CV Score: 0.7558216340058919

 Combined Model:
CV Scores: [0.63682356 0.79262806 0.79595266 0.77544494 0.75095765]
Mean CV Score: 0.7503613730513929


## 6. Summary

In the modeling phase, we implemented XGBoost, LightGBM, and Random Forest algorithms to predict individuals likely to need mental health support. After fine-tuning the hyperparameters for each model, we achieved the following results:

Feature Engineering Accuracy: 0.69

**Global Model**:

- XGBoost (XGB): Achieved an accuracy of **0.84**.
- LightGBM (LGBM): Achieved an accuracy of **0.83**.
- Random Forest Classifier (RFC): Achieved an accuracy of **0.83**.
- Stack Model: Achieved an accuracy of **0.87**.

**High Performing Countries Model**:

- XGBoost (XGB): Achieved an accuracy of **0.86**.
- LightGBM (LGBM): Achieved an accuracy of **0.86**.
- Random Forest Classifier (RFC): Achieved an accuracy of **0.81**.
- Stack Model: Achieved an accuracy of **0.85**.

**Low Performing Countries Model**:

- XGBoost (XGB): Achieved an accuracy of **0.56**.
- LightGBM (LGBM): Achieved an accuracy of **0.74**.
- Random Forest Classifier (RFC): Achieved an accuracy of **0.71**.
- Stack Model: Achieved an accuracy of **0.64**.

**Combined Model**:

- Stack Model: Achieved an accuracy of **0.86**.

**Cross validation Scores**:

- The **global model** achieved a mean cross-validation score of **0.74**, 
- while the **high-performing countries** segment model had a mean score of **0.72**, 
- the **low-performing countries** segment model reached a mean score of **0.76**, and 
- the **combined model** yielded a mean score of **0.75**; 

Performance improved from the 0.69 accuracy acchieved in Logistic Regression Model, however, these scores indicate that the models still require improvement for better performance.

#### Next Steps

**Experiment with Other Algorithms:** Introduce other machine learning algorithms (e.g., Support Vector Machines, Gradient Boosting Machines, Neural Networks) to compare their performance against your current models.

**Ensemble Methods:** Experiment with various ensemble techniques, such as bagging, and boosting, to see if combining models can yield better predictive accuracy.

**Segment-Specific Modeling:** Analyze the high and low-performing segments separately to tailor models specifically to their characteristics, potentially leading to better predictions.


