# Stat 451 Project
### Mental Health Diagnosis and Treatment Monitoring
Group 28: 
1. Mai Tah Lee mtlee2@wisc.edu
2. Annie Purisch apurisch@wisc.edu
3. Seth Mlodzik smlodzik@wisc.edu
4. Tianxing Liu tliu398@wisc.edu

Link: https://www.kaggle.com/datasets/uom190346a/mental-health-diagnosis-and-treatment-monitoring

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, svm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn import tree


df = pd.read_csv("mental_health_diagnosis_treatment_.csv")

## Preliminary Measures

In [2]:
df

Unnamed: 0,Patient ID,Age,Gender,Diagnosis,Symptom Severity (1-10),Mood Score (1-10),Sleep Quality (1-10),Physical Activity (hrs/week),Medication,Therapy Type,Treatment Start Date,Treatment Duration (weeks),Stress Level (1-10),Outcome,Treatment Progress (1-10),AI-Detected Emotional State,Adherence to Treatment (%)
0,1,43,Female,Major Depressive Disorder,10,5,8,5,Mood Stabilizers,Interpersonal Therapy,2024-01-25,11,9,Deteriorated,7,Anxious,66
1,2,40,Female,Major Depressive Disorder,9,5,4,7,Antipsychotics,Interpersonal Therapy,2024-02-27,11,7,No Change,7,Neutral,78
2,3,55,Female,Major Depressive Disorder,6,3,4,3,SSRIs,Mindfulness-Based Therapy,2024-03-20,14,7,Deteriorated,5,Happy,62
3,4,34,Female,Major Depressive Disorder,6,3,6,5,SSRIs,Mindfulness-Based Therapy,2024-03-29,8,8,Deteriorated,10,Excited,72
4,5,52,Male,Panic Disorder,7,6,6,8,Anxiolytics,Interpersonal Therapy,2024-03-18,12,5,Deteriorated,6,Excited,63
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496,24,Male,Generalized Anxiety,10,4,8,6,Mood Stabilizers,Dialectical Behavioral Therapy,2024-04-09,8,9,Improved,10,Depressed,73
496,497,22,Male,Panic Disorder,5,6,6,7,Benzodiazepines,Mindfulness-Based Therapy,2024-02-05,13,6,Deteriorated,8,Happy,86
497,498,23,Male,Major Depressive Disorder,7,3,4,2,Antidepressants,Cognitive Behavioral Therapy,2024-03-24,10,5,Deteriorated,5,Neutral,87
498,499,48,Male,Bipolar Disorder,9,4,6,9,Antidepressants,Mindfulness-Based Therapy,2024-03-22,10,6,Improved,7,Anxious,73


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Patient ID                    500 non-null    int64 
 1   Age                           500 non-null    int64 
 2   Gender                        500 non-null    object
 3   Diagnosis                     500 non-null    object
 4   Symptom Severity (1-10)       500 non-null    int64 
 5   Mood Score (1-10)             500 non-null    int64 
 6   Sleep Quality (1-10)          500 non-null    int64 
 7   Physical Activity (hrs/week)  500 non-null    int64 
 8   Medication                    500 non-null    object
 9   Therapy Type                  500 non-null    object
 10  Treatment Start Date          500 non-null    object
 11  Treatment Duration (weeks)    500 non-null    int64 
 12  Stress Level (1-10)           500 non-null    int64 
 13  Outcome             

### Data Cleaning
Make necessary features binary (e.g., gender).
1. Gender
2. Symptom Severity (1-10)
3. Mood Score (1-10)    
4. Sleep Quality (1-10)
5. Stress Level (1-10)
6. Treatment Progress (1-10)
7. Adherence to Treatment (%)

        
Features scaled from 1-10 are recoded as either 0=(<5) and 1=(>5), being rated higher than 5 means that their is significant change within those scales. While the feature, adherence to treatment is coded as 1=greater than 80%, and 0=less than 80, which denotes for adhering to the treatment

In [4]:
# binomial gender
df_b = df.copy()
df_b['Gender'] = df_b['Gender'].map({'Female':1, 'Male':0})

In [5]:
# make rating scaled features binary
features = ['Symptom Severity (1-10)', 'Sleep Quality (1-10)', 'Mood Score (1-10)', 'Stress Level (1-10)', 'Treatment Progress (1-10)']

for feat in features:
    df_b[feat] = (df_b[feat] > 5).astype(int)
    
df_b["Adherence to Treatment (%)"] = (df_b["Adherence to Treatment (%)"] > 80).astype(int)


### Numericalize Categorical Values with LabelEncoder
1. Mediation
2. Therapy Type
3. AI-Detected Emotional State
4. Diagnosis

In [6]:
# use labelEncoder to numericalize categorical values
encoder = LabelEncoder()

encode = ["Medication", "Therapy Type", "AI-Detected Emotional State", "Diagnosis"]
encode_n = ["Med_n", "Therapy_n", "Emot_n", "Diag_n"]

for i in range(len(encode)):
    df_b[encode_n[i]] = encoder.fit_transform(df_b[encode[i]])
    

    

In [7]:
# drop start date
df_b = df_b.drop("Treatment Start Date", axis="columns")

### Important Feature: Outcome

Create 2 different dataframes that: 

1. Drop deterioriated: Focus only on no change and improve

2. Drop improved: Focuse only on no change and deteriorated


This will intially make the values binary for two different outcomes. 0=no change, 1=improved/deteriorated

In [8]:
# Drop deteriorated 
df1 = df_b.copy()
df2 = df_b.copy()

# drop if outcome=deteriorated 
df1 = df1[df1['Outcome'] != 'Deteriorated']

# drop if outcome=improve
df2 = df2[df2['Outcome'] != 'Improved']

df1['Out_b'] = df1['Outcome'].map({'No Change': 0, 'Improved':1})
df2['Out_b'] = df2['Outcome'].map({'No Change': 0, 'Deteriorated':1})

## Feature Importance to Logistic Regression

In [9]:
X = df1[features]
y = df1['Outcome'] # for no change & improved 
model = linear_model.LogisticRegression()
model.fit(X,y)

importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.coef_[0]
}).sort_values(by='Importance', ascending=False)

print(importance)

                     Feature  Importance
4  Treatment Progress (1-10)    0.244637
3        Stress Level (1-10)    0.180013
2          Mood Score (1-10)    0.151808
0    Symptom Severity (1-10)   -0.177218
1       Sleep Quality (1-10)   -0.309771


In [10]:
X = df2[features]
y = df2['Outcome'] # for no change & deter
model = linear_model.LogisticRegression()
model.fit(X,y)

importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.coef_[0]
}).sort_values(by='Importance', ascending=False)

print(importance)

                     Feature  Importance
4  Treatment Progress (1-10)    0.521745
2          Mood Score (1-10)    0.275338
0    Symptom Severity (1-10)   -0.230991
3        Stress Level (1-10)   -0.306581
1       Sleep Quality (1-10)   -0.335855


In [11]:
X = df1[encode_n]
y = df1['Outcome']
model = linear_model.LogisticRegression()
model.fit(X,y)

importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.coef_[0]
}).sort_values(by='Importance', ascending=False)

print(importance)

     Feature  Importance
3     Diag_n    0.078957
1  Therapy_n    0.051382
0      Med_n   -0.022722
2     Emot_n   -0.063074


In [12]:
X = df2[encode_n]
y = df2['Outcome']
model = linear_model.LogisticRegression()
model.fit(X,y)

importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.coef_[0]
}).sort_values(by='Importance', ascending=False)

print(importance)

     Feature  Importance
0      Med_n    0.103152
2     Emot_n    0.026927
3     Diag_n   -0.021898
1  Therapy_n   -0.083610


In [13]:
X = df1[['Treatment Progress (1-10)', 'Stress Level (1-10)', 'Mood Score (1-10)']]
y = df1['Outcome']

model = linear_model.LogisticRegression(class_weight='balanced', random_state=0)
model.fit(X, y)
print(f"Logistic Regression accuracy = {model.score(X, y)}")

Logistic Regression accuracy = 0.5197568389057751


Features: 
1. Treatment Progress
2. Stress level
3. Mood Score


Shows positive predictors

In [14]:
b = model.intercept_[0]
w = model.coef_[0]
print(f"Intercept={b}\nCoefficient={w}")

Intercept=-0.3596880125577921
Coefficient=[0.20562397 0.14927444 0.12649911]


In [15]:
y_pred = model.predict(X)
y_pred_prob = model.predict_proba(X)[:, 1]
print(classification_report(y, y_pred))
auc_score = roc_auc_score(y, y_pred_prob)
print(f"AUC-ROC: {auc_score}")

              precision    recall  f1-score   support

    Improved       0.53      0.66      0.59       170
   No Change       0.50      0.37      0.43       159

    accuracy                           0.52       329
   macro avg       0.52      0.51      0.51       329
weighted avg       0.52      0.52      0.51       329

AUC-ROC: 0.5316500184979651


## Decision Tree

In [17]:
allFeat = features + encode

In [22]:
allFeat

['Symptom Severity (1-10)',
 'Sleep Quality (1-10)',
 'Mood Score (1-10)',
 'Stress Level (1-10)',
 'Treatment Progress (1-10)',
 'Medication',
 'Therapy Type',
 'AI-Detected Emotional State',
 'Diagnosis']

## Exploratory Data Analysis

In [None]:
df1['Outcome'].value_counts(normalize=True)

In [None]:
df2['Outcome'].value_counts(normalize=True)

There exists similar distribution of binary outcomes from either deteriorated or improved.