Multiple simple model based on the depression-data-set:

Idea: We have a basic dataset, which is given in: https://www.kaggle.com/datasets/shahzadahmad0402/depression-and-anxiety-data


We deal with 4 different target-columns, which are related to different feature_columns, e.g., detected wth Power BI or with ML (feature importance):

1.) Prediction of `anxiousness` (Target variable: 0 (False) / 1 (True))

    good feature variables:

    - phq_score: it is a measure of depression, higher values ​​deal with more general psychological distress.

    - gender: Gender-specific differences in susceptibility to anxiety disorders (often females are more anxious).

    - age: Age groups can react differently to fear.


2.) Prediction of `depressiveness` (Target variable: 0 (False) / 1 (True))

    good feature variables:

    - phq_score: A direct indicator of depression.

    - gender: Gender-specific differences in susceptibility to depression.

    - age: Age groups can react differently to depression.

    - sleepiness: Sleep disorders can be an indication of depressive symptoms.


3.) Prediction of `will_get_treatment (Target variable: 0 (False) / 1 (True))

    good feature variables:

    - phq_score: More severe depression may be more likely to require treatment.

    - depression_severity: Higher levels of depression severity increase the likelihood of treatment.

    - age: Age groups have different access to treatment.

    - depression_diagnosis: A formal diagnosis can increase the likelihood of treatment.

    - sleepiness: Sleep disorders can increase the need for treatment.


4.) Prediction of `suicidality` (Target variable: False (no suicidality) / 1: True (suicidality))

    possible feature variables:

    - phq_score: A higher value can correlate with suicidality.

    - depression_severity: Higher levels of depression increase the risk of suicide.

    - depressiveness: Direct indicator of depressive symptoms.

    - age: Age groups have different suicide rates.

    - gender: Gender-specific differences in susceptibility to suicide.
    
    - Sleepiness: Sleep disorders often correlate with suicide risk.


We apply a multiple depression model, which is done in the kaggle-work: https://www.kaggle.com/code/geovaniwoll/machine-learningproject 
We classify the target depressiveness, with the three columns (features): gender, phq_score, gad_score,
see also the given description of the dataset, above and given in the Kaggle_work: https://www.kaggle.com/datasets/shahzadahmad0402/depression-and-anxiety-data

| **Column** | **Description** |
| ------------ | :-----------------: |
| id | each number is a participant in the experiment |
| school_year | years in school |
| age | |
| gender | |
| bmi | body mass index |
| who_bmi | bmi category |
| phq_score | measure the severity of symptoms related to depression, anxiety, and other related disorders in patients |
| depression_severity | degree or intensity of symptoms experienced by an individual with depression |
| depressiveness | |
| suicidal | the candidate have suicide thought |
| depression_diagnosis | the candidate already have depression diagnosis |
| depression_treatment | the candidate already have depression treatment |
| gad_score | measure that assesses the severity of Generalized Anxiety Disorder |
| anxiety_severity |  intensity of symptoms experienced by an individual with anxiety |
| anxiousness | |
| anxiety_diagnosis | the candidate already have anxiety diagnosis |
| anxiety_treatment | the candidate already have anxiety treatment |
| epworth_score |  score to assess daytime sleepiness ytime sleepiness |
| sleepiness | |

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import resample

In [None]:
# Read the csv-data
df = pd.read_csv('data/depression_anxiety_data.csv')

In [None]:
# see the data
df.head()

In [None]:
print("\nData-Types of the columns:")
display(df.dtypes)

In [None]:
#check NaNs and duplicates
print('Index')
print('index_size', df.index.size)
print('Columns with NaN')
print('is NaN', df.isna().sum())
print('Duplicates in Columns')
print('duplicated', df.duplicated().sum())
#note: no NaNs, no duplicates, no cleaning required

In [None]:
# Modells with targets and features

# Target columns (later we put 'depression_treatment', 'anxiety_treatment' together to 'treatment_status')

target_cols =['anxiousness', 'depressiveness', 'depression_treatment', 'anxiety_treatment', 'suicidal']

# first all numerical columns

num_cols =  ['school_year', 'age', 'bmi', 'phq_score', 'gad_score', 'epworth_score' ]


# categorical columns, which can be transformed simple to numerical columns (only true/false entires)

cat_cols_trans = ['gender', 'depression_diagnosis', 'depression_treatment',  'anxiety_diagnosis', 'sleepiness']


# We have 3 catergoical columns with more that 2 entries:

cat_cols = ['who_bmi', 'anxiety_severity', 'depressiv_severity']


# We could clean

In [None]:
# Data-cleaning

# Drop all NaNs (we have ony a few NaNs in the columns): 
df = df.dropna()


# Correct Datatypes of the target:
# and the feature gender (both int)

df.gender = df.gender.map({'male':1, 'female':0})


# Define the targets:
#1.) anxiousness:  Prediction:  0: False (no anxious) / 1: True (anxious)
#2.) depressiveness: Prediction: 0: False (no depression) / 1: True (depression)
#3.) will_get_treatment: Prediction: 0: False (no get-treatement) / 1: True (get-treatment)
#4.) suicidality: Prediction: 0: False (no suicidality) / 1: True (suicidality)

df['anxiousness'] = df['anxiousness'].astype(int)
df['depressiveness'] = df['depressiveness'].astype(int)

df['treatment_status'] = df['depression_treatment'] | df['anxiety_treatment']
df['treatment_status'] = df['treatment_status'].astype(int)


df['suicidal'] = df['suicidal'].astype(int)

# Change the binary categorial feature variables into numerical 

# Convert binary categorical features to integer
df[cat_cols_trans] = df[cat_cols_trans].astype(int)




# Describe the dates
display(df.describe())



In [None]:
df.info()

**Most important features, given by the clinical test:**
- gender
- phd_score
- gad_score

In [None]:
# correlation of the three important features:  gender, gad_score, phq_score 

correlation_matrix = df[['gender', 'phq_score', 'gad_score']].corr()
print(correlation_matrix)


import seaborn as sns
import matplotlib.pyplot as plt

# Set the size of the plot
plt.figure(figsize=(8, 6))

# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)

# Set the title
plt.title('Correlation Heatmap')

# Show the plot
plt.show()

**Exploration of the dataset with the important features**

We apply sns-plots

In [None]:


# Define your target and numerical columns
target_cols = ['anxiousness', 'depressiveness', 'treatment_status', 'suicidal']
num_cols = ['gender', 'phq_score', 'gad_score']

# Iterate over each numerical column
for num_col in num_cols:
    # Iterate over each target column
    for target_col in target_cols:
        # Create the figure and axes for the plots
        fig, axes = plt.subplots(2, 1, figsize=(20, 6), sharex=True, gridspec_kw={'height_ratios': [5, 1]})
        
        # Histogram
        sns.histplot(data=df, x=num_col, hue=target_col, kde=True, multiple="stack", ax=axes[0])
        axes[0].set_title(f'Histogram of {num_col} by {target_col}')
        
        # Boxplot
        sns.boxplot(data=df, x=num_col, hue=target_col, ax=axes[1])
        axes[1].set_title(f'Boxplot of {num_col} by {target_col}')
        
        # Titles of the axes and display the plot
        axes[1].set_xlabel(num_col)
        axes[1].set_ylabel('')
        plt.show()


* Base model with the training and test datasets of all the 4 different targets

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Define your target columns
target_cols = ['anxiousness', 'depressiveness', 'treatment_status', 'suicidal']

# Define the feature columns for each model
feature_sets = {
    'anxiousness': ['phq_score', 'gender', 'age'],
    'depressiveness': ['phq_score', 'gender', 'age', 'sleepiness'],
#    'treatment_status': ['phq_score', 'depression_severity', 'age', 'depression_diagnosis', 'sleepiness'],  # here, we have one categorical column: depression_severity
    'treatment_status': ['phq_score', 'age', 'depression_diagnosis', 'sleepiness'],
    'suicidal': ['phq_score', 'depressiveness', 'age', 'gender', 'sleepiness']
}

# Initialize an empty list to store the models
models = []

# Iterate over each target column
for target in target_cols:
    # Select the corresponding features for the current target
    X = df[feature_sets[target]]
    y = df[target]
    
    # Split the train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Apply a simple logistic regression model
    model = LogisticRegression()
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Append the fitted model to the list
    models.append(model)
    
    # Predict the model
    y_pred = model.predict(X_test)
    
    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy for {target}: {accuracy}")
    
    # Apply a first prediction
    print(f"Classification report for {target}:\n{classification_report(y_test, y_pred)}")

# The models list now contains the fitted models for each target
# models[0] -> model for 'anxiousness'
# models[1] -> model for 'depressiveness'
# models[2] -> model for 'treatment_status'
# models[3] -> model for 'suicidal'


In [None]:
# Data-cleaing of the aim-data set:

# This is the connection to the ML part, we need also the function: clean_data, feature_engineering etc. what we did with the dates in the ML part


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Define your target columns
target_cols = ['anxiousness', 'depressiveness', 'treatment_status', 'suicidal']

# Define the feature columns for each model
feature_sets = {
    'anxiousness': ['phq_score', 'gender', 'age'],
    'depressiveness': ['phq_score', 'gender', 'age', 'sleepiness'],
    'treatment_status': ['phq_score', 'age', 'depression_diagnosis', 'sleepiness'],
    'suicidal': ['phq_score', 'depressiveness', 'age', 'gender', 'sleepiness']
}

# Initialize an empty list to store the models
models = []

# Iterate over each target column
for target in target_cols:
    # Select the corresponding features for the current target
    X = df[feature_sets[target]]
    y = df[target]
    
    # Split the train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Apply a simple logistic regression model
    model = LogisticRegression()
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Append the fitted model to the list
    models.append(model)
    
    # Predict the model
    y_pred = model.predict(X_test)
    
    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy for {target}: {accuracy}")
    
    # Print classification report
    print(f"Classification report for {target}:\n{classification_report(y_test, y_pred)}")

# The models list now contains the fitted models for each target
# models[0] -> model for 'anxiousness'
# models[1] -> model for 'depressiveness'
# models[2] -> model for 'treatment_status'
# models[3] -> model for 'suicidal'


# Define the status names for each model
status_names = {
    'anxiousness': ['non-anxious', 'anxious'],
    'depressiveness': ['non-depressive', 'depressive'],
    'treatment_status': ['not in treatment', 'in treatment'],
    'suicidal': ['non-suicidal', 'suicidal']
}

# Reading the cleaned CSV file into a DataFrame
X_aim = pd.read_csv('aim_test_cleaned.csv')  # Ensure this file is cleaned similarly to the training data

# Ensure X_aim has the same features as used for training
for i, target in enumerate(target_cols):
    # Select the corresponding features for the current target in the aim dataset
    X_aim_target = X_aim[feature_sets[target]]
    
    # Predict using the corresponding model
    y_pred_aim = models[i].predict(X_aim_target)
    print(f"Predictions for {target}:")
    
    # Format the predictions for better readability
    for j, prediction in enumerate(y_pred_aim, start=1):
        status = status_names[target][prediction]
        print(f'person{j} is {status}')


**Improvement of the model, via oversampling to balance the target (we apply it for all the targets)**

In [None]:
import pandas as pd
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Define your target columns
target_cols = ['anxiousness', 'depressiveness', 'treatment_status', 'suicidal']

# Define the feature columns for each model
feature_sets = {
    'anxiousness': ['phq_score', 'gender', 'age'],
    'depressiveness': ['phq_score', 'gender', 'age', 'sleepiness'],
    'treatment_status': ['phq_score', 'age', 'depression_diagnosis', 'sleepiness'],
    'suicidal': ['phq_score', 'depressiveness', 'age', 'gender', 'sleepiness']
}

# Initialize an empty list to store the models
models = []

# Iterate over each target column
for target in target_cols:
    # Select the corresponding features and target for the current model
    X = df[feature_sets[target]]
    y = df[target]
    
    # Separate the majority and minority classes
    data_majority = df[df[target] == 0]
    data_minority = df[df[target] == 1]
    
    # Oversample the minority class
    data_minority_oversampled = resample(data_minority, 
                                         replace=True, 
                                         n_samples=len(data_majority), 
                                         random_state=42)
    
    # Combine the majority class with the oversampled minority class
    df_oversampled = pd.concat([data_majority, data_minority_oversampled])
    
    # Update X and y with the oversampled data
    X = df_oversampled[feature_sets[target]]
    y = df_oversampled[target]
    
    # Split the train and test dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Apply a simple logistic regression model
    model = LogisticRegression()
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Append the fitted model to the list
    models.append(model)
    
    # Predict the model
    y_pred = model.predict(X_test)
    
    # Calculate the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy for {target}: {accuracy}")
    
    # Print the classification report
    print(f"Classification report for {target}:\n{classification_report(y_test, y_pred)}")

# The models list now contains the fitted models for each target
# models[0] -> model for 'anxiousness'
# models[1] -> model for 'depressiveness'
# models[2] -> model for 'treatment_status'
# models[3] -> model for 'suicidal'


**Final predictions with an own or given dataset: aim_test.csv**

In [None]:

# Reading the cleaned CSV file into a DataFrame
X_aim = pd.read_csv('aim_test_cleaned.csv')  # Ensure this file is cleaned similarly to the training data

# Ensure X_aim has the same features as used for training
for i, target in enumerate(target_cols):
    # Select the corresponding features for the current target in the aim dataset
    X_aim_target = X_aim[feature_sets[target]]
    
    # Predict using the corresponding model
    y_pred_aim = models[i].predict(X_aim_target)
    print(f"Predictions for {target}:")
    
    # Format the predictions for better readability
    for j, prediction in enumerate(y_pred_aim, start=1):
        status = status_names[target][prediction]
        print(f'person{j} is {status}')


* We apply a simplified depression model, which are based on the three important feature: gander, phq_score and gad_score.
We could apply a logistic regression model for the classification and obtain good clasifiactions.
Based on the fitted model we apply the predictio to an own data set.
Such a simple model culd be used as a first classification of depressiveness.*




In [None]:
import pickle
import pandas as pd

# Angenommen, 'models' ist dein trainiertes Modell mit den Einträgen models[0], models[1], ..., models[3]
# Speichere das Modell in einer Datei
#with open('model.pkl', 'wb') as file:
#    pickle.dump(model, file)


with open('models.pkl', 'wb') as file:
    pickle.dump(models, file)
    pickle.dump(status_names, file)
    pickle.dump(target_cols, file)
    pickle.dump(feature_sets, file)