We apply a simple depression model, which is done in the kaggle-work: https://www.kaggle.com/code/geovaniwoll/machine-learningproject 
We classify the target depressiveness, with the three columns (features): gender, phq_score, gad_score,
see also the given description of the dataset, above and given in the Kaggle_work: https://www.kaggle.com/datasets/shahzadahmad0402/depression-and-anxiety-data

| **Column** | **Description** |
| ------------ | :-----------------: |
| id | each number is a participant in the experiment |
| school_year | years in school |
| age | |
| gender | |
| bmi | body mass index |
| who_bmi | bmi category |
| phq_score | measure the severity of symptoms related to depression, anxiety, and other related disorders in patients |
| depression_severity | degree or intensity of symptoms experienced by an individual with depression |
| depressiveness | |
| suicidal | the candidate have suicide thought |
| depression_diagnosis | the candidate already have depression diagnosis |
| depression_treatment | the candidate already have depression treatment |
| gad_score | measure that assesses the severity of Generalized Anxiety Disorder |
| anxiety_severity |  intensity of symptoms experienced by an individual with anxiety |
| anxiousness | |
| anxiety_diagnosis | the candidate already have anxiety diagnosis |
| anxiety_treatment | the candidate already have anxiety treatment |
| epworth_score |  score to assess daytime sleepiness ytime sleepiness |
| sleepiness | |

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.utils import resample

In [None]:
# Read the csv-data
df = pd.read_csv('data/depression_anxiety_data.csv')

In [None]:
# see the data
df.head()

In [None]:
print("\nData-Types of the columns:")
display(df.dtypes)

In [None]:
#check NaNs and duplicates
print('Index')
print('index_size', df.index.size)
print('Columns with NaN')
print('is NaN', df.isna().sum())
print('Duplicates in Columns')
print('duplicated', df.duplicated().sum())
#note: no NaNs, no duplicates, no cleaning required

In [None]:
# Data-cleaning

# Drop all NaNs (we have ony a few NaNs in the columns): 
df = df.dropna()


# Correct Datatypes of the target:
# and the feature gender (both int)

df.gender = df.gender.map({'male':1, 'female':0})

df['depressiveness'] = df['depressiveness'].astype(int)
display(df.head())


In [None]:
df.info()

**Most important features, given by the clinical test:**
- gender
- phd_score
- gad_score

In [None]:
# correlation of the three important features:  gender, gad_score, phq_score 

correlation_matrix = df[['gender', 'phq_score', 'gad_score']].corr()
print(correlation_matrix)


import seaborn as sns
import matplotlib.pyplot as plt

# Set the size of the plot
plt.figure(figsize=(8, 6))

# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)

# Set the title
plt.title('Correlation Heatmap')

# Show the plot
plt.show()

**Exploration of the dataset with the important features**

We apply sns-plots

In [None]:
# historgam and boxplot for the important numerical columns with the target 'depressivenes'  as hue


num_cols = ['gender', 'phq_score', 'gad_score']

for col in num_cols:


    # Full plot of the histogram and combined boxplot
    fig, axes = plt.subplots(2, 1, figsize=(20, 6), sharex=True, gridspec_kw={'height_ratios': [5, 1]})
    
    # Histogram
    sns.histplot(data=df, x=col, hue='depressiveness', kde=True, multiple="stack", ax=axes[0])
    axes[0].set_title(f'Histogram of {col} by depressiveness')
    
    # Boxplot
    sns.boxplot(data=df, x=col, hue='depressiveness', ax=axes[1])
    axes[1].set_title(f'Boxplot of {col} by depressiveness')



    # Titles of the axes and display the plot
    axes[1].set_xlabel(col)
    axes[1].set_ylabel('')
    plt.show()
    

* Base model with the training and test datasets *

In [None]:

# Important features and the target 
X = df[['phq_score', 'gad_score', 'gender']]
y = df['depressiveness']


# Split the train and test data-set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply a simple logist model
model = LogisticRegression()

# Fit the model
model.fit(X_train, y_train)

# Predict the mode
y_pred = model.predict(X_test)

# Calcuate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Apply a first prediction
print(classification_report(y_test, y_pred))

Improvement of the model, via oversampling to balance the target

In [None]:
data_majority = df[df['depressiveness'] == 0]
data_minority = df[df['depressiveness'] == 1]


In [None]:
data_minority_oversampled = resample(data_minority, replace=True, n_samples=len(data_majority), random_state=42)


In [None]:
df_oversampled = pd.concat([data_majority, data_minority_oversampled])


In [None]:
# Improvement of with the oversampled dataset

X = df_oversampled[['phq_score', 'gad_score', 'gender']]
y = df_oversampled['depressiveness']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply and fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

# Prediction of the model
y_pred = model.predict(X_test)
print(X_test)
print(y_pred)

# Accuray and classification report
print("Accuracy:", model.score(X_test, y_test))
print(classification_report(y_test, y_pred))

**Final predictions with an own or given dataset: aim_test.csv**

In [None]:
# Idea of an app interface:

# We have the following questiond:
# phq_score: measure the severity of symptoms related to depression, anxiety, and other related disorders in patients: between 0 and 24
# gad_score: measure that assesses the severit of generalised anxity disorder:  between 0 and 21
# gender:  0 for female / 1 for male

# Ask the questions to the students/childs and the predict if they are depressive or not


#  Example aim-file


# Reading the CSV file into a DataFrame
X_aim = pd.read_csv('aim_test.csv')

y_pred_aim = model.predict(X_aim)
print(X_aim)
print(y_pred_aim)

# 0 is not depressive
# 1 is depressiv

* We apply a simplified depression model, which are based on the three important feature: gander, phq_score and gad_score.
We could apply a logistic regression model for the classification and obtain good clasifiactions.
Based on the fitted model we apply the predictio to an own data set.
Such a simple model culd be used as a first classification of depressiveness.*




In [None]:
import pickle
import pandas as pd

# Angenommen, 'model' ist dein trainiertes Modell
# Speichere das Modell in einer Datei
with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)
