# Portfolioprojekt: Depression prediction with the best mdel (method-file to choose the best model)

The project was given from:

https://www.kaggle.com/datasets/shahzadahmad0402/depression-and-anxiety-data/code


***
We have the following tasks:
* Data reading and cleaning,
* Feature Engineering,
* Optimal model, fit the dates to the model and grid-search of different models,
* Combination to the pipeline,
* Chossing the best model with the optimal columns.
***


The data-set is given with th columns

| **Column** | **Description** |
| ------------ | :-----------------: |
| id | each number is a participant in the experiment |
| school_year | years in school |
| age | |
| gender | |
| bmi | body mass index |
| who_bmi | bmi category |
| phq_score | measure the severity of symptoms related to depression, anxiety, and other related disorders in patients |
| depression_severity | degree or intensity of symptoms experienced by an individual with depression |
| depressiveness | |
| suicidal | the candidate have suicide thought |
| depression_diagnosis | the candidate already have depression diagnosis |
| depression_treatment | the candidate already have depression treatment |
| gad_score | measure that assesses the severity of Generalized Anxiety Disorder |
| anxiety_severity |  intensity of symptoms experienced by an individual with anxiety |
| anxiousness | |
| anxiety_diagnosis | the candidate already have anxiety diagnosis |
| anxiety_treatment | the candidate already have anxiety treatment |
| epworth_score |  score to assess daytime sleepiness ytime sleepiness |
| sleepiness | |

### Define Metric

We apply the F1-matric to get a balance between the precision and recall.

In [None]:
from sklearn.metrics import classification_report, accuracy_score, recall_score, f1_score, precision_score


# Metrics considerations

# Accuracy: The proportion of correct predictions. 
            # In your case this metric might be less important, 
           # I may have an unbalanced dataset (more good purchases than bad purchases).
# Better metrics
# Recall (Sensitivity): The proportion of correctly identified bad purchases out of all 
# actual bad purchases. Here I optimize: Identify as many bad purchases as possible.
# Precision: The proportion of correctly identified bad purchases out of all as bad purchases 
# predicted cars. Important to minimize false purchases.
# It is best to weight between the two:
# F1-Score: The harmonic mean of precision and recall. 
# Here I have to find a balance between precision and recall.

# We apply the F1-score

### Gather Data

We have the dataset: *depression_anxiety_data.csv*. With the target-column: 'depressiveness

In [None]:
# import modules 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# Modelle
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Pipeline
from sklearn.pipeline import Pipeline
# Columntransormer
from sklearn.compose import ColumnTransformer
# OneHotEndode,, Scaler
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Train/Test Splitt, Crossvaue-Score, SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures



In [None]:
# Target encoder

# !pip install category_encoders  # done in yml
import category_encoders as ce


## EDA

**EDA**:
Here, we apply only a brief EDA, while the full EDA is done on the EDA-files

In [None]:

# read data (start from the dataset)
df = pd.read_csv("../data/depression_anxiety_data.csv")

display(df.head())
#print(df)



## Data Preparation

Here, we apply the cleaing of the data based on our EDA in the "eda_model_depression.ipynb"  file

In [None]:
# Cleaning of the full data set, before the splitt based on the targets
# Optional 

# import pandas as pd

# def clean_data_target(df, target_columns=None):
#     """
#     Cleans and transforms the input DataFrame.

#     This function performs the following operations on the DataFrame:
#         - Drops rows with any missing values (NaNs) in the specified target columns.
#         - Converts binary categorical features into numerical values (0 and 1).
#         - Converts datetime features to the appropriate datetime format.
#         - Returns the cleaned DataFrame.

#     Args: 
#         df (pd.DataFrame): The data to clean.
#         target_columns (list of str, optional): List of target column names for NaN removal. 
#                                                 If None, no target-specific NaN removal is performed.
        
#     Returns:
#         pd.DataFrame: A cleaned DataFrame with transformed features.
#     """
    
#     # Drop rows with missing values in the specified target columns
#     if target_columns:
#         df_clean = df.dropna(subset=target_columns)
#     else:
#         df_clean = df.copy()

#     # Further cleaning or transformation can be added here if needed
    
#     return df_clean

# # Define the target of the dataset

# target = ['depressiveness']

# df = clean_data_target(df, target)



In [None]:
display(df.head())


# target

# depressiveness

# binary categories:

# gender	suicidal	depression_diagnosis	depression_treatment	anxiousness	anxiety_diagnosis	anxiety_treatment	sleepiness

In [None]:
# Filter the DataFrame for rows where epworth_score >= 31
filtered_df = df[df['epworth_score'] >= 31]

# Display the filtered DataFrame
print(filtered_df)

General data-cleaning after the splitt

In [None]:
import pandas as pd

# Data Cleaning and Transformation Function, this is done for the train and test datasets
def clean_data(X, y=None):
    """Cleans and transforms the input feature DataFrame and optionally the target.

    This function performs the following operations on the feature DataFrame (X):
        - Drops rows with any missing values (NaNs) in either X or y to ensure they remain aligned.
        - Converts binary categorical features into numerical values (0 and 1).
        - Optionally, cleans and transforms the target DataFrame (y) if provided.
        - Converts datetime features to the appropriate datetime format.
        - If 'epworth_score' >= 31, sets the value to 17.
        - Returns the cleaned features and optionally the cleaned target.

    Args: 
        X (pd.DataFrame): The feature data.
        y (pd.DataFrame, optional): The target data. Default is None.
        
    Returns:
        pd.DataFrame: A cleaned DataFrame with transformed features.
        pd.DataFrame: Optionally, the cleaned target DataFrame if y is provided.
    """
    # Concatenate X and y to handle missing values across both simultaneously
    if y is not None:
        # Combine X and y into a single DataFrame for joint NaN removal
        combined = pd.concat([X, y], axis=1)
        
        # Drop rows with any missing values in either X or y
        combined_clean = combined.dropna()
        
        # Separate X and y again
        X_clean = combined_clean.iloc[:, :X.shape[1]]
        y_clean = combined_clean.iloc[:, X.shape[1]:]
    else:
        # If y is not provided, only clean X
        X_clean = X.dropna()

    # List of categorical columns with binary values that need to be converted to integers
    cat_cols_trans = ['gender', 'suicidal' , 'depression_diagnosis', 'depression_treatment',  
                      'anxiousness', 'anxiety_diagnosis', 'anxiety_treatment', 'sleepiness']

    # Map 'gender' column to integers: 'male' -> 1, 'female' -> 0
    X_clean['gender'] = X_clean['gender'].map({'male': 1, 'female': 0})

    # Convert the specified binary categorical columns in X to integers (0 and 1)
    X_clean[cat_cols_trans] = X_clean[cat_cols_trans].astype(int)

    # If 'epworth_score' >= 31, set the value to 17
    X_clean.loc[X_clean['epworth_score'] >= 31, 'epworth_score'] = 17

    if y is not None:
        # Convert all target variables to integers
        y_clean = y_clean.astype(int)
        # Return both cleaned X and cleaned y
        return X_clean, y_clean
    else:
        # If y is not provided, return only the cleaned features
        return X_clean


In [None]:
# # Data Cleaning and Transformation Function, this is done for the train and test datasets
# def clean_data(X, y=None):
#     """Cleans and transforms the input feature DataFrame and optionally the target.

#     This function performs the following operations on the feature DataFrame (X):
#         - Drops rows with any missing values (NaNs) in either X or y to ensure they remain aligned.
#         - Converts binary categorical features into numerical values (0 and 1).
#         - Optionally, cleans and transforms the target DataFrame (y) if provided.
#         - Converts datetime features to the appropriate datetime format.
#         - Returns the cleaned features and optionally the cleaned target.

#     Args: 
#         X (pd.DataFrame): The feature data.
#         y (pd.DataFrame, optional): The target data. Default is None.
        
#     Returns:
#         pd.DataFrame: A cleaned DataFrame with transformed features.
#         pd.DataFrame: Optionally, the cleaned target DataFrame if y is provided.
#     """
#     # Concatenate X and y to handle missing values across both simultaneously
#     if y is not None:
#         # Combine X and y into a single DataFrame for joint NaN removal
#         combined = pd.concat([X, y], axis=1)
        
#         # Drop rows with any missing values in either X or y
#         combined_clean = combined.dropna()
        
#         # Separate X and y again
#         X_clean = combined_clean.iloc[:, :X.shape[1]]
#         y_clean = combined_clean.iloc[:, X.shape[1]:]
#     else:
    #     # If y is not provided, only clean X
    #     X_clean = X.dropna()

    # # List of categorical columns with binary values that need to be converted to integers
    # cat_cols_trans = ['gender', 'suicidal' , 'depression_diagnosis', 'depression_treatment',  
    #                   'anxiousness', 'anxiety_diagnosis', 'anxiety_treatment', 'sleepiness']

    # # Map 'gender' column to integers: 'male' -> 1, 'female' -> 0
    # X_clean['gender'] = X_clean['gender'].map({'male': 1, 'female': 0})

    # # Convert the specified binary categorical columns in X to integers (0 and 1)
    # X_clean[cat_cols_trans] = X_clean[cat_cols_trans].astype(int)



    # if y is not None:
    #     # Convert all target variables to integers
    #     y_clean = y_clean.astype(int)
    #     # Return both cleaned X and cleaned y
    #     return X_clean, y_clean
    # else:
    #     # If y is not provided, return only the cleaned features
    #     return X_clean




# We apply the data-cleaning after the train-test split

## Train-Test-Split

In diesem Projekt brauchst du sowohl die Trainings- als auch die Test- und Zieldaten. 

Die Testdaten haben wir nicht vorgegeben. Deshalb empfehlen wir dir an dieser Stelle, die Daten in ein Trainings- und Testset zu teilen und mit dem Trainingsset so zu arbeiten, als wären dies alle Daten, die du zur Verfügung hast. Wenn du dein Modell dann fertig gebaut hast, kannst du mit dem Testset simulieren, was passiert, wenn neue Daten in deine Datenpipeline hereinkommen, also beispielsweise neue Autos auf Auktionsplattformen angeboten werden.

Denke immer daran: **Always fit on Train Set only!** Das gilt inbesondere auch für die Datenbereinigung und das *Feature Engineering*. Im Idealfal fasst du dein Testset nur einmal an, und zwar, wenn du ein für dich optimales Modell gebaut und evaluiert hast und wissen möchtest, wie gut es auf ungesehenen Daten performt. Stell dir einfach vor, du hättest das Testset gar nicht vorliegen.

Nutze `train_test_split` aus dem Submodul `sklearn.model_selection`, um die Daten in Test- und Trainingsset aufzuteilen. Übergebe die folgenden Parameter: `random_state=42` und `test_size=0.1`, damit du deine Vorhersagen später mit unserem Modell vergleichen kannst und eine erste Einschätzung erhältst. Speichere zusätzlich `features_test` als *features_test.csv* ab.

In [None]:
# read data (start from the dataset)
df = pd.read_csv("../data/depression_anxiety_data.csv")


# perform train-test-split


# Feature-Matrix und Zielvektor definieren
features = df.drop(columns=['depressiveness'])
target = df['depressiveness']

# Daten aufteilen in Trainings- und Testdaten
features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                            target, test_size=0.1, random_state=42)




# Testdaten speichern und Target speichern
features_test.to_csv('../data/features_test.csv', index=False)
# target_test.to_csv('target_test.csv', index=False)


In [None]:
# Clean the data:

features_train, target_train = clean_data(features_train, target_train)

display(features_train.head())
display(target_train.head())


# Preparation of the test-data-set
features_test, target_test = clean_data(features_test, target_test)

display(features_test.head())
display(target_test.head())

# save features_test as 'features_test.csv'

In [None]:
# Resampling:
# features_train samplen mit Over

from imblearn.over_sampling import RandomOverSampler

# manuell den Oversampler ausgefüht


# Anwenden des RandomOverSamplers
ros = RandomOverSampler(random_state=42)
features_resampled, target_resampled = ros.fit_resample(features_train, target_train)

print(target_resampled)


In [None]:

# Flatten the 2D array to 1D
target_resampled_flattened = target_resampled.values.ravel()

# Convert to Pandas Series
target_resampled_series = pd.Series(target_resampled_flattened)

# Plotting the distribution
plt.figure(figsize=(8, 6))
sns.countplot(x=target_resampled_series)
plt.title('Distribution of Target Variable After Resampling')
plt.xlabel('Target Class')
plt.ylabel('Count')
plt.show()

In [None]:
display(features_train)

## Data Preparation


### Datatype Transformation

It is done previous 

### Data Imputation

We apply the mean for the numerical values and KNN for the catergorical values

In [None]:

# Pipeline for the imputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer


num_cols = ['school_year', 'age', 'gender', 'bmi', 'phq_score', 'depression_diagnosis', 'gad_score',
            'anxiety_diagnosis', 'epworth_score', 'sleepiness']

num_cols_no_pca = ['gender', 'bmi']

num_cols_pca = ['school_year', 'age', 'phq_score', 'depression_diagnosis', 'gad_score',
            'anxiety_diagnosis', 'epworth_score', 'sleepiness']

targets = ['anxiousness', 'depressiveness', 'treatment_status', 'suicidal']

cat_cols = ['who_bmi', 'depression_severity', 'anxiety_severity']





#cat_pipe = Pipeline(steps=[
#    ("imp", SimpleImputer(strategy="constant", fill_value="missing")),
#    ("ohe", OneHotEncoder(handle_unknown='ignore')),
#])


# Pipeline for categorical columns
cat_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="most_frequent")),  # Simple Imputer for categorical data
    ("ohe", OneHotEncoder(handle_unknown='ignore'))  # OneHotEncoding
])



#num_pipe = Pipeline(steps=[
#    ("imp", SimpleImputer(strategy="mean")),
#])

num_pipe = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])

#num_pipe = Pipeline(steps=[
#    ('imputer', KNNImputer(n_neighbors=5)),
#    ('std_scaler', StandardScaler())
#])



# Warnung vermeiden durch Kopieren des DataFrames, sher wichtig sonst funktioniert es nicht!
features_train = features_train.copy()
target_train = target_train.copy()


# Spalten-Transformator
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipe, num_cols),
        ('cat', cat_pipe, cat_cols)
    ])

# Erstellen einer kompletten Pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

# Fitte die Pipeline auf die Trainingsdaten
features_train_transformed = pipeline.fit_transform(features_train)

# Transformiere die Testdaten
features_test_transformed = pipeline.transform(features_test)

# Ergebnisse anzeigen
print("Transformierte Trainingsdaten:\n", features_train_transformed)
print("Transformierte Testdaten:\n", features_test_transformed)


# Überprüfen auf NaN-Werte
print('Anzahl der NaN-Werte in den transformierten Trainingsdaten:', np.isnan(features_train_transformed).sum())
print('Anzahl der NaN-Werte in den transformierten Testdaten:', np.isnan(features_test_transformed).sum())



### Deal with outliers

* We decide to keep all the outliers, thea are important

Feature Engineering, see the paper:

https://www.kaggle.com/code/geovaniwoll/machine-learningproject

We apply the idea with a new feature:

# severity_index and phq_gad_relation 
df['severity_index'] = df['phq_score'] + df['gad_score']
df['phq_gad_relation'] = df['phq_score'] / df['gad_score']

# Polynomal features

We apply a new target (target-engineering), see:

df['treatment_status'] = df['depression_treatment'] | df['anxiety_treatment']


In [None]:
# Feature Engineering


features_train['severity_index'] = features_train['phq_score'] + features_train['gad_score']
features_train['phq_gad_relation'] = features_train['phq_score'] / features_train['gad_score']

# Part: new features

features_train['depressiveness_awareness'] = features_train['depression_diagnosis'] | features_train['depression_treatment']
features_train['anxiety_awareness'] = features_train['anxiety_diagnosis'] | features_train['anxiety_treatment']

df_mod = features_train.copy()

df_mod['depressiveness'] = target_train['depressiveness']

display(df_mod.head())


num_new_cols = ['severity_index', 'phq_gad_relation', 'depressiveness_awareness', 'anxiety_awareness']

# Histogram and boxplot for the numerical features with the target "depressiveness" as Hue


# Histogram and boxplot for the numerical features with the target "depressiveness" as Hue
for col in num_new_cols:

    # Full plot histogram and boxplot
    fig, axes = plt.subplots(2, 1, figsize=(20, 10), sharex=True, gridspec_kw={'height_ratios': [5, 1]})
    
    # Histogram
    sns.histplot(data=df_mod, x=col, hue='depressiveness', kde=True, multiple="stack", ax=axes[0])
    axes[0].set_title(f'Histogram of {col} by depressiveness')
    
    # Boxplot
    sns.boxplot(data=df_mod, x=col, hue='depressiveness', ax=axes[1])
    axes[1].set_title(f'Boxplot of {col} by depressiveness')

    # Titles of the axis
    axes[1].set_xlabel(col)
    axes[1].set_ylabel('')

    plt.tight_layout()
    plt.show()



In [None]:

# Optionally, you can check just the new features for NaNs
print("\nNumber of NaN values in the newly engineered features:")
print(features_train[['severity_index', 'phq_gad_relation']].isnull().sum())

In [None]:


# Feature Engineering

# Design new prosperous features



def engineer_features(X, y=None, inplace=False):
    if not inplace:
        X = X.copy()
        if y is not None:
            y = y.copy()

    X['severity_index'] = X['phq_score'] + X['gad_score']
    X['phq_gad_relation'] = X['phq_score'] / X['gad_score']
    
    # Features: out together the diagnosis and treatment
    X['depressiveness_awareness'] = X['depression_diagnosis'] | X['depression_treatment']
    X['anxiety_awareness'] = X['anxiety_diagnosis'] | X['anxiety_treatment']

    if y is not None:
        return X, y
    else:
        return X
    
    



# Feature-Engineering auf train und test anwenden
features_train, target_train = engineer_features(features_train, target_train)
display(features_train)

features_test, target_test = engineer_features(features_test, target_test)
display(features_test)








In [None]:
# Add the new features

display(features_train.head())



In [None]:
# Function to clean with the outliers


def sampling_features(X, y=None, inplace=False):
    if not inplace:
        X = X.copy()
        if y is not None:
            y = y.copy()
 
    # Outliers, if necessary
#    if y is not None:
#        mask = X['epworth_score'] > 21
#        X = X[mask]
#        y = y[mask]
#        return X, y
#    else:
#        X = X[X['epworth_score'] > 21]
#        return X
    
    if y is not None:
        return X, y
    else:
        return X
    
    
    


### Resample

The imbalanced target is *resampling*, e.g. `imblearn.pipeline` 

In [None]:
# See also at the beginning:

print(target_resampled)
# Assuming target_resampled is a Pandas Series, check the balance
print(target_resampled.value_counts())

# Optionally, you can also print the proportion of each class
print(target_resampled.value_counts(normalize=True))


In [None]:
# to store the files with pickle (optional)

# import pickle

# state = {
#     'features_train': features_train,
#     'target_train': target_train,
#     'features_test': features_test,
#     'target_test': target_test,    
# }
# # Dictionary speichern:
# with open('state.pkl', 'wb') as f:
#     pickle.dump(state, f)


In [None]:

# # Reopen the stored pickle-files (optional):
# import pickle

# with open('state.pkl', 'rb') as f:
#     state = pickle.load(f)

# features_train = state['features_train']
# target_train = state['target_train']

# features_test = state['features_test']
# target_test = state['target_test']


# #rf_pipe = state['rf_pipe']

In [None]:
display(features_train)

display(features_train.dtypes)

display(target_train)

display(target_train.dtypes)



## Modeling

Here we apply the different models

In [None]:
display(features_train.head(2))
display(features_test.head(2))

In [None]:
# define num_cols and cat_cols for the depressiveness

# Important Columns and their relation, e.g., for the PCA

#num_cols = ['school_year', 'age', 'gender', 'bmi', 'phq_score', 'depression_diagnosis', 'gad_score',
#            'anxiety_diagnosis', 'epworth_score', 'sleepiness']


num_new_features = ['severity_index', 'depressiveness_awareness', 'anxiety_awareness']  # delicate while NaN , 'phq_gad_relation']

#num_cols_no_pca = ['gender', 'bmi']

#num_cols_pca = ['school_year', 'age', 'depression_diagnosis', 'gad_score',
#            'anxiety_diagnosis', 'epworth_score', 'sleepiness']

targets = ['anxiousness', 'depressiveness', 'treatment_status', 'suicidal']

# (depression_treatment and anxiety_treatment  are columns, which are used in the target treatment_status)

#cat_cols = ['who_bmi', 'depression_severity', 'anxiety_severity']
#cat_cols = ['who_bmi', 'anxiety_severity']
cat_cols = []

# num-cols for the target depressiveness

num_cols = ['age', 'gender', 'gad_score', 'epworth_score', 'bmi', 'depressiveness_awareness', 'anxiety_awareness'] # , 'severity_index']

#num_cols = [] 
num_cols_poly = ['age','gender', 'gad_score', 'epworth_score', 'bmi', 'depressiveness_awareness', 'anxiety_awareness']
num_cols_poly = []
target = ['depressiveness']



In [None]:
# instantiate model

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier





In [None]:


from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import RandomOverSampler

cat_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="constant", fill_value="missing")),
    ("target_enc", ce.TargetEncoder())  # TargetEncoder hinzufügen
#    ("ohe", OneHotEncoder(handle_unknown='ignore')),  # One-Hot encoder
])

num_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="median")),
#    ("poly", PolynomialFeatures(degree=2, include_bias=False)),  # PolynomialFeatures hinzufügen, bringt nichts, da man
#    Lineare Kombinationen hat
    ("scaler", StandardScaler()),
])

num_pipe_poly = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),  # PolynomialFeatures hinzufügen, bringt nichts, da man
#    Lineare Kombinationen hat
    ("scaler", StandardScaler()),
])



# for the depressiveness as target, we only have num-cols

preprocessing = ColumnTransformer(
    transformers=[
        ("num_trans", num_pipe, num_cols),
#       ("num_trans_poly", num_pipe_poly, num_cols_poly),
#        ("cat_trans", cat_pipe, cat_cols),
    ],
    remainder = "drop",
)


rf_pipe = ImbPipeline(steps=[
    ("oversampler", RandomOverSampler()),   # Hinzufügen des RandomOverSamplers
    ("pre", preprocessing),
    ("kbest", SelectKBest(k=20, score_func=f_classif)),
    ("model", RandomForestClassifier()),
#    ("model", RandomForestClassifier_mod()),
])


In [None]:
# fit pipeline on cleaned (and filtered) training set

rf_pipe.fit(features_train, target_train)

In [None]:
# predict and evaluate on test set

print(classification_report(target_test, rf_pipe.predict(features_test)))






In [None]:
#
#              precision    recall  f1-score   support
#
#           0       0.86      0.94      0.90        54
#           1       0.81      0.62      0.70        21
#
#   accuracy                           0.85        75
#   macro avg       0.84      0.78      0.80        75
#weighted avg       0.85      0.85      0.85        75


#polynomial feature all num. features
#              precision    recall  f1-score   support
#
#           0       0.93      0.98      0.95        54
#           1       0.94      0.81      0.87        21
#
#    accuracy                           0.93        75
#   macro avg       0.94      0.90      0.91        75
# weighted avg       0.93      0.93      0.93        75

In [None]:
# Extracting the feature names after the transformation
preprocessor = rf_pipe.named_steps['pre']
feature_names = preprocessor.get_feature_names_out()

# Selecting the features after SelectKBest
kbest = rf_pipe.named_steps['kbest']
selected_feature_names = [feature_names[i] for i in kbest.get_support(indices=True)]

# Extracting feature importances from the RandomForest model
rf_model = rf_pipe.named_steps['model']
feature_importance_rf = pd.Series(rf_model.feature_importances_, index=selected_feature_names)

# Sorting the values and selecting the top 20 features
top_20_features = feature_importance_rf.nlargest(20)

# Creating a bar plot
plt.style.use('fivethirtyeight')
top_20_features.plot(kind='barh', figsize=(10, 8))

# Adding labels and title
plt.xlabel('Feature Importance')
plt.ylabel('Features')
plt.title('Feature Importance in Random Forest')

# Showing the plot
plt.show()


### Data Scaling

This is done with the standardscalar

### Dimensionality Reduction

Here, we apply the knowledge of the previous EDA and PowerBI:

    all organic  (age, gender, etc)

    BMI and EpScore (OR the who_bmi and Sleepiness classification columns) 

    Anxiety (GAD score OR anxiety severity OR anxiousness) 

    Anxiety diagnosis OR treatment 

    Depression diagnosis OR treatment (since independent of the test)

In [None]:
# Correlation matrix

target = ['depressiveness']


# Korrelationsmatrix
corr_matrix = features_train[num_cols].corr()

# Heatmap der Korrelationsmatrix
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin = -1, vmax = 1, fmt='.2f')
plt.title('Korrelationsmatrix')
plt.show()


### Train model

In [None]:
from sklearn.exceptions import DataConversionWarning
import warnings
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [None]:
#useful imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, recall_score, precision_score, accuracy_score, confusion_matrix, classification_report

Here, we built the pipelines:

In [None]:
cat_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="constant", fill_value="missing")),
#   ("ohe", OneHotEncoder(handle_unknown='ignore')),
    ("target_enc", ce.TargetEncoder()),  # TargetEncoder hinzufügen    
])

# Macht hier mehr Sinn
num_pipe = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="mean")),
#    ("poly", PolynomialFeatures(degree=2, include_bias=False)),  # PolynomialFeatures hinzufügen
    ("scaler", StandardScaler()),

])



# Pipeline für numerische Daten mit PCA
num_pipe_pca = Pipeline(steps=[
    ("imp", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler()),
#    ("pca", PCA(n_components=0.90))  # 90% der Varianz beibehalten
])



In [None]:
# RF model



num_cols = ['age', 'gender', 'bmi', 'gad_score', 'epworth_score', 'depressiveness_awareness', 'anxiety_awareness']


target = ['depressiveness']



# We have only numerical cols

preprocessing_rf = ColumnTransformer(
    transformers=[
        ("num_trans", num_pipe, num_cols),
#        ("num_pca", num_pipe_pca, num_cols_pca),
#        ("cat_trans", cat_pipe, cat_cols)
    ],
    remainder="drop"
)

# PolynomialFeatures hinzufügen
# poly_features = PolynomialFeatures(degree=2, include_bias=False)


rf_pipe_pca = ImbPipeline(steps=[
    ("oversampler", RandomOverSampler()),   # Hinzufügen des RandomOverSamplers oder Rausnehmen!!! Aufpassen am Ende!!!
    ("pre", preprocessing_rf),
#    ("kbest", SelectKBest(k=10, score_func=f_classif)),
    ("model", RandomForestClassifier()),
])




print(rf_pipe_pca)


In [None]:
# fit pipeline on cleaned (and filtered) training set

rf_pipe_pca.fit(features_train, target_train)

In [None]:

#display(features_train)
print(classification_report(target_test, rf_pipe_pca.predict(features_test)))



In [None]:
# Logging the models:

import logging
import os

# Manually close the current log handlers
logging.shutdown()

In [None]:
import logging
import os

# Ensure the 'log' directory exists
log_dir = '../log'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

# Confirm that the directory was created
if os.path.exists(log_dir):
    print(f"Directory '{log_dir}' exists or was successfully created.")
else:
    print(f"Failed to create directory '{log_dir}'.")

# Check the current working directory
print(f"Current working directory: {os.getcwd()}")

# Configure logging explicitly
log_file_path = os.path.join(log_dir, 'best_model_evaluation.log')
print(f"Log file path: {log_file_path}")  # This should point to the 'log' directory

# Create and configure the logger explicitly
logger = logging.getLogger('model_evaluation_logger')
logger.setLevel(logging.INFO)


# Avoid duplicate logs by clearing existing handlers if any

if logger.hasHandlers():

    logger.handlers.clear()


# Create a file handler for logging

file_handler = logging.FileHandler(log_file_path)

file_handler.setLevel(logging.INFO)


# Create a logging format

formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

file_handler.setFormatter(formatter)


# Add the file handler to the logger

logger.addHandler(file_handler)


# Example log entry

logger.info("Logging to the log/best_model_evaluation.log file.")

print("Logging setup completed.")



In [None]:


def log_model(model_pipeline, features_test, target_test):
    """
    Logs the details of a model including its name, parameters, and classification report.

    Args:
    - model_pipeline: The model pipeline to be logged.
    - features_test: The test features used to generate the classification report.
    - target_test: The test target used to generate the classification report.
    """

    # Model name
    model_name = type(model_pipeline.named_steps['model']).__name__

    # Model parameters
    model_params = model_pipeline.named_steps['model'].get_params()

    # Classification report
    classification_rep = classification_report(target_test, model_pipeline.predict(features_test))

    # Log the model details
    logger.info("Model Evaluation")
    logger.info(f"Model Name: {model_name}")
    logger.info("Model Parameters:")
    for param, value in model_params.items():
        logging.info(f"{param}: {value}")

    # Log the classification report
    logger.info("Classification Report:")
    logger.info("\n" + classification_rep)

    print(f"Model details and evaluation have been logged to 'log/best_model_evaluation.log'.")

# Example usage
log_model(rf_pipe_pca, features_test, target_test)


In [None]:
# Modell 2:
# logistic regression


preprocessing_log = ColumnTransformer(
    transformers=[
        ("num_trans", num_pipe, num_cols),
#        ("num_pca", num_pipe_pca, num_cols_pca),
#        ("cat_trans", cat_pipe, cat_cols)
    ],
    remainder="drop"
)

# PolynomialFeatures hinzufügen
# poly_features = PolynomialFeatures(degree=2, include_bias=False)


log_pipe_pca = ImbPipeline(steps=[
#    ("pre", preprocessing_rf),
    ("oversampler", RandomOverSampler()),   # Hinzufügen des RandomOverSamplers oder Rausnehmen!!! Aufpassen am Ende!!!
    ("pre", preprocessing_log),
#    ("poly", poly_features),   # später ausprobieren, bringt es nicht mehr
#    ("kbest", SelectKBest(k=10, score_func=f_classif)),
#    ("model", LogisticRegression(max_iter=1000))
    ("model", LogisticRegression(max_iter=5000))    
])


print(log_pipe_pca)





In [None]:
# fit pipeline on cleaned (and filtered) training set

log_pipe_pca.fit(features_train, target_train)

In [None]:
# predict and evaluate on test set
# Auf dem Testset die Feature-Engineering anwenden)
#
#display(target_test)
#display(features_test)

#display(features_train)
print(classification_report(target_test, log_pipe_pca.predict(features_test)))

In [None]:
# Logging the output 
log_model(log_pipe_pca, features_test, target_test)

In [None]:
# Model 3: KNN


preprocessing_knn = ColumnTransformer(
    transformers=[
        ("num_trans", num_pipe, num_cols),
#        ("num_pca", num_pipe_pca, num_cols_pca),
#        ("cat_trans", cat_pipe, cat_cols)
    ],
    remainder="drop"
)

# PolynomialFeatures hinzufügen
# poly_features = PolynomialFeatures(degree=2, include_bias=False)


knn_pipe_pca = ImbPipeline(steps=[
#    ("pre", preprocessing_rf),
    ("oversampler", RandomOverSampler()),   # Hinzufügen des RandomOverSamplers oder Rausnehmen!!! Aufpassen am Ende!!!
    ("pre", preprocessing_knn),
#    ("poly", poly_features),   # später ausprobieren, bringt es nicht mehr
#    ("kbest", SelectKBest(k=10, score_func=f_classif)),
    ("model", KNeighborsClassifier())
])


print(knn_pipe_pca)


In [None]:
# fit it

knn_pipe_pca.fit(features_train, target_train)

In [None]:

#display(features_train)
print(classification_report(target_test, knn_pipe_pca.predict(features_test)))

In [None]:
# Logging the output 
log_model(knn_pipe_pca, features_test, target_test)

In [None]:
# Model 4: Decision Tree


preprocessing_dec = ColumnTransformer(
    transformers=[
        ("num_trans", num_pipe, num_cols),
#        ("num_pca", num_pipe_pca, num_cols_pca),
#        ("cat_trans", cat_pipe, cat_cols)
    ],
    remainder="drop"
)

# PolynomialFeatures hinzufügen
# poly_features = PolynomialFeatures(degree=2, include_bias=False)


dec_pipe_pca = ImbPipeline(steps=[
#    ("pre", preprocessing_rf),
    ("oversampler", RandomOverSampler()),   # Hinzufügen des RandomOverSamplers oder Rausnehmen!!! Aufpassen am Ende!!!
    ("pre", preprocessing_dec),
#    ("poly", poly_features),   # später ausprobieren, bringt es nicht mehr
#    ("kbest", SelectKBest(k=10, score_func=f_classif)),
    ("model", DecisionTreeClassifier()),
])


print(dec_pipe_pca)


In [None]:
# fit pipeline on cleaned (and filtered) training set

dec_pipe_pca.fit(features_train, target_train)

In [None]:

#display(features_train)
print(classification_report(target_test, dec_pipe_pca.predict(features_test)))

In [None]:
# Logging the output 
log_model(dec_pipe_pca, features_test, target_test)

In [None]:
# Cross-Validation Curve zum Sichtbarmachung der Lernens

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import validation_curve


# Definiere die Tiefe des Baums, die getestet werden soll
depth = [3, 6, 9, 10, 11, 12] # , 13, 14, 15, 18, 21]

# Berechne die Validierungskurve
train_scores, test_scores = validation_curve(
    estimator=dec_pipe_pca, # Die Pipeline
    X=features_train, # Merkmalsmatrix
    y=target_train, # Zielvektor
    param_name='model__max_depth', # Hyperparameter des DecisionTreeClassifiers innerhalb der Pipeline
    param_range=depth, # Werte des Hyperparameters
    cv=5, # 5-fache Kreuzvalidierung
    scoring='f1', # F1-Score zur Bewertung
    n_jobs=-1 # Alle verfügbaren Kerne verwenden, um zu beschleunigen
)

# Mittelwert und Standardabweichung der Trainings- und Test-Scores berechnen
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

# Visualisiere die Validierungskurve
plt.figure(figsize=(10, 6))
plt.title('Validation Curve with Decision Tree')
plt.xlabel('max_depth')
plt.ylabel('F1 Score')
plt.ylim(0.0, 1.1)

# Fehlerbalken für Trainingsscores
plt.plot(depth, train_scores_mean, label='Training score', color='r')
plt.fill_between(depth, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2, color='r')

# Fehlerbalken für Testscores
plt.plot(depth, test_scores_mean, label='Cross-validation score', color='g')
plt.fill_between(depth, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2, color='g')

plt.legend(loc='best')
plt.grid()
plt.show()

# F1-Scores in einem DataFrame anzeigen
F1 = pd.DataFrame({
    'max_depth': depth,
    'Training F1-score': train_scores_mean,
    'Validation F1-score': test_scores_mean
})
print(F1)

In [None]:

# Klassifikationsberichte erstellen
dec_report = classification_report(target_test, dec_pipe_pca.predict(features_test), output_dict=True)
knn_report = classification_report(target_test, knn_pipe_pca.predict(features_test), output_dict=True)
log_report = classification_report(target_test, log_pipe_pca.predict(features_test), output_dict=True)
rf_report = classification_report(target_test, rf_pipe_pca.predict(features_test), output_dict=True)

# Funktion zum Extrahieren der Metriken aus dem Bericht
def extract_metrics(report, model_name):
    metrics = []
    for key in ['0', '1', 'weighted avg']:
        precision = report[key]['precision']
        recall = report[key]['recall']
        f1_score = report[key]['f1-score']
        metrics.append({
            'Model': model_name,
            'Class': key,
            'Precision': precision,
            'Recall': recall,
            'F1 Score': f1_score
        })
    return metrics

# Extrahiere die Metriken für alle Modelle
all_metrics = []
all_metrics.extend(extract_metrics(dec_report, 'Decision Tree'))
all_metrics.extend(extract_metrics(knn_report, 'KNN'))
all_metrics.extend(extract_metrics(log_report, 'Logistic Regression'))
all_metrics.extend(extract_metrics(rf_report, 'Random Forest'))

# Erstelle einen DataFrame
metrics_df = pd.DataFrame(all_metrics)

# Ausgabe der Tabelle
print(metrics_df)

In [None]:
# logging the results


# Log the metrics for a single model
def log_metrics(metrics):
    for metric in metrics:
        logger.info(f"Model: {metric['Model']}, Class: {metric['Class']}, Precision: {metric['Precision']:.4f}, Recall: {metric['Recall']:.4f}, F1 Score: {metric['F1 Score']:.4f}")


# Log all metrics
log_metrics(all_metrics)

In [None]:

# Finde das Modell mit dem höchsten F1-Score für die Klasse '1'
best_model_row = metrics_df[metrics_df['Class'] == '1'].sort_values(by='F1 Score', ascending=False).iloc[0]
best_model_name = best_model_row['Model']

# Abspeichern des besten Modells
if best_model_name == 'Decision Tree':
    best_model = dec_pipe_pca
elif best_model_name == 'KNN':
    best_model = knn_pipe_pca
elif best_model_name == 'Logistic Regression':
    best_model = log_pipe_pca
elif best_model_name == 'Random Forest':
    best_model = rf_pipe_pca

print(f"The best model is: {best_model_name} with F1 Score: {best_model_row['F1 Score']}")

In [None]:
logger.info(".....")
logger.info(f"The best model is: {best_model_name} with F1 Score: {best_model_row['F1 Score']}")


In [None]:
# Optimation of the logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Parameter Grid definieren
param_grid = {
    'model__C': [0.01, 0.1, 1, 10, 100],  # Regularisierungsparameter
    'model__solver': ['newton-cg', 'lbfgs', 'sag'],
#    'model__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], # wäre noch bei L1 angesagt
    'model__penalty': ['l2'],  # L2-Regularisierung
}

# Logistic Regression Pipeline
log_pipe_pca = ImbPipeline(steps=[
    ("oversampler", RandomOverSampler()),  # Hinzufügen des RandomOverSamplers
    ("pre", preprocessing_log),
    ("model", LogisticRegression(max_iter=5000))  # Erhöhen der maximalen Iterationen
])

# GridSearchCV durchführen

# 'recall' ev. dann optimiert man den Recall!
grid_search_log = GridSearchCV(log_pipe_pca, param_grid, cv=5, scoring='f1', n_jobs=-1)
#grid_search_log = GridSearchCV(log_pipe_pca, param_grid, cv=5, scoring='recall', n_jobs=-1)
grid_search_log.fit(features_train, target_train)

# Beste Parameter und Ergebnis anzeigen
print(f'Best parameters: {grid_search_log.best_params_}')
print(f'Best score: {grid_search_log.best_score_}')

In [None]:
logger.info(".....")
logger.info("The best model is: logistic regression method is:")
logger.info(f'Best parameters: {grid_search_log.best_params_}')
logger.info(f'Best score: {grid_search_log.best_score_}')



In [None]:
# Logistic Regression

# Save the best model
best_model_log = grid_search_log.best_estimator_

print(f'The best model in logistic regression is {best_model_log}.')

In [None]:
# Classification of the best log.-Regression Modell
print(classification_report(target_test, grid_search_log.predict(features_test)))

In [None]:
# Optimation of Random-forest


from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier



# Parameter Grid definieren
param_grid = {
    'model__n_estimators': [100, 200],  # , 300],
    'model__max_features': ['sqrt', 'log2'],
    'model__max_depth': [10, 20],  # , 30, None],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [2, 4],  # [1, 2, 4]
}

# GridSearchCV durchführen # opimise the model after precision
grid_search_rf = GridSearchCV(rf_pipe_pca, param_grid, cv=5, scoring='f1', n_jobs=-1)
#grid_search_rf = GridSearchCV(rf_pipe_pca, param_grid, cv=5, scoring='precision', n_jobs=-1)
grid_search_rf.fit(features_train, target_train)

# Beste Parameter und Ergebnis anzeigen
print(f'Best parameters: {grid_search_rf.best_params_}')
print(f'Best score: {grid_search_rf.best_score_}')


In [None]:
# Classification of the best RF model
print(classification_report(target_test, grid_search_rf.predict(features_test)))

In [None]:



# Save and print the best model
best_model_rf = grid_search_rf.best_estimator_

print(f'The best model in Random Forest is {best_model_rf}.')
 


In [None]:
logger.info(".....")
logger.info("The best model is: Random Forest method is:")
logger.info(f'Best parameters: {grid_search_rf.best_params_}')
logger.info(f'Best score: {grid_search_rf.best_score_}')

In [None]:
# Optimation of KNN


# Define the parameter grid for KNN
param_grid_knn = {
    'model__n_neighbors': [3, 5, 7, 9],  # Number of neighbors to consider
    'model__weights': ['uniform', 'distance'],  # Weight function used in prediction
    'model__metric': ['euclidean', 'manhattan', 'minkowski'],  # Distance metric to use
}

# Set up GridSearchCV for the KNN pipeline
grid_search_knn = GridSearchCV(knn_pipe_pca, param_grid_knn, cv=5, scoring='f1', n_jobs=-1)

# Fit the GridSearchCV
grid_search_knn.fit(features_train, target_train)

# Display the best parameters and score
print(f'Best parameters for KNN: {grid_search_knn.best_params_}')
print(f'Best score for KNN: {grid_search_knn.best_score_}')

In [None]:
# clssification-report of the KNN model

print(classification_report(target_test, grid_search_knn.predict(features_test)))

In [None]:
logger.info(".....")
logger.info("The best model is: KNN method is:")
logger.info(f'Best parameters: {grid_search_knn.best_params_}')
logger.info(f'Best score: {grid_search_knn.best_score_}')

### Model selection

Choose the best model. Decide for yourself which metric is most important to you. Using `confusion_matrix()` from `sklearn.metrics` you can see exactly how many data points were classified correctly and incorrectly. Maybe this will help you with your decision.

In [None]:
# Best model


from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Machen Sie Vorhersagen auf dem Testdatensatz
predictions = best_model_rf.predict(features_test)

# Berechnen Sie die Confusion Matrix
cm = confusion_matrix(target_test, predictions)

# Anzeigen der Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix of Random-Forest')
plt.show()


#Log Regression

# Machen Sie Vorhersagen auf dem Testdatensatz
predictions = best_model_log.predict(features_test)

# Berechnen Sie die Confusion Matrix
cm = confusion_matrix(target_test, predictions)

# Anzeigen der Confusion Matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix of logistic Regression')
plt.show()


In [None]:
# Assuming `best_model_rf` is your trained RandomForestClassifier and `features_test` is your test dataset

# Get the predicted probabilities for each class
probabilities = best_model_rf.predict_proba(features_test)

# Extract the probabilities for class 1 (second column in the array)
prob_class_1 = probabilities[:, 1]  # Probability of class 1 for each prediction

prob_class_0 = probabilities[:, 0]  # Probability of class 1 for each prediction



# If you want to print the predictions along with their probabilities




for i in range(len(features_test)):

    # Get the predicted class for this sample


    predicted_class = best_model_rf.predict(features_test.iloc[[i]])[0]
    
    if predicted_class == 0:
        # Print the predicted class and the corresponding probability for class 1
        print(f"Sample {i+1}: Predicted Depressiveness = {predicted_class}, Probability = {prob_class_0[i]:.2f}")
    else:
        print(f"Sample {i+1}: Predicted Depressiveness = {predicted_class}, Probability = {prob_class_1[i]:.2f}")       

In [None]:
# We decide us for the RF-model

## Final Data-pipeline

We appy the full data set to our best model

In [None]:


def predict(csv_file, model):
    """Estimate waiting time in seconds for taxi-rides in the Quito area.
    
    Args:
        csv_file (str): Path of the incoming data csv file 
        model (model): The trained model.
    
    Returns:
        numpyArray: Estimated waiting time [seconds]
    """
    # Read the file
    test = pd.read_csv(csv_file)


    # Clean te file
    test = clean_data(test)

    # sample the file
    test = sampling_features(test)
    
    # Feature Engineering
    test = engineer_features(test)

    # Prediction
    predictions = model.predict(test)
    
    return predictions

# Example to test the function is based on the stored features_test.csv, which is never toughed in the process and is used as a test
csv_file = '../data/features_test.csv'

# We decide us for the Random Forest as best modell

#best_pipe = dec_pipe_pca  
# best_pipe = best_model_log 
best_pipe = best_model_rf 
#best_pipe = best_model_rf
#best_pipe = best_model_log  


trained_model = best_pipe # chossing the best model
test = predict(csv_file, trained_model)
print(test)



In [None]:
# Here is an alternative of the prediction-function, if we assume also to store the target, while we use it in the clean_data, sampling_features and engineering_features

# def predict(csv_file_1, csv_file_2, model):
#     """Estimate waiting time in seconds for taxi-rides in the Quito area.
    
#     Args:
#         csv_file (str): Path of the incoming data csv file 
#         model (model): The trained model.
    
#     Returns:
#         numpyArray: Estimated waiting time [seconds]
#     """
#     # Read data
#     test = pd.read_csv(csv_file_1)
#     target = pd.read_csv(csv_file_2)


#     # clean data
#     test, target = clean_data(test, target)

#     # apply sampling
#     test, target = sampling_features(test, target)
    
#     # Feature Engineering
#     test, target = engineer_features(test, target)

#     # Prediction
#     predictions = model.predict(test)
    
#     return predictions

# csv_file_1 = 'features_test.csv'
# csv_file_2 = 'target_test.csv'



# best_pipe = best_model_rf  
# #best_pipe = best_model_rf
# #best_pipe = best_model_log  


# trained_model = best_pipe 
# test = predict(csv_file_1, csv_file_2, trained_model)
# print(test)



In [None]:

# Classification of the test-dates


print(classification_report(target_test, predict(csv_file, trained_model)))

## Model Interpretation

We discuss the feature importance of the best model

In [None]:

# We apply the best trained model, e.g., best_model_rf
trained_model = best_model_rf 
# trained_model = best_model_log

# Calculate the original accuracy with the F1-Score
y_pred = trained_model.predict(features_test)
f1_orig = f1_score(target_test, y_pred)

# Calculate the Permutation Feature Importance
perm_importances = []

for col in features_test.columns:
    permuted_features = features_test.copy()
    permuted_features[col] = np.random.permutation(permuted_features[col])
    permuted_pred = trained_model.predict(permuted_features)
    permuted_f1 = f1_score(target_test, permuted_pred)
    perm_important = f1_orig - permuted_f1
    perm_importances.append(perm_important)

# Set a series with the Permutations importances and the column-names as index
perm_importance = pd.Series(perm_importances, index=features_test.columns)

# Sort for the top 10 features
top_10_features = perm_importance.sort_values(ascending=False).head(10)

# Visualion of the Top 10 Permutationsimportances
plt.figure(figsize=(10, 8))
top_10_features.plot(kind='barh')
plt.xlabel('Feature Importance (F1 Score Decrease)')
plt.ylabel('Features')
plt.title('Top 10 Permutation Feature Importance')
plt.gca().invert_yaxis()  # Invert the series to get the importantest at the beginning
plt.show()


In [None]:
# Permutation Feature Importance

# Calculate the Permutation Feature Importance

# features_train, target_train    Train-data-set
# features_test, target_test      Test-data-set

trained_model = best_model_rf 
#trained_model = best_model_log

# Calculation of the original accuracy with the F1-Score
y_pred = trained_model.predict(features_test)
f1_orig = f1_score(target_test, y_pred)

# Columns to focus on
columns_finals = ['age', 'gender', 'bmi', 'gad_score', 'epworth_score', 'depressiveness_awareness', 'anxiety_awareness']

# Computation of the Permutation Feature Importance
perm_importances = []

for col in columns_finals:
    permuted_features = features_test.copy()
    permuted_features[col] = np.random.permutation(permuted_features[col])
    permuted_pred = trained_model.predict(permuted_features)
    permuted_f1 = f1_score(target_test, permuted_pred)
    perm_important = f1_orig - permuted_f1
    perm_importances.append(perm_important)

# Set a series of the permutation-importances and their column-name as index
perm_importance = pd.Series(perm_importances, index=columns_finals)

# Sorting the values
perm_importance = perm_importance.sort_values()

# Visualization of the Permutation Importances
plt.style.use('fivethirtyeight')

# Plot the features
perm_importance.plot(kind='barh', figsize=(10, 8))
plt.xlabel('Feature Importance (F1 Score Decrease)')
plt.ylabel('Features')
plt.title('Permutation Feature Importance for Selected Features')
plt.show()


All the features

In [None]:
print(features_train.columns)



In [None]:
print(best_pipe)

## Final fitting of the best model with all the full dataset, here we use the train and test dataset for the fitting the model

We apply the file: data/depression_anxiety_data.csv, as a full dataset for fitting our best model and apply the dataset aim_test.csv for our prediction.
We store the predicted values in predictions_depression.csv.

In [None]:
import pandas as pd

def fit(csv_file, model, target):
    """ Final fitting of the best model
    
    Args:
        csv_file (str): Path of the incoming data csv file.
        model: The trained model.
        target (list of str): List of target column names.
    
    Returns:
        model: Fitted model.
    """
    # Read the datafile
    df_clean = pd.read_csv(csv_file)

    # Skip the NaN of the Target  # Optional
#    df_clean = clean_data_target(df_clean, target)
    
    # Separate features and target
    features = df_clean.drop(columns=target)
    target = df_clean[target]

    # Optional additional cleaning if needed
    features, target = clean_data(features, target)

    # Apply sampling
    features, target = sampling_features(features, target)
    
    # Feature Engineering
    features, target = engineer_features(features, target)

    # Fit the model
    fitted_model = model.fit(features, target)
    
    return fitted_model

# Original and full dataset for the final training

csv_file = '../data/depression_anxiety_data.csv'
target = ['depressiveness']
trained_model = best_pipe  # Best model (assumed to be defined elsewhere)

fitted_model = fit(csv_file, trained_model, target)


In [None]:

# Final prediction
csv_file = '../data/aim_test.csv'
trained_model = fitted_model # Bestes Modell

# Calculate the prediction
predictions = predict(csv_file, trained_model)

# Number of predicted persons, e.g., depressivness
num_predict = sum(predictions)

# Vorhersagen in einer CSV-Datei speichern
predictions_df = pd.DataFrame({'Depressiveness': predictions})
predictions_df.to_csv('../Data/predictions_depression.csv', index=False)

# Number of the depressions
print(f"Number of Depessiveness: {num_predict}")





In [None]:
print(trained_model)

Save the best-model, features and targets as a pickle-file for the interface

In [None]:
import pickle
import pandas as pd


#  Save with pickle

target_col = ['depressiveness']



all_features = ['age', 'gender', 'bmi', 'gad_score', 'epworth_score', 'depressiveness_awareness', 'anxiety_awareness']

with open('../models/best_model_depression.pkl', 'wb') as file:
    pickle.dump(trained_model, file)
    pickle.dump(target_col, file)
    pickle.dump(all_features, file)