<a href="https://colab.research.google.com/github/insarov2014/Depression-Data-Analysis/blob/main/Depression_Data_Analysis_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I aim at searching a model which help predict if a message reflects depression hints.

Ref: https://www.kaggle.com/datasets/infamouscoder/depression-reddit-cleaned/data?select=depression_dataset_reddit_cleaned.csv

In [None]:
import numpy as np
import pandas as pd

In [None]:
pip install num2words

In [None]:
pip install autocorrect

In [None]:
pip install evaluate

In [None]:
from num2words import num2words
import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller

# ML imports:
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
import sklearn.linear_model as lm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# DL imports:
from transformers import AutoTokenizer, TrainingArguments, Trainer
from torch.utils.data import DataLoader
import evaluate

In [None]:
# Items:
# source_of_dataset: Whether to load the dataset from the package, or from a URL (for the particular dataset in this notebook, both options are available)
# json_url: The url for the dataset's .json file
# db_name: The db name from HuggingFace that holds the raw data
# do_preprocessing: Logical, should preprocessing be performed
# do_enhanced_preprocessing: Logical, should the computation-heavy preprocessing be performed
# do_feature_eng: Logical
# maximize_a_priori: Logocal, should the univariate preliminary feature selection be based on a priori or a postiori stats
# num_chosen_features_per_class: Int, for the preliminary feature selection, how many features should be selected per class
# test_size: ratio between 0 - 1
# feature_eng_details: Either "TfidfVectorizer" (for TFIDF feature eng.) or "CountVectorizer" (for one hot encoding)
# seed: Integer, the random seed used to insure reproducibility of results
config_dict = {#'source_of_dataset': "json",
               #'json_url': "https://huggingface.co/datasets/medalpaca/medical_meadow_health_advice/raw/main/medical_meadow_health_advice.json",
               #'db_name': "medalpaca/medical_meadow_health_advice",
               'do_preprocessing': True,
               'do_enhanced_preprocessing': False,
               'do_feature_eng': True,
               'maximize_a_priori': False,
               'num_chosen_features_per_class': 200,
               'test_size': 0.25,
               'feature_eng_details': "CountVectorizer-binary",
               'ngram_range_min': 1,
               'ngram_range_max': 3,
               'max_features': 1000,
               'seed': 0}

# Deep learning training parameters:
# See description of input parameters in documentation for transformers.TrainingArguments.
lm_training_args = TrainingArguments(
    output_dir="test_trainer",
    num_train_epochs=4, #2
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    eval_strategy="steps",
    logging_steps=100,
    report_to=[],  # Disable logging to Weights & Biases or other services
    )

layers_to_fine_tune = None

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)


In [None]:
# Mount my Google drive so you can read them easily
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("infamouscoder/depression-reddit-cleaned")

print("Path to dataset files:", path)

In [None]:
import os

# Construct the full path to the CSV file
csv_file_path = os.path.join(path, "depression_dataset_reddit_cleaned.csv")

# Read the CSV file into a pandas DataFrame
try:
    df = pd.read_csv(csv_file_path)
    print("Successfully loaded 'depression_dataset_reddit_cleaned.csv':")
    print(df.head())
except FileNotFoundError:
    print(f"Error: The file '{csv_file_path}' was not found.")
except Exception as e:
    print(f"An error occurred while reading the CSV: {e}")

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df[df.duplicated().values==True].shape

In [None]:
df_prep = df.drop_duplicates().copy()
df_prep.shape

In [None]:
mask_NaN = df_prep.isnull().any(axis=1)
df_prep[mask_NaN].shape

In [None]:
df_prep['is_depression'].value_counts()

In [None]:
df_prep.index[0]

In [None]:
most_frequent_class = df_prep.index[0]
print("The most frequent class is:", most_frequent_class)
print("And its baseline accuracy is:", round((df_prep['is_depression'] == most_frequent_class).mean(), 3))

Now let's look for a good model. Even though the data look quite clean, I will still do a bit further cleaning, to get rid of some words such as reflexive pronouns.

In [None]:
import re

In [None]:
def spelling_correction(text):
    """
    Replace misspelled words with the correct spelling.

    Input: str
    Output: str
    """
    corrector = Speller()
    spells = [corrector(word) for word in text.split()]
    return " ".join(spells)


def remove_stop_words(text):
    """
    Remove stopwords.

    Input: str
    Output: str
    """
    stopwords_set = set(stopwords.words('english'))
    return " ".join([word for word in text.split() if word not in stopwords_set])


def stemming(text):
    """
    Perform stemming of each word individually.

    Input: str
    Output: str
    """
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in text.split()])


def lemmatizing(text):
    """
    Perform lemmatization for each word individually.

    Input: str
    Output: str
    """
    lemmatizer = WordNetLemmatizer()
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])


def preprocessing(input_text):
  """
  This function represents a complete pipeline for text preprocessing.

  Input: str
  Output: str
  """
  output = input_text
  # Lower casing:
  output = output.lower()
  # Remove punctuations and other special characters:
  output = re.sub('[^ A-Za-z0-9]+', '', output)

  if config_dict["do_enhanced_preprocessing"]:
    # Spelling corrections:
    output = spelling_correction(output)

  # Remove stop words:
  output = remove_stop_words(output)

  if config_dict["do_enhanced_preprocessing"]:
    # Stemming:
    output = stemming(output)
    # Lemmatizing:
    output = lemmatizing(output)

  return output

In [None]:
dataset_clean = df_prep.copy()
if config_dict["do_preprocessing"]:
  dataset_clean['clean_text'] = [preprocessing(text) for text in dataset_clean['clean_text']]

In [None]:
dataset_clean[['clean_text', 'is_depression']].head(10).style.set_properties(**{'text-align': 'left'})

#EDA

In [None]:
dataset_clean["length of text"] = dataset_clean['clean_text'].map(len)

In [None]:
import matplotlib.pyplot as plt

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(12,4), sharey=False, tight_layout=True)

bins = 12
axs[0].hist(dataset_clean[dataset_clean['is_depression']==0][["length of text"]], bins=bins, alpha=0.5)
axs[0].set_title("Distribution of string length of class 0")
axs[0].set_ylim(0,1000)
axs[0].grid(True)

axs[1].hist(dataset_clean[dataset_clean['is_depression']==1][["length of text"]], bins=bins, alpha=0.5)
axs[1].set_title("Distribution of string length of class 1")
axs[1].set_ylim(0,4000)
axs[1].grid(True)
plt.show()

It seems that the messages associated with depression are very lengthy!

Next, I try looking for the words that imply depression and non-depression.

#Feature Engineering

In [None]:
def feat_eng_text_df(in_df, text_col, labels_col, config_dict):
  if "CountVectorizer-binary" == config_dict["feature_eng_details"]:
    print("Feature Engineering method: Binary (one hot encoding)")
    countvectorizer = CountVectorizer(ngram_range=(config_dict["ngram_range_min"], config_dict["ngram_range_max"]),
                                      stop_words='english',
                                      max_features=config_dict["max_features"],
                                      binary=True)

  elif "CountVectorizer-BOW" == config_dict["feature_eng_details"]:
    print("Feature Engineering method: Bag of words")
    countvectorizer = CountVectorizer(ngram_range=(config_dict["ngram_range_min"], config_dict["ngram_range_max"]),
                                      stop_words='english',
                                      max_features=config_dict["max_features"],
                                      binary=False)

  out_arr = countvectorizer.fit_transform(in_df[text_col])
  count_tokens = countvectorizer.get_feature_names_out()
  out_df = pd.DataFrame(data = out_arr.toarray(),columns = count_tokens)
  out_df[labels_col] = list(in_df[labels_col])
  return out_df


if config_dict["do_feature_eng"]:
  dataset_feat_eng = feat_eng_text_df(dataset_clean, 'clean_text', 'is_depression', config_dict)
else:
  # This option isn't being supported, the notebook would fail. This option is
  # here to cater for a ML pipeline that uses deep learning language models that consume text, and not engineered features.
  dataset_feat_eng = dataset_clean.copy()

#Exploring the new numerical features

In [None]:
dataset_feat_eng.head()

#Split to Train/Test

In [None]:
dataset_feat_eng_test = dataset_feat_eng.sample(frac=config_dict["test_size"],random_state=config_dict['seed'])
dataset_feat_eng_train = dataset_feat_eng.drop(dataset_feat_eng_test.index)

#Preliminary statistical analysis and feasibility study

In [None]:
## Statistics of features per class:
means_by_class = dataset_feat_eng_train.groupby(by=['is_depression']).mean().T.sort_index()
means_by_class.head()

Calc the ratio that reflects statistical dependence:
P(class, feature)/(P(class)P(feature))
And note that it could be rewritten as:
P(class | feature)/P(class)
Or equivalently:
P(feature | class)/P(feature)

*Note:
The below calculation is assuming that the numerical features of each text term is binary, only then is the below a probability measure.
If another feature method is used, such as BoW or TF/IDF, then the below is not the probability, but a proxy of it.

In [None]:
P_class = sorted([[c, np.mean(dataset_feat_eng['is_depression'] == c)] for c in set(means_by_class.columns)])
P_feature = sorted([[f, np.mean(dataset_feat_eng[f] > 0)] for f in dataset_feat_eng.columns if f != 'is_depression'])
P_feature_inv = [[f, 1/p] for f, p in P_feature]

P_class_arr = np.array(P_class)
P_feature_arr = np.array(P_feature)
P_feature_inv_arr = np.array(P_feature_inv)
# Multiplying a "column vector" of feature probablities with a "line vector" of
# class probilities to get a matrix where each element is a product of probabilities:
P_class_prod_P_feature_inv_arr = np.outer(P_feature_inv_arr[:, 1].astype(float), P_class_arr[:, 1].astype(float))

P_class_given_feature = means_by_class.copy()
for feature_counter in range(len(P_class_given_feature)):
  for c in P_class_given_feature.columns:
    # Right hand side: P(feature | class) / P(feature)
    P_class_given_feature[c][feature_counter] = means_by_class[c][feature_counter] / P_feature_arr[feature_counter, 1].astype(float)

In [None]:
P_class_given_feature.sort_values([0], ascending=False).head(10)

In [None]:
P_class_given_feature.sort_values([1], ascending=False).head(10)

The two tables reveal that there exist some indicative words for depression associated messages!

#Feature Selection

This is a univariate feature selection process.
It is based on conditional dependency between a feature being 0/1 and a class being 0/1, thus the mean value of the feature is its probability.
Note that the process of feature selection is done on the training set.

For each class, choose the most indicative features.
Either maximize the:

a-priori distribution P(feature | class), Max Liklihood
or
a posteriori P(class | feature), MAP

In [None]:
chosen_features = []
if config_dict["maximize_a_priori"] == True:
  classes = means_by_class.columns
  for c in classes:
    chosen_features += list(means_by_class[c].sort_values(ascending=False).index[:config_dict["num_chosen_features_per_class"] + 1])
else:
  classes = P_class_given_feature.columns
  for c in classes:
    chosen_features += list(P_class_given_feature[c].sort_values(ascending=False).index[:config_dict["num_chosen_features_per_class"] + 1])


chosen_features = list(set(chosen_features))

In [None]:
len(chosen_features)

In [None]:
chosen_features

### Leave only chosen features:
Now that we deduced which features are "important" based on the train set, we select them for both the train set and the test set.  

In [None]:
dataset_feat_eng_train_selected = dataset_feat_eng_train.filter(chosen_features + ['is_depression'])
dataset_feat_eng_test_selected = dataset_feat_eng_test.filter(chosen_features + ['is_depression'])

dataset_feat_eng_train_selected.head()

Now the data is shrunk and we obtain a more effective dataset.

In [None]:
dataset_feat_eng_train_selected['is_depression'].value_counts()

#Machine Learning

We conduct regular ML models first.

In [None]:
dataset_feat_eng_train_selected.head()

In [None]:
x_features_train = dataset_feat_eng_train_selected.values[:, 0:-1]
y_labels_train = dataset_feat_eng_train_selected.values[:, -1]

x_features_test = dataset_feat_eng_test_selected.values[:, :-1]
y_labels_test = dataset_feat_eng_test_selected.values[:, -1]

In [None]:
%%time
models = []
models.append(("Random Forest", RandomForestClassifier(random_state=config_dict['seed'])))
models.append(("LASSO", lm.LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, random_state=config_dict['seed'])))
models.append(("KNN", KNeighborsClassifier()))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=config_dict['seed'])))
models.append(("SVM", SVC(gamma='auto', random_state=config_dict['seed'])))

results = []
names = []
best_mean_result = 0
best_std_result = 0
for name, model in models:
  kfold = StratifiedKFold()
  cv_results = cross_val_score(model, X=x_features_train, y=y_labels_train, scoring='accuracy', cv=kfold)
  results.append(cv_results)
  names.append(name)
  print(name + ": mean(accuracy)=" + str(round(np.mean(cv_results), 3)) + ", std(accuracy)=" + str(round(np.std(cv_results), 3)))
  if (best_mean_result < np.mean(cv_results)) or \
    ((best_mean_result == np.mean(cv_results)) and (best_std_result > np.std(cv_results))):
    best_mean_result = np.mean(cv_results)
    best_std_result = np.std(cv_results)
    best_model_name = name
    best_model = model
print("\nBest model is:\n" + best_model_name)

In [None]:
plt.boxplot(results, labels=names)
plt.title("Models' results' distribution of accuracy")
plt.show()

Logistic Regression with LASSO regulation is the best candidate.

In [None]:
model = lm.LogisticRegression(solver='liblinear', penalty='l1', max_iter=1000, random_state=config_dict['seed'])
params = {"C": np.linspace(start=0.001, stop=10, num=20)}
grid_search = GridSearchCV(model, params, scoring='accuracy')
grid_search.fit(x_features_train, y_labels_train)
print("The optimal hyperparameter 'C' is:", grid_search.best_params_["C"])


In [None]:
model = lm.LogisticRegression(C=grid_search.best_params_["C"], max_iter=1000, random_state=config_dict['seed'])
model.fit(x_features_train, y_labels_train)

#Generate the ML train results: Use for Design Choices

In [None]:
y_train_estimated = model.predict(x_features_train)
accuracy_train = np.mean(y_train_estimated == y_labels_train)
baseline_accuracy_train = np.mean(0 == y_labels_train)
accuracy_lift_train = 100 * (accuracy_train/baseline_accuracy_train - 1)

print("Results on the train set for a traditional ML model:\n-------------------------")
print("Baseline (dummy classifier) accuracy:", round(baseline_accuracy_train, 2))
print("Current model's accuracy:", round(accuracy_train, 2))
print("The accuracy lift is:", round(accuracy_lift_train), "%")

#Generate the ML test results: Use for presenting performance

In [None]:
y_test_estimated = model.predict(x_features_test)
accuracy_test = np.mean(y_test_estimated == y_labels_test)
baseline_accuracy_test = np.mean(0 == y_labels_test)
accuracy_lift = 100 * (accuracy_test/baseline_accuracy_test - 1)

print("Results on the test set for a traditional ML model:\n-------------------------")
print("Baseline (dummy classifier) accuracy:", round(baseline_accuracy_test, 2))
print("Current model's accuracy:", round(accuracy_test, 2))
print("The accuracy lift is:", round(accuracy_lift), "%")


print("\nConfusion Matrix:")
print(confusion_matrix(y_labels_test, y_test_estimated))
print("\nClassification Report:")
print(classification_report(y_labels_test, y_test_estimated))

Results on the test set for a traditional ML model:
-------------------------
Baseline (dummy classifier) accuracy: 0.5
Current model's accuracy: 0.82
The accuracy lift is: 64 %

Confusion Matrix:
[[910  50]
 [288 664]]

Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.95      0.84       960
           1       0.93      0.70      0.80       952

    accuracy                           0.82      1912
   macro avg       0.84      0.82      0.82      1912
weighted avg       0.84      0.82      0.82      1912

****
# Deep Learning  
Applying BERT, a Language Model to Text Classification

## Formatting our data
Adjusting the name of the label column:  
The design of the Transformers package requires the dataset's lables column to be named exactly `label`.  
In the above part of this notebook, where we did tranditional ML work, we had to pick a column name that **isn't** a natural word. The reason is that when we performed feature engineering, each word/Ngram was allocated its own column named after it. If the word "label" just happened to appear in the text, it could have a column called `label` defined for it in the dataframe, which would then **conflict with the labels' column name**.  
We no longer have that risk, and we need to comply with Transformers' requirements:  

Load the tokenizer and the pre-trained Language Model:  

*Note about fine tuning with Hugging Face:  
As of 2025, Hugging Face's Trainer defaults to log metrics using Weights & Biases. That means it demands an API key for that.  
To fine-tune without needing a W&B API key, you can disable this integration by setting an env variable `WANDB_DISABLED=true`.  

In [None]:
!export WANDB_DISABLED=true

In [None]:
labels = list(dataset_clean['is_depression'].unique())

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
language_model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(labels))

>>
The previous cell would output a warnin starting with:  
`Some weights of the model checkpoint at bert-base-uncased were not used when initializing...`

>>
It is expected, as the model you imported had its pre-trained classification head (i.e. last neural layer) removed and a new "fresh" layer is initialized.  
That's what we want, as we seek to train that classification head to suit our dataset. Based on our choice, we may choose to also fine-tune other layers.    

In [None]:
print(f"The size of the model's token dictionary: {language_model.config.vocab_size}")

Split the dataset to three sub-sets:  
1. A held-out test set  
2. A train set that is split to two:  
  2.1 A subset used for training the neural network's parameters  
  2.2 A subset used to evaluate the progress of the training  

In [None]:
# Create a training set and a test set
test_df = dataset_clean.sample(frac=config_dict["test_size"],random_state=config_dict['seed'])
train_df = dataset_clean.drop(test_df.index)

# Splitting the train set to "just train" and "training evaluation" set:
train_eval_df = train_df.sample(frac=config_dict["test_size"],random_state=config_dict['seed'])
train_train_df = train_df.drop(train_eval_df.index)

# Rename the label column to 'label' as required by the Transformers Trainer
train_train_df = train_train_df.rename(columns={'is_depression': 'label'})
train_eval_df = train_eval_df.rename(columns={'is_depression': 'label'})
test_df = test_df.rename(columns={'is_depression': 'label'})

# Conver the dataframes to a Dataset format per the Transformers package's requirement:
dataset_train_train = Dataset.from_pandas(train_train_df)
dataset_train_eval = Dataset.from_pandas(train_eval_df)
dataset_test = Dataset.from_pandas(test_df)

In order for the LM to process the text, it must be tokenized:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples['clean_text'], padding="max_length", truncation=True)

train_train_tokenized = dataset_train_train.map(tokenize_function, batched=True)
train_eval_tokenized = dataset_train_eval.map(tokenize_function, batched=True)
test_tokenized = dataset_test.map(tokenize_function, batched=True)

#Training LM

We fine tune our pre-trained Language Model via transformers's Trainer.

### Choosing which neural network layers to fine-tune

In [None]:
if layers_to_fine_tune == "head":
  print("Fine-tuning only the classification head!")
  language_model.train()
  for name, param in language_model.named_parameters():
    # Freeze parameters of all layers except classifier head:
    if 'classifier' not in name:
        param.requires_grad = False
else:
  print("Fine-tuning the entire neural network!")

### Training hyperparameters

These are the settings for training our model:

### Evaluation metric

Defining the evaluation metric for the Language Model fine-tuning:

In [None]:
metric = evaluate.load("accuracy")

Setting the metric evaluation function for the trainer to utilize:

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # As the model returns a pair of logit values for each observation,
    # where each of the two logit value reflects the likelihood of each
    # class, we want to conver it to a classification:
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Trainer object

In [None]:
trainer = Trainer(
    model=language_model,
    args=lm_training_args,
    train_dataset=train_train_tokenized,
    eval_dataset=train_eval_tokenized,
    compute_metrics=compute_metrics,
)

### Fine tuning

In [None]:
%%time
trainer.train()

Converting the training log to a dataframe for plotting:

In [None]:
training_logs_df = pd.DataFrame(trainer.state.log_history).groupby("step", as_index=False).first()

In [None]:
training_logs_df

In [None]:
training_logs_df.plot(x="epoch", y=["loss", "eval_loss"])
plt.title('Observing the performance as the training progresses')
plt.legend(['Train Loss', 'Validation Loss'], loc='upper right')
plt.show()

In [None]:
results_train_train = trainer.predict(train_train_tokenized)#
predictions_train_train = np.argmax(results_train_train[0], axis=-1)

accuracy_dl_train = np.mean(predictions_train_train == train_train_df["label"])
baseline_accuracy_dl_train = np.mean(most_frequent_class == train_train_df["label"])
accuracy_dl_lift_train = 100 * (accuracy_dl_train/baseline_accuracy_dl_train - 1)

print("Results on the train set for a DL Language Model:\n----------------------------------------------------")
print("Baseline (dummy classifier) accuracy:", round(baseline_accuracy_dl_train, 2))
print("Current model's accuracy:", round(accuracy_dl_train, 2))
print("The accuracy lift is:", round(accuracy_dl_lift_train), "%")

What an improvement!

In [None]:
results_test = trainer.predict(test_tokenized)
predictions_test = np.argmax(results_test[0], axis=-1)

accuracy_dl_test = np.mean(predictions_test == test_df["label"])
baseline_accuracy_dl_test = np.mean(most_frequent_class == test_df["label"])
accuracy_dl_lift = 100 * (accuracy_dl_test/baseline_accuracy_dl_test - 1)

print("Results on the test set for a DL Language Model:\n---------------------------------------------------")
print("Baseline (dummy classifier) accuracy:", round(baseline_accuracy_dl_test, 2))
print("Current model's accuracy:", round(accuracy_dl_test, 2))
print("The accuracy lift is:", round(accuracy_dl_lift), "%")


print("\nConfusion Matrix:")
print(confusion_matrix(test_df["label"], predictions_test))
print("\nClassification Report:")
print(classification_report(test_df["label"], predictions_test))

Results on the test set for a DL Language Model:
---------------------------------------------------
Baseline (dummy classifier) accuracy: 0.5
Current model's accuracy: 0.97
The accuracy lift is: 93 %

Confusion Matrix:
[[945  15]
 [ 41 911]]

Classification Report:
              precision    recall  f1-score   support

           0       0.96      0.98      0.97       960
           1       0.98      0.96      0.97       952

    accuracy                           0.97      1912
   macro avg       0.97      0.97      0.97      1912
weighted avg       0.97      0.97      0.97      1912

The language model did a much better job than regular ML models. Now we can confidently make prediction for any messages and tell whether or not the speakers have depression symptom.

Reference: Mastering-NLP-from-Foundations-to-LLMs by Lior Gazit Meysam Ghaffari