In [571]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE
import numpy as np

Load Dataset:<br>
train_dataset=original training set with no augmentation<br>
back_dataset=training set with back-translation<br>
aug_dataset=training set with eda-paraphrasing augmentation

In [5]:
train_dataset=pd.read_csv(r'..\DAIC\Preprocessed\train_dataset.csv')
back_dataset=pd.read_csv(r'..\DAIC\Preprocessed\back_dataset.csv')
aug_dataset=pd.read_csv(r'..\DAIC\Preprocessed\aug_dataset.csv')
test_dataset=pd.read_csv(r'..\DAIC\Preprocessed\test_dataset.csv')
val_dataset=pd.read_csv(r'..\DAIC\Preprocessed\dev_dataset.csv')

Checking the no of datapoints and class balances.

In [11]:
print(f'The samples in training dataset is: ',(len(train_dataset['response'])),'and the distribution is ',(train_dataset['PHQ8_Binary'].value_counts()))
print(f'The samples in back dataset is: ',(len(back_dataset['response'])),'and the distribution is ',(back_dataset['PHQ8_Binary'].value_counts()))
print(f'The samples in aug dataset is: ',(len(aug_dataset['response'])),'and the distribution is ',(aug_dataset['PHQ8_Binary'].value_counts()))
print(f'The samples in validation dataset is: ',(len(val_dataset['response'])),'and the distribution is ',(val_dataset['PHQ8_Binary'].value_counts()))
print(f'The samples in test dataset is: ',(len(test_dataset['response'])),'and the distribution is ',(test_dataset['PHQ8_Binary'].value_counts()))

The samples in training dataset is:  107 and the distribution is  PHQ8_Binary
0    77
1    30
Name: count, dtype: int64
The samples in back dataset is:  136 and the distribution is  PHQ8_Binary
0    77
1    59
Name: count, dtype: int64
The samples in aug dataset is:  127 and the distribution is  PHQ8_Binary
0    77
1    50
Name: count, dtype: int64
The samples in validation dataset is:  35 and the distribution is  PHQ8_Binary
0    23
1    12
Name: count, dtype: int64
The samples in test dataset is:  47 and the distribution is  PHQ8_Binary
0    33
1    14
Name: count, dtype: int64


Let's start with the training dataset. First of all we will do the classification without under/oversampling, using tf-idf, word2vec and glove. After which we will use sampling balancing. We will the test in the validation set and subsequently test set. We will do the same for back and aug dataset. At last we, will also try incorporating val into training dataset, as we have a separate test dataset for testing.

<h1>TF-IDF</h1>
<h3>Train_set</h3>

In [420]:
tfidf_vectorizer=TfidfVectorizer(lowercase=True,stop_words='english',max_features=6100)

X_train_tfidf=tfidf_vectorizer.fit_transform(train_dataset['response'])
X_val_tfidf=tfidf_vectorizer.transform(val_dataset['response'])
X_test_tfidf=tfidf_vectorizer.transform(test_dataset['response'])

y_train=train_dataset['PHQ8_Binary']
y_val=val_dataset['PHQ8_Binary']
y_test=test_dataset['PHQ8_Binary']

In [439]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000000.0,random_state=42)
lr.fit(X_train_tfidf,y_train)
y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.67      0.87      0.75        23
  Depression       0.40      0.17      0.24        12

    accuracy                           0.63        35
   macro avg       0.53      0.52      0.50        35
weighted avg       0.58      0.63      0.58        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.74      0.94      0.83        33
  Depression       0.60      0.21      0.32        14

    accuracy                           0.72        47
   macro avg       0.67      0.58      0.57        47
weighted avg       0.70      0.72      0.67        47



In [205]:
#Random Undersampling
rus=RandomUnderSampler()
X_train_tfidf_un,y_train_un=rus.fit_resample(X_train_tfidf,y_train)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_train_tfidf_un,y_train_un)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.83      0.65      0.73        23
  Depression       0.53      0.75      0.62        12

    accuracy                           0.69        35
   macro avg       0.68      0.70      0.68        35
weighted avg       0.73      0.69      0.69        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.87      0.79      0.83        33
  Depression       0.59      0.71      0.65        14

    accuracy                           0.77        47
   macro avg       0.73      0.75      0.74        47
weighted avg       0.78      0.77      0.77        47



In [254]:
#Random OverSampling
smote=SMOTE()
X_train_tfidf_smote,y_train_smote=smote.fit_resample(X_train_tfidf,y_train)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100.0,random_state=42)
lr.fit(X_train_tfidf_smote,y_train_smote)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      1.00      0.81        23
  Depression       1.00      0.08      0.15        12

    accuracy                           0.69        35
   macro avg       0.84      0.54      0.48        35
weighted avg       0.79      0.69      0.58        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.73      0.97      0.83        33
  Depression       0.67      0.14      0.24        14

    accuracy                           0.72        47
   macro avg       0.70      0.56      0.53        47
weighted avg       0.71      0.72      0.65        47



<h3>Back_Set</h3>

In [261]:
tfidf_vectorizer=TfidfVectorizer(lowercase=True,stop_words='english',max_features=6100)

X_back_tfidf=tfidf_vectorizer.fit_transform(back_dataset['response'])
X_val_tfidf=tfidf_vectorizer.transform(val_dataset['response'])
X_test_tfidf=tfidf_vectorizer.transform(test_dataset['response'])

y_back=back_dataset['PHQ8_Binary']
y_val=val_dataset['PHQ8_Binary']
y_test=test_dataset['PHQ8_Binary']

In [291]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1.0,random_state=42)
lr.fit(X_back_tfidf,y_back)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.66      0.83      0.73        23
  Depression       0.33      0.17      0.22        12

    accuracy                           0.60        35
   macro avg       0.49      0.50      0.48        35
weighted avg       0.54      0.60      0.56        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.70      0.91      0.79        33
  Depression       0.25      0.07      0.11        14

    accuracy                           0.66        47
   macro avg       0.47      0.49      0.45        47
weighted avg       0.56      0.66      0.59        47



In [347]:
#Random Undersampling
rus=RandomUnderSampler()
X_back_tfidf_un,y_back_un=rus.fit_resample(X_back_tfidf,y_back)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_back_tfidf_un,y_back_un)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.69      0.96      0.80        23
  Depression       0.67      0.17      0.27        12

    accuracy                           0.69        35
   macro avg       0.68      0.56      0.53        35
weighted avg       0.68      0.69      0.62        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.76      0.94      0.84        33
  Depression       0.67      0.29      0.40        14

    accuracy                           0.74        47
   macro avg       0.71      0.61      0.62        47
weighted avg       0.73      0.74      0.71        47



In [371]:
#Random OverSampling
smote=SMOTE()
X_back_tfidf_smote,y_back_smote=smote.fit_resample(X_back_tfidf,y_back)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1.0,random_state=42)
lr.fit(X_back_tfidf_smote,y_back_smote)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.64      0.78      0.71        23
  Depression       0.29      0.17      0.21        12

    accuracy                           0.57        35
   macro avg       0.46      0.47      0.46        35
weighted avg       0.52      0.57      0.54        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.70      0.94      0.81        33
  Depression       0.33      0.07      0.12        14

    accuracy                           0.68        47
   macro avg       0.52      0.51      0.46        47
weighted avg       0.59      0.68      0.60        47



<h3>Aug_Set</h3>

In [442]:
tfidf_vectorizer=TfidfVectorizer(lowercase=True,stop_words='english',max_features=6200)

X_aug_tfidf=tfidf_vectorizer.fit_transform(aug_dataset['response'])
X_val_tfidf=tfidf_vectorizer.transform(val_dataset['response'])
X_test_tfidf=tfidf_vectorizer.transform(test_dataset['response'])

y_aug=aug_dataset['PHQ8_Binary']
y_val=val_dataset['PHQ8_Binary']
y_test=test_dataset['PHQ8_Binary']

In [452]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
lr.fit(X_aug_tfidf,y_aug)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.91      0.78        23
  Depression       0.50      0.17      0.25        12

    accuracy                           0.66        35
   macro avg       0.59      0.54      0.51        35
weighted avg       0.62      0.66      0.60        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.74      0.88      0.81        33
  Depression       0.50      0.29      0.36        14

    accuracy                           0.70        47
   macro avg       0.62      0.58      0.58        47
weighted avg       0.67      0.70      0.67        47



In [528]:
#Random Undersampling
rus=RandomUnderSampler()
X_aug_tfidf_un,y_aug_un=rus.fit_resample(X_aug_tfidf,y_aug)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_aug_tfidf_un,y_aug_un)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.74      0.71        23
  Depression       0.40      0.33      0.36        12

    accuracy                           0.60        35
   macro avg       0.54      0.54      0.54        35
weighted avg       0.58      0.60      0.59        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.80      0.85      0.82        33
  Depression       0.58      0.50      0.54        14

    accuracy                           0.74        47
   macro avg       0.69      0.67      0.68        47
weighted avg       0.74      0.74      0.74        47



In [535]:
#Random OverSampling
smote=SMOTE()
X_aug_tfidf_smote,y_aug_smote=smote.fit_resample(X_aug_tfidf,y_aug)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
lr.fit(X_aug_tfidf_smote,y_aug_smote)

y_val_pred=lr.predict(X_val_tfidf)
y_test_pred=lr.predict(X_test_tfidf)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.91      0.78        23
  Depression       0.50      0.17      0.25        12

    accuracy                           0.66        35
   macro avg       0.59      0.54      0.51        35
weighted avg       0.62      0.66      0.60        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.72      0.88      0.79        33
  Depression       0.43      0.21      0.29        14

    accuracy                           0.68        47
   macro avg       0.58      0.55      0.54        47
weighted avg       0.64      0.68      0.64        47



<h1>Word2Vec</h1>
<h3>Train_set</h3>

In [536]:
import gensim

word2vec_path='..\GoogleNews-vectors-negative300.bin\GoogleNews-vectors-negative300.bin'
word2vec=gensim.models.KeyedVectors.load_word2vec_format(word2vec_path,binary=True)

In [746]:
def get_average_word2vec(tokens_list,vector,k=300):
    valid_vectors=[vector[word] for word in tokens_list if word in vector]

    if not valid_vectors:
        print(tokens_list)
        return np.zeros(k)
        
    
    return np.mean(valid_vectors,axis=0)

In [540]:
train_dataset['tokens']=train_dataset['response'].apply(lambda x:x.split())
val_dataset['tokens']=val_dataset['response'].apply(lambda x:x.split())
test_dataset['tokens']=test_dataset['response'].apply(lambda x:x.split())
X_train_word2vec=np.array([get_average_word2vec(tokens,word2vec) for tokens in train_dataset['tokens'] ])
X_val_word2vec=np.array([get_average_word2vec(tokens,word2vec) for tokens in val_dataset['tokens']])
X_test_word2vec=np.array([get_average_word2vec(tokens,word2vec) for tokens in test_dataset['tokens']])

In [565]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100.0,random_state=42)
lr.fit(X_train_word2vec,y_train)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.75      0.65      0.70        23
  Depression       0.47      0.58      0.52        12

    accuracy                           0.63        35
   macro avg       0.61      0.62      0.61        35
weighted avg       0.65      0.63      0.64        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.85      0.70      0.77        33
  Depression       0.50      0.71      0.59        14

    accuracy                           0.70        47
   macro avg       0.68      0.71      0.68        47
weighted avg       0.75      0.70      0.71        47



In [636]:
#Random Undersampling
rus=RandomUnderSampler(random_state=42)
X_train_word2vec_un,y_train_word2vec_un=rus.fit_resample(X_train_word2vec,y_train)

In [722]:
rus=RandomUnderSampler(random_state=42)
X_train_word2vec_un,y_train_word2vec_un=rus.fit_resample(X_train_word2vec,y_train)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_train_word2vec_un,y_train_word2vec_un)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.87      0.57      0.68        23
  Depression       0.50      0.83      0.62        12

    accuracy                           0.66        35
   macro avg       0.68      0.70      0.65        35
weighted avg       0.74      0.66      0.66        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.84      0.48      0.62        33
  Depression       0.39      0.79      0.52        14

    accuracy                           0.57        47
   macro avg       0.62      0.64      0.57        47
weighted avg       0.71      0.57      0.59        47



In [729]:
#Random OverSampling
smote=SMOTE(random_state=42)
X_train_word2vec_smote,y_train_word2vec_smote=smote.fit_resample(X_train_word2vec,y_train)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
lr.fit(X_train_word2vec_smote,y_train_word2vec_smote)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.69      0.78      0.73        23
  Depression       0.44      0.33      0.38        12

    accuracy                           0.63        35
   macro avg       0.57      0.56      0.56        35
weighted avg       0.61      0.63      0.61        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.79      0.70      0.74        33
  Depression       0.44      0.57      0.50        14

    accuracy                           0.66        47
   macro avg       0.62      0.63      0.62        47
weighted avg       0.69      0.66      0.67        47



<h3>Back_set</h3>

In [768]:
back_dataset['tokens']=back_dataset['response'].apply(lambda x:x.split())
X_back_word2vec=np.array([get_average_word2vec(tokens,word2vec) for tokens in back_dataset['tokens'] ])

In [789]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000.0,random_state=42)
lr.fit(X_back_word2vec,y_back)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.70      0.83      0.76        23
  Depression       0.50      0.33      0.40        12

    accuracy                           0.66        35
   macro avg       0.60      0.58      0.58        35
weighted avg       0.63      0.66      0.64        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.74      0.79      0.76        33
  Depression       0.42      0.36      0.38        14

    accuracy                           0.66        47
   macro avg       0.58      0.57      0.57        47
weighted avg       0.65      0.66      0.65        47



In [807]:
#Random Undersampling
rus=RandomUnderSampler(random_state=42)
X_back_word2vec_un,y_back_word2vec_un=rus.fit_resample(X_back_word2vec,y_back)

In [813]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_back_word2vec_un,y_back_word2vec_un)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.83      0.75        23
  Depression       0.43      0.25      0.32        12

    accuracy                           0.63        35
   macro avg       0.55      0.54      0.53        35
weighted avg       0.59      0.63      0.60        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.69      0.73      0.71        33
  Depression       0.25      0.21      0.23        14

    accuracy                           0.57        47
   macro avg       0.47      0.47      0.47        47
weighted avg       0.56      0.57      0.56        47



In [814]:
#Random OverSampling
smote=SMOTE(random_state=42)
X_back_word2vec_smote,y_back_word2vec_smote=smote.fit_resample(X_back_word2vec,y_back)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
lr.fit(X_back_word2vec_smote,y_back_word2vec_smote)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.70      0.83      0.76        23
  Depression       0.50      0.33      0.40        12

    accuracy                           0.66        35
   macro avg       0.60      0.58      0.58        35
weighted avg       0.63      0.66      0.64        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.74      0.79      0.76        33
  Depression       0.42      0.36      0.38        14

    accuracy                           0.66        47
   macro avg       0.58      0.57      0.57        47
weighted avg       0.65      0.66      0.65        47



<h3>Aug_set</h3>

In [815]:
aug_dataset['tokens']=aug_dataset['response'].apply(lambda x:x.split())
X_aug_word2vec=np.array([get_average_word2vec(tokens,word2vec) for tokens in aug_dataset['tokens'] ])

In [827]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000.0,random_state=42)
lr.fit(X_aug_word2vec,y_aug)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.70      0.83      0.76        23
  Depression       0.50      0.33      0.40        12

    accuracy                           0.66        35
   macro avg       0.60      0.58      0.58        35
weighted avg       0.63      0.66      0.64        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.81      0.76      0.78        33
  Depression       0.50      0.57      0.53        14

    accuracy                           0.70        47
   macro avg       0.65      0.66      0.66        47
weighted avg       0.72      0.70      0.71        47



In [828]:
#Random Undersampling
rus=RandomUnderSampler(random_state=42)
X_aug_word2vec_un,y_aug_word2vec_un=rus.fit_resample(X_aug_word2vec,y_aug)

In [833]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_aug_word2vec_un,y_aug_word2vec_un)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.74      0.74      0.74        23
  Depression       0.50      0.50      0.50        12

    accuracy                           0.66        35
   macro avg       0.62      0.62      0.62        35
weighted avg       0.66      0.66      0.66        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.85      0.67      0.75        33
  Depression       0.48      0.71      0.57        14

    accuracy                           0.68        47
   macro avg       0.66      0.69      0.66        47
weighted avg       0.74      0.68      0.69        47



In [840]:
#Random OverSampling
smote=SMOTE(random_state=42)
X_aug_word2vec_smote,y_aug_word2vec_smote=smote.fit_resample(X_aug_word2vec,y_aug)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000000.0,random_state=42)
lr.fit(X_aug_word2vec_smote,y_aug_word2vec_smote)

y_val_pred=lr.predict(X_val_word2vec)
y_test_pred=lr.predict(X_test_word2vec)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.65      0.74      0.69        23
  Depression       0.33      0.25      0.29        12

    accuracy                           0.57        35
   macro avg       0.49      0.49      0.49        35
weighted avg       0.54      0.57      0.55        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.78      0.76      0.77        33
  Depression       0.47      0.50      0.48        14

    accuracy                           0.68        47
   macro avg       0.62      0.63      0.63        47
weighted avg       0.69      0.68      0.68        47



<h1>GlOvE</h1>
<h3>Train_Set</h3>

In [841]:
embedding_index={}
glove_path='../glove.6B.100d.txt'
with open(glove_path,'r',encoding='utf-8') as f:
    for line in f:
        values=line.split()
        word=values[0]
        coefs=np.asarray(values[1:],dtype='float32')
        embedding_index[word]=coefs

In [842]:
def get_average_glove(tokens_list,embedding_index,k=100):
    valid_vectors=[embedding_index[word] for word in tokens_list if word in embedding_index]

    if not valid_vectors:
        return np.zeros(k)
    
    return np.mean(valid_vectors,axis=0)

In [844]:
X_train_glove=np.array([get_average_glove(tokens,embedding_index) for tokens in train_dataset['tokens']])
X_back_glove=np.array([get_average_glove(tokens,embedding_index) for tokens in back_dataset['tokens']])
X_aug_glove=np.array([get_average_glove(tokens,embedding_index) for tokens in aug_dataset['tokens']])
X_val_glove=np.array([get_average_glove(tokens,embedding_index) for tokens in val_dataset['tokens']])
X_test_glove=np.array([get_average_glove(tokens,embedding_index) for tokens in test_dataset['tokens']])

In [904]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000.0,random_state=42)
lr.fit(X_train_glove,y_train)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.74      0.71        23
  Depression       0.40      0.33      0.36        12

    accuracy                           0.60        35
   macro avg       0.54      0.54      0.54        35
weighted avg       0.58      0.60      0.59        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.82      0.70      0.75        33
  Depression       0.47      0.64      0.55        14

    accuracy                           0.68        47
   macro avg       0.65      0.67      0.65        47
weighted avg       0.72      0.68      0.69        47



In [860]:
#Random Undersampling
rus=RandomUnderSampler(random_state=42)
X_train_glove_un,y_train_glove_un=rus.fit_resample(X_train_glove,y_train)

In [870]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_train_glove_un,y_train_glove_un)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.57      0.62        23
  Depression       0.38      0.50      0.43        12

    accuracy                           0.54        35
   macro avg       0.53      0.53      0.52        35
weighted avg       0.58      0.54      0.55        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.80      0.61      0.69        33
  Depression       0.41      0.64      0.50        14

    accuracy                           0.62        47
   macro avg       0.60      0.62      0.59        47
weighted avg       0.68      0.62      0.63        47



In [908]:
#Random OverSampling
smote=SMOTE(random_state=42)
X_train_glove_smote,y_train_glove_smote=smote.fit_resample(X_train_glove,y_train)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000000.0,random_state=42)
lr.fit(X_train_glove_smote,y_train_glove_smote)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.71      0.74      0.72        23
  Depression       0.45      0.42      0.43        12

    accuracy                           0.63        35
   macro avg       0.58      0.58      0.58        35
weighted avg       0.62      0.63      0.62        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.86      0.73      0.79        33
  Depression       0.53      0.71      0.61        14

    accuracy                           0.72        47
   macro avg       0.69      0.72      0.70        47
weighted avg       0.76      0.72      0.73        47



<h3>Back_set</h3>

In [918]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1000.0,random_state=42)
lr.fit(X_back_glove,y_back)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.67      0.78      0.72        23
  Depression       0.38      0.25      0.30        12

    accuracy                           0.60        35
   macro avg       0.52      0.52      0.51        35
weighted avg       0.57      0.60      0.58        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.83      0.88      0.85        33
  Depression       0.67      0.57      0.62        14

    accuracy                           0.79        47
   macro avg       0.75      0.73      0.73        47
weighted avg       0.78      0.79      0.78        47



In [919]:
#Random Undersampling
rus=RandomUnderSampler(random_state=42)
X_back_glove_un,y_back_glove_un=rus.fit_resample(X_back_glove,y_back)

In [945]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_back_glove_un,y_back_glove_un)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.64      0.70      0.67        23
  Depression       0.30      0.25      0.27        12

    accuracy                           0.54        35
   macro avg       0.47      0.47      0.47        35
weighted avg       0.52      0.54      0.53        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.80      0.85      0.82        33
  Depression       0.58      0.50      0.54        14

    accuracy                           0.74        47
   macro avg       0.69      0.67      0.68        47
weighted avg       0.74      0.74      0.74        47



In [953]:
#Random OverSampling
smote=SMOTE(random_state=42)
X_back_glove_smote,y_back_glove_smote=smote.fit_resample(X_back_glove,y_back)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
lr.fit(X_back_glove_smote,y_back_glove_smote)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.68      0.74      0.71        23
  Depression       0.40      0.33      0.36        12

    accuracy                           0.60        35
   macro avg       0.54      0.54      0.54        35
weighted avg       0.58      0.60      0.59        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.81      0.79      0.80        33
  Depression       0.53      0.57      0.55        14

    accuracy                           0.72        47
   macro avg       0.67      0.68      0.68        47
weighted avg       0.73      0.72      0.73        47



<h3>Aug_set</h3>

In [976]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=1000000000.0,random_state=42)
lr.fit(X_aug_glove,y_aug)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.69      0.78      0.73        23
  Depression       0.44      0.33      0.38        12

    accuracy                           0.63        35
   macro avg       0.57      0.56      0.56        35
weighted avg       0.61      0.63      0.61        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.81      0.79      0.80        33
  Depression       0.53      0.57      0.55        14

    accuracy                           0.72        47
   macro avg       0.67      0.68      0.68        47
weighted avg       0.73      0.72      0.73        47



In [977]:
#Random Undersampling
rus=RandomUnderSampler(random_state=42)
X_aug_glove_un,y_aug_glove_un=rus.fit_resample(X_aug_glove,y_aug)

In [978]:
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=10000000.0,random_state=42)
# lr=LogisticRegression(max_iter=1000)
lr.fit(X_aug_glove_un,y_aug_glove_un)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.72      0.78      0.75        23
  Depression       0.50      0.42      0.45        12

    accuracy                           0.66        35
   macro avg       0.61      0.60      0.60        35
weighted avg       0.64      0.66      0.65        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.89      0.73      0.80        33
  Depression       0.55      0.79      0.65        14

    accuracy                           0.74        47
   macro avg       0.72      0.76      0.72        47
weighted avg       0.79      0.74      0.75        47



In [982]:
#Random OverSampling
smote=SMOTE(random_state=42)
X_aug_glove_smote,y_aug_glove_smote=smote.fit_resample(X_aug_glove,y_aug)
lr=LogisticRegression(max_iter=1000,class_weight='balanced',C=100000.0,random_state=42)
lr.fit(X_aug_glove_smote,y_aug_glove_smote)

y_val_pred=lr.predict(X_val_glove)
y_test_pred=lr.predict(X_test_glove)

print('Validation Set Performance:')
print(classification_report(y_val,y_val_pred,target_names=['Controlled','Depression'],zero_division=0.0))

print('Test Set Performance:')
print(classification_report(y_test,y_test_pred,target_names=['Controlled','Depression'],zero_division=0.0))

Validation Set Performance:
              precision    recall  f1-score   support

  Controlled       0.69      0.78      0.73        23
  Depression       0.44      0.33      0.38        12

    accuracy                           0.63        35
   macro avg       0.57      0.56      0.56        35
weighted avg       0.61      0.63      0.61        35

Test Set Performance:
              precision    recall  f1-score   support

  Controlled       0.81      0.76      0.78        33
  Depression       0.50      0.57      0.53        14

    accuracy                           0.70        47
   macro avg       0.65      0.66      0.66        47
weighted avg       0.72      0.70      0.71        47



<h1>Conclusion</h1>

It can be seen that word2vec and glove perforrms extremely well without any augmentation and sampling. Their test results are really good, however their validation results aren't that good. It can be that test dataset is more similar in distribution to train dataset than validation dataset. <br>
Overall, train_dataset (non-augmented data) with <strong>tf-idf embeddings</strong> and <strong>undersampling</strong> performed the best. The test scores are highest, however the validation scores are highest as well.