# Social media posts as Readout for Mental health wellness

## Dataset
Link: https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch

Citation
The dataset is originally published at Suicide Ideation Detection in Social Media Forums( https://ieeexplore.ieee.org/document/9591887)

## Column Description

The current version has only suicide & non-suicide labels.
Version V13 has suicide, depression & teenagers(normal conversations) as labels.

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [4]:
df = pd.read_csv('../data/suicide_detection_kaggle.csv')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232074 entries, 0 to 232073
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  232074 non-null  int64 
 1   text        232074 non-null  object
 2   class       232074 non-null  object
dtypes: int64(1), object(2)
memory usage: 5.3+ MB


In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,text,class
0,2,Ex Wife Threatening SuicideRecently I left my ...,suicide
1,3,Am I weird I don't get affected by compliments...,non-suicide
2,4,Finally 2020 is almost over... So I can never ...,non-suicide
3,8,i need helpjust help me im crying so hard,suicide
4,9,"I’m so lostHello, my name is Adam (16) and I’v...",suicide


In [7]:
df['class'].value_counts()

class
suicide        116037
non-suicide    116037
Name: count, dtype: int64

In [8]:
df1 = df.copy()
df1.replace({'class': {'suicide': 1, 'non-suicide': 0}}, inplace=True)


  df1.replace({'class': {'suicide': 1, 'non-suicide': 0}}, inplace=True)


In [9]:
df2 = df1.drop(columns=['Unnamed: 0'])

In [10]:
df2.head()

Unnamed: 0,text,class
0,Ex Wife Threatening SuicideRecently I left my ...,1
1,Am I weird I don't get affected by compliments...,0
2,Finally 2020 is almost over... So I can never ...,0
3,i need helpjust help me im crying so hard,1
4,"I’m so lostHello, my name is Adam (16) and I’v...",1


In [11]:
X = df2['text']
y = df2['class']

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)
X_train.shape, X_test.shape

((174055,), (58019,))

In [13]:
# Fit the CountVectorizer to the training data
vect = CountVectorizer()

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.fit_transform(X_train)
X_test_vectorized = vect.transform(X_test)
# print("X_train_vectorized: ", X_train_vectorized)

In [14]:
print("X_train shape = {}".format(X_train.shape))
print("Vocabulary length = {}".format(len(vect.vocabulary_)))

X_train shape = (174055,)
Vocabulary length = 142535


In [15]:
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[:20]

[('00', 0),
 ('000', 1),
 ('0000', 2),
 ('0000000', 3),
 ('00000000', 4),
 ('000000000000', 5),
 ('0000000000000000000000000000000', 6),
 ('000000000000000000000000000000000', 7),
 ('0000000000000000000000001', 8),
 ('0000000000000000001', 9),
 ('000000000000000001', 10),
 ('000000000000001', 11),
 ('000000000001', 12),
 ('000000001', 13),
 ('0000001', 14),
 ('000001', 15),
 ('00000101', 16),
 ('00000111', 17),
 ('00000459183', 18),
 ('00001', 19)]

In [None]:
# Train the model
lr = LogisticRegression(max_iter=1500)
lr.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
predictions = lr.predict(vect.transform(X_test))
predict_probab = lr.predict_proba(vect.transform(X_test))[:,1]

print("AUC = {:.3f}".format(roc_auc_score(y_test, predict_probab)))

AUC = 0.978


In [20]:
y_pred_train_lr = lr.predict(X_train_vectorized)
y_pred_test_lr = lr.predict(X_test_vectorized)

In [21]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, precision_recall_curve, f1_score, classification_report, confusion_matrix

In [22]:
# Evaluate the logistic regression model
print('Performance on Train Dataset')
print(classification_report(y_train, y_pred_train_lr))
print('______________________________________________')
print('Performance on Test Dataset')
print(classification_report(y_test, y_pred_test_lr))

Performance on Train Dataset
              precision    recall  f1-score   support

           0       0.96      0.98      0.97     87027
           1       0.98      0.96      0.97     87028

    accuracy                           0.97    174055
   macro avg       0.97      0.97      0.97    174055
weighted avg       0.97      0.97      0.97    174055

______________________________________________
Performance on Test Dataset
              precision    recall  f1-score   support

           0       0.92      0.96      0.94     29010
           1       0.95      0.91      0.93     29009

    accuracy                           0.94     58019
   macro avg       0.94      0.94      0.94     58019
weighted avg       0.94      0.94      0.94     58019



In [24]:
from sklearn.model_selection import GridSearchCV, cross_val_score

In [52]:
lr = LogisticRegression(random_state=42, max_iter=1500, class_weight="balanced")
param_lr = {'C': [0.1, 0.2, 0.5],
            'solver': ['liblinear', 'lbfgs', 'sag']
            }

grid_lr = GridSearchCV(lr, param_grid=param_lr, cv=3, scoring='roc_auc', 
                           verbose=5, n_jobs=-1)
grid_lr.fit(X_train_vectorized, y_train)


print("Best parameters for Logistic Regression:", grid_lr.best_params_)
print('Best score:\n{:.2f}'.format(grid_lr.best_score_))

Fitting 3 folds for each of 9 candidates, totalling 27 fits
Best parameters for Logistic Regression: {'C': 0.2, 'solver': 'lbfgs'}
Best score:
0.98


In [54]:
# Save best model (including fitted preprocessing steps) as best_model 
best_model_lr = grid_lr.best_estimator_
best_model_lr
# Apply the best model to the training and test data
y_pred_train_lrg = best_model_lr.predict(X_train_vectorized)
y_pred_test_lrg = best_model_lr.predict(X_test_vectorized)
# Evaluate the best logistic regression model
print('Performance on Train Dataset')
print(classification_report(y_train, y_pred_train_lrg))
print('______________________________________________')
print('Performance on Test Dataset')
print(classification_report(y_test, y_pred_test_lrg))

Performance on Train Dataset
              precision    recall  f1-score   support

           0       0.93      0.97      0.95     87027
           1       0.97      0.93      0.95     87028

    accuracy                           0.95    174055
   macro avg       0.95      0.95      0.95    174055
weighted avg       0.95      0.95      0.95    174055

______________________________________________
Performance on Test Dataset
              precision    recall  f1-score   support

           0       0.92      0.96      0.94     29010
           1       0.96      0.91      0.93     29009

    accuracy                           0.93     58019
   macro avg       0.94      0.93      0.93     58019
weighted avg       0.94      0.93      0.93     58019



In [35]:
param_lr = best_model_lr.get_params()
param_lr

{'C': 0.2,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1500,
 'multi_class': 'deprecated',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [55]:
cm = confusion_matrix(y_test, y_pred_test_lrg)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[27803  1207]
 [ 2571 26438]]


In [31]:
# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 15
# This means a word should have been used in at least 15 SMS 
vect_tf = TfidfVectorizer(min_df=15)
vect_tf.fit(X_train)


# transform the documents in the training data to a document-term matrix
X_train_vect_tf = vect_tf.transform(X_train)
X_test_vect_tf = vect_tf.transform(X_test)

# let's look of some of the words gathered with this method
sorted(vect_tf.vocabulary_.items(), key=lambda x: x[1])[10:30]

[('01101000', 10),
 ('01101111', 11),
 ('01110000', 12),
 ('01110011', 13),
 ('01110100', 14),
 ('01110101', 15),
 ('01111001', 16),
 ('02', 17),
 ('03', 18),
 ('04', 19),
 ('05', 20),
 ('06', 21),
 ('07', 22),
 ('08', 23),
 ('09', 24),
 ('10', 25),
 ('100', 26),
 ('1000', 27),
 ('10000', 28),
 ('100k', 29)]

In [32]:
lr = LogisticRegression(random_state=42, max_iter=1500, class_weight="balanced")
lr.fit(X_train_vect_tf, y_train)

# Predict the transformed test documents
predictions = lr.predict_proba(X_test_vect_tf)[:,1]

print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

AUC = 0.983


In [34]:
y_pred_train_tf = lr.predict(X_train_vect_tf)
y_pred_test_tf = lr.predict(X_test_vect_tf)

print('Performance on Train Dataset')
print(classification_report(y_train, y_pred_train_tf))
print('______________________________________________')
print('Performance on Test Dataset')
print(classification_report(y_test, y_pred_test_tf))

Performance on Train Dataset
              precision    recall  f1-score   support

           0       0.94      0.95      0.95     87027
           1       0.95      0.94      0.94     87028

    accuracy                           0.94    174055
   macro avg       0.94      0.94      0.94    174055
weighted avg       0.94      0.94      0.94    174055

______________________________________________
Performance on Test Dataset
              precision    recall  f1-score   support

           0       0.93      0.95      0.94     29010
           1       0.95      0.93      0.94     29009

    accuracy                           0.94     58019
   macro avg       0.94      0.94      0.94     58019
weighted avg       0.94      0.94      0.94     58019



In [36]:
param_tf = lr.get_params()
param_tf

{'C': 1.0,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 1500,
 'multi_class': 'deprecated',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [47]:
lr = LogisticRegression(random_state=42, max_iter=1500, class_weight="balanced")
param_lr_tf = {'C': [2, 5, 7],
               'penalty': ['l1', 'l2'],
               'solver': ['liblinear', 'lbfgs']
               }

grid_lr_tf = GridSearchCV(lr, param_grid=param_lr_tf, cv=3, scoring='roc_auc', 
                           verbose=5, n_jobs=-1)
grid_lr_tf.fit(X_train_vect_tf, y_train)


print("Best parameters for Logistic Regression:", grid_lr_tf.best_params_)
print('Best score:\n{:.2f}'.format(grid_lr_tf.best_score_))

Fitting 3 folds for each of 12 candidates, totalling 36 fits


9 fits failed out of a total of 36.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
9 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\lekal\DS_repos\Capstone_Project\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\lekal\DS_repos\Capstone_Project\.venv\Lib\site-packages\sklearn\base.py", line 1363, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\lekal\DS_repos\Capstone_Project\.venv\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1210, in fit
    solver = _check_solver(self.solver, self.penalty, se

Best parameters for Logistic Regression: {'C': 5, 'penalty': 'l2', 'solver': 'liblinear'}
Best score:
0.98


In [50]:
best_model_lr_tf = grid_lr_tf.best_estimator_
best_model_lr_tf

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,5
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,42
,solver,'liblinear'
,max_iter,1500


In [51]:
y_pred_train_tfg = best_model_lr_tf.predict(X_train_vect_tf)
y_pred_test_tfg = best_model_lr_tf.predict(X_test_vect_tf)

print('Performance on Train Dataset')
print(classification_report(y_train, y_pred_train_tfg))
print('______________________________________________')
print('Performance on Test Dataset')
print(classification_report(y_test, y_pred_test_tfg))

Performance on Train Dataset
              precision    recall  f1-score   support

           0       0.95      0.96      0.96     87027
           1       0.96      0.95      0.96     87028

    accuracy                           0.96    174055
   macro avg       0.96      0.96      0.96    174055
weighted avg       0.96      0.96      0.96    174055

______________________________________________
Performance on Test Dataset
              precision    recall  f1-score   support

           0       0.94      0.95      0.94     29010
           1       0.95      0.94      0.94     29009

    accuracy                           0.94     58019
   macro avg       0.94      0.94      0.94     58019
weighted avg       0.94      0.94      0.94     58019



In [56]:
cm = confusion_matrix(y_test, y_pred_test_tfg)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[27538  1472]
 [ 1874 27135]]
