## Analyzing the Manual Coded Data


Sentiment Analysis: Analyzing the sentiment of posts helps us understand the emotional tone expressed in the text. This could be particularly useful in identifying posts that express treatments and treatment outcomes associated with mental health issues.

[Link to Mental Disorders Identification Reddit NLP dataset](https://www.kaggle.com/datasets/kamaruladha/mental-disorders-identification-reddit-nlp)

# CODE

In [2]:
import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot  as plt

In [3]:
df1=pd.read_csv('data/sample_posts_manual_coding_2.csv')

# Looking at the manual coded data
As of 5/18 at 6:13 AM I have manually coded the first 600 entries of this csv file, which contains all posts from the BPD subreddit between 01-01-2022 and 02-01-2022. There are 1882 posts in total, so this is roughly 1/3 of the total. The CSV file is not in chronological order, however.

The column 'self' refers to whether or not the person with (diagnosed or suspected) BPD is the poster, or someone else. This is defaulted to 1, and is 0 if the post is about someone else.
The column 'is_relevant' takes an inclusive view, so includes any post which discusses therapy or medication in any way. This is defaulted to zero, and is set to 1 if the post is relevant. The majority of these mentions are only done in passing, so I created a new column called 'highly_relevant', which only includes posts which mention a specific type of therapy/medication and some type of outcome. This is defaulted to null, and set to 1 if the post is highly relevant.

From reading through 500 of these posts, here are some common patterns:
1. There are relatively few posts which go into detail about treatments. I suspect the 'is_relevant' column will not be particularly useful, but the 'highly_relevant' one might be better.
2. There may be other research questions which this data set is better suited to address, since there are a lot of commonalities between the posts.
3. A number of posts mention that other posts are getting removed or deleted. Since I have already removed those from the data (and wouldn't be able to analyze them anyway), we are getting a filtered sample.

In [4]:
print(df1.shape)

(1882, 9)


In [5]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1882 entries, 0 to 1881
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   title            1882 non-null   object
 1   selftext         1882 non-null   object
 2   created_utc      1882 non-null   int64 
 3   over_18          1882 non-null   bool  
 4   subreddit        1882 non-null   object
 5   date_created     1882 non-null   object
 6   self             1882 non-null   int64 
 7   is_relevant      1882 non-null   int64 
 8   highly_relevant  21 non-null     object
dtypes: bool(1), int64(3), object(5)
memory usage: 119.6+ KB


In [6]:
## Counting the relevant posts

df1[df1['is_relevant'] != 0]['is_relevant'].count()

106

Of the first 600 posts in this sample, there are 106 relevant posts and 20 highly relevant posts (the last non-null entry is a marker for the last post which was coded).

In [7]:
posts_analyzed = 600

df_coded = df1.head(posts_analyzed)

## Removing the marker of where things were left off.

df_coded.highly_relevant[598] = np.nan

df_coded['highly_relevant'] = df_coded['highly_relevant'].fillna(0)

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_coded.highly_relevant[598] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coded.highly_relevant

In [22]:
distinct_entries = df_coded['highly_relevant'].unique()

# Print the distinct entries
print("Distinct entries in column 'highly_relevant':")
print(distinct_entries)

Distinct entries in column 'highly_relevant':
[0 '1']


In [23]:
## Note that the '1's are being read as strings. We'll fix that now

df_coded['highly_relevant'] = df_coded['highly_relevant'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coded['highly_relevant'] = df_coded['highly_relevant'].astype(int)


#### First attempts at building NLP models
Probably going to fail miserably, but there's only one way to find out...

In [8]:
## pip install nltk

## nltk.download('stopwords')

## nltk.download('wordnet') 

## For the life of me, I could not get this to work and had to manually download the folders and put them in the folder by hand.
## Hopefully, you have better luck with this.


In [9]:
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, make_scorer, accuracy_score
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#### Creating a single column with all the text and removing stopwords

In [29]:
# Combine text columns into a single column
df_coded['combined_text'] = df_coded['title'] + ' ' + df_coded['selftext']

# Text preprocessing function
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    
    # Tokenization
    tokens = text.split()
    
    # Lowercase and remove stopwords
    tokens = [word.lower() for word in tokens if word.lower() not in stop_words]
    
    # Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return ' '.join(tokens)

# Apply preprocessing to the combined text column
df_coded['processed_text'] = df_coded['combined_text'].apply(preprocess_text)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coded['combined_text'] = df_coded['title'] + ' ' + df_coded['selftext']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_coded['processed_text'] = df_coded['combined_text'].apply(preprocess_text)


#### The 0th level model

In [38]:
# Split the data into a training and hold-out set
X = df_coded['processed_text']
y = df_coded['highly_relevant']

X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, stratify=y)

# Create a text classification pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Perform k-fold cross-validation on the training set
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=kfold, scoring='accuracy')

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation score: {cv_scores.mean()}")

# Train the model on the entire training set and evaluate on the hold-out set
pipeline.fit(X_train, y_train)
y_pred_holdout = pipeline.predict(X_holdout)

print("\nEvaluation on the hold-out set:")
print(classification_report(y_holdout, y_pred_holdout))



Cross-validation scores: [0.96875 0.96875 0.9625 ]
Mean cross-validation score: 0.9666666666666667

Evaluation on the hold-out set:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       116
           1       0.00      0.00      0.00         4

    accuracy                           0.97       120
   macro avg       0.48      0.50      0.49       120
weighted avg       0.93      0.97      0.95       120



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


This is not working at all. Because the columns are so imbalanced, the model just returns 0 (not 'highly_relevant'). So let's start changing things to better reward finding 1s.

#### A slightly better baseline

In [56]:
X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

# Create a text classification pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('classifier', MultinomialNB())
])

# Perform k-fold cross-validation on the training set
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Train the model on the entire training set and evaluate on the hold-out set
pipeline.fit(X_train, y_train)
y_pred_prob_holdout = pipeline.predict_proba(X_holdout)[:, 1]  # Get probabilities for the positive class

# Adjust the threshold
threshold = 0.001  # Example threshold less than 0.5 to increase sensitivity
y_pred_holdout = (y_pred_prob_holdout >= threshold).astype(int)

print("\nEvaluation on the hold-out set with adjusted threshold:")
print(classification_report(y_holdout, y_pred_holdout))


Evaluation on the hold-out set with adjusted threshold:
              precision    recall  f1-score   support

           0       0.99      0.81      0.89       116
           1       0.12      0.75      0.21         4

    accuracy                           0.81       120
   macro avg       0.55      0.78      0.55       120
weighted avg       0.96      0.81      0.87       120



In [57]:
# Filter the holdout set for entries predicted as positive
positive_entries = X_holdout[y_pred_holdout == 1]

# Display the positive entries from the holdout set
print("\nEntries from the holdout set that the model predicted as positive:")
print(positive_entries)

# Additionally, show the actual labels for these entries
positive_labels = y_holdout[y_pred_holdout == 1]
print("\nActual labels for the positive entries:")
print(positive_labels)


Entries from the holdout set that the model predicted as positive:
570    book movie recommendation i've recently gotten...
316    got second opinion confirmed bpd left sub firs...
451    using sleep coping mechanism past 2 year copin...
285    i’m quit job mental health bad feel guilty lik...
103    shfkfhfj boy showing slight interest i’m prett...
17     ever stop? pretty much always go like this: so...
269    quiet bpd day like today inside out, organ bur...
137    ex boyfriend fetish messaged girl reddit about...
367    bipolar bdp well today first appointment new t...
376    really get diagnosed 25? told psychologist i’v...
356    better mood swing bad im getting better managi...
400    re: removed post hi all! we've lot people reac...
117    chaos embodied borderline personality disorder...
47     advice overcoming future failure hey everyone,...
565    whats wrong ? posted long ago best friend bpd ...
75     diagnosed bpd tendency mean full blown disorde...
126    period make b

There are a ton of false positives, but we seem to be doing a reasonably job of finding the relevant posts. More importantly, we can use this as a baseline to improve upon.