This is the second of three notebooks to analyze whether we can distinguish between a depressed note and a suicidal note. In this notebook, I conduct topic modeling for the posts pulled from r/depression and r/SuicideWatch to see if differentiation between the two types of posts can be discerned.

In [1]:
import pandas as pd
df_depression = pd.read_csv("r_depression.csv",index_col = 0)
df_suicide = pd.read_csv("r_suicide.csv", index_col = 0)

In [4]:
print(df_depression.shape)
df_depression.dropna(how = 'any', inplace = True)
print(df_depression.shape)
df_depression[:5]

(3796, 2)
(3714, 2)


Unnamed: 0,Title,Body
0,Not sure if I'm being annoying and overbearing...,This semester a friend of mine has skipped sev...
1,Why can't I just be homeless and die in the cold,I dont want to work I dont want to get up I do...
2,I’m better,I’d officially better and I’m ready to leave t...
3,i have therapy and i’m not going,i’ve got a therapy appointment in 1 hour and i...
4,This one girl has actually driven me into depr...,This girl and I had known each other for a whi...


In [5]:
print(df_suicide.shape)
df_suicide.dropna(how = 'any', inplace = True)
print(df_suicide.shape)
df_suicide[:5]

(3900, 2)
(3680, 2)


Unnamed: 0,Title,Body
0,Holy crap I’m back,Is that what this account will be? Pouring my ...
1,Telling someone not to kill themselves seems a...,"Also, stop telling people ""Think about your fa..."
2,My life is meaningless,\n\n\n\nI just want to kill myself. I cannot g...
3,I think this is my last option,It's not that I want to die. It's that I have ...
4,I don't think I can do this,I have midterms and the last thing I wanna do ...


## High-level steps: 
for each dataframe:
1. preprocess: merge the two columns (title + body), tokenize, remove stopwords, lemmatize
2. topic modeling

then concatenate the two dataframe (by rows) and perform topic modeling again

## 1. Topic modeling for r/Depression

In [6]:
# merge title and body for comprehensive analysis
df_depression['Combined'] = df_depression['Title']+' '+ df_depression['Body']
df_depression[:5]

Unnamed: 0,Title,Body,Combined
0,Not sure if I'm being annoying and overbearing...,This semester a friend of mine has skipped sev...,Not sure if I'm being annoying and overbearing...
1,Why can't I just be homeless and die in the cold,I dont want to work I dont want to get up I do...,Why can't I just be homeless and die in the co...
2,I’m better,I’d officially better and I’m ready to leave t...,I’m better I’d officially better and I’m ready...
3,i have therapy and i’m not going,i’ve got a therapy appointment in 1 hour and i...,i have therapy and i’m not going i’ve got a th...
4,This one girl has actually driven me into depr...,This girl and I had known each other for a whi...,This one girl has actually driven me into depr...


In [7]:
# prep for preprocessing
# tokenization
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
# stopword
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
# lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Jenny\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jenny\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Jenny\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
import string
import re
def preprocess(post):
    """
    preprocess (tokenize, remove stopwords, lemmatize) a post for topic modeling
    args:
        post (string): one string object that user would like to process
    returns:
        a list of words
    """
    word_tokens = word_tokenize(post) # tokenize words
    filtered_sentence = [word.lower() for word in word_tokens if word not in stop_words] # remove stopwords
    filtered_sentence_2 = [word for word in filtered_sentence if not re.match(r"^\W+$", word)] # remove standalone punctuations
    filtered_sentence_3 = [re.sub(r"\W+", "", word) for word in filtered_sentence_2] # delete any non-word characters (e.g., '')
    
    # lemmatize
    lemmatized = []
    for word in filtered_sentence_3:
        lemmatized.append(lemmatizer.lemmatize(word))
    return lemmatized

# test
preprocess("Mainstream of 'depression and anxiety' is ruing the empathy for people that have it.")

['mainstream', 'depression', 'anxiety', 'ruing', 'empathy', 'people']

In [11]:
df_depression['Combined'] = df_depression['Combined'].astype('str')
df_depression[:5]

Unnamed: 0,Title,Body,Combined
0,Not sure if I'm being annoying and overbearing...,This semester a friend of mine has skipped sev...,Not sure if I'm being annoying and overbearing...
1,Why can't I just be homeless and die in the cold,I dont want to work I dont want to get up I do...,Why can't I just be homeless and die in the co...
2,I’m better,I’d officially better and I’m ready to leave t...,I’m better I’d officially better and I’m ready...
3,i have therapy and i’m not going,i’ve got a therapy appointment in 1 hour and i...,i have therapy and i’m not going i’ve got a th...
4,This one girl has actually driven me into depr...,This girl and I had known each other for a whi...,This one girl has actually driven me into depr...


In [12]:
# prepare document-term matrix
from sklearn.feature_extraction.text import CountVectorizer
corpus = df_depression['Combined'].values
vectorizer = CountVectorizer(tokenizer = preprocess, # use our own preprocessor that tokenizes, removes stopwords, and lemmatizes
                            min_df = 15)
depression_dtm = vectorizer.fit_transform(corpus)

In [14]:
print(len(vectorizer.get_feature_names()))
vectorizer.get_feature_names()[:5]

2004


['0', '1', '10', '100', '11']

In [15]:
X = depression_dtm.toarray()
vocab = vectorizer.get_feature_names()
titles = df_depression.index.values

print(type(X), X.sum(), X.shape) # an array may be needed for lda

<class 'numpy.ndarray'> 288114 (3714, 2004)


In [18]:
# get topics with LDA
import numpy as np
import lda

model = lda.LDA(n_topics=5, n_iter=1500, random_state=1) # 5 topics seem to be too much, but 4 seem confused
model.fit(X)  # model.fit_transform(X) is also available
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 20
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, '/'.join(topic_words)))

INFO:lda:n_documents: 3714
INFO:lda:vocab_size: 2004
INFO:lda:n_words: 288114
INFO:lda:n_topics: 5
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -2437687
INFO:lda:<10> log likelihood: -2202905
INFO:lda:<20> log likelihood: -2141844
INFO:lda:<30> log likelihood: -2113252
INFO:lda:<40> log likelihood: -2096042
INFO:lda:<50> log likelihood: -2083532
INFO:lda:<60> log likelihood: -2073214
INFO:lda:<70> log likelihood: -2066408
INFO:lda:<80> log likelihood: -2061395
INFO:lda:<90> log likelihood: -2056455
INFO:lda:<100> log likelihood: -2052748
INFO:lda:<110> log likelihood: -2049825
INFO:lda:<120> log likelihood: -2050034
INFO:lda:<130> log likelihood: -2047948
INFO:lda:<140> log likelihood: -2048064
INFO:lda:<150> log likelihood: -2046765
INFO:lda:<160> log likelihood: -2045989
INFO:lda:<170> log likelihood: -2045036
INFO:lda:<180> log likelihood: -2045671
INFO:lda:<190> log likelihood: -2045744
INFO:lda:<200> log likelihood: -2045191
INFO:lda:<210> log likelihood: -2044249
INFO:lda:<

Topic 0: nt/m/s/ve/ca/like/feel/want/know/life/ll/re/even/get/d/really/people/anymore/one/thing
Topic 1: emptiness/im/dont/cant/ive/help/didnt/need/tired/hate/wont/thats/want/doesnt/shes/please/wasnt/havent/isnt/there
Topic 2: year/friend/school/got/job/time/get/go/day/would/back/told/one/month/work/mom/home/want/parent/life
Topic 3: feel/like/depression/get/time/help/day/know/really/thing/life/even/make/thought/feeling/people/year/work/depressed/go
Topic 4: feel/like/want/know/life/people/even/friend/make/get/thing/fucking/time/never/one/really/think/would/hate/someone


### The four-five topics that people are depressed about seem to be

- confused: feel, even, anymore
- lonely and depressed: emptiness, can't, tired, hate 
- people and career: friend, school, job, work, mom, parent
- mental health: depression, feeling, depressed 
- anger (more ambiguous): life, fucking, never, hate

In [20]:
# understand the top topic of each post
doc_topic = model.doc_topic_
topic_index = {0:'confused', 1: 'lonely', 2:'people_career', 3:'mental_health', 4:'anger'}
topic_count = {v : 0 for v in topic_index.values()}
for i in range(len(titles)):
    topic_count[topic_index[doc_topic[i].argmax()]] += 1

In [21]:
topic_count

{'confused': 234,
 'lonely': 11,
 'people_career': 721,
 'mental_health': 1156,
 'anger': 1592}

In [58]:
# # check each topic
# for i in range(len(titles)):
#     if doc_topic[i].argmax() == 3:
#         print(df_depression['Combined'].iloc[i])

## 1. Topic modeling for r/SuicideWatch

In [22]:
# merge two columns
df_suicide['Combined'] = df_suicide['Title']+' '+ df_suicide['Body']
df_suicide[:5]

Unnamed: 0,Title,Body,Combined
0,Holy crap I’m back,Is that what this account will be? Pouring my ...,Holy crap I’m back Is that what this account w...
1,Telling someone not to kill themselves seems a...,"Also, stop telling people ""Think about your fa...",Telling someone not to kill themselves seems a...
2,My life is meaningless,\n\n\n\nI just want to kill myself. I cannot g...,My life is meaningless \n\n\n\nI just want to ...
3,I think this is my last option,It's not that I want to die. It's that I have ...,I think this is my last option It's not that I...
4,I don't think I can do this,I have midterms and the last thing I wanna do ...,I don't think I can do this I have midterms an...


In [24]:
df_suicide['Combined'] = df_suicide['Combined'].astype('str')
df_suicide[:5]

Unnamed: 0,Title,Body,Combined
0,Holy crap I’m back,Is that what this account will be? Pouring my ...,Holy crap I’m back Is that what this account w...
1,Telling someone not to kill themselves seems a...,"Also, stop telling people ""Think about your fa...",Telling someone not to kill themselves seems a...
2,My life is meaningless,\n\n\n\nI just want to kill myself. I cannot g...,My life is meaningless \n\n\n\nI just want to ...
3,I think this is my last option,It's not that I want to die. It's that I have ...,I think this is my last option It's not that I...
4,I don't think I can do this,I have midterms and the last thing I wanna do ...,I don't think I can do this I have midterms an...


In [25]:
# prepare document-term matrix
corpus_2 = df_suicide['Combined'].values
vectorizer = CountVectorizer(tokenizer = preprocess, # use our own preprocessor that tokenizes, removes stopwords, and lemmatizes
                            min_df = 15)
suicide_dtm = vectorizer.fit_transform(corpus_2)

In [26]:
print(len(vectorizer.get_feature_names()))
vectorizer.get_feature_names()[:5]

1957


['0', '1', '10', '100', '11']

In [27]:
X = suicide_dtm.toarray()
vocab = vectorizer.get_feature_names()
titles = df_suicide.index.values

print(type(X), X.sum(), X.shape) # an array may be needed for lda

<class 'numpy.ndarray'> 281251 (3680, 1957)


In [28]:
# get topics with LDA
import numpy as np
import lda

model = lda.LDA(n_topics=3, n_iter=1500, random_state=1) # the topics seem to overlap quite a bit; limiting to 3 topics
model.fit(X)  # model.fit_transform(X) is also available
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 20
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, '/'.join(topic_words)))

INFO:lda:n_documents: 3680
INFO:lda:vocab_size: 1957
INFO:lda:n_words: 281251
INFO:lda:n_topics: 3
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -2186687
INFO:lda:<10> log likelihood: -2056377
INFO:lda:<20> log likelihood: -2012810
INFO:lda:<30> log likelihood: -1981041
INFO:lda:<40> log likelihood: -1958350
INFO:lda:<50> log likelihood: -1940562
INFO:lda:<60> log likelihood: -1930191
INFO:lda:<70> log likelihood: -1925771
INFO:lda:<80> log likelihood: -1922929
INFO:lda:<90> log likelihood: -1921691
INFO:lda:<100> log likelihood: -1920409
INFO:lda:<110> log likelihood: -1920617
INFO:lda:<120> log likelihood: -1919763
INFO:lda:<130> log likelihood: -1917280
INFO:lda:<140> log likelihood: -1917494
INFO:lda:<150> log likelihood: -1917251
INFO:lda:<160> log likelihood: -1916453
INFO:lda:<170> log likelihood: -1916141
INFO:lda:<180> log likelihood: -1915471
INFO:lda:<190> log likelihood: -1915435
INFO:lda:<200> log likelihood: -1915269
INFO:lda:<210> log likelihood: -1914500
INFO:lda:<

Topic 0: like/feel/life/want/know/time/get/year/even/would/friend/people/thing/go/really/day/think/make/going/never
Topic 1: nt/m/s/ve/ca/want/ll/like/know/feel/life/fucking/get/die/d/people/anymore/help/even/kill
Topic 2: one/care/im/dont/kill/cant/ive/doesnt/thats/ill/isnt/there/wont/wouldnt/didnt/he/havent/shes/id/now


### The three topics that people are suicidal over seem to be

- help-seeking? feel, friend, people, think
- anger and given up? feel, fucking, die, anymore, kill
- ...everything

In [29]:
# understand the top topic of each post
doc_topic = model.doc_topic_
topic_index = {0:'help-seeking', 1: 'anger', 2:'everything'}
topic_count = {v : 0 for v in topic_index.values()}
for i in range(len(titles)):
    topic_count[topic_index[doc_topic[i].argmax()]] += 1

In [30]:
topic_count

{'help-seeking': 3421, 'anger': 256, 'everything': 3}

In [71]:
# # check each topic
# for i in range(len(titles)):
#     if doc_topic[i].argmax() == 2:
#         print(df_suicide['Combined'].iloc[i])

## 3. Topic modeling for concatenated dataframe

In [31]:
df_both = pd.concat([df_depression, df_suicide],ignore_index=True).dropna(how = 'any') #reset the index since both were #'s
print(df_both.shape)
df_both[:5]

(7394, 3)


Unnamed: 0,Title,Body,Combined
0,Not sure if I'm being annoying and overbearing...,This semester a friend of mine has skipped sev...,Not sure if I'm being annoying and overbearing...
1,Why can't I just be homeless and die in the cold,I dont want to work I dont want to get up I do...,Why can't I just be homeless and die in the co...
2,I’m better,I’d officially better and I’m ready to leave t...,I’m better I’d officially better and I’m ready...
3,i have therapy and i’m not going,i’ve got a therapy appointment in 1 hour and i...,i have therapy and i’m not going i’ve got a th...
4,This one girl has actually driven me into depr...,This girl and I had known each other for a whi...,This one girl has actually driven me into depr...


In [33]:
# prepare document-term matrix
corpus_3 = df_both['Combined'].values
vectorizer = CountVectorizer(tokenizer = preprocess, # use our own preprocessor that tokenizes, removes stopwords, and lemmatizes
                            min_df = 15)
both_dtm = vectorizer.fit_transform(corpus_3)

In [34]:
print(len(vectorizer.get_feature_names()))
vectorizer.get_feature_names()[:5]

3031


['0', '1', '10', '100', '1000']

In [35]:
X = both_dtm.toarray()
vocab = vectorizer.get_feature_names()
titles = df_both.index.values

print(type(X), X.sum(), X.shape) # an array may be needed for lda

<class 'numpy.ndarray'> 591620 (7394, 3031)


In [36]:
# get topics with LDA
import numpy as np
import lda

model = lda.LDA(n_topics=5, n_iter=1500, random_state=1) # the topics get quite blurry once we combine the two dataframes...
# going with 5 for now (tried 3-5; all have overlapping words but with three there's essentially no difference)
model.fit(X)  # model.fit_transform(X) is also available
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 20
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, '/'.join(topic_words)))

INFO:lda:n_documents: 7394
INFO:lda:vocab_size: 3031
INFO:lda:n_words: 591620
INFO:lda:n_topics: 5
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -5075575
INFO:lda:<10> log likelihood: -4630994
INFO:lda:<20> log likelihood: -4491901
INFO:lda:<30> log likelihood: -4430139
INFO:lda:<40> log likelihood: -4393164
INFO:lda:<50> log likelihood: -4369894
INFO:lda:<60> log likelihood: -4354611
INFO:lda:<70> log likelihood: -4343500
INFO:lda:<80> log likelihood: -4334337
INFO:lda:<90> log likelihood: -4327494
INFO:lda:<100> log likelihood: -4322840
INFO:lda:<110> log likelihood: -4319536
INFO:lda:<120> log likelihood: -4314655
INFO:lda:<130> log likelihood: -4312751
INFO:lda:<140> log likelihood: -4309404
INFO:lda:<150> log likelihood: -4308907
INFO:lda:<160> log likelihood: -4305176
INFO:lda:<170> log likelihood: -4304190
INFO:lda:<180> log likelihood: -4301987
INFO:lda:<190> log likelihood: -4298440
INFO:lda:<200> log likelihood: -4298081
INFO:lda:<210> log likelihood: -4295932
INFO:lda:<

Topic 0: like/feel/life/get/time/year/thing/even/depression/day/really/know/work/job/make/go/people/going/thought/friend
Topic 1: friend/year/would/time/got/mom/told/one/said/back/day/go/dad/parent/month/kill/never/school/last/week
Topic 2: one/care/emptiness/mountain/climb/http/river/cheap/friendship/enjoyable/afternoon/consciousness/45/happened/drifting/private/face/continuously/heaven/everywhere
Topic 3: want/feel/like/know/life/im/people/fucking/even/get/think/dont/anymore/really/hate/die/make/friend/much/everything
Topic 4: nt/m/s/ve/ca/want/like/feel/know/life/people/ll/get/even/anymore/think/would/make/fucking/d


### The five topics in the combined dataframe include

- depressed? work?
- friends and parents? school?
- lonely; life is too challenging
- anger
- also anger?

In [38]:
# understand the top topic of each post
doc_topic = model.doc_topic_
topic_index = {0:'work', 1: 'school', 2:'lonely', 3:'anger', 4:'also_anger'}
topic_count = {v : 0 for v in topic_index.values()}
for i in range(len(titles)):
    topic_count[topic_index[doc_topic[i].argmax()]] += 1

In [39]:
topic_count

{'work': 3252, 'school': 829, 'lonely': 6, 'anger': 2140, 'also_anger': 1167}

In [37]:
# # check each topic
# for i in range(len(titles)):
#     if doc_topic[i].argmax() == 4:
#         print(df_both['Combined'].iloc[i])

**Final Comment:** In general, the topics in r/depression are __slightly__ more clear cut; that said, everything overlaps quite a bit in both subreddits. Moreover, the topics in the combined dataframe are not at all obvious. I would hesitate to conclude the topics in the two subreddits can be distinguished.