To explore topic modeling using [Latent Dirichlet allocation (LDA)](https://www.ibm.com/think/topics/latent-dirichlet-allocation).

In [1]:
from tqdm.auto import tqdm
import os
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn' This disables warning of "copying a slice of a DataFrame"
tqdm.pandas() # activate progress_apply

import numpy as np
from copy import copy

from collections import Counter

# Data

Let's keep using the responses of psycologists to comments from patients dataset

In [2]:
mental_health = pd.read_csv("data/mental_health.csv")
mental_health.rename(columns={"Context": "patient_comment", "Response": "psyc_response"}, inplace=True)
mental_health.dropna(inplace=True)

# take a sample of 20 responses
mental_health = mental_health.sample(20, random_state=1337)
mental_health

Unnamed: 0,patient_comment,psyc_response
3296,"My boyfriend is in Ireland for 11 days, and I ...",It sounds like you and your boyfriend are very...
2011,I have so many issues to address. I have a his...,I think this is a very common question that pe...
1683,"After 40 years of being straight, how could I ...",Sexuality is normally formed during adolescenc...
1560,I feel like I took our relationship for grante...,A key factor in a relationship is trust.I'd st...
1267,"I crave attention, companionship, and sex. She...","Hi Hampton,Although I'd bet your wife also wan..."
2895,He said he would try and he never did. It's be...,If your husband is changing his mind about whe...
2697,"I always feel the need to impress people, whet...",It is normal to seek other’s attention and not...
1042,We're in an eight year relationship. My boyfri...,"First, let me extend my compassion to both of ..."
3312,I've gone to a couple therapy sessions so far ...,"Yes, it is completely normal to feel anxious a..."
2788,He is an adolescent. He has peed his pant mult...,"Sounds as though your son is ""pissed off"" abou..."


# Process text

The LDA algorithm uses as an input the Term Document Frequency or TDF, which is just a list of words and their frequency per each row of the data.

We will use `spacy` to process each text and get the lemmas of each patient and psyc data.

In [3]:
import spacy
from spacy.language import Language

# check and download the needed models
for model in ["en_core_web_sm"]:     # smallest processing pipeline
    try:
        spacy.load(model)
        print(f"{model} is available")
    except:
        spacy.cli.download(model)

en_core_web_sm is available


**MIGHT NEED TO RESTARD THE KERNEL TO USE THE MODELS**

In [4]:
# Define a custom spaCy pipeline component to clean the text
@Language.component("cleaning_component")
def remove_stop_lemma_POS(doc):
    # Lemmatize, remove stopwords, filter out specific POS tags, and remove tokens containing a backslash
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.pos_ not in ['PROPN', 'PUNCT', 'INTJ', 'NUM']]
    # Filter out tokens containing '\n\n', tokens with '-', and tokens starting with '\'
    tokens = [token for token in tokens if '\n\n' not in token and '-' not in token and not token.startswith('\\')]
    # Remove any empty strings and special characters like '\xa0'
    tokens = [token for token in tokens if token and not (token.startswith('\xa0') or token.startswith('\n') or token.startswith(' '))]
    tokens = [token for token in tokens if token not in ['','.','¿','20','sé','"laziness']]
    doc.user_data["cleaned_tokens"] = tokens
    return doc

In [5]:
nlp_en = spacy.load("en_core_web_sm")
nlp_en.add_pipe("cleaning_component", name="cleaning_component", last=True);

Now we can then process the text to only contain the lemmas.

In [6]:
# Prepare the text into 1 vector with labels just to keep track
# Process patient comments
patient_comments = mental_health["patient_comment"].tolist()
patient_docs = list(nlp_en.pipe(patient_comments))
patient_cleaned_tokens = [" ".join(doc.user_data["cleaned_tokens"]) for doc in patient_docs]

# Process psychologist responses
psyc_responses = mental_health["psyc_response"].tolist()
psyc_docs = list(nlp_en.pipe(psyc_responses))
psyc_cleaned_tokens = [" ".join(doc.user_data["cleaned_tokens"]) for doc in psyc_docs]

# Create dataframe with processed texts
# First create separate records for patient and psyc texts
patient_records = [
    {"index": idx, "text": tokens, "label": "patient"} 
    for idx, tokens in zip(mental_health.index, patient_cleaned_tokens)
]
psyc_records = [
    {"index": idx, "text": tokens, "label": "psyc"}
    for idx, tokens in zip(mental_health.index, psyc_cleaned_tokens)
]

# Combine both record types
all_records = patient_records + psyc_records

# Create the final dataframe
processed_texts = pd.DataFrame(all_records)
processed_texts['length'] = [len(text.split()) for text in processed_texts['text']]
processed_texts.sample(5)

Unnamed: 0,index,text,label,length
7,1042,year relationship boyfriend drink lot experien...,patient,12
13,1078,pretty day middle child trust take year finall...,patient,19
37,1606,foremost want acknowledge effort gain ideal er...,psyc,80
23,1560,key factor relationship trust i'd start unders...,psyc,58
1,2011,issue address history sexual abuse breast canc...,patient,25


We now need to create the Document Term Frecuency matrix. Note, that you can also use the [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) nad it may improve performance with `TfidfVectorizer` instead of `CountVectorizer`

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
all_texts = " ".join(processed_texts["text"].tolist())
all_texts = all_texts.split(" ")
dictionary = set(all_texts)

In [9]:
vec = CountVectorizer(vocabulary=dictionary,token_pattern=r"(?u)\b\w+\b", lowercase=True)
X = vec.fit_transform(processed_texts["text"].values)



In [10]:
tdf = pd.DataFrame(X.toarray(), columns=vec.get_feature_names_out())
tdf.head()

Unnamed: 0,/,People,También,ability,able,absolute,absorb,abuse,accept,acknowledge,...,word,work,worried,worry,wrap,wreck,write,year,yell,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we can then feed this `tdf` in the representation `X` to the LDA algorithm

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

In [12]:
lda = LatentDirichletAllocation(n_components=5, random_state=1337)
topics_vecs = lda.fit_transform(X)

In [13]:
topics_vecs[0]

array([0.04083957, 0.04022991, 0.04001249, 0.04181473, 0.83710331])

This will give you a set of probabilities for each document of the 5 topics found in the data. For the first text in the dataset it has the most likely prob of topic 5 (83%)

In [14]:
processed_texts['topic_probs'] = [list(topic_vec) for topic_vec in topics_vecs]
# assuming that you assign the topic with the highest probability
processed_texts['topic'] = topics_vecs.argmax(axis=1) + 1

In [15]:
processed_texts.head()

Unnamed: 0,index,text,label,length,topic_probs,topic
0,3296,boyfriend day emotional wreck,patient,4,"[0.04083956812543484, 0.040229907313569714, 0....",5
1,2011,issue address history sexual abuse breast canc...,patient,25,"[0.007735944645487772, 0.007854819609671722, 0...",5
2,1683,year straight find interested people sex sex e...,patient,8,"[0.022514455359502604, 0.022460127729601535, 0...",4
3,1560,feel like take relationship grant point give t...,patient,26,"[0.007486141632854933, 0.007476137812433564, 0...",5
4,1267,crave attention companionship sex hysterectomy...,patient,7,"[0.8981821340230539, 0.025192398635983493, 0.0...",1


Now we will use the `pyldavis` package to understand the results. See the following [example](https://nbviewer.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb) on what other things you can do

In [16]:
import pyLDAvis

In [17]:
movies_vis_data = pyLDAvis.prepare(topic_term_dists = lda.components_,         # The distribution of terms in each topic
                                   doc_topic_dists = topics_vecs,              # The distribution of topics in each document
                                   doc_lengths = processed_texts['length'],    # The length of each document
                                   vocab = vec.get_feature_names_out(),        # The vocabulary or dictionary
                                   term_frequency= vec.vocabulary_,            # The frequency of each term in the corpus
                                   mds='tsne'                                  # The latent space method
                                   )

In [18]:
pyLDAvis.display(movies_vis_data)

This algorithm does not give the overall general topics discussed, but you can infer them with the relevant words.

Notice how topic 4 is problems related to *intimacy* with important words like `relationship`, `sex`, `sexual`.

For our dataset you can now try to see if the estimated topic between psyc and patient allign!

In [19]:
# Create wide format DataFrame with patient and psychologist information
wide_df = pd.DataFrame(index=mental_health.index)

# Add original text columns
wide_df['patient_comment'] = mental_health['patient_comment']
wide_df['psyc_response'] = mental_health['psyc_response']

# Add processed text
patient_data = processed_texts[processed_texts['label'] == 'patient']
psyc_data = processed_texts[processed_texts['label'] == 'psyc']

# Add processed texts
wide_df['patient_processed_text'] = patient_data.set_index('index')['text']
wide_df['psyc_processed_text'] = psyc_data.set_index('index')['text']

# Add topic information
wide_df['patient_topic'] = patient_data.set_index('index')['topic']
wide_df['psyc_topic'] = psyc_data.set_index('index')['topic']

# Add topic probabilities
wide_df['patient_topic_probs'] = patient_data.set_index('index')['topic_probs']
wide_df['psyc_topic_probs'] = psyc_data.set_index('index')['topic_probs']

# Add text length
wide_df['patient_text_length'] = patient_data.set_index('index')['length']
wide_df['psyc_text_length'] = psyc_data.set_index('index')['length']

# Check if topics align between patient and psychologist
wide_df['topics_match'] = wide_df['patient_topic'] == wide_df['psyc_topic']

In [20]:
wide_df.head(2)

Unnamed: 0,patient_comment,psyc_response,patient_processed_text,psyc_processed_text,patient_topic,psyc_topic,patient_topic_probs,psyc_topic_probs,patient_text_length,psyc_text_length,topics_match
3296,"My boyfriend is in Ireland for 11 days, and I ...",It sounds like you and your boyfriend are very...,boyfriend day emotional wreck,sound like boyfriend close typically spend tim...,5,4,"[0.04083956812543484, 0.040229907313569714, 0....","[0.005947458295939941, 0.005977912364450864, 0...",4,33,False
2011,I have so many issues to address. I have a his...,I think this is a very common question that pe...,issue address history sexual abuse breast canc...,think common question people counseling lot an...,5,2,"[0.007735944645487772, 0.007854819609671722, 0...","[0.0017249217399812869, 0.9930907611200566, 0....",25,117,False


In [21]:
wide_df['topics_match'].value_counts()

topics_match
False    11
True      9
Name: count, dtype: int64

Seems like they match 9/20 times, which does not seem like a good match on topics. However, one needs to look a little deeper into why there is a disconent in the topics, maybe some topics are more common than others and for that you can use the topic probabilities.

For a paper that uses LDA, you can look at [Michalopoulos and Xue (2021)](https://academic.oup.com/qje/article/136/4/1993/6124640?searchresult=1&login=true).