### MIE 2024 Submission - Topic Modelling 

#### J. Hastings and M. Wosny


We evaluate the behaviour of topic modelling approaches on qualitative interview datasets with relevance to healthcare research. 



#### Datasets

We load two different datasets: 
- Newcastle perspectives on flu and vaccination
- Extracts from our own study of clinician perspectives on digitalisation in hospitals



In [14]:
from datasets import load_dataset

ds1 = load_dataset("text", data_dir="~/Work/Python/hastingslab-aitools/topic-modelling/datasets/health-promotion-interviews-newcastle/txt/", sample_by="document",split='train') # sample_by="paragraph",
#ds2 = load_dataset("text", data_dir="datasets/cedi-study-extracts/*.txt", sample_by="paragraph",split='train')
ds1

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['text'],
    num_rows: 12
})

### LSA approach

We implement a latent semantic analysis as an exemplar of traditional topic modeling approaches

In [5]:
import spacy
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

# Load spaCy model for English
nlp = spacy.load("en_core_web_sm")

# Preprocessing and stopwords for the English language
processed_docs_all = []

stopwords = set(['probably', 'simply', 'exactly', 'bit', 'tell', 'okay', 'datum', 'stadt', 'yeah','look','um','like','sure'])

for text in ds1['text']:
    # Remove occurrences of stopwrods
    text_without_okay = ' '.join([word for word in text.split() if word.lower() not in stopwords])

    processed_doc = ' '.join([token.lemma_ for token in nlp(text_without_okay) if not token.is_stop
                              and token.is_alpha and token.lemma_ not in stopwords])
    processed_docs_all.append(processed_doc)

In [None]:
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

tfidf_data = vectorizer.fit_transform(processed_docs_all)

# Define the number of topics (or components in LSA)
n_topics = 10

# Create a Truncated SVD (LSA) model
lsa = TruncatedSVD(n_components=n_topics, random_state=42)

# Fit the model to the TF-IDF data
lsa.fit(tfidf_data)

# Transform the TF-IDF data using the fitted LSA model
lsa_topic_matrix = lsa.transform(tfidf_data)

# Number of top words per topic
num_top_words = 20
    
# Print the top 20 words for each topic
feature_names = np.array(vectorizer.get_feature_names_out())
for topic_idx, topic in enumerate(lsa.components_):
    top_words_idx = topic.argsort()[:-num_top_words-1:-1]
    top_words = feature_names[top_words_idx]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}\n")

Topic 1: participant, yes, agree, child, right, think, know, fine, flu, speak, price, timothy, vaccine, great, kind, thing, year, sorry, pretty, absolutely

Topic 2: timothy, price, flu, know, child, think, vaccine, kind, right, thing, sort, year, question, old, little, want, absolutely, vaccination, information, benefit

Topic 3: price, timothy, agree, study, yes, right, interview, understand, information, research, copy, quotation, collect, researcher, receive, excellent, finding, statement, sheet, permission

Topic 4: yes, agree, think, study, know, read, statement, kind, consent, sheet, thing, ask, vaccine, series, start, question, actually, interesting, information, flu

Topic 5: agree, study, think, know, kind, flu, thing, read, people, statement, information, ask, sort, series, consent, sheet, come, symptom, maybe, obviously

Topic 6: right, child, agree, old, speak, younger, year, way, school, ahead, yes, absolutely, start, laughter, help, age, reception, little, dive, benefit


### Semantic Signal Separation

We compare the above topics to the results of using a small language-model based approach, Semantic Signal Separation


In [8]:
from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10, encoder="all-MiniLM-L12-v2")
model.fit(processed_docs_all)

model.print_topics()

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

Output()

#### OpenAI's GPT-3.5 based approach

We use GPT-3.5 via OpenAI's API with system prompt `You are a helpful research assistant` and user prompt `Please tell me what themes are mentioned in the following interview transcript. Themes are short words or phrases that capture something important about the research topic and purpose as revealed in the interview.  Please categorise themes based on the COM-B model (capability, opportunity, and motivation), and identify barriers and facilitators.`



In [None]:
%env OPENAI_API_KEY=xxx

In [23]:
import os
#import wandb
from openai import OpenAI
client = OpenAI()

#gpt_assistant_prompt = "You are a " + input ("Who should I be, as I answer your prompt?") 
#gpt_user_prompt = input ("What prompt do you want me to do?") 

#wandb.init()
#prediction_table = wandb.Table(columns=["Prompt", "Response", "Tokens", "Max Tokens", "Frequency Penalty", "Temperature"])

identified_themes = []

for doc in ds1['text']: 
    
    gpt_assistant_prompt = "You are a helpful research assistant. "
    gpt_user_prompt = '''Please tell me what themes are mentioned in the following interview transcript. 
                      Themes are short words or phrases that capture something important about the research topic and purpose as revealed in the interview. 
                      Please categorise themes based on the COM-B model (capability, opportunity, and motivation), and identify barriers and facilitators.'''
    gpt_user_prompt = gpt_user_prompt + f'Transcript: " {doc} "'
    
    gpt_prompt = gpt_assistant_prompt, gpt_user_prompt
    #print(gpt_prompt)
    
    message=[{"role": "assistant", "content": gpt_assistant_prompt}, {"role": "user", "content": gpt_user_prompt}]
    temperature=0.2
    max_tokens=1000
    frequency_penalty=0.0
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",#model="gpt-4",
        messages = message,
        temperature=temperature,
        max_tokens=max_tokens,
        frequency_penalty=frequency_penalty
    )
    print(".")
    response_text = response.choices[0].message.content
    identified_themes.append(response_text)
    #tokens_used = response.usage.total_tokens
    #prediction_table.add_data(gpt_prompt, response_text, tokens_used, max_tokens, frequency_penalty, temperature)

#wandb.log({'predictions': prediction_table})
#wandb.finish()




.
.
.
.
.
.
.
.
.
.
.
.


In [24]:
print(identified_themes)

gpt_assistant_prompt = "You are a helpful research assistant. "
gpt_user_prompt = "Please integrate and summarise the following themes that have been identified in interview transcripts into a core set of key repeating themes, according to the COM-B model. Themes are short words or phrases that capture something important about the research topic and purpose as revealed in the interview."
gpt_user_prompt = gpt_user_prompt + f'Identified themes " {" ".join(identified_themes)} "'
    
gpt_prompt = gpt_assistant_prompt, gpt_user_prompt
    
message=[{"role": "assistant", "content": gpt_assistant_prompt}, {"role": "user", "content": gpt_user_prompt}]
temperature=0.2
max_tokens=1000
frequency_penalty=0.0
    
response = client.chat.completions.create(
    model="gpt-3.5-turbo",#model="gpt-4",
    messages = message,
    temperature=temperature,
    max_tokens=max_tokens,
    frequency_penalty=frequency_penalty
)
print(".")
response_text = response.choices[0].message.content

print(response_text)


["Themes identified in the interview transcript based on the COM-B model (capability, opportunity, and motivation) are as follows:\n\nCapability:\n1. Understanding of the flu and flu vaccine: The participant demonstrates a good understanding of the flu, its symptoms, and the flu vaccine, including how it is chosen and administered.\n2. Information-seeking behavior: The participant mentions using resources like the internet and health visitor's book for information on illnesses like the flu.\n\nOpportunity:\n1. Access to healthcare professionals: The participant discusses the process of accessing healthcare professionals like GPs and health visitors for advice on vaccinations.\n2. Barriers to vaccination: Personal circumstances, such as moving quickly and not having a new GP, are mentioned as barriers to accessing the flu vaccine for the participant's child.\n\nMotivation:\n1. Concern for children's health: The participant expresses concern for their children's health, especially the ol