### MIE 2024 Submission - Topic Modelling 

#### J. Hastings and M. Wosny


We evaluate the behaviour of topic modelling approaches on qualitative interview datasets with relevance to healthcare research. 



#### Datasets

We load two different datasets: 
- Newcastle perspectives on flu and vaccination
- Extracts from our own study of clinician perspectives on digitalisation in hospitals



In [1]:
from datasets import load_dataset

ds1 = load_dataset("text", data_dir="~/Work/Python/hastingslab-aitools/topic-modelling/datasets/health-promotion-interviews-newcastle/txt/", sample_by="document",split='train') # sample_by="paragraph",

ds1

Dataset({
    features: ['text'],
    num_rows: 12
})

### LSA approach

We implement a latent semantic analysis as an exemplar of traditional topic modeling approaches

In [2]:
import spacy
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

# Load spaCy model for English
nlp = spacy.load("en_core_web_sm")

# Preprocessing and stopwords for the English language
processed_docs_all = []

stopwords = set(['probably', 'simply', 'exactly', 'bit', 'tell', 'okay', 'datum', 'stadt', 'yeah','look','um','like','sure'])

for text in ds1['text']:
    # Remove occurrences of stopwrods
    text_without_okay = ' '.join([word for word in text.split() if word.lower() not in stopwords])

    processed_doc = ' '.join([token.lemma_ for token in nlp(text_without_okay) if not token.is_stop
                              and token.is_alpha and token.lemma_ not in stopwords])
    processed_docs_all.append(processed_doc)

In [3]:
# Vectorize the text data using TF-IDF
vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

tfidf_data = vectorizer.fit_transform(processed_docs_all)

# Define the number of topics (or components in LSA)
n_topics = 10

# Create a Truncated SVD (LSA) model
lsa = TruncatedSVD(n_components=n_topics, random_state=42)

# Fit the model to the TF-IDF data
lsa.fit(tfidf_data)

# Transform the TF-IDF data using the fitted LSA model
lsa_topic_matrix = lsa.transform(tfidf_data)

# Number of top words per topic
num_top_words = 20
    
# Print the top 20 words for each topic
feature_names = np.array(vectorizer.get_feature_names_out())
for topic_idx, topic in enumerate(lsa.components_):
    top_words_idx = topic.argsort()[:-num_top_words-1:-1]
    top_words = feature_names[top_words_idx]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}\n")

Topic 1: feel, come, obviously, younger, school, guess, nhs, health, kid, interesting, oh, young, speak, happen, ill, effect, letter, appointment, remember, cold

Topic 2: wife, straightforward, pretty, faith, obviously, seek, letter, beneficial, medical, website, uk, alright, important, garden, world, certainly, scientist, correct, science, text

Topic 3: son, team, specialist, daughter, health, blood, problem, visitor, today, unwell, advice, help, form, june, allergic, particularly, eat, play, laughter, necessarily

Topic 4: daughter, feel, season, jab, guess, mind, issue, definitely, book, covid, come, friend, protect, vulnerable, mainly, strain, immunization, outbreak, public, importance

Topic 5: son, specialist, team, unwell, phone, faith, available, guess, come, organize, clinic, arrange, word, medical, ah, beneficial, bear, food, reception, trust

Topic 6: practice, guess, concerned, invite, try, cost, rest, fear, speak, provide, important, mention, correct, contact, ring, wint

### Semantic Signal Separation

We compare the above topics to the results of using a small language-model based approach, Semantic Signal Separation


In [6]:
from turftopic import SemanticSignalSeparation

model = SemanticSignalSeparation(10, encoder="all-MiniLM-L12-v2")
model.fit(processed_docs_all)

model.print_topics()

Output()

#### OpenAI's GPT-3.5 based approach

We use GPT-3.5 via OpenAI's API with system prompt `You are a helpful research assistant` and user prompt `Please tell me what themes are mentioned in the following interview transcript. Themes are short words or phrases that capture something important about the research topic and purpose as revealed in the interview.  Please categorise themes based on the COM-B model (capability, opportunity, and motivation), and identify barriers and facilitators.`



In [None]:
%env OPENAI_API_KEY=xxx

In [8]:
import os
#import wandb
from openai import OpenAI
client = OpenAI()

#gpt_assistant_prompt = "You are a " + input ("Who should I be, as I answer your prompt?") 
#gpt_user_prompt = input ("What prompt do you want me to do?") 

#wandb.init()
#prediction_table = wandb.Table(columns=["Prompt", "Response", "Tokens", "Max Tokens", "Frequency Penalty", "Temperature"])

identified_themes = []

for doc in ds1['text']: 
    
    gpt_assistant_prompt = "You are a helpful research assistant. "
    gpt_user_prompt = '''Please tell me what themes are mentioned in the following interview transcript. 
                      Themes are short words or phrases that capture something important about the research topic and purpose as revealed in the interview. 
                      Please categorise themes based on the COM-B model (capability, opportunity, and motivation), and identify barriers and facilitators.'''
    gpt_user_prompt = gpt_user_prompt + f'Transcript: " {doc} "'
    
    gpt_prompt = gpt_assistant_prompt, gpt_user_prompt
    #print(gpt_prompt)
    
    message=[{"role": "assistant", "content": gpt_assistant_prompt}, {"role": "user", "content": gpt_user_prompt}]
    temperature=0.2
    max_tokens=1000
    frequency_penalty=0.0
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",#model="gpt-4",
        messages = message,
        temperature=temperature,
        max_tokens=max_tokens,
        frequency_penalty=frequency_penalty
    )
    print(".")
    response_text = response.choices[0].message.content
    identified_themes.append(response_text)
    #tokens_used = response.usage.total_tokens
    #prediction_table.add_data(gpt_prompt, response_text, tokens_used, max_tokens, frequency_penalty, temperature)

#wandb.log({'predictions': prediction_table})
#wandb.finish()




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


.
.
.
.
.
.
.
.
.
.
.
.


In [9]:
print(identified_themes)

gpt_assistant_prompt = "You are a helpful research assistant. "
gpt_user_prompt = "Please integrate and summarise the following themes that have been identified in interview transcripts into a core set of key repeating themes, according to the COM-B model. Themes are short words or phrases that capture something important about the research topic and purpose as revealed in the interview."
gpt_user_prompt = gpt_user_prompt + f'Identified themes " {" ".join(identified_themes)} "'
    
gpt_prompt = gpt_assistant_prompt, gpt_user_prompt
    
message=[{"role": "assistant", "content": gpt_assistant_prompt}, {"role": "user", "content": gpt_user_prompt}]
temperature=0.2
max_tokens=1000
frequency_penalty=0.0
    
response = client.chat.completions.create(
    model="gpt-3.5-turbo",#model="gpt-4",
    messages = message,
    temperature=temperature,
    max_tokens=max_tokens,
    frequency_penalty=frequency_penalty
)
print(".")
response_text = response.choices[0].message.content

print(response_text)


["Themes identified in the interview transcript based on the COM-B model:\n\nCapability:\n- Understanding of the flu vaccine and its benefits\n- Knowledge about flu symptoms in children\n- Access to information sources like the internet and health visitor's resources\n\nOpportunity:\n- Access to healthcare professionals for vaccination advice\n- Challenges in accessing healthcare due to personal circumstances and moving\n- Lack of discussion with GP about flu vaccine for preschool-age child\n\nMotivation:\n- Willingness to vaccinate children based on medical advice\n- Concern for children's health, especially with underlying health conditions\n- Personal experience with vaccination due to health condition\n\nBarriers:\n- Lack of discussion with healthcare professionals about flu vaccine for preschool-age child\n- Personal circumstances affecting access to healthcare and vaccination information\n\nFacilitators:\n- Willingness to vaccinate children based on medical advice\n- Personal exp