LLM-Based recategorization of Responses into 13 topics. The below code was executed in jupyter lab hosted on Yale cluster Misha, not google colab.
This takes urge_docs.csv (which is the equivalent of the output file 'URGE_new_embeds_{current_datetime}_topic_documents.csv' from clustering with new embddings file - I renamed it which is confusing. Sorry!) and outputs recategorized_13_full.csv.

In [None]:
#Load necessary packages
import pandas as pd
import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
#Set seeds
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
random.seed(42)
np.random.seed(42)

In [None]:
#Model and tokenizer loading from local Hugging Face cache in bfloat16 precision for GPU inference.
#Model identifier and path (must match cache folder)
model_id = "meta-llama/Llama-3.3-70B-Instruct"
cache_dir = "/gpfs/radev/project/s_wang/lmb248/huggingface"

#Load tokenizer and model from local cache only
tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    cache_dir=cache_dir,
    local_files_only=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    cache_dir=cache_dir,
    local_files_only=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

print("✅ Model loaded from cache successfully!")

Loading checkpoint shards:   0%|          | 0/30 [00:00<?, ?it/s]

✅ Model loaded from cache successfully!


In [None]:
#Check it's working/responding with a test prompt. Update: it works!
#Test response
test_prompt = "<s>[INST] What are the benefits of having a cat? [/INST]"

#Tokenize and move to GPU
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

#Generate output
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)

#Decode and print result
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("LLM response:\n", response)


LLM response:
 <s>[INST] What are the benefits of having a cat? [/INST] Cats are often considered low-maintenance pets compared to dogs, requiring less attention and exercise. They are generally easy to care for, and their independent nature means they can entertain themselves for periods of time. This makes them a great choice for busy people or those who live in small spaces. Additionally, cats are known for their affectionate personalities and can provide companionship and emotional support to their owners. They also have a calming presence, which can help reduce stress and anxiety. Overall, having a cat can


In [None]:
#Load the dataset (just responses and their topic assingments)
input_path = "/gpfs/radev/project/s_wang/lmb248/working_env/urge_docs.csv"
responses_df = pd.read_csv(input_path)
responses_df = responses_df[['Document', 'Topic']]

print(responses_df.head())


                                            Document  Topic
0  I'd think it'd be better to not be alive, but ...      5
1                       self doubt, insecurity, pain      9
2         I was considering it and maybe planning it      3
3  Thoughts about meaninglessness, nothingness, o...      0
4                                I won't consider it      1


Ignore. These chunks were a test to see how the LLM does with labelling the topics. We ended up just sticking with our own labels as we didn't think it did a good job (wasn't very good at picking up on subtleties/nuances within suicidal thinking at the label level).

In [None]:
#Ask LLM to make labels for each topic.
def generate_topic_label(model, tokenizer, topic_num, topic_responses):
    joined = "\n".join(f"- {resp}" for resp in topic_responses)
    prompt = f"""<s>[INST]
You are an expert in psychology and suicidology. Based on the following responses from a topic cluster, generate a concise, descriptive label that summarizes the dominant theme.

Responses:
{joined}

Return only the label. [/INST]"""

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )
    label = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    return label

In [None]:
def generate_all_labels(responses_df, model, tokenizer):
    labels = {}
    for topic in sorted(responses_df['Topic'].unique()):
        all_responses = responses_df[responses_df['Topic'] == topic]['Document'].dropna().tolist()
        label = generate_topic_label(model, tokenizer, topic, all_responses)
        labels[topic] = label
        print(f"Topic {topic}: {label}")
    return labels

In [None]:
#Run everything
# Step 1: Generate new labels
generated_labels = generate_all_labels(responses_df, model, tokenizer)

Topic 0: <s>[INST]
You are an expert in psychology and suicidology. Based on the following responses from a topic cluster, generate a concise, descriptive label that summarizes the dominant theme.

Responses:
- Thoughts about meaninglessness, nothingness, overall numbness
- Sad thoughts about life
- N\a
- Just keep going, it's no that serious
- There's things I would miss
- I never want to think again 
- strong
- strong
- same as 8 but not as much i would probably disassociate a lot between thoughts
- Little to none
- Not at all
- Imagining it in not a lot of detail but enough detail to be able to kinda see it happening
- Very little thoughts just about not waking up. 
- Probably normal other thoughts about my day
- Myself dead
- Disappearing
- Hopeless
- hardly any
- Nothing
- Nothing
- Stuff
- Gay
- Usually third person and less realistic, but repetitive
- Slightly more vivid, now I imagne what it might feel like, still in third person 
- I might be having slight thoughts, but it is 

LLM recategorization! 1: Define a dictionary of our labels, followed by definition examples. 2: Build LLM prompt to recategorize responses into 13 predefined themes using themes_dict. 3: Inference loop, send each response to the model, receive predicted categories, store results.

In [None]:
#Themes dictionary for our 13 topics
themes_dict = {
    0: ("Normal life thoughts", "business as usual, normal thoughts, neutral thoughts, no overt distress"),
    1: ("Existential reflections", "things happen for a reason, thinking about life and death, life choices"),
    2: ("Family and friends reactions", "family and friends grieving, funeral, leaving pets, consequences of suicide on others, traumatizing other people, family and friends feelings"),
    3: ("Thinking of a plan", "ways to die, how to die, notes, planning, suicide note, planning how, how, say goodbye"),
    4: ("Ruminating about death", "what happens when you die, what would happen, no plan or intent to die, thinking about death"),
    5: ("Passive suicidal ideation", "want to die but won't do it, no intent, wishing death, thinking of suicide, passive thoughts of suicide"),
    6: ("Depressed, sadness, exhaustion", "tired, life sucks, depressed, sad, sadness, fatigue, tears, sleeping"),
    7: ("Images of self-harm and suicide methods", "bloody images, blood, pills, scratching, cutting, jumping, visualizing suicide"),
    8: ("Happy life thoughts", "happy, having fun, socializing, love life, enjoying life, feeling good"),
    9: ("Low self-esteem and self-hatred", "low self esteem, hating oneself, caring about what others think, waste of space"),
    10: ("Active suicidal ideation", "want to die now, need to die, need to kill, suicide now, suicidal intent"),
    11: ("Not sure, I don't know", "n/a, na, NA, I don't know, idk"),
    12: ("No thoughts of suicide, none", "no thoughts, none, no images, no other thoughts, not thinking of suicide")
}


In [None]:
#LLM prompt generation (this was a bit of trial and error but ultimately the below prompt worked best)
def generate_theme_prompt(response, themes_dict):
    themes_text = "\n".join([
        f"{idx}: {theme} - {description}"
        for idx, (theme, description) in themes_dict.items()
    ])

    prompt = f"""<s>[INST]
You are categorizing a participant's response into one of the following 13 themes related to thoughts of suicidal urge.

Themes:
{themes_text}

Response:
"{response}"

Respond with the single number (0–12) corresponding to the most appropriate theme. No other text. [/INST]"""

    return prompt


In [None]:
#Send categorization prompt to the LLM, decode its output, and extract the predicted theme number (0-12).
#If the model answer isn't valid, default back to the original topic assignment.
def get_model_categorization(model, tokenizer, prompt, current_theme):
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=5,
        temperature=0,
        do_sample=False, #deterministc sampling for reproducibility (as opposed to probabilistic)
        pad_token_id=tokenizer.eos_token_id
    )
    #decode generated tokens into a string
    response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
    print("Decoded model output:\n", response)

    valid_responses = [str(i) for i in range(0, 13)]
    for word in response.split():
        if word in valid_responses:
            return int(word)
    return current_theme #if no other category is found, default back to original assignment.


In [None]:
#Recategorize a subset (or all) responses with the LLM.
#Takes a DataFrame with columns 'Document' (text) and 'Topic' (current label).
#Optionally limits to the first `sample_size` rows.
#For each row: builds a categorization prompt, queries the LLM, and records the new theme id.
#Prints occasional progress and total runtime.
#Returns a copy of the processed slice with a new column: 'new_theme'.
def recategorize_responses(responses_df, model, tokenizer, themes_dict, sample_size):
    """
    Recategorizes responses using the LLM.

    :param responses_df: The dataframe containing responses.
    :param model: The LLM model.
    :param tokenizer: The tokenizer for the model.
    :param themes_dict: Dictionary of theme descriptions.
    :param sample_size: Number of rows to process (None = full dataset).
    """
    start_time = time.time()
    new_categories = []

    #Select sample
    df_to_process = responses_df.head(sample_size).copy() if sample_size else responses_df.copy()

    for idx, row in df_to_process.iterrows():
        prompt = generate_theme_prompt(row['Document'].strip(), themes_dict)
        new_category = get_model_categorization(model, tokenizer, prompt, row['Topic'])
        new_categories.append(new_category)

        if idx % 200 == 0:
            print(f"Processed response {idx + 1}: {new_category}") #progress log

    df_to_process['new_theme'] = new_categories
    print(f"Time taken for {len(df_to_process)} responses: {time.time() - start_time:.2f} seconds")

    return df_to_process


In [None]:
#Run
sample_output = recategorize_responses(responses_df, model, tokenizer, themes_dict, sample_size = None) #can adjust sample size for number of rows for testing
sample_output.to_csv("/gpfs/radev/project/s_wang/lmb248/working_env/recategorized_13_full.csv", index=False)
print("Test results saved.")

Decoded model output:
 <s>[INST]
You are categorizing a participant's response into one of the following 13 themes related to thoughts of suicide or self-harm.

Themes:
0: Normal life thoughts - business as usual, normal thoughts, neutral thoughts, no overt distress
1: Existential reflections - things happen for a reason, thinking about life and death, life choices
2: Family and friends reactions - family and friends grieving, funeral, leaving pets, consequences of suicide on others, traumatizing other people, family and friends feelings
3: Thinking of a plan - ways to die, how to die, notes, planning, suicide note, planning how, how, say goodbye
4: Ruminating about death - what happens when you die, what would happen, no plan or intent to die, thinking about death
5: Passive suicidal ideation - want to die but won't do it, no intent, wishing death, thinking of suicide, passive thoughts of suicide
6: Depressed, sadness, exhaustion - tired, life sucks, depressed, sad, sadness, fatigue, 