# Analysis of Survey Responses using LLMs and Clustering

This notebook contains code that can label and cluster survey responses.

It is designed to be ran in the Federated Data Platform.

This notebook will explain teh process of annotating responses from one survey location. However, the code can easily be edited to handle more datasets.

## Method

### 1 - Setup and Labelling

First, we will import the required modules.

In [None]:
from foundry.transforms import Dataset
import pandas as pd
import numpy as np
from sklearn.cluster import AffinityPropagation
import json
import re
import ast

from language_model_service_api.languagemodelservice_api_embeddings_v3 import GenericEmbeddingsRequest
from language_model_service_api.languagemodelservice_api_completion_v3 import GptChatCompletionRequest
from language_model_service_api.languagemodelservice_api import ChatMessage, ChatMessageRole

from palantir_models.models import GenericEmbeddingModel
from palantir_models.models import OpenAiGptChatLanguageModel

from string import Template

class PromptTemplate(Template):
    delimiter = ""

Next, we will import the required data.

In [None]:
responses = Dataset.get("<<Name of Dataset>>").read_table()

# Optional - load in multiple data sources.
responses_2 = Dataset.get("<<Name of Dataset>>").read_table()


Let's define a prompt which can analyse a response and output the sentiment, alongside ideas for change.

In [None]:
prompt = PromptTemplate("""
You identify 'Sentiment' and 'Ideas for Change' from a survey response. 

You are to note on 'Ideas for Change' and 'Sentiment', alongside giving your 'Reasoning'.

Each of these fields is a free text field. Sentiment must be one of 'Positive', 'Neutral', 'Poor' or 'Very Poor'.

Your response must be valid json.

Here are two examples of key topics extracted from survey responses.

Example 1:
{example_1_prompt}

The output would be:
{example_1_output}

Example 2:
{example_2_prompt}

The output would be:
{example_2_output}

Here is the survey reponse you must label with topics:

Survey Response:
{survey_response}

Within the 'Reasoning' section of your json output, think step by step to reason each topic. This are is for you to outline your thoughts, consider if you have missed any key topics, and check your responses.
Then, score on 'Sentiment' and 'Ideas for Change'. Your response must be valid json.

Your response must be in the format:

{"Reasoning":"xx","Ideas for Change": "xx", "Sentiment": "xx"}

Do not put any personal identifiable information in your response. Include nothing else in your output.
""")

We also can define a function which given a database row, will output a readable example that can be given to a prompt.

In [None]:
def row_to_prompt(row,
                  row_columns):
    
    string = ''
    
    for column, response in zip(row_columns, row[row_columns]):
        new_string = f'{column.replace("_"," ")}\n{response}\n\n'
        string += new_string
        
    return string

Currently, structured outputs are not supported within the FDP.

We will use a cleaning prompt to clean json outputs if needed. That is defined here.

In [None]:
prompt_cleaning = """Clean this json file: {json_file}. Only output a json file, include no other text. The json should be of the format: {"Reasoning":"xx","Short Term priorities": "xx","Long Term Priorities": "xx","Blockers": "xx","Ideas for Change": "xx""Sentiment": "xx"}"""

Here, we select which columns we want to include in the examples given to our prompt.

In [None]:
columns_to_include_for_prompt = ["<<Column 1 Name>>", "<<Column 2 Name>>", ...]

We next define the examples we want to feed into out prompt.

In [None]:
example_1_prompt = row_to_prompt(responses.loc[0],columns_to_include_for_prompt)
# You will need to give the example output
example_1_output = {"Reasoning":"xx","Ideas for Change": "xx", "Sentiment": "xx"}

example_2_prompt = row_to_prompt(responses.loc[1],columns_to_include_for_prompt)
# You will need to give the example output
example_2_output = {"Reasoning":"xx","Ideas for Change": "xx", "Sentiment": "xx"}

Now, let us call an LLM and label each response.

In [None]:
model = OpenAiGptChatLanguageModel.get("GPT_4o")

LLM_responses = [example_1_output,example_2_output]
length = len(reponses)

print("Starting...")
for i in range(2, length):
    
    if i%100==0:
        print(f"Completed {i}/{length}")
    
    formatted_prompt = prompt.safe_substitute(
        example_1_prompt = example_1_prompt,
        example_1_output = example_1_output,
        example_2_prompt = example_2_prompt,
        example_2_output = example_2_output,
        survey_response = row_to_prompt(reponses.iloc[i],columns_to_include_for_prompt))
    
    response = model.create_chat_completion(GptChatCompletionRequest([ChatMessage(ChatMessageRole.USER, formatted_prompt)]))
    raw_content = response.choices[0].message.content  # Extract the raw content
    
    try: 
        json_content = json.loads(raw_content)
        LLM_responses.append(json_content)
    
    except:
        
        try:
            raw_content = raw_content.replace("```","")
            raw_content = raw_content.replace("json","")
            raw_content = raw_content.replace("\n","")
            json_content = json.loads(raw_content)
            LLM_responses.append(json_content)
        except:
            
            try:
                cleaning_prompt = cleaning_prompt.safe_substitute(json_file = raw_content)
                response = model.create_chat_completion(GptChatCompletionRequest([ChatMessage(ChatMessageRole.USER, formatted_prompt)]))
                raw_content = response.choices[0].message.content  # Extract the raw content
                json_content = json.loads(raw_content)
                LLM_responses.append(json_content)
            except:
                LLM_responses.append(f"Error: {raw_content}")

Now, we can save these responses.

In [None]:
output = responses
output["labels"] = LLM_responses

ideas_for_change = [x["Ideas for Change"] for x in output["labels"]]
sentiments = [x["Sentiment"] for x in output["labels"]]

output["Ideas_for_Change_LLM"] = ideas_for_change
output["Sentiment"] = sentiments

final_labels = Dataset.get("<<Dataset Name>>")
final_labels.write_table(output)

### 2 - LLM Topic Modelling

We can continue to use a LLM for topic modelling. This has been shown to be effect, for example in this [paper](https://arxiv.org/pdf/2403.16248).

In [None]:
all_ideas = ideas_for_change # you can add together ideas for change from other documents here.

We are going to use an LLM to label the 'ideas to change' with topics.

In [None]:
extraction_prompt = PromptTemplate("""
You are categorizing "Ideas for Change" from survey responses into topics.

- The "Current Topics" are: {topics_}
- The "Ideas for Change" are: {ideas_for_change}

Instructions:
1. Assign the "Ideas for Change" to one or more topics from the "Current Topics".
2. If the appropriate topic does not exist in "Current Topics", create a new topic and include it in your response. Give your new topic a descriptive name.
3. If the "Ideas for Change" applies to two topics, you may assign it to two topics.
4. Do not assign more than two topics.

Example:
If the "Current Topics" are ["Improved Communication", "Employee Wellbeing"] and the "Idea for Change" is "Install solar panels on the roof and improve communication between staff members." your response might be:
["Sustainability", "Improved Communication"]

Your response must strictly be a Python list containing the relevant topics, such as:
["Topic A", "Topic B"]

Do not include any additional text in your response.
Do not label a topic as a "New Topic" - instead give it a descriptive name.
""")

cleaning_prompt = PromptTemplate("""
Clean this list of topics to be a flat python list. The list is: {response}.

Your response must strictly be a flat python list containing the relevant topics, such as:
["Topic A", "Topic B"]
""")

We next call our LLM and apply topics to each idea.

In [None]:
current_topics = ["Improved Communication"]

all_responses = []
failed_responses = []

model = OpenAiGptChatLanguageModel.get("GPT_4o")

def add_topics(all_responses, topic_array, current_topics):
    
    all_responses.append(topic_array)
    
    for topic in topic_array:
        if topic not in current_topics:
            current_topics.append(topic)
            
    return all_responses, topic_array, current_topics

def contains_nests(arr):
    for element in arr:
        if isinstance(element, list):
            return True
    return False
    
def flatten(array):
        result = []
        for item in array:
            if isinstance(item, list):
                result.extend(flatten(item))  # Recursively flatten nested lists
            else:
                result.append(item)
        return result

for i in range(len(all_ideas)):
    
    formatted_prompt = extraction_prompt.safe_substitute(
            topics_ = current_topics,
            ideas_for_change = all_ideas[i])
    
    response = model.create_chat_completion(GptChatCompletionRequest([ChatMessage(ChatMessageRole.USER, formatted_prompt)]))
    raw_content = response.choices[0].message.content  # Extract the raw content
    try:
        topic_array = ast.literal_eval(raw_content)
        if contains_nests(topic_array):
            topic_array = flatten(topic_array)
        all_responses, topic_array, current_topics = add_topics(all_responses, topic_array, current_topics)
    except:
        try:
            formatted_cleaning_prompt = cleaning_prompt.safe_substitute(response=raw_content)
            response = model.create_chat_completion(GptChatCompletionRequest([ChatMessage(ChatMessageRole.USER, formatted_prompt)]))
            raw_content = response.choices[0].message.content  # Extract the raw content
            topic_array = ast.literal_eval(raw_content)
            
            if contains_nests(topic_array):
                topic_array = flatten(topic_array)
            all_responses, topic_array, current_topics = add_topics(all_responses, topic_array, current_topics)
            
        except:
            print("Failed at",i)
            print("with", raw_content)
            all_responses.append("ERROR: ")
            failed_responses.append([i, raw_content]) 
    
    if i%100==0:
        print(f"Completed {i}/{len(all_ideas)}")
        print(f"Current Topics: {current_topics}")

There are various steps of cleaning required to reduce the number of errors.

In [None]:
cleaned_ideas = []
cleaned_responses = []

for idea in all_ideas:
    if isinstance(idea, str):
        cleaned_ideas.append(idea)
    else:
        cleaned_ideas.append("None")
    

for response in all_responses:
    if isinstance(response, list):
        cleaned_responses.append(response)
    else:
        cleaned_responses.append(["ERROR"])

Let's look at how many topics there are, and save our first lot of data.

In [None]:
print("The number of topics are:", len(current_topics))

df = pd.DataFrame({"Ideas for Change": cleaned_ideas, "Topics": cleaned_responses})

print("Modelling failed at:", [f[0] for f in failed_responses])

llm_topic_modeling = Dataset.get("llm_topic_modeling")
llm_topic_modeling.write_table(df)

We can now explore the topis in a bit more detail, and save another DataFrame.

In [None]:
topic_df = {"Topic": [],
           "Count": [],
           "Cluster": []}

for topic_id in current_topics:
    
    count = 0
    ideas = []
    
    for topic_labels, idea in zip(all_responses,all_ideas):
        if isinstance(topic_labels, list) and topic_id in topic_labels:
            count += 1
            ideas.append(idea)
        
    topic_df["Topic"].append(topic_id)
    topic_df["Count"].append(count)
    topic_df["Cluster"].append(ideas)

pd_topic_df = pd.DataFrame(topic_df).sort_values(by="Count",ascending=False)

llm_topic_counts = Dataset.get("llm_topic_counts")
llm_topic_counts.write_table(pd_topic_df)

### 3 - Clustering

Using our labelled responses, we can use embedding and clustering to analyse the responses.

We expect that some responses will have multiple ideas for change. For that reason, we do analysis on the full responses to 'Ideas for Change', alongside responses split by a comma.

In [None]:
all_ideas = ideas_for_change # you can add together ideas for change from other documents here.

split_all_ideas = []
cleaned_all_ideas = []

for idea in all_ideas:
    if not pd.isna(idea):
        cleaned_all_ideas.append(idea)
        split_all_ideas.extend(idea.split(","))
    else:
        split_all_ideas.append("None")
        cleaned_all_ideas.append("None")
        

We alsop have to do some cleaning of the data.

In [None]:
all_ideas = [string for string in cleaned_all_ideas if string !=""]  
split_all_ideas = [string for string in split_all_ideas if string !=""]  

We split the data into 'batches' - smaller chunks which can be passed into models.

In [None]:
split_batch_locations = []
batch_locations = []

for i in range(len(split_all_ideas)):
    if i % 100 == 0 and i != 0:
        split_batch_locations.append(i)
        
for i in range(len(all_ideas)):
    if i % 100 == 0 and i != 0:
        batch_locations.append(i)
        
split_batches = np.split(split_all_ideas, split_batch_locations)
batches = np.split(all_ideas, batch_locations)

Now, we calculate 'embeddings'. This converts our strings of text into vectors. These vectors are essentially coordinates, and we can use these coordinates to see how similar different strings of text are and group them into clusters.

In [None]:
embeddings = []
split_embeddings = []

model = GenericEmbeddingModel.get("text-embedding-ada-002")
print("Calculating Embeddings for split ideas...")

for n, batch in enumerate(split_batches):
    try:
        response = model.create_embeddings(GenericEmbeddingsRequest(inputs=list(batch)))
        split_embeddings.extend((response.embeddings))
    except:
        print("Error at",n)

In [None]:
print("Calculating Embeddings for ideas...")
for n, batch in enumerate(batches):
    try:
        response = model.create_embeddings(GenericEmbeddingsRequest(inputs=list(batch)))
        embeddings.extend((response.embeddings))
    except:
        print("Error at",n)
        
print("Done")

Now that we have embeddings, we want to do some 'clustering'. 

For this example we use a technique called [Affinity Propogation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation).

This has the advantage of working in non-flat geometry, and you do not need to state how many clusters you want it to generate.

However, **we have not tested any other methods yet**.

You can view other methods [here](https://scikit-learn.org/stable/modules/clustering.html).

In [None]:
embeddings = np.array(embeddings)
clusters = AffinityPropagation(random_state=0).fit(embeddings)

split_embeddings = np.array(split_embeddings)
split_clusters = AffinityPropagation(random_state=0).fit(split_embeddings)

We can look at how many clusters there are:

In [None]:
print("When ideas are not split:")
print(f"Number of Clusters: {split_clusters.labels_.max()}")

print("When ideas are not split:")
print(f"Number of Clusters: {clusters.labels_.max()}")

Using a more efficient function, we can automate the process of clustering, alongside getting an LLM to describe to us what appears in each cluster with a summary.

In [None]:
cluster_summary_prompt = """
Given a list of topics, summarise the topics into a short, single sentence summary of approximately 10 words.

You must only include information from the list of topics.
Do not assume any additional information.

The topics are: {topics}
"""

def cluster_analysis(embeddings, data, prompt):
    
    sentence_summaries = []
    cluster_sizes = []
    
    model = OpenAiGptChatLanguageModel.get("GPT_4o")
    
    embeddings = np.array(embeddings)
    clusters = AffinityPropagation(random_state=1).fit(embeddings)
    
    print(f"Number of Clusters: {clusters.labels_.max()}")
    
    for cluster_id in range(clusters.labels_.max()):
        
        topic_cluster = []
        
        for label, string in zip(clusters.labels_, data):
            if label == cluster_id:
                topic_cluster.append(string)
        
        if cluster_id % 10 == 0:
            print(f"Analysing Cluster {cluster_id}/{clusters.labels_.max()}")
        response = model.create_chat_completion(
            GptChatCompletionRequest([ChatMessage(ChatMessageRole.USER, prompt.format(topics=topic_cluster))]))

        cluster_sizes.append(len(topic_cluster))
        sentence_summaries.append(response.choices[0].message.content)
    
    print("Done")
    return cluster_sizes, sentence_summaries, clusters

We can call the function like this:

In [None]:
cluster_sizes, cluster_summaries, cluster_ids = cluster_analysis(embeddings,
                                                all_ideas,
                                                cluster_summary_prompt,
                                                )

split_cluster_sizes, split_cluster_summaries, split_cluster_ids = cluster_analysis(split_embeddings,
                                                split_all_ideas,
                                                cluster_summary_prompt,
                                                )

Now, let's save our results to a DataFrame for further analysis later.

In [None]:
topics = []

for cluster_id in range(cluster_ids.labels_.max()):
    topic_cluster = []
    for label, string in zip(cluster_ids.labels_, all_ideas):
        if label == cluster_id:
            topic_cluster.append(string)
    topics.append(topic_cluster)

print(len(topics))
LLM_df = pd.DataFrame({"Counts": cluster_sizes, "Summaries": cluster_summaries, "Cluster Content":topics})

# Sort the clusters by size, largest to smallest.
LLM_df_sorted = LLM_df.sort_values(by = "Counts", ascending = False)

clusters_summary = Dataset.get("clusters_summary")
clusters_summary.write_table(LLM_df_sorted)

In [None]:
split_topics = []

for cluster_id in range(split_cluster_ids.labels_.max()):
    topic_cluster = []
    for label, string in zip(split_cluster_ids.labels_, all_ideas):
        if label == cluster_id:
            topic_cluster.append(string)
    split_topics.append(topic_cluster)

LLM_df_split = pd.DataFrame({"Counts": split_cluster_sizes, "Summaries": split_cluster_summaries, "Cluster Content":split_topics})

# Sort the clusters by size, largest to smallest.
LLM_df_sorted_split = LLM_df_split.sort_values(by = "Counts", ascending = False)

clusters_summary_split = Dataset.get("clusters_summary_split")
clusters_summary_split.write_table(LLM_df_sorted_split)

This concludes the example in this notebook. 