# Topic Explorer

In [1]:
import json
import numpy as np
import pandas as pd
import re
from datasets import load_dataset
import os

file_path = "./out"

## Specific Category

We began by summarizing the English prompts from the 06/2024 - 08/2024 leaderboard dataset into specific categories.

### Data Processing

From conversations, we selected those tagged as English and removed any repetitive entries.

In [None]:
df = pd.read_parquet("hf://datasets/lmarena-ai/arena-explorer-preference-100k/data/arena-explorer-preference-100k.parquet")

In [4]:
english_df = df[df['language'] == 'English'].copy()
english_df['Prompt'] = english_df.apply(lambda x: ' '.join([i['content'] for i in x['conversation_a'] if i['role'] == 'user']), axis=1)
english_df = english_df.drop_duplicates(subset='Prompt')
english_df = english_df[english_df['Prompt'].str.len() < 8000]
doc = english_df['Prompt']

In [6]:
len(doc)

48586

### Create Embedding

Computing embeddings is resource-intensive, so we recommend precomputing and saving them. 

In [None]:
import openai
from bertopic.backend import OpenAIBackend

client = openai.OpenAI()
embedding_model = OpenAIBackend(client, "text-embedding-3-large", batch_size=1000)
embeddings = embedding_model.embed(doc, verbose=True)

# save embeddings
np.save(f"{file_path}/embeddings.npy", embeddings)

49it [08:09,  9.98s/it]


We saved the embeddings used to create Arena Explorer, which can be quickly loaded here for demonstration purposes.

In [None]:
# load saved embeddings
embeddings = np.load("hf://datasets/lmarena-ai/arena-explorer-preference-100k/data/embeddings.npy")
len(embeddings)

### BERTopic Topic Clustering

We performed topic clustering on the english conversation dataset using BERTopic.

In [8]:
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
import openai

client = openai.OpenAI()
embedding_model = OpenAIBackend(client, "text-embedding-3-large", batch_size=1000)
umap_model = UMAP(n_neighbors=20, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 3))

topic_model = BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,
        
        top_n_words=10,
        verbose=True,
        calculate_probabilities=True
)

topics, probs = topic_model.fit_transform(doc, embeddings=embeddings)

2025-02-06 18:20:54,490 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-02-06 18:22:15,998 - BERTopic - Dimensionality - Completed ✓
2025-02-06 18:22:16,000 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-02-06 18:24:37,849 - BERTopic - Cluster - Completed ✓
2025-02-06 18:24:37,861 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-02-06 18:24:45,930 - BERTopic - Representation - Completed ✓


In [9]:
# number of clusters
len(topic_model.get_topic_info())

268

In [10]:
topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,24155,-1_self_data_like_time,"[self, data, like, time, use, new, 10, make, w...",[System: you are a math assistant\nUser: Here ...
1,0,758,0_div_const_class_react,"[div, const, class, react, button, div div, cl...","[请你分析：import React, {useEffect, useState } fro..."
2,1,654,1_spinal_margins_spine_blood,"[spinal, margins, spine, blood, safety margins...",[My girlfriend Lisa who is 22 had a cut in her...
3,2,597,2_moon_quantum_earth_magnetic,"[moon, quantum, earth, magnetic, wave, mass, v...",[Act as an expert in space exploration history...
4,3,525,3_linux_bash_ubuntu_armin,"[linux, bash, ubuntu, armin, command, sed, fil...",[how to groub list by service name and sort by...


Before reducing outliers, we selected 20 example prompts from each identified cluster. These prompts were chosen from those in the first 20th percentile of probability calculated by HDBSCAN clustering, representing the likelihood that they belong to the cluster. We excluded extra-long (> 100 words) and extra-short (< 5 words) prompts for better readability.

In [11]:
from collections import defaultdict

sampled_prompts = defaultdict(list)
topic_info = topic_model.get_topic_info()
doc_info = topic_model.get_document_info(doc)

for topic_id in topic_info['Topic'][1:]:
    filtered_docs = doc_info[(doc_info['Topic'] == topic_id) & 
                             (doc_info['Probability'] >= doc_info['Probability'].quantile(0.8)) &
                             (doc_info['Document'].str.split().str.len() >= 5)]

    res = filtered_docs
    cap = 100
    if len(filtered_docs) >= 20:
        while len(res) < 20:
            res = filtered_docs[
                filtered_docs['Document'].str.split().str.len() <= cap
            ]
            cap += 50
    
    sampled_docs = res.sample(n=min(20, 
                            len(res)),
                            random_state=42,
                            replace=False)
    
    sampled_prompts[topic_id] = sampled_docs['Document'].tolist()

In [14]:
sampled_prompts[0]

['can you write a nodejs typescript class that is initialized with a dir name , and implement a method to cleanup files \nthat receives a list of absolute paths of files to keep and a list of extensions to include ( other extensions are ignored and files are not deleted )  or maybe globpatterns to exclude . ',
 "If I have this response in ts:\n  if (adType === AdType.FBStory) {\n    return {\n      type: AdType.FBStory,\n      title: results.title,\n      body: results.body\n    } as FBStoryAd;\n  } else { // AdType.InstagramPost\n    return {\n      type: AdType.InstagramPost,\n      title: results.title\n    } as InstagramPostAd;\n  }\n\nBut I'd like for the kets using results to not be included if they are null, so for example if results.title doesn't exist it should not be included. How can I do that? Can I some how do it on each line?",
 'Extract the keys of the typescript type type and put them in a ruby list of symbols using the %i approach.\n\nexport interface TradeDataFilterPa

In [15]:
import pickle 

with open(f"{file_path}/example_prompts.pkl", 'wb') as f:
    pickle.dump(sampled_prompts, f)

Reduce outliers.

In [28]:
new_topics = topic_model.reduce_outliers(list(doc), topics , strategy="c-tf-idf", threshold=0.1)
new_topics = topic_model.reduce_outliers(list(doc), new_topics, strategy="distributions")
topic_model.update_topics(doc, topics=new_topics)

100%|██████████| 19/19 [00:11<00:00,  1.62it/s]


In [30]:
topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,187,-1_what_threadautoarchiveduration_autoarchived...,"[what, threadautoarchiveduration, autoarchived...",[System: you are a math assistant\nUser: Here ...
1,0,1036,0_const_div_id_class,"[const, div, id, class, text, button, type, er...","[请你分析：import React, {useEffect, useState } fro..."
2,1,780,1_her_patient_blood_spinal,"[her, patient, blood, spinal, leg, right, marg...",[My girlfriend Lisa who is 22 had a cut in her...
3,2,877,2_earth_magnetic_moon_wave,"[earth, magnetic, moon, wave, theory, charge, ...",[Act as an expert in space exploration history...
4,3,759,3_ubuntu_file_command_linux,"[ubuntu, file, command, linux, echo, files, ba...",[how to groub list by service name and sort by...


In [31]:
# save the model for future analysis
topic_model.save(
    path=f"{file_path}/model",
    serialization="safetensors",
    save_ctfidf=True
)

### Summarize Category Names

For each cluster, we used ChatGPT-4o to assign a category name based on the selected example prompts. 

In [33]:
def summarize_topic(prompts):
    input_text = "Based on the sampled prompts below, extract a short but highly descriptive \
                  topic label of at most 5 words and a short description of this category in \
                  two sentences:\n\n" + "\n\n".join(prompts)
    client = openai.OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You help summarize the category of the given prompts. \
              Make sure it is in the following format: The topic of doc is: '...'. Description: '...'."},
            {"role": "user", "content": input_text}
        ],
        temperature=0
    )

    return response.choices[0].message.content

summaries = {}
for topic_id, prompts in sampled_prompts.items():
    summary = summarize_topic(prompts)
    summaries[topic_id] = summary

In [35]:
def extract_category(summary):
    try:
        return re.search(r"is: '(.*?)'", summary).group(1)
    except AttributeError:
        try:
            return re.search(r"'(.*?)'. ", summary).group(1)
        except AttributeError:
            print(f"Regex failed for: {list(summaries.keys())[list(summaries.values()).index(summary)]}")
            return None
def extract_description(summary):
    try:
        return re.search(r"Description: '(.*?)'", summary).group(1)
    except AttributeError:
        try:
            return re.search(r"Description: (.*?)", summary).group(1)
        except AttributeError:
            print(f"Regex failed for: {summary}")
            return None

In [41]:
summaries[-1] = "The topic of doc is 'Miscellaneous Categories'. Description: 'They are outliers in the topic modeling process'."
summaries_df = pd.DataFrame(list(summaries.items()), columns=['Topic', 'Summary'])
summaries_df['Category'] = summaries_df['Summary'].apply(extract_category)
summaries_df['Description'] = summaries_df['Summary'].apply(extract_description)

topic_info_modified = topic_info[['Topic', 'Count']]
summaries_df = summaries_df.merge(topic_info_modified, on='Topic')[['Topic', 'Category', 'Description', 'Count']]
summaries_df['Percentage'] = summaries_df['Count'] / summaries_df['Count'].sum()
summaries_df['Example Prompt'] = summaries_df.apply(lambda x: sampled_prompts[x.Topic], axis=1)
summaries_df['Example Prompt'] = summaries_df['Example Prompt'].str.join('|||')

In [44]:
summaries_df.head()

Unnamed: 0,Topic,Category,Description,Count,Percentage,Example Prompt
0,0,Web Development and Programming,This document contains various prompts related...,758,0.015601,can you write a nodejs typescript class that i...
1,1,Medical Diagnosis and Treatment Queries,This document contains a series of medical-rel...,654,0.013461,My girlfriend Lisa who is 22 had a cut in her ...
2,2,Physics and Astronomy Concepts,This category encompasses a range of prompts r...,597,0.012287,Is liquid core in earth stop rotating|||write ...
3,3,Linux and Unix Troubleshooting,This category encompasses various troubleshoot...,525,0.010806,how to mount a partition in ubuntu 22.04 using...
4,4,Counting Letters in Words,This category involves prompts that ask for th...,486,0.010003,How many r are there in strawberry|||how many ...


In [45]:
# save if needed
summaries_df.to_csv(f"{file_path}/narrow_categories.csv", index=False)

## Broad Category

We performed topic clustering again on the category names of these 193 specific categories, summarizing them into 12 broad categories. The summarization process followed an almost identical approach as before.

In [47]:
from bertopic.backend import OpenAIBackend

broad_doc = list(summaries_df['Category'] + ': ' + summaries_df['Description'])
broad_doc.pop() # not considering outliers

# Create embeddings
client = openai.OpenAI()
embedding_model = OpenAIBackend(client, "text-embedding-3-large")
embeddings = embedding_model.embed(broad_doc)

# BERTopic
umap_model = UMAP(n_neighbors=13, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 3))
topic_model= BERTopic(
        embedding_model=embedding_model,
        umap_model=umap_model,
        hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model,

        top_n_words=3,
        verbose=True
)

topics, probs = topic_model.fit_transform(broad_doc, embeddings=embeddings)

# Reduce all outliers
new_topics = topic_model.reduce_outliers(broad_doc, topics , strategy="c-tf-idf", threshold=0.1)
new_topics = topic_model.reduce_outliers(broad_doc, new_topics, strategy="distributions")
topic_model.update_topics(broad_doc, topics=new_topics)

2025-02-06 18:56:26,299 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-02-06 18:56:26,801 - BERTopic - Dimensionality - Completed ✓
2025-02-06 18:56:26,801 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-02-06 18:56:26,809 - BERTopic - Cluster - Completed ✓
2025-02-06 18:56:26,812 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-02-06 18:56:26,844 - BERTopic - Representation - Completed ✓
100%|██████████| 1/1 [00:00<00:00, 546.49it/s]


In [53]:
len(topic_model.get_topic_info())

10

In [52]:
topic_model.get_topic_info().head()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,60,0_the_puzzles_and_to,"[the, puzzles, and, to, solving, of, involves,...",[Calculus Problems and Solutions: This categor...
1,1,36,1_the_of_and_prompts,"[the, of, and, prompts, explore, historical, t...",[Taiwan and Hong Kong Geopolitical Issues: The...
2,2,30,2_and_document_as_contains,"[and, document, as, contains, code, like, prog...",[Django and Database Models: The document cont...
3,3,22,3_and_the_business_strategies,"[and, the, business, strategies, of, for, on, ...",[Business Strategy and Sector Analysis: The pr...
4,4,19,4_python_and_encompasses_programming,"[python, and, encompasses, programming, catego...",[Python File Manipulation Scripts: This catego...


In [None]:
# Summarize category names
def summarize_topic(prompts):
    input_text = "Based on the topic names, extract a short but highly descriptive and concrete \
                  label of at most 2 words:\n\n" + "\n\n".join(prompts)
    client = openai.OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You help summarize the topic of the given fine grained \
             categories in the following format: The topic is '...'."},
            {"role": "user", "content": input_text}
        ],
        temperature=1
    )

    return response.choices[0].message.content

broad_topic_info = topic_model.get_topic_info()
broad_doc_info = topic_model.get_document_info(broad_doc)
summaries = {}

for topic_id in broad_topic_info['Topic']:
    docs = list(broad_doc_info[broad_doc_info['Topic'] == topic_id]['Document'])
    names = [re.search(r"(.*?): ", x).group(1) for x in docs]
    cat = ', '.join(names)
    summary = summarize_topic(cat)
    summaries[topic_id] = summary

In [50]:
# Combine results 
broad_summaries_df = pd.DataFrame(list(summaries.items()), columns=['Topic', 'Summary'])
broad_summaries_df['Category'] = broad_summaries_df['Summary'].apply(lambda x: re.search(r"'(.*?)'", x).group(1))
topic_info_modified = broad_topic_info[['Topic', 'Count']]
broad_summaries_df = broad_summaries_df.merge(topic_info_modified, on='Topic')[['Topic', 'Category', 'Count']]
broad_summaries_df['Percentage'] = broad_summaries_df['Count'] / broad_summaries_df['Count'].sum()
broad_summaries_df = broad_summaries_df.fillna('Other')

In [51]:
broad_summaries_df.head()

Unnamed: 0,Topic,Category,Count,Percentage
0,0,Logic Puzzles,60,0.224719
1,1,Diverse Topics,36,0.134831
2,2,Technical Programming,30,0.11236
3,3,Business Strategies,22,0.082397
4,4,Programming Techniques,19,0.071161


In [None]:
# save if needed
broad_summaries_df.to_csv(f"{file_path}/broad_categories.csv", index=False)

## Data Processing: combine broad and narrow topics

The clustering results are stored in JSON format to facilitate future visualizations.

### Combine broad, narrow category, and examples

In [54]:
# Merge categories
merged = broad_doc_info[['Topic']].merge(summaries_df, left_index=True, right_index=True)
merged = merged.merge(broad_summaries_df, left_on='Topic_x', right_on='Topic')
merged = merged[['Topic_x', 'Category_y', 'Topic_y', 'Category_x', 'Count_x', 'Percentage_x', 'Example Prompt']]
merged = merged.rename(columns={
    'Topic_x': 'broad_category_id', 
    'Category_y': 'broad_category', 
    'Topic_y': 'narrower_category_id',
    'Category_x': 'narrower_category',
    'Count_x': 'prompt_count',
    'Percentage_x': 'prompt_percentage',
    'Example Prompt': 'example_prompt'})

In [55]:
merged.head()

Unnamed: 0,broad_category_id,broad_category,narrower_category_id,narrower_category,prompt_count,prompt_percentage,example_prompt
0,2,Technical Programming,0,Web Development and Programming,758,0.015601,can you write a nodejs typescript class that i...
1,5,Creative Technologies,1,Medical Diagnosis and Treatment Queries,654,0.013461,My girlfriend Lisa who is 22 had a cut in her ...
2,1,Diverse Topics,2,Physics and Astronomy Concepts,597,0.012287,Is liquid core in earth stop rotating|||write ...
3,4,Programming Techniques,3,Linux and Unix Troubleshooting,525,0.010806,how to mount a partition in ubuntu 22.04 using...
4,0,Logic Puzzles,4,Counting Letters in Words,486,0.010003,How many r are there in strawberry|||how many ...


In [76]:
# save if needed
merged.to_csv(f"{file_path}/category_summary.csv", index=False)

For each conversation in the original dataset, assign the corresponding broad and narrow category.

In [77]:
topic_model = BERTopic.load(f"{file_path}/model")
doc_info = topic_model.get_document_info(doc)
merged = pd.read_csv(f"{file_path}/category_summary.csv")



### Label conversations with broad, narrow category

In [86]:
english_df.reset_index(inplace=True)
llm_df = english_df.merge(doc_info[['Topic']], left_index=True, right_index=True)
llm_df = llm_df.merge(merged, how='left', left_on='Topic', right_on='narrower_category_id')
llm_df = llm_df[['question_id', 'broad_category_id', 'broad_category', 
    'narrower_category_id', 'narrower_category', 'model_a', 'model_b', 'winner']]

In [87]:
llm_df.shape

(48586, 8)

In [88]:
llm_df.head()

Unnamed: 0,question_id,broad_category_id,broad_category,narrower_category_id,narrower_category,model_a,model_b,winner
0,76ce56f8ba474768bc66128c7993ccb8,2.0,Technical Programming,0.0,Web Development and Programming,mistral-large-2407,athene-70b-0725,model_b
1,e8fe7c9f75ab4e528367cc7de625c475,7.0,Interest Categories,122.0,Knowledge Cutoff Date,gemma-2-9b-it,qwen2-72b-instruct,model_b
2,772d53e5c51c487e8a293eadcd9d4855,0.0,Logic Puzzles,7.0,Comparing Decimal Numbers,mixtral-8x22b-instruct-v0.1,llama-3.1-70b-instruct,tie (bothbad)
3,6ccd7a51825249d5881ee501e06bb9ab,0.0,Logic Puzzles,80.0,Algebraic Equation Solving,mixtral-8x22b-instruct-v0.1,gemma-2-2b-it,model_a
4,463aa4efacf34f27b6a5c3f1f7417e86,3.0,Business Strategies,16.0,Business and Marketing Strategies,gemini-1.5-pro-api-0514,reka-flash-preview-20240611,model_a


In [89]:
# save if needed
llm_df.to_csv(f"{file_path}/conversations_and_category.csv", index=False)

### Create visualization 

In [90]:
# Export results in JSON format
root = {
    "name": "categories",
    "children": []
}
for broad_category, group in merged.groupby(["broad_category_id", "broad_category"]):
    parent = {
        "id": int(broad_category[0]),
        "name": broad_category[1],
        "children": []
    }
    
    for _, row in group.iterrows():
        child = {
            "id": row["narrower_category_id"],
            "name": row["narrower_category"],
            "count": row["prompt_count"],
            "percent": row['prompt_percentage'],
        }

        parent["children"].append(child)
    
    root["children"].append(parent)

json_output = json.dumps(root, indent=4)

with open(f"{file_path}/data.json", "w") as f:
    f.write(json_output)

In [92]:
# json file for example prompts
import pickle 

with open(f"{file_path}/example_prompts.pkl", 'rb') as f:
    sampled_prompts = pickle.load(f)

# Group by 'broad_category' and transform to the desired JSON structure
root = []
for i in sampled_prompts:
    obj = {
        "id": i,
        "name": merged[merged['narrower_category_id'] == i].loc[i, 'narrower_category'],
        "examples": sampled_prompts[i],
    }
    root.append(obj)

json_output = json.dumps(root, indent=4)
with open(f"{file_path}/examples.json", "w") as f:
    f.write(json_output)

Instruction to generate explorer visualization:
1. Run the pipeline to produce two output files: data.json and examples.json.
2. Clone the [Arena-catalog](https://github.com/lmarena/arena-catalog) repository, which contains the necessary HTML, CSS, and JavaScript files for the explorer.
2. In explorer/index.html, replace the file paths on lines 44 & 45 with the correct paths to your generated data.json and examples.json files.