Embeddings are a foundational building block of LLMs - by locating semantics in a multi-dimensional space they allow LLMs to make sense of meaning and "understand" what texts are about. In this sample, we look at two ways of using embeddings for the purpose of clustering texts by their meaning. In the first, we use the embeddings directly and learn the clusters based on their embedding vectors. In the second, with summarise the texts and pass them on to a powerful LLM, prompting it to cluster the articles. As we'll see, both approaches work well, and each approach has its advantages.

Learning clusters directly from embeddings:
- Fast and inexpensive
- Requires an understanding of basic machine learning methods (K-Means clustering)
- Does not provide any additional information about the clusters, since the simple machine learning algorithm we're using is working directly with the numerical representation of the embeddings and does not have any insight into their meaning.

Prompting an LLM to cluster the articles:
- Relatively slow and expensive - this is a task for GPT-4, simpler LLMs do not achieve it with convincing results
- Easy to accomplish - we're "programming" the LLM in English, no additional knowledge of machine-learning techniques required
- Can provides additional information about the clusters, since the clustering is happening within the context of the LLM, where the summary of the articles is available

In [1]:
%pip install openai sklearn numpy requests BeautifulSoup4
from IPython.display import clear_output ; clear_output()

After we installed the necessary package, we'll gather some data for our experiment by fetching 23 random articles from Wikipedia.

In [2]:
import requests
from bs4 import BeautifulSoup

articles = []

def get_random_wikipedia_article_title():
    response = requests.get("https://en.wikipedia.org/wiki/Special:Random")
    soup = BeautifulSoup(response.content, "html.parser")
    title = soup.find(class_="firstHeading").text
    text = soup.get_text()
    return title, text

for _ in range(23):
    title, text = get_random_wikipedia_article_title()
    articles.append({
        "title": title,
        "text": text,
    })

We configure Azure Open AI with our deployment.

In [3]:
import openai

openai.api_type = "azure"
openai.api_version = "2023-06-01-preview"
openai.api_key = " ... " # Replace with your Azure Open AI key
openai.api_base = " ... " # Replace with your Azure Open AI endpoint

We'll use the Open AI embeddings endpoint to calculate embeddings for all articles.

Note that we're trimming the text of the article a bit to fit it within the available context window of the model. This shouldn't be a problem, since encyclopedia articles usually contain the most important information in the beginning of the article.

In [4]:
for article in articles:
    response = openai.Embedding.create(
        engine="text-embedding-ada-002",
        input=[article["text"][:5555]],
    )
    article["embedding"] = response["data"][0]["embedding"]

In addition to embeddings, we'll also create a summary for each article, which we will use later to get an LLM to do the clustering without having to include the complete articles. GPT-3.5 is more than capable for this task, and we're using the version of the model with a larger context window to be able to comfortably fit more of the article content in our request.

In [5]:
for article in articles:
    completion = openai.ChatCompletion.create(
        engine = "gpt-35-turbo-16k",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert encyclopedia editor. "
                    "Your task is to read an article from Wikipedia and write a short summary of it. "
                    "Your response should always be a single paragraph of text with the article summary, "
                    "and should never include any titles, headings or markup."
                )
            },
            {
                "role": "user",
                "content": (
                    f"Read the following article and write a short summary of it."
                    f"\n\nTitle:\n{article['title']}\nArticle:\n{article['text'][:9999]}"
                )
            },
        ],
        temperature=0.3,
        max_tokens=250,
    )
    article["summary"] = completion['choices'][0]['message']["content"]

Let's look at option 1 - learning the clustering of the articles from their embeddings. We'll use K-Means clustering, a simple but very effective clustering model.

In [6]:
from sklearn.cluster import KMeans
import numpy as np

embeddings = np.array([article["embedding"] for article in articles])
kmeans = KMeans(n_clusters=7, random_state=0, n_init=10).fit(embeddings)

for cluster in range(7):
    print(f"Cluster {cluster + 1}")
    for article in articles:
        if kmeans.predict([article["embedding"]])[0] == cluster:
            print(f" - {article['title']}")
    print()

Cluster 1
 - 2010 Campania regional election
 - List of mayors of Vicenza

Cluster 2
 - Boxing at the 2014 Commonwealth Games – Lightweight
 - 1965 Nuneaton by-election
 - 2019 Copa do Brasil
 - 1934 County Championship
 - 1699 in science

Cluster 3
 - Paracongroides
 - Berlin Correspondent
 - Karl Gether Bomhoff
 - Zeitraumexit
 - Fotonovela (film)

Cluster 4
 - Jewish Neo-Aramaic dialect of Koy Sanjaq
 - Special Assault Team
 - Olympic Security Command Centre

Cluster 5
 - Carlos Uzabeaga
 - Thunder Live
 - Luke Kibet Bowen
 - Dan Duffy (artist)
 - Arthit Sunthornpit

Cluster 6
 - Mohtarma Benazir Bhutto Shaheed Medical College

Cluster 7
 - Array Network Facility
 - Young African Leaders Initiative



Looks like K-means got us pretty good results. We don't have an explicitly name title for each cluster, but from looking at the included articles we can see that it did a decent job grouping them by topic. We may be able to achieve even better results by adjusting the parameters of the model, but for a generic solution this is not bad, if a bit obscure.

Now, let's see what a powerful LLM (GPT-4) can come up with, with just prompting and with the titles and and summaries of the articles included in the context.

In [9]:
completion = openai.ChatCompletion.create(
    engine="gpt-4",
    messages=[
        {
            "role": "system",
            "content": (
                "You are an expert encyclopedia editor. "
                "Your task is to read through the summaries of Wikipedia articles, "
                "and assign them to one of seven clusters. "
            )
        },
        {
            "role": "user",
            "content": (
                "The following are titles and summaries of Wikipedia articles. "
                "Read the summaries. For each article, come up with a few tags that would "
                "describe the cluster it belongs to. "
                "Then assign the article to one of exactly 7 clusters (from Cluster 1 to Cluster 7), "
                "based on their topics. It is very important that you only assign each article to one cluster. "
                "You must have exactly 7 clusters, and each cluster must have at least two articles assigned to it. "
                "Your final output should follow the format:\n"
                "Cluster N: Cluster Title\n"
                " - First Article Title\n"
                " - Second Article Title\n"
                "etc...\n\n\n"
            ) + "\n".join(f"Title: {article['title']}\nSummary: {article['summary']}\n\n" for article in articles)
        },
    ],
    temperature=0.1,
)
clusters = completion['choices'][0]['message']["content"]

print(clusters)

Cluster 1: Politics and Governance
 - 2010 Campania regional election
 - List of mayors of Vicenza
 - 1965 Nuneaton by-election
 - Karl Gether Bomhoff

Cluster 2: Sports and Athletics
 - Carlos Uzabeaga
 - Boxing at the 2014 Commonwealth Games – Lightweight
 - Luke Kibet Bowen
 - 2019 Copa do Brasil
 - Arthit Sunthornpit

Cluster 3: Science and Research
 - Paracongroides
 - Array Network Facility
 - 1699 in science

Cluster 4: Language and Culture
 - Jewish Neo-Aramaic dialect of Koy Sanjaq
 - Young African Leaders Initiative

Cluster 5: Law Enforcement and Security
 - Special Assault Team
 - Olympic Security Command Centre

Cluster 6: Arts and Entertainment
 - Berlin Correspondent
 - Thunder Live
 - Dan Duffy (artist)
 - Fotonovela (film)
 - Zeitraumexit

Cluster 7: Education and Health
 - Mohtarma Benazir Bhutto Shaheed Medical College
 - 1934 County Championship


In conclusion, we can achieve pretty good clustering using both techniques. Using LLMs to summarise the articles and then asking for a clustering gets us excellent results, but at a price - we need one minute and a half to complete the task, and the costs for making the LLM calls add up. Learning the clustering from the embeddings directly is very fast (4 seconds) and inexpensive, but getting high quality results is much harder, and we don't get the added benefit of the LLM being able to reason about the clusters and give them a clear description, we have to do that using a human, and that's many times more expensive.