# TOPIC MODELLING WITH COHERE API
Topic modeling is a technique used to extract the main topics from a large collection of text documents. It helps people make sense of large volumes of unstructured text data, such as incoming messages to a chatbot. For example, this can be useful in the context of customer service, where teams can analyze common questions and feedback coming from customers about a product or service, so they can better serve customers in the future.

The Topic Modeler sample app analyzes a dataset composed of commands that people give to their AI-based personal assistant, and it then extracts the dataset's key topics and themes. It leverages text embeddings, which are numerical representations of text data that captures its meaning, and uses the Cohere Embed endpoint to retrieve these text embeddings. The topics are generated via a clustering algorithm, grouping the various text inputs into different clusters based on the topics they represent.

The steps to build the Topic Modeler are:

Step 1: Load the text dataset
Step 2: Create clusters
Step 3: Get cluster keywords
Step 4: Visualize clusters on a plot

In [14]:
!pip install pandas cohere datasets altair topically umap-learn bertopic > /dev/null

## Step 1: Load the Text Dataset



In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import cohere
import umap
import altair as alt
from bertopic import BERTopic
from datasets import load_dataset
from typing import Optional, List

# Get (a small sample) the dataset
dataset = load_dataset("AmazonScience/massive", "en-US", split="train" )

Downloading builder script:   0%|          | 0.00/30.3k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/34.4k [00:00<?, ?B/s]

Downloading and preparing dataset massive/en-US to /root/.cache/huggingface/datasets/AmazonScience___massive/en-US/1.0.0/71d360eb7d7a18565ff8c10609cebf714fce3cc390e173ba5b02ffd48543cdc1...


Downloading data:   0%|          | 0.00/40.3M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset massive downloaded and prepared to /root/.cache/huggingface/datasets/AmazonScience___massive/en-US/1.0.0/71d360eb7d7a18565ff8c10609cebf714fce3cc390e173ba5b02ffd48543cdc1. Subsequent calls will reuse this data.


In [23]:
df_full = pd.DataFrame(dataset)
df_full.shape

(11514, 10)

In [24]:
df_full.columns

Index(['id', 'locale', 'partition', 'scenario', 'intent', 'utt', 'annot_utt',
       'worker_id', 'slot_method', 'judgments'],
      dtype='object')

In [25]:
df_full.head()

Unnamed: 0,id,locale,partition,scenario,intent,utt,annot_utt,worker_id,slot_method,judgments
0,1,en-US,train,16,48,wake me up at nine am on friday,wake me up at [time : nine am] on [date : friday],1,"{'slot': [], 'method': []}","{'worker_id': [], 'intent_score': [], 'slots_s..."
1,2,en-US,train,16,48,set an alarm for two hours from now,set an alarm for [time : two hours from now],1,"{'slot': [], 'method': []}","{'worker_id': [], 'intent_score': [], 'slots_s..."
2,4,en-US,train,10,46,olly quiet,olly quiet,1,"{'slot': [], 'method': []}","{'worker_id': [], 'intent_score': [], 'slots_s..."
3,5,en-US,train,10,46,stop,stop,1,"{'slot': [], 'method': []}","{'worker_id': [], 'intent_score': [], 'slots_s..."
4,6,en-US,train,10,46,olly pause for ten seconds,olly pause for [time : ten seconds],1,"{'slot': [], 'method': []}","{'worker_id': [], 'intent_score': [], 'slots_s..."


In [26]:
df = df_full.sample(100)

In [27]:
# Initialize the Cohere client
co = cohere.Client("Lsae0OgXSDvyTgm7kGoKyoI0yYTHm8xNvW94kmb7")

# Embed with Cohere’s embedding model, then convert into a numpy array
embeddings = co.embed(texts=list(df['utt']),
                       truncate="RIGHT").embeddings
embeddings = np.array(embeddings)


title = "Commands to AI personal assistant"

## Step 2: Create Clusters

In [15]:
from sklearn.cluster import KMeans

# Load and initialize BERTopic to use KMeans clustering with 8 clusters only.
topic_model = BERTopic(hdbscan_model = KMeans(n_clusters = 10))

# df is a dataframe. df['title'] is the column of text we're modeling
df['topic'], probabilities = topic_model.fit_transform(df['utt'], embeddings)

## Step 3: Get Cluster Keywords

In [16]:
keywords = topic_model.generate_topic_labels()
df['cluster_keywords'] = df['topic'].map(lambda x: keywords[x])

## Step 4: Visualize Clusters on a Plot

In [17]:
def interactive_clusters_scatterplot(
        df: pd.DataFrame,
        fields_in_tooltip: List[str] = None,
        title: str = '',
        title_column: str = 'keywords'
):
    if fields_in_tooltip is None:
        fields_in_tooltip = ['']

    selection = alt.selection_multi(fields=[title_column], bind='legend')

    chart = alt.Chart(df).transform_calculate(
    ).mark_circle(size=20, stroke='#666', strokeWidth=1, opacity=0.1).encode(
        x=
        alt.X('x',
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
              ),
        y=
        alt.Y('y',
              scale=alt.Scale(zero=False),
              axis=alt.Axis(labels=False, ticks=False, domain=False)
              ),

        color=alt.Color(f'{title_column}:N',
                        legend=alt.Legend(columns=2,
                                          symbolLimit=0,
                                          orient='right',
                                          labelFontSize=12)
                        ),
        opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
        tooltip=fields_in_tooltip
    ).properties(
        width=1000,
        height=800
    ).add_selection(
        selection
    ).configure_legend(labelLimit=0).configure_view(
        strokeWidth=0
    ).configure(background="#F6f6f6").properties(
        title=title
    ).configure_range(
        category={'scheme': 'category20'}
    )
    return chart

# Reduce dimensions to be able to plot the embeddings
n_neighbors = 15
reducer = umap.UMAP(n_neighbors=n_neighbors)
umap_embeds = reducer.fit_transform(embeddings)
df['x'] = umap_embeds[:, 0]
df['y'] = umap_embeds[:, 1]

# Specify the names of columns to plot

title_column = 'cluster_keywords'
fields_in_tooltip = ['utt', 'topic', 'cluster_keywords']

title = "Commands to AI personal assistant"

chart = interactive_clusters_scatterplot(df,
                                            fields_in_tooltip=fields_in_tooltip,
                                            title=title + " - " + str(n_clusters) + " clusters",
                                            title_column=title_column)
chart