# Measure the scores for BERTopic

This notebook was run on google colab, due to the required resources, not available locally.

## Setup

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# path_to_list_of_contents='[...]/list_of_contents.txt'

In [1]:
import pandas as pd
import numpy as np
from bertopic import BERTopic
from umap import UMAP
from bertopic.representation import MaximalMarginalRelevance
import random
import tensorflow as tf
np.random.seed(42)
tf.random.set_seed(42)

from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.evaluation_metrics.topic_significance_metrics import KL_background, KL_vacuous

import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

import spacy
import nltk
from nltk.stem import PorterStemmer
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
path_list=path_to_list_of_contents
with open(path_list, 'r') as file:
  list_of_contents = file.readlines()
list_of_contents[0]

"The BBC is being urged to drop singer Olly Alexander as its entrant for Eurovision after it emerged he signed a letter calling Israel an 'apartheid regime'. The Years And Years frontman, 33, was unveiled as next year's candidate for the UK during the Strictly Come Dancing final, which aired on the BBC on Saturday. But he now faces having that role stripped from him after he signed a letter from LGBT charity Voices4London which described Israel as an 'apartheid regime' which is trying to 'ethnically cleanse' Palestine. The statement, which was published on October 20, almost two weeks after Hamas' October 7 attack, also says that Israel has 'terrorised' Palestinian people and there is now a 'genocide' taking place 'in real time'. The Conservatives have accused the BBC of 'either a massive oversight or sheer brass neck' for selecting Alexander, while a Jewish charity has called for him to be replaced and for the broadcaster to cut ties with him. The BBC is not planning on taking any act

# Functions to measure the performance

1. TopicDiversity
  * https://github.com/MIND-Lab/OCTIS/blob/master/octis/evaluation_metrics/diversity_metrics.py#L12
  * inspired by https://github.dev/lbl-camera/berteley/tree/main
  * from all topics gets all words present there and forms a set from them -> takes only unique values, than take the number of unique words in the set, count the diversity as a ratio of the total number of words considered; if value near 0, than it is not good, when higher it is better
  * not perfect because e.g. if there are two topics, one would be about nice home animals, and the other about dangerous ones, both of them should contain word "animal" and it is desirable, but this score would not show that
2. Another version of TopicDiversity metric:
  * measure the diveristy inside each topic, not for all of them together
  * the final score is the average from all the scores
3. Coherence:
  * from issue https://github.com/MaartenGr/BERTopic/issues/90
  * in OCTIS they also use gensim implementation
  * measures how the top-k words in the topic relate to each other
4. Significance metrics KL_background and KL_vacuous
  * aim at discovering high-quality and junk topics based on document-topic and topic-word distributions
  * score equals to 0 means that two distibutions are exactly the same, the higher the score, the more different the distributions are

In [19]:
def TopicDiveristy_inside_topics(list_of_topics):
  nlp = spacy.load('en_core_web_sm')
  stemmer = PorterStemmer()
  scores_sum=[0 for _ in range(len(list_of_topics))]
  for i, topic in enumerate(list_of_topics):
    if i%50==0:
      print(i)
    topic_str=' '.join(topic)
    doc1=nlp(topic_str)
    lemmatized_tokens=[token.lemma_ for token in doc1]
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens] #the tokens are not perfect, but it helps to count as 1 "tesla" and "teslas"
    updated_list=[item.replace("tweet", "twitter") for item in stemmed_tokens]
    unique_words=set(updated_list)
    score=len(unique_words)/len(updated_list)
    scores_sum[i]=score
  return sum(scores_sum)/len(scores_sum)

In [20]:
def measure_topic_diversities(topic_model, topics, probs, diversity_mmr=0.5):

  topic_words = {}
  topic_dict = topic_model.topic_representations_
  for k in topic_dict.keys():
      topic_words[k] = [x[0] for x in topic_dict[k]]
  word_list = list(topic_words.values())
  word_list.pop(0)

  topic_diversity = TopicDiversity(topk=10)
  random.shuffle(word_list)
  output_tm = {"topics": word_list}
  diversity_score = topic_diversity.score(output_tm)
  mean_diversity=TopicDiveristy_inside_topics(word_list)
  return diversity_score, mean_diversity

In [21]:
def calculate_coherence(list_of_contents, topics, probs,topic_model, coherence_metric='c_npmi'):
  """
    texts - list of documents, list_of_contents
    topic_model - BERTopic model
    topics - list of topics assignment, topics z: topics, probs = topic_model.fit_transform(list_of_contents)
  """

  #from issue https://github.com/MaartenGr/BERTopic/issues/90

  # Preprocess Documents
  documents = pd.DataFrame({"Document": list_of_contents,
                            "ID": range(len(list_of_contents)),
                            "Topic": topics})
  documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
  cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

  # Extract vectorizer and analyzer from BERTopic
  vectorizer = topic_model.vectorizer_model
  analyzer = vectorizer.build_analyzer()

  # Extract features for Topic Coherence evaluation
  words = vectorizer.get_feature_names()
  tokens = [analyzer(doc) for doc in cleaned_docs]
  dictionary = corpora.Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]
  topic_words = [[words for words, _ in topic_model.get_topic(topic)]
                for topic in range(len(set(topics))-1)]

  coherence_model = CoherenceModel(topics=topic_words,
                                  texts=tokens,
                                  corpus=corpus,
                                  dictionary=dictionary,
                                  coherence=coherence_metric)
  coherence = coherence_model.get_coherence()

  return coherence

In [14]:
def measure_significance_metric(topic_model, topics, probs, kl_type='background'):
  #topics
  topic_words = {}
  topic_dict = topic_model.topic_representations_
  for k in topic_dict.keys():
      topic_words[k] = [x[0] for x in topic_dict[k]]
  word_list = list(topic_words.values())
  word_list.pop(0)
  random.shuffle(word_list)

  #topic-word-matrix
  topic_words_matrix = []
  topic_words=topic_model.get_topic_info()

  for topic in topic_words.Topic:
    if topic !=-1:
      words, scores = zip(*topic_model.get_topic(topic))
      topic_words_matrix.append(scores)
  topic_words_matrix=np.array(topic_words_matrix)

  # topic-document-matrix
  topic_document_matrix=np.array(probs).T

  output_tm = {"topics": word_list, "topic-word-matrix": topic_words_matrix, "topic-document-matrix":  topic_document_matrix}
  if kl_type=='background':
    kl_background_metric=KL_background()
    score=kl_background_metric.score(output_tm)
  elif kl_type=='vacuous':
    kl_vacuous_metric=KL_vacuous()
    score=kl_vacuous_metric.score(output_tm)
  else:
    print('Wrong type of metric')
    return
  return score

# Use

In [17]:
def check_measures(diversity=0.5, coherence='c_npmi'):
  # create model
  umap_model = UMAP(n_neighbors=15, n_components=5,
                    min_dist=0.0, metric='cosine', random_state=42)
  representation_model = MaximalMarginalRelevance(diversity=diversity)
  topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, representation_model=representation_model, umap_model=umap_model)
  topics, probs = topic_model.fit_transform(list_of_contents)

  # measure metrics
  diversity_score, mean_diversity = measure_topic_diversities(diversity_mmr=diversity, topics=topics, probs=probs, topic_model=topic_model)
  coherence_score = calculate_coherence(list_of_contents, coherence_metric=coherence, topic_model=topic_model, topics=topics, probs=probs)
  background_score=measure_significance_metric(topic_model, topics, probs, kl_type='background')
  vacuous_score=measure_significance_metric(topic_model, topics, probs, kl_type='vacuous')

  print(f"Diversity parameter: {diversity}, coherence metric: {coherence}, TopicDiversity: {diversity_score}, mean diversity in topics: {mean_diversity}, coherence score: {coherence_score}, KL background: {background_score}, KL vacuous: {vacuous_score}.")
  return diversity_score, mean_diversity, coherence_score, background_score, vacuous_score

In [22]:
diversity_score05, mean_diversity05, coherence05, background05, vacuous_score05 = check_measures(diversity=0.5, coherence='c_npmi')

2024-05-27 15:06:29,970 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/335 [00:00<?, ?it/s]

2024-05-27 15:07:21,890 - BERTopic - Embedding - Completed ✓
2024-05-27 15:07:21,891 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-27 15:07:32,938 - BERTopic - Dimensionality - Completed ✓
2024-05-27 15:07:32,940 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-27 15:07:55,566 - BERTopic - Cluster - Completed ✓
2024-05-27 15:07:55,576 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-27 15:08:07,427 - BERTopic - Representation - Completed ✓


0
50
100
150
200
Diversity parameter: 0.5, coherence metric: c_npmi, TopicDiversity: 0.7886792452830189, mean diversity in topics: 1.0, coherence score: 0.034461948245713726, KL background: 2.534053223796575, KL vacuous: 2.4354541148173947.


In [23]:
diversity_score02, mean_diversity02, coherence02, background02, vacuous_score02 = check_measures(diversity=0.2, coherence='c_npmi')

2024-05-27 15:12:35,051 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/335 [00:00<?, ?it/s]

2024-05-27 15:13:27,275 - BERTopic - Embedding - Completed ✓
2024-05-27 15:13:27,277 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-27 15:13:38,638 - BERTopic - Dimensionality - Completed ✓
2024-05-27 15:13:38,641 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-27 15:14:01,296 - BERTopic - Cluster - Completed ✓
2024-05-27 15:14:01,307 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-27 15:14:13,387 - BERTopic - Representation - Completed ✓


0
50
100
150
200
Diversity parameter: 0.2, coherence metric: c_npmi, TopicDiversity: 0.7061320754716981, mean diversity in topics: 0.9001286449399674, coherence score: 0.09791122425972787, KL background: 2.534053223796575, KL vacuous: 2.3216755782442813.


In [24]:
diversity_score005, mean_diversity005, coherence005, background005, vacuous_score005 = check_measures(diversity=0.05, coherence='c_npmi')

2024-05-27 15:15:52,253 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/335 [00:00<?, ?it/s]

2024-05-27 15:16:42,479 - BERTopic - Embedding - Completed ✓
2024-05-27 15:16:42,481 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-27 15:16:53,900 - BERTopic - Dimensionality - Completed ✓
2024-05-27 15:16:53,902 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-27 15:17:17,414 - BERTopic - Cluster - Completed ✓
2024-05-27 15:17:17,424 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-27 15:17:28,165 - BERTopic - Representation - Completed ✓


0
50
100
150
200
Diversity parameter: 0.05, coherence metric: c_npmi, TopicDiversity: 0.680188679245283, mean diversity in topics: 0.8406534503232632, coherence score: 0.11335504226096892, KL background: 2.534053223796575, KL vacuous: 2.3040921834697.


# Show sample output from the model

In [None]:
umap_model = UMAP(n_neighbors=15, n_components=5,
                    min_dist=0.0, metric='cosine', random_state=42)
representation_model = MaximalMarginalRelevance(diversity=0.5)
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True, representation_model=representation_model, umap_model=umap_model)
topics, probs = topic_model.fit_transform(list_of_contents)

2024-05-26 20:49:25,457 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/335 [00:00<?, ?it/s]

2024-05-26 20:50:24,024 - BERTopic - Embedding - Completed ✓
2024-05-26 20:50:24,027 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-05-26 20:50:39,525 - BERTopic - Dimensionality - Completed ✓
2024-05-26 20:50:39,528 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-05-26 20:51:20,580 - BERTopic - Cluster - Completed ✓
2024-05-26 20:51:20,591 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-05-26 20:51:40,395 - BERTopic - Representation - Completed ✓


In [None]:
topic_model.topic_representations_[topic_model.topics_[0]]

[('hamas', 0.03437410007487483),
 ('terrorist', 0.01593485845474893),
 ('antisemitism', 0.01382230042677843),
 ('jews', 0.012036425841046161),
 ('civilians', 0.007678645589811692),
 ('staff', 0.007132417021782438),
 ('attacks', 0.0069416562088209215),
 ('bbcs', 0.005977607178207836),
 ('editorial', 0.005890806183172142),
 ('simpson', 0.005484746849865494)]

In [None]:
list_of_contents[0]

"The BBC is being urged to drop singer Olly Alexander as its entrant for Eurovision after it emerged he signed a letter calling Israel an 'apartheid regime'. The Years And Years frontman, 33, was unveiled as next year's candidate for the UK during the Strictly Come Dancing final, which aired on the BBC on Saturday. But he now faces having that role stripped from him after he signed a letter from LGBT charity Voices4London which described Israel as an 'apartheid regime' which is trying to 'ethnically cleanse' Palestine. The statement, which was published on October 20, almost two weeks after Hamas' October 7 attack, also says that Israel has 'terrorised' Palestinian people and there is now a 'genocide' taking place 'in real time'. The Conservatives have accused the BBC of 'either a massive oversight or sheer brass neck' for selecting Alexander, while a Jewish charity has called for him to be replaced and for the broadcaster to cut ties with him. The BBC is not planning on taking any act

In [None]:
topic_model.topic_representations_[topic_model.topics_[1000]]

[('tinto', 0.05576038748508442),
 ('radiation', 0.0188352131003142),
 ('mining', 0.01701427688531484),
 ('site', 0.011923775095855302),
 ('alumina', 0.011651744094925356),
 ('ore', 0.011522331192205635),
 ('rios', 0.011260890707743408),
 ('copper', 0.010048154882721563),
 ('shelters', 0.009212359054452753),
 ('aboriginal', 0.00858673843806499)]

In [None]:
list_of_contents[1000]

"An investigation into how a tiny radioactive capsule was lost while being transported in outback Western Australia earlier this year has cleared mining giant Rio Tinto of any wrongdoing. On Thursday, the miner said the Western Australian Radiological Council, which had been investigating the incident, had not identified any breaches of WA's Radiation Safety Act by the company. 'We are grateful to the state and federal governments and all of those involved in the successful recovery of the capsule,' a Rio Tinto spokesperson said. 'Our own internal review has identified opportunities for improvement in the selection of radiation gauges and the way they are packaged and transported.' The capsule - which could fit onto a 10 cent piece - came loose during transportation from north of the Pilbara mining town of Newman to the Perth suburb of Malaga sometime between January 10 and January 16. An investigation into how a radioactive capsule (above) was lost in Western Australia in January has 