# BERTopic Parameters

In this final notebook, we take what we have learned in the first five notebooks and apply this directly within the BERTopic library.

In [5]:
import sqlite3
import msgspec
import numpy as np
import random

db_conn = sqlite3.connect("ChinaGraph2024.db")
cursor = db_conn.cursor()
decoder = msgspec.json.Decoder()

data = cursor.execute('SELECT id, explanation_text, explanation_embedding FROM neurons;').fetchall()

explanations = []
embeddings = []
for row in data:
  explanations.append(row[1])
  embeddings.append(decoder.decode(row[2]))

embeddings = np.array(embeddings)

To pass the level of custom parameters the we used in the previous notebooks to BERTopic, we actually initialize our own instances of UMAP and HDBSCAN.

In [6]:
from umap import UMAP
from hdbscan import HDBSCAN

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.01, random_state=42)
hdbscan_model = HDBSCAN(min_samples=80, min_cluster_size=775, gen_min_span_tree=True, prediction_data=True)

  from .autonotebook import tqdm as notebook_tqdm


Then we pass these to the `umap_model` and `hdbscan_model` arguments when initializing `BERTopic`. In the previous notebook we used the `sentence-transformers/all-MiniLM-L6-v2` embedding model, we do the same here by passing this string to the `embedding_model` argument. Finally, we add in the `CountVectorizer` which removes English stopwords for us in the c-TF-IDF step.

In [7]:
import bertopic
from sklearn.feature_extraction.text import CountVectorizer

stopwords = ['and', 'to', 'of', 'the', 'or', 'in', 'related', 'words', 'word', 'phrases', 'with']

# we add this to remove stopwords that can pollute topcs
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words=stopwords)

model = bertopic.BERTopic(
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  top_n_words=10,
  language='english',
  calculate_probabilities=True,
  verbose=True
)
model.fit(documents=explanations, embeddings=embeddings)
topics, probs = model.fit_transform(documents=explanations, embeddings=embeddings)


2024-06-02 15:41:33,533 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-06-02 15:42:37,175 - BERTopic - Dimensionality - Completed ✓
2024-06-02 15:42:37,184 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-06-02 15:42:46,824 - BERTopic - Cluster - Completed ✓
2024-06-02 15:42:46,844 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-06-02 15:42:47,654 - BERTopic - Representation - Completed ✓
2024-06-02 15:42:47,785 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-06-02 15:43:24,298 - BERTopic - Dimensionality - Completed ✓
2024-06-02 15:43:24,300 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-06-02 15:43:30,254 - BERTopic - Cluster - Completed ✓
2024-06-02 15:43:30,264 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-06-02 15:43:31,101 - BERTopic - Representation - Completed ✓


In [8]:
for i in range(5):
  print(f"{topics[i]}: {data[i][0]}")

print({key: topics.count(key) for key in set(topics)})

-1: L00N0000
0: L00N0001
3: L00N0002
0: L00N0003
6: L00N0004
{0: 9389, 1: 8493, 2: 2808, 3: 2575, 4: 2103, 5: 1276, 6: 1213, 7: 1100, 8: 896, 9: 848, -1: 6163}


In [9]:
model.visualize_barchart()

In [10]:
model.visualize_hierarchy()

Save topics to database

In [None]:
import json

for topic_index in model.get_topics().keys():
  if topic_index < 0:
    continue  
  
  top_words = []
  for word in model.get_topic(topic_index):
    top_words.append(word[0])
  
  top_words_json = json.dumps(top_words, separators=(',', ':'))
  print("{}: {}".format(topic_index, top_words_json))
  
  
  cursor.execute(
    "INSERT INTO topics (id, title, top_words) VALUES (?, ?, ?)",
    (topic_index, "", top_words_json),
  )

db_conn.commit()

In [29]:
titles_by_topic_index = {
  "0": "Names, names of people and organizations",
  "1": "Numerical, numbers, measurements and punctuation",
  "2": "Parts of words, letters and syllables",
  "3": "Actions, movement and change",
  "4": "Connecting words, prepositions and conjunctions",
  "5": "Pronouns",
  "6": "Locations, positions and places",
  "7": "Temporal",
  "8": "Negative, bad sentiment",
  "9": "Positive, good sentiment",
}

for topic_index in titles_by_topic_index.keys():
  cursor.execute(
    "UPDATE topics SET title = ? WHERE id = ?",
    (titles_by_topic_index[topic_index], topic_index),
  )

db_conn.commit()

Update neurons with topics

In [33]:
print(len(data), len(topics))

36864 36864


In [34]:
for i in range(len(data)):
  cursor.execute(
    "UPDATE neurons SET explanation_topic_id = ? WHERE id = ?",
    (topics[i], data[i][0]),
  )

db_conn.commit()