# B.2 - Topic Classification - BERT Topic Modelling

In this notebook, we utilize S-BERT embeddings with k-means clustering to discover topics within our goals. To do so, we import our network, create embeddings and assign each node one topic. We then extract relevant topic keywords for each topic. Finally, we export the network with the created topics to start a manual open-coding process.

In [1]:
# importing all relevant packages
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import pandas as pd
import numpy as np
import requests
import pickle
from collections import Counter
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
url = "https://raw.githubusercontent.com/nicosrp/The-Architecture-of-Aspiration-A-Network-Perspective-on-Human-Goals/main/Networks/Prior%20Network%20Versions/b1_network.pkl"
response = requests.get(url)
response.raise_for_status()

G = pickle.loads(response.content)

In [3]:
# current node attributes
attr_keys = {k for _, attrs in G.nodes(data=True) for k in attrs}
print(attr_keys)

{'wants_to_do', 'have_done', 'merged_goals', 'included_by_our_users', 'description', 'tags', 'comments', 'title'}


In [4]:
# checking the number of nodes to ensure network is correct
len(G.nodes())

2890

## S-BERT with K-Means Clustering

First, we extract titles and descriptions of our nodes and combine them.

In [5]:
texts = []
node_ids = []

for node, attrs in G.nodes(data=True):
    title = attrs.get("title", "")
    description = attrs.get("description", "")
    
    # Combined text for topic modeling
    text = f"{title}. {description}".strip()
    
    texts.append(text)
    node_ids.append(node)

Next, we set up our model and create embeddings using the texts extracted from the network.

In [6]:
# Use a reasonably small but strong model
model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

# Create embeddings (matrix of size 2900 x 384)
embeddings = model.encode(
    texts,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

Batches: 100%|██████████| 91/91 [00:32<00:00,  2.78it/s]


Then we use k-means clustering with 20 clusters on the created embeddings. We choose a random state for replicability. The amount of clusters was chosen on a basis of trial and error, resulting in coherent yet not too specific topics. Finally, we assign the topic labels back to the nodes in our graph.

In [8]:
# Choose number of clusters
num_clusters = 20

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(embeddings)

# Assign topic labels back to graph nodes
for node, topic in zip(node_ids, cluster_labels):
    G.nodes[node]["topic"] = int(topic)

To get an overview over the distribution of the topics created above and the assigned nodes, we print the counts of the nodes assigned to each topic/cluster.

In [9]:
# Extract all topic values from nodes
topics = [G.nodes[n].get("topic") for n in G.nodes()]

topic_counts = Counter(topics)

for topic, count in sorted(topic_counts.items()):
    print(topic, count)

0 156
1 193
2 164
3 83
4 179
5 181
6 210
7 220
8 117
9 172
10 191
11 136
12 75
13 80
14 172
15 180
16 103
17 81
18 98
19 99


We can see that clusters have somewhat different sizes, ranging from 75 to 220 nodes, showing that no extremely small or large clusters exist.

Now, we make use of a TF-IDF vectorizer, removing stopwords, to build a document-feature matrix out of the texts contained in our network (titles and descriptions of nodes as used above). 

In [10]:
vectorizer = TfidfVectorizer(
    stop_words="english",
    max_features=5000
)

tfidf_matrix = vectorizer.fit_transform(texts)
terms = vectorizer.get_feature_names_out()

Next, we define a function that retrieves the top ten terms for each topic/cluster found above.

In [11]:
def top_terms_for_topic(topic_id, n=10):
    # find all rows belonging to this topic
    idx = [i for i, t in enumerate(cluster_labels) if t == topic_id]
    
    # average TF-IDF score for each term across documents in this topic
    sub = tfidf_matrix[idx].mean(axis=0)
    
    # get top n word indices
    top_indices = np.asarray(sub).ravel().argsort()[::-1][:n]
    
    return [terms[i] for i in top_indices]

Following this, we get a set of unique labels or terms for each cluster, and print those for further inspection.

In [12]:
unique_topics = sorted(set(cluster_labels))

for topic in unique_topics:
    words = top_terms_for_topic(topic, n=10)
    print(f"Topic {topic}: {', '.join(words)}")

Topic 0: island, islands, coast, km, mi, north, west, visit, south, largest
Topic 1: city, population, county, largest, visit, capital, province, area, lake, europe
Topic 2: zoo, park, theme, aquarium, disney, located, opened, world, resort, visit
Topic 3: australia, sydney, south, melbourne, beach, park, australian, harbour, kilometres, tasmania
Topic 4: state, states, united, national, visit, located, memorial, american, washington, historic
Topic 5: museum, art, collection, visit, located, national, history, arts, gallery, united
Topic 6: park, national, lake, canyon, state, river, visit, natural, colorado, wildlife
Topic 7: city, visit, san, argentina, park, la, el, spanish, located, brazil
Topic 8: island, islands, park, zealand, national, visit, new, reef, marine, waters
Topic 9: city, history, vibrant, offers, rich, promises, unforgettable, bustling, visit, blend
Topic 10: festival, held, attend, event, music, marathon, annual, day, world, year
Topic 11: mountain, mount, highest

While there are some words overlapping between topics, most clusters seem to make sense and have a sound and coherent topic.

In order to use the keywords assigned to each topic/cluster for inspiration in the open coding phase, we add them as node attributes.

In [13]:
topic_keywords = {
    topic: top_terms_for_topic(topic, n=10)
    for topic in unique_topics
}

for node in G.nodes():
    t = G.nodes[node]["topic"]
    G.nodes[node]["topic_keywords_SBERT"] = topic_keywords[t]

Finally, we export the network to excel in order to conduct a manual open coding process.

In [14]:
data = []

for node, attrs in G.nodes(data=True):
    row = {"node_id": node}
    row.update(attrs)
    data.append(row)

# Create dataframe
df = pd.DataFrame(data)

# Export to Excel
df.to_excel("../Data/Validation/graph_nodes.xlsx", index=False)