# Topic modeling using BERTopic
## Introduction
As an alternative to the previously presented approaches LDA and NMF, we can also use a transformer-based approach. For this example, we will be using BERTopic as our model, and compare if we can get meaingful results with it as well.

Goal of this experiment is to find out, if we can reduce time neede to preprocess the data for traditional approaches (as seen in the previos notebook).

## Preprocessing
Quote from [BERTopic GitHub](https://github.com/MaartenGr/BERTopic/issues/40):

In general, no, you do not need to preprocess your data. Like you said, keeping the original structure of the text is especially important for transformer-based models to understand the context.
However, there are exceptions to this. For example, if you were to have scraped documents with a lot of html tags, then it might be beneficial to remove those as they do not provide any interesting context.
If you have paragraphs in a document, then it might be worthwhile to split up the paragraphs in order to more precisely extract the correct topics.


### Remaining tasks
Since the protocols are not HTML-based, we do not need to strip tags from it. However, we do have a big number of paragraphs in the dataset, so before model training we are splitting them up in the declared train_bertopic method.

As the documentation suggests to use a count_vectorizer to remove stopwords, we pass that to the model creation as well.

Most code was already explained previously, new is the UMAP model: It helps reducing the dimensions for the analyzing model.

In [None]:
from sentence_transformers import SentenceTransformer
from tqdm.notebook import tqdm
import numpy as np
import os
import torch
import pandas as pd
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from umap import UMAP

BASE_INPUT = "../../data/dataPreprocessedStage/speechContentCleaned"
BASE_OUTPUT = "../../data/dataTopicModeling/bertopic"

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

def train_bertopic(input_pickle, output_npy, model, vectorizer_model):
	df = pd.read_pickle(input_pickle)
	paragraphs = [
		paragraph.strip()
		for paragraphs in df['speech_content'].str.split('\n')
		for paragraph in paragraphs
		if paragraph.strip()
	]
	# check if output_npy exists, and if not train embeddings
	if os.path.exists(output_npy):
		embeddings = np.load(output_npy)
		print(f"Embeddings already exist at {output_npy}, skipping training.")
	else:
		print(f"Training embeddings for {len(paragraphs)} paragraphs...")
		embeddings = train_embeddings(output_npy, paragraphs, model)
	# now check if trained bertopic model has already been saved
	if os.path.exists(output_npy.replace('.npy', '.model')):
		print(f"BERTopic model already exists at {output_npy.replace('.npy', '.model')}, skipping training.")
		return
	# train BERTopic model
	print("Training BERTopic model...")
	# create BERTopic model with the specified vectorizer and UMAP model
	# Note: calculate_probabilities=False to speed up training, probabilities can be calculated later if needed
	# Note: language="german" to use the German stop words from the CountVectorizer
	if not os.path.exists(os.path.dirname(output_npy)):
		os.makedirs(os.path.dirname(output_npy))
	# create BERTopic model with the specified vectorizer and UMAP model
	topic_model = BERTopic(vectorizer_model=vectorizer_model, language="german", verbose=True, calculate_probabilities=False, umap_model=umap_model)
	print(f"len(paragraphs): {len(paragraphs)}")
	print(f"embeddings.shape: {embeddings.shape}")
	topics = topic_model.fit_transform(paragraphs, embeddings=embeddings)
	# save the model
	topic_model.save(output_npy.replace('.npy', '.model'))
	# save topics as well
	np.save(output_npy.replace('.npy', '_topics.npy'), topics)
	#np.save(output_npy.replace('.npy', '_probs.npy'), probs)
	print(f"BERTopic model trained and saved to {output_npy.replace('.npy', '.model')}")
	#return topic_model, embeddings

def train_embeddings(output_npy, paragraphs, model):
	print(f"Number of threads used: ", torch.get_num_threads())
	embedding_model = SentenceTransformer(model, backend='torch')
	embeddings = embedding_model.encode(paragraphs, show_progress_bar=True, batch_size=128)
	os.makedirs(os.path.dirname(output_npy), exist_ok=True)
	np.save(output_npy, embeddings)  # speichern!
	return embeddings

### Training
We will now train the model on the data of the 19th and 20th electural terms, just like we did with LDA and NMF.

In [7]:
german_stop_words = stopwords.words('german')
nltk_vectorizer_model = CountVectorizer(stop_words=german_stop_words)

for term in [19, 20]:
    # define paths
    input_pickle = os.path.join(BASE_INPUT, f"speech_content_cleaned_{term}.pkl")
    output_npy = os.path.join(BASE_OUTPUT, f"iteration_1/embeddings_{term}.npy")
    train_bertopic(input_pickle, output_npy, model="paraphrase-multilingual-MiniLM-L12-v2", vectorizer_model=nltk_vectorizer_model)


Embeddings already exist at bertopic/iteration_1/embeddings_19.npy, skipping training.
BERTopic model already exists at bertopic/iteration_1/embeddings_19.model, skipping training.
Embeddings already exist at bertopic/iteration_1/embeddings_20.npy, skipping training.
BERTopic model already exists at bertopic/iteration_1/embeddings_20.model, skipping training.


In [9]:
from bertopic import BERTopic

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
german_stop_words = stopwords.words('german')
nltk_vectorizer_model = CountVectorizer(stop_words=german_stop_words)

topics = np.load(os.path.join(BASE_OUTPUT, "iteration_3/embeddings_19_topics.npy"), allow_pickle=True)
topic_model = BERTopic.load(os.path.join(BASE_OUTPUT, "iteration_1/embeddings_19.model"))
# show topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,156871,-1_herr_deutschen_deutschland_schon,"[herr, deutschen, deutschland, schon, kollegen...","[Ich glaube, dass wir im Deutschen Bundestag b..."
1,0,27468,0_büttenrede_verwechseln_ach_nein,"[büttenrede, verwechseln, ach, nein, mal, scho...","[({0}), ({0}), ({0})]"
2,1,22544,1_entscheidungssatz_leere_läuft_daher,"[entscheidungssatz, leere, läuft, daher, völli...","[({1}), ({1}), ({1})]"
3,2,20201,2____,"[, , , , , , , , , ]","[({2}), ({2}), ({2})]"
4,3,18020,3_ziel___,"[ziel, , , , , , , , , ]","[({3}), ({3}), ({3})]"
...,...,...,...,...,...
1179,1178,10,1178_wertpapierinstituten_wertpapierfirmen_ric...,"[wertpapierinstituten, wertpapierfirmen, richt...","[Jetzt ist die Frage, wie wir das innerhalb de..."
1180,1179,10,1179_hartz_iv_damoklesschwert_paternalistisches,"[hartz, iv, damoklesschwert, paternalistisches...","[Aber die Analyse, die Sie in Ihrem Antrag vor..."
1181,1180,10,1180_pestizide_pestiziden_exportiert_doppelsta...,"[pestizide, pestiziden, exportiert, doppelstan...",[Da ich jetzt noch ein bisschen Zeit habe: Auc...
1182,1181,10,1181_vögel_fledermäuse_insekten_lungen,"[vögel, fledermäuse, insekten, lungen, windkra...","[Die Menschen, die im Umfeld von Windindustrie..."


## Outcome
As can be seen in the table above, the model generated a table of topics. However, output is not yet precise and meaningful enough to get a detailed overview of the topics. Therefore, we use a bigger embedding model and compare the differences.


In [None]:
for term in [19, 20, "19_20"]:
	# define paths
	model = "paraphrase-multilingual-mpnet-base-v2"
	input_pickle = os.path.join(BASE_INPUT, f"speech_content_cleaned_{term}.pkl")
	output_npy = os.path.join(BASE_OUTPUT, model, f"embeddings_{term}.npy")
	train_bertopic(input_pickle, output_npy, model, vectorizer_model=nltk_vectorizer_model)


Embeddings already exist at dataProcessedStage/topicModeling/paraphrase-multilingual-mpnet-base-v2/embeddings_19.npy, skipping training.
BERTopic model already exists at dataProcessedStage/topicModeling/paraphrase-multilingual-mpnet-base-v2/embeddings_19.model, skipping training.
Embeddings already exist at dataProcessedStage/topicModeling/paraphrase-multilingual-mpnet-base-v2/embeddings_20.npy, skipping training.
BERTopic model already exists at dataProcessedStage/topicModeling/paraphrase-multilingual-mpnet-base-v2/embeddings_20.model, skipping training.
Embeddings already exist at dataProcessedStage/topicModeling/paraphrase-multilingual-mpnet-base-v2/embeddings_19_20.npy, skipping training.
Training BERTopic model...
len(paragraphs): 1001726
embeddings.shape: (1001726, 768)


2025-06-15 13:16:20,832 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


: 

In [4]:
from bertopic import BERTopic

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words=german_stop_words)

topics = np.load(os.path.join(BASE_OUTPUT, "paraphrase-multilingual-mpnet-base-v2/embeddings_19_topics.npy"), allow_pickle=True)
topic_model = BERTopic.load(os.path.join(BASE_OUTPUT, "paraphrase-multilingual-mpnet-base-v2/embeddings_19.model"))
# show topics
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,162398,-1_herr_antrag_müssen_afd,"[herr, antrag, müssen, afd, deutschland, schon, sagen, mal, ja, heute]","[– Gut. Jetzt lassen Sie mich doch mal darüber reden., Ich gebe der Opposition recht, die uns vorwirft, dass wir sehr lange gebraucht haben etc. Da haben Sie einen Punkt gemacht. Wir sind in einer..."
1,0,27469,0_nullkommagarnichts_büttenrede_verwechseln_ach,"[nullkommagarnichts, büttenrede, verwechseln, ach, nein, mal, schon, , , ]","[({0}), ({0}), ({0})]"
2,1,22542,1____,"[, , , , , , , , , ]","[({1}), ({1}), ({1})]"
3,2,20201,2____,"[, , , , , , , , , ]","[({2}), ({2}), ({2})]"
4,3,18019,3____,"[, , , , , , , , , ]","[({3}), ({3}), ({3})]"
...,...,...,...,...,...
1186,1185,10,1185_auswärtigen_diplomaten_ortsfesten_amt,"[auswärtigen, diplomaten, ortsfesten, amt, spezialisten, amtes, rotation, dienstposten, auslandsbezug, ausländerrechtes]","[Der diplomatische Dienst ist durch zwei Prinzipien gekennzeichnet: Die Angehörigen des diplomatischen Dienstes müssen zur Rotation bereit sein, und sie verstehen sich als Generalisten. Das heißt ..."
1187,1186,10,1186_aushandlungssystem_schimke_tarifverträge_lohnsenkungen,"[aushandlungssystem, schimke, tarifverträge, lohnsenkungen, unternehmenszentralen, auffangböden, bargeldauszahlungen, osten, tarifverträgen, menschenrechtsthema]","[Ein weiterer Punkt ist: Dort, wo wir Regelungen haben – es wurde schon von Frau Schimke genannt, der ich sonst selten zustimme –, beispielsweise bei den Tarifverträgen, bei den Löhnen, haben wir ..."
1188,1187,10,1187_regierungsansprache_wohlstandsindex_hauptgegenstand_einprogrammiert,"[regierungsansprache, wohlstandsindex, hauptgegenstand, einprogrammiert, premium, dazugeben, einigungsprozesses, ropäische, regionalbanken, partizipiert]","[Die Krise wird benutzt: zum einen, wie erwähnt, für den Umbau der Wirtschaft – man könnte auch sagen: für die Abschaffung des Wirtschaftsstandorts Deutschland –, zum anderen für den Marsch in die..."
1189,1188,10,1188_dieselfahrer_fahrverbote_hardwarenachrüstungen_dieselbetrug,"[dieselfahrer, fahrverbote, hardwarenachrüstungen, dieselbetrug, hotline, dieselfahrverbote, pflichtnachrüstungen, dieselmodelle, schadstoffverhinderung, großflächigste]",[Herr Präsident! Liebe Kolleginnen und Kollegen! Millionen Dieselfahrer sind verunsichert wegen der Fahrverbote und der immensen Wertverluste. Ein riesiger volkswirtschaftlicher Schaden entsteht. ...


## BERTopic Analysis and Comparison to LDA/NMF

After applying traditional topic modeling approaches, it can be observed that both models yielded interpretable and semantically meaningful topic representations. The extracted topics typically consisted of clear, thematically coherent keyphrases, and could be readily mapped to real-world political domains or current debates.

In contrast, the **BERTopic**-based approach—despite leveraging advanced transformer embeddings—has thus far produced less interpretable results. As seen in the "get_topic_info()" output, the automatically assigned topic representations are either dominated by generic, high-frequency tokens (such as "herr", "afd", "müssen", "antrag") or occasionally by unrelated or placeholder-like terms (e.g., "nullkommagarnichts"). Some topics (e.g., index 1) are even entirely filled with empty strings, suggesting issues either in preprocessing, topic assignment, or model configuration.

### **Potential Causes**
- **Input granularity:** As we already observed in the first topic modeling notebook, input length varies a lot between speeches. It could eventually help to filter out too short speeches here as well.
- **Stopword handling:** Although the CountVectorizer is set up with German stopwords, transformer models themselves still process and embed these words (which are then reflected in topic names and representations).
- **Representation calculation:** By not calculating probabilities (to save compute), some topic meta-data (such as topic-word distributions) might be less reliable.
- **Embeddings:** The difference between the smaller ("MiniLM") and larger ("mpnet") embedding models is visible but does not fully solve the interpretability issue, possibly due to above factors.

### **Possible solutions**:
- **PreProcessing**: Although it was mentioned that this is generally not needed, the result shows us that at least some kind of PreProcessing could be helpful in improving topic's coherence and meaningfullnessanyway.

#### Should we compare embeddings-based topic models on data with stopwords removed or lemmatized?
As we already preprocessed data for the topic modeling using LDA and NMF, we stored them for later use. Therefore, the question remains if they can help us out here as well.
##### 1. **Context Preservation**
Transformer models (like BERT or models used in BERTopic) are trained on large corpora with “normal” text, including function words (stopwords) and inflections. They exploit the syntactic and semantic context provided by these words to build their representations.  
- **Lemmatization** removes morphological variety, which can lead to loss of subtle meanings or syntactic information that these models leverage.
- **Stopword removal** might remove words that, while not individually meaningful, contribute to overall meaning and fluency at the sentence or paragraph level.

##### 2. **Intended Use vs. Classic Models**
- **Bag-of-words methods (LDA, NMF):** Work best on preprocessed, lemmatized, and stopword-removed texts because they have no understanding of structure or context.
- **Embeddings-based models:** Designed to work with raw or lightly cleaned text.

##### 3. **Conclusion for the Project**
- Re-training the embeddings on the preprocessed data will not lead to a coherent topic modeling, which we are trying to achieve. - Next step will be a combination: A new Embedding model will be trained, again on the paragraphs list (not lemmatized, no stopwords removed). - However, as we previously analyzed 50 as a good minimum length, we will apply that here as well, to filter out speeches which cannot be directly assigned to a topic.
- Additionally, CountVectorizer will now use the SpaCy stopwords list, instead of the one provided by NLTK. Because it is bigger and more up-to-date, it might improve the results as well.

- We will keep the bigger Embeddings model, as the difference was already visible to a certain point.

In [None]:
# Strip short speeches, less then 50 words

def strip_short_speeches(df, min_words=50):
	df['word_count'] = df['speech_content'].apply(lambda x: len(x.split()))
	return df[df['word_count'] >= min_words].drop(columns=['word_count'])

for term in [19, 20]:
    # define paths
    input_pickle = os.path.join(BASE_INPUT, f"speech_content_cleaned_{term}.pkl")
    output_pickle = os.path.join(BASE_OUTPUT, f"iteration_3/speech_content_stripped_{term}.pkl")
    df = pd.read_pickle(input_pickle)
    df = strip_short_speeches(df, min_words=50)
    # check first if output directory exists
    if not os.path.exists(os.path.dirname(output_pickle)):
        os.makedirs(os.path.dirname(output_pickle))
    df.to_pickle(output_pickle)

In [14]:
import spacy

spacy_instance = spacy.load("de_core_news_sm")
german_stop_words = spacy_instance.Defaults.stop_words
spacy_vectorizer_model = CountVectorizer(stop_words=german_stop_words)

for term in [19, 20]:
    # define paths
    input_pickle = os.path.join(BASE_OUTPUT, f"iteration_3/speech_content_stripped_{term}.pkl")
    output_npy = os.path.join(BASE_OUTPUT, f"iteration_3/embeddings_{term}.npy")
    train_bertopic(input_pickle, output_npy, model="paraphrase-multilingual-mpnet-base-v2", vectorizer_model=spacy_vectorizer_model)


Embeddings already exist at ../dataProcessedStage/topicModeling/bertopic/iteration_3/embeddings_19.npy, skipping training.
Training BERTopic model...
len(paragraphs): 459009
embeddings.shape: (459009, 768)


2025-06-21 18:12:37,598 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-06-21 18:31:46,193 - BERTopic - Dimensionality - Completed ✓
2025-06-21 18:31:46,230 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before

InvalidParameterError: The 'stop_words' parameter of CountVectorizer must be a str among {'english'}, an instance of 'list' or None. Got {'ag', 'davor', 'ihre', 'rechter', 'erster', 'teil', 'na', 'etwas', 'jedem', 'daneben', 'eben', 'seinem', 'geht', 'vierte', 'danach', 'dass', 'irgend', 'sind', 'immer', 'dessen', 'dort', 'hätte', 'recht', 'tagen', 'gewesen', 'muss', 'erste', 'grosses', 'wart', 'derselben', 'vielen', 'wann', 'gemocht', 'zum', 'zugleich', 'besonders', 'um', 'je', 'kleinen', 'demgemäss', 'jenes', 'manches', 'acht', 'sechsten', 'viel', 'demgemäß', 'wo', 'ganzen', 'nachdem', 'eine', 'dieser', 'neunten', 'derjenigen', 'alle', 'welchem', 'könnt', 'hoch', 'wollten', 'wenigstens', 'zwei', 'sechster', 'ersten', 'rechte', 'auf', 'denselben', 'ihm', 'großer', 'drittes', 'einmal', 'bereits', 'sich', 'eigene', 'welcher', 'seit', 'manchem', 'diejenige', 'daran', 'los', 'vierten', 'wenn', 'gehabt', 'siebenter', 'nicht', 'alles', 'ende', 'selbst', 'uhr', 'übrigens', 'wollt', 'dritter', 'auch', 'her', 'natürlich', 'durften', 'hatten', 'daher', 'tat', 'ob', 'nur', 'so', 'manchen', 'vier', 'jahr', 'früher', 'man', 'eigener', 'deshalb', 'sie', 'geschweige', 'außer', 'dritte', 'siebenten', 'oben', 'ehrlich', 'machte', 'viele', 'meine', 'weniger', 'kannst', 'würde', 'ganzes', 'weil', 'dein', 'damit', 'von', 'das', 'seine', 'gute', 'als', 'durfte', 'durchaus', 'ihrem', 'bei', 'kleines', 'sieben', 'siebter', 'dieses', 'macht', 'mich', 'dafür', 'derselbe', 'einige', 'vergangene', 'gekannt', 'hast', 'willst', 'zehnten', 'deswegen', 'jenen', 'war', 'unserer', 'an', 'grossen', 'solcher', 'sollen', 'demzufolge', 'deren', 'wollen', 'oft', 'dagegen', 'dahinter', 'später', 'dementsprechend', 'allein', 'sein', 'über', 'darüber', 'gegen', 'ohne', 'a', 'wohl', 'daselbst', 'fünfter', 'drei', 'diesem', 'solang', 'gemacht', 'dasein', 'rechtes', 'wieder', 'beiden', 'weiteres', 'kommt', 'erstes', 'gutes', 'aber', 'große', 'achtes', 'sonst', 'ach', 'bald', 'niemandem', 'offen', 'daß', 'seiner', 'sollte', 'geworden', 'besser', 'werdet', 'dazwischen', 'anderem', 'morgen', 'bis', 'mancher', 'möchte', 'ab', 'seien', 'darauf', 'rechten', 'anderen', 'dahin', 'allgemeinen', 'weiter', 'achte', 'zeit', 'also', 'zur', 'weitere', 'kommen', 'dich', 'solchem', 'mehr', 'gibt', 'kann', 'leider', 'hätten', 'dank', 'weit', 'neunte', 'würden', 'jemanden', 'grosser', 'magst', 'mag', 'leicht', 'lang', 'siebentes', 'solchen', 'dermaßen', 'gegenüber', 'solche', 'deine', 'wie', 'neben', 'uns', 'wen', 'warum', 'gleich', 'neue', 'groß', 'müssen', 'demgegenüber', 'sechste', 'kein', 'großen', 'du', 'jetzt', 'damals', 'siebtes', 'zwar', 'soll', 'vom', 'bisher', 'dieselben', 'wer', 'im', 'siebte', 'hin', 'musst', 'davon', 'da', 'einen', 'grosse', 'ganze', 'andern', 'viertes', 'jedermanns', 'sagte', 'zuerst', 'demselben', 'kurz', 'kaum', 'einiges', 'kleiner', 'richtig', 'jemandem', 'währenddem', 'vierter', 'neun', 'ihr', 'eigenes', 'ich', 'diejenigen', 'darum', 'deiner', 'fünf', 'mögen', 'mittel', 'infolgedessen', 'keinem', 'denen', 'jemand', 'nichts', 'sah', 'zwischen', 'gemusst', 'ist', 'haben', 'seitdem', 'dermassen', 'vergangenen', 'jedoch', 'was', 'ihrer', 'wirst', 'seid', 'noch', 'gehen', 'während', 'welche', 'wird', 'sagt', 'bekannt', 'jenem', 'darfst', 'neuen', 'besten', 'allen', 'ein', 'einer', 'diese', 'nun', 'er', 'will', 'keinen', 'einem', 'allerdings', 'jedermann', 'elf', 'mir', 'ganz', 'heisst', 'zunächst', 'seinen', 'für', 'mussten', 'entweder', 'keiner', 'beim', 'konnten', 'niemanden', 'heißt', 'zusammen', 'gab', 'ihres', 'zweiten', 'wäre', 'wurde', 'rund', 'sechs', 'mochten', 'sechstes', 'dadurch', 'jede', 'fünften', 'ausserdem', 'genug', 'fünfte', 'nach', 'guter', 'lange', 'statt', 'währenddessen', 'gerade', 'vielleicht', 'satt', 'dasselbe', 'solches', 'sehr', 'siebten', 'mochte', 'schlecht', 'jeder', 'diesen', 'ihren', 'ebenso', 'durch', 'überhaupt', 'eigenen', 'weniges', 'darf', 'nahm', 'endlich', 'dabei', 'niemand', 'einmaleins', 'muß', 'dies', 'wem', 'dürfen', 'neunter', 'drin', 'mögt', 'gekonnt', 'deinem', 'wollte', 'heute', 'wenig', 'ja', 'und', 'aus', 'kam', 'außerdem', 'etwa', 'wirklich', 'ganzer', 'gedurft', 'jene', 'schon', 'aller', 'trotzdem', 'jahre', 'seines', 'weiteren', 'müsst', 'zehn', 'sei', 'ging', 'dann', 'mit', 'es', 'jener', 'werden', 'zehnte', 'gesagt', 'dieselbe', 'werde', 'habt', 'fünftes', 'können', 'zweite', 'sondern', 'achten', 'die', 'siebente', 'am', 'worden', 'sowie', 'einander', 'zehnter', 'zu', 'doch', 'á', 'daraus', 'sollten', 'beispiel', 'zehntes', 'waren', 'darin', 'bist', 'zwanzig', 'derjenige', 'dritten', 'gross', 'manche', 'gewollt', 'eigen', 'einigen', 'den', 'denn', 'tun', 'darunter', 'wir', 'hier', 'zweiter', 'hinter', 'einiger', 'hatte', 'meinen', 'en', 'mein', 'indem', 'tel', 'allem', 'ihn', 'vor', 'ins', 'lieber', 'dir', 'wahr', 'euch', 'welches', 'meines', 'jeden', 'unter', 'anders', 'tage', 'desselben', 'in', 'oder', 'konnte', 'meiner', 'nein', 'keine', 'der', 'musste', 'nie', 'möglich', 'jahren', 'welchen', 'vielem', 'großes', 'gut', 'neuntes', 'tag', 'bin', 'machen', 'beide', 'des', 'andere', 'dazu', 'könnte', 'eines', 'hat', 'kleine', 'erst', 'dem', 'ausser', 'achter', 'habe', 'unsere', 'gar', 'ihnen', 'meinem', 'zurück', 'dürft', 'gern', 'unser', 'zweites', 'wurden', 'wessen', 'wenige', 'wegen'} instead.

## Error
Different from NLTK, spacy gives the stopwords as a set, not a list. We have to convert it therefore.

In [18]:
import spacy

spacy_instance = spacy.load("de_core_news_sm")
german_stop_words = list(spacy_instance.Defaults.stop_words)
spacy_vectorizer_model = CountVectorizer(stop_words=german_stop_words)

for term in [19, 20]:
    # define paths
    input_pickle = os.path.join(BASE_OUTPUT, f"iteration_3/speech_content_stripped_{term}.pkl")
    output_npy = os.path.join(BASE_OUTPUT, f"iteration_3/embeddings_{term}.npy")
    train_bertopic(input_pickle, output_npy, model="paraphrase-multilingual-mpnet-base-v2", vectorizer_model=spacy_vectorizer_model)


Embeddings already exist at ../dataProcessedStage/topicModeling/bertopic/iteration_3/embeddings_19.npy, skipping training.
BERTopic model already exists at ../dataProcessedStage/topicModeling/bertopic/iteration_3/embeddings_19.model, skipping training.
Embeddings already exist at ../dataProcessedStage/topicModeling/bertopic/iteration_3/embeddings_20.npy, skipping training.
BERTopic model already exists at ../dataProcessedStage/topicModeling/bertopic/iteration_3/embeddings_20.model, skipping training.


In [21]:
from bertopic import BERTopic

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words=german_stop_words)

topics = np.load(os.path.join(BASE_OUTPUT, "iteration_3/embeddings_19_topics.npy"), allow_pickle=True)
topic_model = BERTopic.load(os.path.join(BASE_OUTPUT, "iteration_3/embeddings_19.model"))
# show topics
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))

   Topic   Count                                     Name  \
0     -1  167329           -1_herr_afd_deutschland_antrag   
1      0   24676            0_büttenrede_verwechseln_mal_   
2      1   22262                                    1____   
3      2   20179                                    2____   
4      3   18017                        3_aufhören_ziel__   
5      4   15750                                    4____   
6      5   13440                   5_oh_nennt_arzt_punkte   
7      6   11139     6_basispunkte_kumuliert_zuruf_gründe   
8      7    9058  7_thebaner_kundigen_biblische_testament   
9      8    7273                                  8_uh___   

                                      Representation  \
0  [herr, afd, deutschland, antrag, sagen, mal, k...   
1       [büttenrede, verwechseln, mal, , , , , , , ]   
2                               [, , , , , , , , , ]   
3                               [, , , , , , , , , ]   
4                   [aufhören, ziel, , , , , , ,

### Analysis of Empty or Low-Content Topics in BERTopic

Several topics with either blank or underscored labels and no significant representative words can be observed here. This phenomenon typically appears in the `topic_info` table, where certain topics are denoted by names such as `2____` and a largely empty word list under `Representation`.

This issue arises when the cluster of documents assigned to a topic lacks cohesive, meaningful vocabulary after stopword removal and other preprocessing. In the specific context of parliamentary records, it is mainly attributable to recurring formal phrases, greetings, speaker tags, or ultra-short utterances (such as approvals or heckles). When such segments are grouped into a topic, they fail to yield distinguishing keywords for the vectorizer, resulting in apparently "empty" or non-interpretable topics.

From a methodological perspective, these empty topics are important diagnostics:

- **They indicate the presence of noise or non-informative text that remains after initial preprocessing.**
- **They highlight limitations of the model and vectorization settings—for instance, overly aggressive min_df, max_features filters, or insufficient cleansing of domain-specific boilerplate.**

**Actions to take next:**

- We will use the same embeddings model.
- However, now we define a fixed number of topics - we use 15 in this case, as it was already considered as a good number in the previous notebook.
- We use BERTopic's c-tf-idf model to reduce more stopwords.

In [12]:
def train_bertopic(input_pickle, output_npy, embeddings_model, vectorizer_model,
                   min_topic_size=50, nr_topics=None, calculate_probabilities=False, umap_model=None, ctfidf_model=None):
    df = pd.read_pickle(input_pickle)
    paragraphs = [
        paragraph.strip()
        for paragraphs in df['speech_content'].str.split('\n')
        for paragraph in paragraphs
        if paragraph.strip()
    ]
    # Embeddings
    if os.path.exists(output_npy):
        embeddings = np.load(output_npy)
        print(f"Embeddings already exist at {output_npy}, skipping training.")
    else:
        print(f"Training embeddings for {len(paragraphs)} paragraphs...")
        embeddings = train_embeddings(output_npy, paragraphs, embeddings_model)

    # BERTopic Model Check
    if os.path.exists(output_npy.replace('.npy', '.model')):
        print(f"BERTopic model already exists at {output_npy.replace('.npy', '.model')}, skipping training.")
        return

    if not os.path.exists(os.path.dirname(output_npy)):
        os.makedirs(os.path.dirname(output_npy))
        
    # UMAP Model Fallback/Default
    if umap_model is None:
        from umap import UMAP
        umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

    print(f"Training BERTopic model (min_topic_size={min_topic_size}, nr_topics={nr_topics})...")
    
    topic_model = BERTopic(
        vectorizer_model=vectorizer_model,
        language="german",
        verbose=True,
        calculate_probabilities=calculate_probabilities,
        umap_model=umap_model,
        min_topic_size=min_topic_size,
        nr_topics=nr_topics,
        ctfidf_model=ctfidf_model
                )
    print(f"len(paragraphs): {len(paragraphs)}")
    print(f"embeddings.shape: {embeddings.shape}")
    topics = topic_model.fit_transform(paragraphs, embeddings=embeddings)
    topic_model.save(output_npy.replace('.npy', '.model'))
    np.save(output_npy.replace('.npy', '_topics.npy'), topics)
    print(f"BERTopic model trained and saved to {output_npy.replace('.npy', '.model')}")

In [15]:
spacy_vectorizer_model = CountVectorizer(stop_words=german_stop_words,
                                         min_df=5,
                                                                                 max_df=0.99)
from bertopic.vectorizers import ClassTfidfTransformer
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

for term in [19]:
    # define paths
    input_pickle = os.path.join(BASE_OUTPUT, f"iteration_3/speech_content_stripped_{term}.pkl")
    output_npy = os.path.join(BASE_OUTPUT, f"iteration_3/embeddings_{term}.npy")
    train_bertopic(input_pickle, output_npy, embeddings_model="paraphrase-multilingual-mpnet-base-v2", vectorizer_model=spacy_vectorizer_model, nr_topics=15, ctfidf_model=ctfidf_model)


2025-06-22 22:40:30,515 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm


Embeddings already exist at bertopic/iteration_3/embeddings_19.npy, skipping training.
Training BERTopic model (min_topic_size=50, nr_topics=15)...
len(paragraphs): 459009
embeddings.shape: (459009, 768)


2025-06-22 23:07:01,316 - BERTopic - Dimensionality - Completed ✓
2025-06-22 23:07:01,337 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-06-22 23:07:46,674 - BERTopic - Cluster - Completed ✓
2025-06-22 23:07:46,676 - BERTopic - Representation - Extracting topics using c-TF-IDF for topic reduction.
2025-06-22 23:07:53,223 - BERTopic - Representation - Completed ✓
2025-06-22 23:07:53,242 - BERTopic - Topic reduction - Reducing number of topics
2025-06-22 23:07:53,433 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-06-22 23:07:58,428 - BERTopic - Representation - Completed ✓
2025-06-22 23:07:58,452 - BERTopic - Topic reduction - Reduced number of topics from 296 to 15


BERTopic model trained and saved to bertopic/iteration_3/embeddings_19.model


In [16]:
from bertopic import BERTopic

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')
german_stop_words = stopwords.words('german')
vectorizer_model = CountVectorizer(stop_words=german_stop_words)

topics = np.load(os.path.join(BASE_OUTPUT, "iteration_3/embeddings_19_topics.npy"), allow_pickle=True)
topic_model = BERTopic.load(os.path.join(BASE_OUTPUT, "iteration_3/embeddings_19.model"))
# show topics
topic_info = topic_model.get_topic_info()
print(topic_info.head(10))

   Topic   Count                                   Name  \
0     -1  141259          -1_herr_müssen_deutschland_ja   
1      0  169781                          0_10_11_12_13   
2      1   53074  1_soldaten_bundeswehr_europa_russland   
3      2   52801             2_euro_mehr_müssen_prozent   
4      3   14008         3_dank_vielen_herzlichen_danke   
5      4   12793      4_pflege_patienten_pandemie_virus   
6      5    6736        5_frauen_familien_kinder_eltern   
7      6    3589     6_präsident_präsidentin_herr_liebe   
8      7    2562        7_sea_guardian_tourismus_kultur   
9      8    1670            8_lehnen_ablehnen_ab_antrag   

                                      Representation  \
0  [herr, müssen, deutschland, ja, mehr, kollegen...   
1           [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]   
2  [soldaten, bundeswehr, europa, russland, deuts...   
3  [euro, mehr, müssen, prozent, bildung, brauche...   
4  [dank, vielen, herzlichen, danke, aufmerksamke...   
5  [pflege, pa

## Summary and evaluation
As can be seen in all of our iterations, we got closer to coherent, meaningful topics. However, we still have a lot of noise, and to a certain points stopwords, in the topics.

Other than expected, we **had to preprocess** the data, although we didn't need lemmatization or punctuation removal.

It is clearly visible how training an embeddings model makes a difference here, however in all iterations, a lot of time and computation power was used.

In the end, we saw that both approaches lead to a topic overview.