# **BERTopic - Tutorial**
We start with installing bertopic from pypi before preparing the data. 

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

# **Prepare data**

In [1]:
import json
import pandas as pd
import string, pprint
import spacy
import nl_core_news_sm
import ijson
import nltk
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from bertopic import BERTopic
import spacy
import nl_core_news_sm
import ijson
import nltk
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
from bertopic.backend import languages
import math

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from nltk.tokenize import sent_tokenize
from typing import List

  from .autonotebook import tqdm as notebook_tqdm


In [8]:
# check how the data looks loaded from csv
df1 = pd.read_csv('2+actors_per_topic.csv')
df1

Unnamed: 0.1,Unnamed: 0,title,paragraph_num,paragraph_text,actors_in_paragraph,actors_list,2_plus_actors,numb_unique_actors
0,292,Antwoord op vragen van de leden Van Raan en Te...,5,Vragen van de leden Van Raan en Teunissen (bei...,"['Raan', 'Teunissen', 'Hoekstra']","['Raan', 'Teunissen', 'Hoekstra']",True,3
1,357,Verslag van een schriftelijk overleg met de mi...,10,"Koffeman (PvdD), Faber-Van de Klashorst (PVV),...","['Strien', 'Pijlman', 'Rooijen', 'Gurp', 'Van ...","['Strien', 'Pijlman', 'Rooijen', 'Gurp', 'Van ...",True,9
2,385,Verslag van een schriftelijk overleg met de mi...,38,"Zo nee, waarom niet?\nEn deelt u de mening dat...","['Vestering', 'Raan', 'Vestering', 'Raan']","['Vestering', 'Raan', 'Vestering', 'Raan']",True,2
3,396,Verslag van een schriftelijk overleg met de mi...,49,"Zo ja, kunt u aangeven waarom u SATL niet de g...","['Adegeest', 'van der Linden']","['Adegeest', 'van der Linden']",True,2
4,492,Verslag van een schriftelijk overleg met de mi...,145,Op 29 september jl. nam de Tweede Kamer een mo...,"['Vestering', 'Raan', 'Vestering', 'Raan']","['Vestering', 'Raan', 'Vestering', 'Raan']",True,2
...,...,...,...,...,...,...,...,...
2926,222774,"Begroting Landbouw, Natuur en Voedselkwaliteit...",968,"Ik wil ook heel graag naar de lunch, maar toch...","['Grinwis', 'Kluis', 'Kluis', 'Grinwis']","['Grinwis', 'Kluis', 'Kluis', 'Grinwis']",True,2
2927,223192,2 PAGW brief ministeries LNV en IenW,335,waterschap Aa en Maas\nwaterschap\nScheldestro...,"['Aa', 'Maas']","['Aa', 'Maas']",True,2
2928,223267,3 PAGW voorstel Waddenprovincies,14,3e tranche Waddengebied Geachte mevrouw Van Du...,"['Duin', 'Beekman']","['Duin', 'Beekman']",True,2
2929,223363,3 PAGW voorstel Waddenprovincies,110,"# drs. A.A.M. Brok, voorzitter #@ValidSign_Ond...","['Brok', 'Schepers']","['Brok', 'Schepers']",True,2


# **Splitting up the documents into paragraphs**
BERT topic modeling based on paragraphs instead of whole documents has several advantages, including:

1. Improved granularity: Topic modeling based on paragraphs allows for a more fine-grained analysis of text data. It allows for a better understanding of the themes and topics within a larger document, which can help with more precise and accurate categorization of text data.

2. Better representation of content: Analyzing individual paragraphs rather than whole documents can provide a more accurate representation of the content in a given document. This is particularly important for longer documents where the content can vary significantly across different sections.

3. Better results for shorter documents: BERT-based topic modeling can be challenging for short documents, as there may not be enough information to generate meaningful topics. Analyzing individual paragraphs can provide more reliable results for shorter documents.

4. Ability to identify multiple topics: BERT topic modeling based on paragraphs can help identify multiple topics within a single document, which can be particularly useful in cases where there are multiple themes or subtopics.

Overall, BERT topic modeling based on paragraphs can provide a more detailed and accurate analysis of text data compared to analyzing whole documents.

In [56]:
import spacy
import pandas as pd

def split_into_paragraphs(text):
    nlp = spacy.load('nl_core_news_lg')
    doc = nlp(text)
    paragraphs = []
    current_paragraph = ''
    
    for sentence in doc.sents:
        if len(current_paragraph) == 0:
            current_paragraph = str(sentence)
        else:
            similarity = sentence.similarity(nlp(current_paragraph))
            if similarity < 0.6:  # threshold for new paragraph
                paragraphs.append(current_paragraph)
                current_paragraph = str(sentence)
            else:
                current_paragraph += '\n' + str(sentence)

    paragraphs.append(current_paragraph)  # add last paragraph
    return paragraphs

# create new dataframe for paragraphs
paragraphsDF = pd.DataFrame(columns=['title', 'paragraph_num', 'paragraph_text'])

# loop over documents in the dataset and split each one into paragraphs
for i, row in dataDF.iterrows():
    title = row['title']
    content = row['content']
    paragraphs = split_into_paragraphs(content)
    
    # add each paragraph to the new dataframe
    for j, paragraph in enumerate(paragraphs):
        paragraphsDF = paragraphsDF.append({'title': title, 'paragraph_num': j+1, 'paragraph_text': paragraph}, ignore_index=True)

# save new dataframe to a csv file
paragraphsDF.to_csv('paragraphs_dataset.csv', index=False)

In [11]:
paragraphsDF = df1
paragraphsDF

Unnamed: 0.1,Unnamed: 0,title,paragraph_num,paragraph_text,actors_in_paragraph,actors_list,2_plus_actors,numb_unique_actors
0,292,Antwoord op vragen van de leden Van Raan en Te...,5,Vragen van de leden Van Raan en Teunissen (bei...,"['Raan', 'Teunissen', 'Hoekstra']","['Raan', 'Teunissen', 'Hoekstra']",True,3
1,357,Verslag van een schriftelijk overleg met de mi...,10,"Koffeman (PvdD), Faber-Van de Klashorst (PVV),...","['Strien', 'Pijlman', 'Rooijen', 'Gurp', 'Van ...","['Strien', 'Pijlman', 'Rooijen', 'Gurp', 'Van ...",True,9
2,385,Verslag van een schriftelijk overleg met de mi...,38,"Zo nee, waarom niet?\nEn deelt u de mening dat...","['Vestering', 'Raan', 'Vestering', 'Raan']","['Vestering', 'Raan', 'Vestering', 'Raan']",True,2
3,396,Verslag van een schriftelijk overleg met de mi...,49,"Zo ja, kunt u aangeven waarom u SATL niet de g...","['Adegeest', 'van der Linden']","['Adegeest', 'van der Linden']",True,2
4,492,Verslag van een schriftelijk overleg met de mi...,145,Op 29 september jl. nam de Tweede Kamer een mo...,"['Vestering', 'Raan', 'Vestering', 'Raan']","['Vestering', 'Raan', 'Vestering', 'Raan']",True,2
...,...,...,...,...,...,...,...,...
2926,222774,"Begroting Landbouw, Natuur en Voedselkwaliteit...",968,"Ik wil ook heel graag naar de lunch, maar toch...","['Grinwis', 'Kluis', 'Kluis', 'Grinwis']","['Grinwis', 'Kluis', 'Kluis', 'Grinwis']",True,2
2927,223192,2 PAGW brief ministeries LNV en IenW,335,waterschap Aa en Maas\nwaterschap\nScheldestro...,"['Aa', 'Maas']","['Aa', 'Maas']",True,2
2928,223267,3 PAGW voorstel Waddenprovincies,14,3e tranche Waddengebied Geachte mevrouw Van Du...,"['Duin', 'Beekman']","['Duin', 'Beekman']",True,2
2929,223363,3 PAGW voorstel Waddenprovincies,110,"# drs. A.A.M. Brok, voorzitter #@ValidSign_Ond...","['Brok', 'Schepers']","['Brok', 'Schepers']",True,2


# **Text preprocessing**
The preprocessing pipeline is mentioned below.
#### 1. Tokenisation
First basic tokenization is implemented, to split the text into 
tokens as is recommended by Kannan et al. (2014). For this process I used genism’s 
simple_preprocess, which will convert the text into lowercase & tokens and remove punctuation. 

In [12]:
# Tokenization using gensim
def sent_to_words(sentences, deacc=True): # deacc=True removes punctuations
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence)))  
        
# Convert the data to a list
data = paragraphsDF["paragraph_text"].values.tolist()
data_words = list(sent_to_words(data))

#### 2. Stop word removal 
Secondly stop words will be removed from the data as well as a list of punctuation characters from 
the string.punctiation string, which is a pre-initialized string used as a string constant. These are 
removed because they have little relevance for understanding the content of a text (Kannan et al., 
2014).

In [13]:
# create list of additional stop words
# We remove additional common words that o# Create stopword list
# string.punctuation refers to a list of punctuations
# import nltk
# nltk.download()
from nltk.corpus import stopwords
stop_words = stopwords.words('dutch') + list(string.punctuation) #occur in many documents and have no link to a distinct industry.
additional_stop_words = ['geer','minister', 'postbus', 'retouradres','kamer', 'antwoord', 'www', 'brief', 'voorzitter', 'generaal', 'rijksoverheid', 'voorzitter', 'kamervrag', 'kamervraag','voorzitter', 'generaal', 'annoteren', 'agenda', 'kamerstuk', 'beantwoording', 'stichting','lid', 'partij', 'fractie', 'waarom', 'https', 'brief', 'verslag', 'motie', 'agendapunt', 'indiener', 'Tweede','tweede', 'Kenmerk', 'kenmerk', 'voortgang','kamerstukk', 'website', 'org', 'kamerbrief', 'idem', 'bijlage', 'wet', 'artikel', 'vergaderjaar', 'overheid', 'vraag', 'bericht', 'rapport', 'aanhangsel','staan', 'beleidsreactie','inhoudsopgave','lid', 'jaar', 'commissie', 'reactie', 'reactie', 'mededeling', 'http','zien', 'Elzijn', 'isie', 'ieren', 'pa', 'ibidem','programma','algemeen', 'pagina', 'context', 'circulair', 'voorbeeld', 'bijlaag', 'hoofdstuk', 'zien', 'leeswijzer', 'algemeen', 'blad', 'vooronderzoek', 'revisie', 'zone', 'legenda', 'lineair', 'stof', 'kolomn', 'tabel', 'zone', 'voorstellen', 'heer', 'dank', 'mevrouw', 'wel', 'tijd', 'meneer','adema', 'zaak', 'besluit','commisiedebat','datum', 'onderzoek', 'pagina','geer','minister','vraag','heer','kabinet','agenta','gemeente','gaan','kamer','wel','www', 'aanwezig', 'bijvoorbeeld', 'beide', 'dergelijke', 'dezelfde', 'elke', 'enkele', 'eveneens', 'gaande', 'gaandeweg', 'gehele', 'gehouden', 'genoeg', 'geweest', 'groter', 'hebben', 'heel', 'hetzelfde', 'hetzij', 'huidige', 'hunne', 'immers', 'inmiddels', 'intussen', 'juist', 'kleine', 'komt', 'korte', 'laatst', 'laten', 'lijken', 'maken', 'meeste', 'meestal', 'mede', 'middel', 'misschien', 'namelijk', 'nemen', 'net', 'nieuwe', 'niemand', 'niets', 'nodig', 'nogal', 'normaal', 'nu', 'o.a.', 'ofwel', 'omtrent', 'ondanks', 'onder andere', 'ongeveer', 'ons', 'onzes', 'onzeker', 'overal', 'precies', 'redelijk', 'sinds', 'slechts', 'sommige', 'steeds', 'terwijl', 'toch', 'totaal', 'uiteraard', 'vaak', 'vanaf', 'verschillende', 'vervolgens', 'volledig', 'volgens', 'vroeg', 'vroeger', 'waaronder', 'waarvan', 'wat betreft', 'weer', 'weinig', 'weliswaar', 'waarom', 'wanneer', 'zoals', 'zoveel', 'zulke', 'biodiversiteit', 'natuur', 'ecologie', 'soort', 'soorten', 'plant', 'planten', 'dier', 'dieren', 'bos', 'bosgebied', 'natuurgebied', 'bescherming', 'milieu', 'vervuiling', 'klimaatverandering', 'duurzaamheid', 'ecosysteem', 'biologisch', 'natuurlijk', 'gezondheid', 'beschermen', 'behoud', 'natuurbeheer', 'landschap', 'landschapsbeheer', 'fauna', 'flora', 'wetlands', 'bodem', 'water', 'lucht', 'biodiversiteitsverdrag', 'conventiebiologie', 'duurzaamheidsdoelen', 'habitats', 'inheems', 'soortenrijkdom', 'natuurlijke hulpbronnen', 'sustainable', 'sustainability']
stop_words_final = stop_words + additional_stop_words

In [14]:
# Removing the stopwords from the data
def rem_stopwords (text):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words_final] for doc in text]

# remove stop words
data_words_nostops = rem_stopwords(data_words)

#### 3. Lemmatization
Lastly lemmatization has been performed, since its superior to stemming (Khyani et al., 2021), 
which is a text normalization technique that will switch any word to its lemma. For this process I 
used to open-source software library called spaCy, but NLTK could also have been used. The spaCy 
pre-trained model called en_core_web_md, can be thought of as some kind of pipeline. When this 
model is called upon a text or word, the text will run through the pipeline. If the text isn’t tokenized 
it will be tokenized after which different components will be activated. The thing that’s most 
interesting about this pipeline is a tagger which will assign Part-of-Speech tags based on spaCy’s 
English language model. This is done to gain a variety of annotations. The POS tag refers to a label 
which will be assigned to every token in the corpus to indicate the type of said token (is it a verb or 
punctation or adjective) and other grammatical categories. These POS tags can then be used in the 
preprocess to remove unwanted tags. The only tags that I have allowed in my analysis are Noun, Adj, 
Verb and Adv.

In [15]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
# or even higher

nlp = spacy.load('nl_core_news_lg', disable=['parser', 'ner'])
nlp.max_length = 1322782
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [16]:
paragraphsDF['content'] = data_lemmatized
newDF = paragraphsDF.drop(paragraphsDF[paragraphsDF['content'].apply(lambda x: len(x)==0)].index)

# **Creating Corpus & BERTopics**
BERTopic is a smart topic modeling algorithm that utilizes BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing model developed by Google, to create meaningful and accurate topics from a given corpus. Here are some reasons why BERTopic is considered smart to use:

1. Incorporates contextual understanding: BERT is designed to understand the context of text data, which allows BERTopic to create topics that are based on the full context of the documents. This makes it more accurate and meaningful compared to other topic modeling algorithms.

2. Utilizes clustering: BERTopic uses clustering algorithms, such as Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), to group similar documents together and create coherent topics. This clustering approach helps ensure that the topics are not only meaningful but also distinguishable from one another.

3. Customizable: BERTopic is highly customizable and can be tailored to specific needs. For example, users can adjust the number of topics they want to extract, or exclude specific words from the analysis to improve the quality of topics generated.

4. Efficient: BERTopic is designed to be computationally efficient and can process large datasets quickly. It also has the ability to update topics as new documents are added to the corpus, making it a scalable solution for topic modeling.

5. Easy to use: BERTopic is user-friendly and can be implemented with just a few lines of code. The resulting topics can be visualized using a variety of tools, making it easy to interpret and communicate the findings to others.

Overall, BERTopic is a smart choice for topic modeling as it combines the power of BERT with efficient clustering algorithms and customizability to create meaningful and accurate topics from text data.

We select the "dutch" as the main language for our documents. If you want a multilingual model that supports 50+ languages, please select "multilingual" instead.

In [17]:
newDF['corp'] = [','.join(map(str, l)) for l in newDF['content']]
newDF['corp'] = newDF['corp'].str.replace(',',' ', regex=False)

In [18]:
# reset the index of the dataframe
newDF = newDF.reset_index(drop=True)
newDF

Unnamed: 0.1,Unnamed: 0,title,paragraph_num,paragraph_text,actors_in_paragraph,actors_list,2_plus_actors,numb_unique_actors,content,corp
0,292,Antwoord op vragen van de leden Van Raan en Te...,5,Vragen van de leden Van Raan en Teunissen (bei...,"['Raan', 'Teunissen', 'Hoekstra']","['Raan', 'Teunissen', 'Hoekstra']",True,3,"[vraag, lid, raan, stikstof, economisch, zaak,...",vraag lid raan stikstof economisch zaak klimaa...
1,357,Verslag van een schriftelijk overleg met de mi...,10,"Koffeman (PvdD), Faber-Van de Klashorst (PVV),...","['Strien', 'Pijlman', 'Rooijen', 'Gurp', 'Van ...","['Strien', 'Pijlman', 'Rooijen', 'Gurp', 'Van ...",True,9,"[strien, ondervoorzitter, sgp, klip, Ballekom,...",strien ondervoorzitter sgp klip Ballekom dessi...
2,385,Verslag van een schriftelijk overleg met de mi...,38,"Zo nee, waarom niet?\nEn deelt u de mening dat...","['Vestering', 'Raan', 'Vestering', 'Raan']","['Vestering', 'Raan', 'Vestering', 'Raan']",True,2,"[delen, mening, toestaan, extra, opgave, betek...",delen mening toestaan extra opgave betekenen e...
3,396,Verslag van een schriftelijk overleg met de mi...,49,"Zo ja, kunt u aangeven waarom u SATL niet de g...","['Adegeest', 'van der Linden']","['Adegeest', 'van der Linden']",True,2,"[aangeven, satl, vragen, gegeven, verstrekken,...",aangeven satl vragen gegeven verstrekken reden...
4,492,Verslag van een schriftelijk overleg met de mi...,145,Op 29 september jl. nam de Tweede Kamer een mo...,"['Vestering', 'Raan', 'Vestering', 'Raan']","['Vestering', 'Raan', 'Vestering', 'Raan']",True,2,"[jl, deelnemen, lid, vestering, raan, pvdden, ...",jl deelnemen lid vestering raan pvdden lid ves...
...,...,...,...,...,...,...,...,...,...,...
2840,222774,"Begroting Landbouw, Natuur en Voedselkwaliteit...",968,"Ik wil ook heel graag naar de lunch, maar toch...","['Grinwis', 'Kluis', 'Kluis', 'Grinwis']","['Grinwis', 'Kluis', 'Kluis', 'Grinwis']",True,2,"[graag, lunch, grinwis, mooi, spreken, heten, ...",graag lunch grinwis mooi spreken heten kluis l...
2841,223192,2 PAGW brief ministeries LNV en IenW,335,waterschap Aa en Maas\nwaterschap\nScheldestro...,"['Aa', 'Maas']","['Aa', 'Maas']",True,2,"[waterschap, scheldestroem, waterschap, waters...",waterschap scheldestroem waterschap waterschap...
2842,223267,3 PAGW voorstel Waddenprovincies,14,3e tranche Waddengebied Geachte mevrouw Van Du...,"['Duin', 'Beekman']","['Duin', 'Beekman']",True,2,"[tranche, geacht, duin, ontvingen, landbouw, v...",tranche geacht duin ontvingen landbouw visseri...
2843,223363,3 PAGW voorstel Waddenprovincies,110,"# drs. A.A.M. Brok, voorzitter #@ValidSign_Ond...","['Brok', 'Schepers']","['Brok', 'Schepers']",True,2,[secretaris],secretaris


### Transformer embedding
BERTopic supports several libraries for encoding our text to dense vector embeddings. If we build poor quality embeddings, nothing we do in the other steps will be able to help us, so it is very important that we choose a suitable embedding model. the Sentence Transformers library provides the most extensive library of high-performing sentence embedding models.They can be found on HuggingFace Hub by searching for “sentence-transformers”. The first result of this search is sentence-transformers/all-MiniLM-L6-v2, this is a popular high-performing model that creates 384-dimensional sentence embeddings.

In [19]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

### UMAP
UMAP is an amazing technique for dimensionality reduction. In BERTopic, it is used to reduce the dimensionality of document embedding into something easier to use with HDBSCAN to create good clusters.

However, it does has a significant number of parameters you could take into account. As exposing all parameters in BERTopic would be difficult to manage, we can instantiate our UMAP model and pass it to BERTopic:

- n_neighbors is the number of neighboring sample points used when making the manifold approximation.By increasing n_neighbors we can preserve more global structures, whereas a lower n_neighbors better preserves local structures, finding a good n_neighbours value allows us to preserve both local and global structures relatively well. 
- n_components refers to the dimensionality of the embeddings after reducing them. A too low dimensionality (n_components) results in a loss of information while a too high dimensionality results in poorer clustering results.
- metric refers to the method used to compute the distances in high dimensional space. 
- low_memory is used when datasets may consume a lot of memory. 

In [31]:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, 
                  n_components=5, 
                  metric='cosine', 
                  low_memory=False)

### HDBSCAN
After reducing the embeddings with UMAP, we use HDBSCAN to cluster our documents into clusters of similar documents. Similar to UMAP, HDBSCAN has many parameters that could be tweaked to improve the cluster's quality.
- Min_cluster_size is arguably the most important parameter in HDBSCAN. It controls the minimum size of a cluster and thereby the number of clusters that will be generated.
- Metric, like with HDBSCAN is used to calculate the distances.
- Prediction_data, make sure you always set this value to True as it is needed to predict new points later on. 


In [42]:
from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=15, 
                        metric='euclidean', 
                        prediction_data=True)

### BERTopic model
- The language parameter is used to simplify the selection of models for those who are not familiar with sentence-transformers models.
- Top_n_words refers to the number of words per topic that you want to be extracted.
- The n_gram_range parameter refers to the CountVectorizer used when creating the topic representation.
- min_topic_size is an important parameter! It is used to specify what the minimum size of a topic can be.
- nr_topics can be a tricky parameter. It specifies, after training the topic model, the number of topics that will be reduced.

In [43]:
model = BERTopic(language="dutch",
                 nr_topics='auto',
                 top_n_words = 10, 
                 n_gram_range = (1,2), 
                 min_topic_size = 200,
                 umap_model=umap_model,
                 hdbscan_model=hdbscan_model,
                 embedding_model = embedding_model)
topics, probs = model.fit_transform(newDF['corp'])

We can then extract the most and least frequent topics:

In [44]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,1019
1,0,484
2,1,441
3,2,119
4,3,107
5,4,99
6,5,92
7,6,58
8,7,46
9,8,41


In [45]:
model.get_topic(6)

[('grinwis', 0.32582674676502626),
 ('voorstellen', 0.28991354903459293),
 ('boswijk', 0.23678296599134094),
 ('camp', 0.23472110423310166),
 ('tjeerd', 0.20131225939781813),
 ('lid', 0.18991597842980154),
 ('groot', 0.13850106468120144),
 ('daarna', 0.07756868646690736),
 ('bromet', 0.06106152419411089),
 ('lunch', 0.0597415755524681)]

# **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [46]:
model.visualize_topics()

### Reduce topics based on the above  visualization

We can reduce the number of topics after having trained a BERTopic model. The advantage of doing so is that you can decide the number of topics after knowing how many are created. It is difficult to predict before training your model how many topics that are in your documents and how many will be extracted. Instead, we can decide afterward how many topics seem realistic. So the "nr_topics" parameter should be adapted to a certain realistic number.

In [118]:
model.reduce_topics(newDF['corp'], nr_topics=23, topics = topics)

([-1,
  -1,
  -1,
  0,
  -1,
  -1,
  -1,
  -1,
  9,
  10,
  -1,
  1,
  -1,
  -1,
  -1,
  9,
  1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  0,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  6,
  0,
  -1,
  0,
  -1,
  0,
  -1,
  -1,
  -1,
  -1,
  0,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  0,
  -1,
  -1,
  -1,
  0,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  12,
  9,
  1,
  -1,
  -1,
  -1,
  0,
  15,
  0,
  -1,
  -1,
  -1,
  0,
  -1,
  0,
  -1,
  0,
  7,
  6,
  7,
  7,
  -1,
  -1,
  -1,
  14,
  -1,
  -1,
  1,
  -1,
  -1,
  3,
  8,
  -1,
  5,
  0,
  0,
  -1,
  15,
  15,
  22,
  -1,
  15,
  22,
  -1,
  -1,
  -1,
  0,
  0,
  -1,
  22,
  3,
  -1,
  -1,
  12,
  2,
  -1,
  -1,
  3,
  -1,
  0,
  -1,
  -1,
  0,
  -1,
  1,
  0,
  -1,
  12,
  -1,
  -1,
  -1,
  17,
  -1,
  -1,
  -1,
  5,
  -1,
  -1,
  -1,
  -1,
  0,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  10,
  0,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  14,
  -1,
  -1,
  -1,
  3,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,
  -1,

In [119]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,4297
1,0,889
2,1,303
3,2,222
4,3,179
...,...,...
19,18,70
20,19,68
21,20,66
22,21,65


In [120]:
model.visualize_topics()