# **BERTopic - Tutorial**
We start with installing bertopic from pypi before preparing the data. 

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

# **Prepare data**

In [None]:
import json
import pandas as pd
import string, pprint
# gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from bertopic import BERTopic
import spacy
import nl_core_news_sm
import ijson
import nltk
from nltk.corpus import stopwords
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
from bertopic.backend import languages
import math

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from nltk.tokenize import sent_tokenize
from typing import List

In [10]:
# Load the documents into a documents list
documents = []

with open("../Source/sample_100.json", "rb") as j:
    for record in ijson.items(j, "item"):
        documents.append(record)

In [11]:
doc_titles = []
for doc in documents:
    doc_titles.append(doc['document_title'])
    

doc_content = []
for doc in documents:
    doc_content.append(doc['content'])
    
data = {'title': doc_titles, 'content': doc_content}

In [12]:
dataDF = pd.DataFrame(data)

# **Splitting up the documents into paragraphs**
BERT topic modeling based on paragraphs instead of whole documents has several advantages, including:

1. Improved granularity: Topic modeling based on paragraphs allows for a more fine-grained analysis of text data. It allows for a better understanding of the themes and topics within a larger document, which can help with more precise and accurate categorization of text data.

2. Better representation of content: Analyzing individual paragraphs rather than whole documents can provide a more accurate representation of the content in a given document. This is particularly important for longer documents where the content can vary significantly across different sections.

3. Better results for shorter documents: BERT-based topic modeling can be challenging for short documents, as there may not be enough information to generate meaningful topics. Analyzing individual paragraphs can provide more reliable results for shorter documents.

4. Ability to identify multiple topics: BERT topic modeling based on paragraphs can help identify multiple topics within a single document, which can be particularly useful in cases where there are multiple themes or subtopics.

Overall, BERT topic modeling based on paragraphs can provide a more detailed and accurate analysis of text data compared to analyzing whole documents.

In [13]:
import spacy
import pandas as pd

def split_into_paragraphs(text):
    nlp = spacy.load('nl_core_news_lg')
    doc = nlp(text)
    paragraphs = []
    current_paragraph = ''
    
    for sentence in doc.sents:
        if len(current_paragraph) == 0:
            current_paragraph = str(sentence)
        else:
            similarity = sentence.similarity(nlp(current_paragraph))
            if similarity < 0.6:  # threshold for new paragraph
                paragraphs.append(current_paragraph)
                current_paragraph = str(sentence)
            else:
                current_paragraph += '\n' + str(sentence)

    paragraphs.append(current_paragraph)  # add last paragraph
    return paragraphs

# create new dataframe for paragraphs
paragraphsDF = pd.DataFrame(columns=['title', 'paragraph_num', 'paragraph_text'])

# loop over documents in the dataset and split each one into paragraphs
for i, row in dataDF.iterrows():
    title = row['title']
    content = row['content']
    paragraphs = split_into_paragraphs(content)
    
    # add each paragraph to the new dataframe
    for j, paragraph in enumerate(paragraphs):
        paragraphsDF = paragraphsDF.append({'title': title, 'paragraph_num': j+1, 'paragraph_text': paragraph}, ignore_index=True)

# save new dataframe to a csv file
paragraphsDF.to_csv('paragraphs_dataset.csv', index=False)

In [14]:
paragraphsDF

Unnamed: 0,title,paragraph_num,paragraph_text
0,Contaminatie in het vlees van ‘grote grazers’ ...,1,>
1,Contaminatie in het vlees van ‘grote grazers’ ...,2,Retouradres Postbus
2,Contaminatie in het vlees van ‘grote grazers’ ...,3,20350 2500 EJ Den Haag
3,Contaminatie in het vlees van ‘grote grazers’ ...,4,De Voorzitter van de Tweede Kamer der Staten-G...
4,Contaminatie in het vlees van ‘grote grazers’ ...,5,Postbus
...,...,...,...
22324,Antwoord op de vragen van het lid Tjeerd de Gr...,66,Antwoord 14
22325,Antwoord op de vragen van het lid Tjeerd de Gr...,67,"Banken, pensioenfondsen en andere financiële i..."
22326,Antwoord op de vragen van het lid Tjeerd de Gr...,68,Vraag 15 Kunt u deze vragen elk afzonderlijk e...
22327,Antwoord op de vragen van het lid Tjeerd de Gr...,69,Antwoord 15


# **Text preprocessing**
The preprocessing pipeline is mentioned below.
#### 1. Tokenisation
First basic tokenization is implemented, to split the text into 
tokens as is recommended by Kannan et al. (2014). For this process I used genism’s 
simple_preprocess, which will convert the text into lowercase & tokens and remove punctuation. 

In [16]:
# Tokenization using gensim
def sent_to_words(sentences, deacc=True): # deacc=True removes punctuations
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence)))  
        
# Convert the data to a list
data = paragraphsDF["paragraph_text"].values.tolist()
data_words = list(sent_to_words(data))

#### 2. Stop word removal 
Secondly stop words will be removed from the data as well as a list of punctuation characters from 
the string.punctiation string, which is a pre-initialized string used as a string constant. These are 
removed because they have little relevance for understanding the content of a text (Kannan et al., 
2014).

In [23]:
# create list of additional stop words
# We remove additional common words that o# Create stopword list
# string.punctuation refers to a list of punctuations
# import nltk
# nltk.download()
from nltk.corpus import stopwords
stop_words = stopwords.words('dutch') + list(string.punctuation) #occur in many documents and have no link to a distinct industry.
additional_stop_words = ['geer','minister','vraag', 'postbus', 'retouradres', 'antwoord']
stop_words = stop_words + additional_stop_words

In [24]:
# Removing the stopwords from the data
def rem_stopwords (text):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in text]

# remove stop words
data_words_nostops = rem_stopwords(data_words)

#### 3. Lemmatization
Lastly lemmatization has been performed, since its superior to stemming (Khyani et al., 2021), 
which is a text normalization technique that will switch any word to its lemma. For this process I 
used to open-source software library called spaCy, but NLTK could also have been used. The spaCy 
pre-trained model called en_core_web_md, can be thought of as some kind of pipeline. When this 
model is called upon a text or word, the text will run through the pipeline. If the text isn’t tokenized 
it will be tokenized after which different components will be activated. The thing that’s most 
interesting about this pipeline is a tagger which will assign Part-of-Speech tags based on spaCy’s 
English language model. This is done to gain a variety of annotations. The POS tag refers to a label 
which will be assigned to every token in the corpus to indicate the type of said token (is it a verb or 
punctation or adjective) and other grammatical categories. These POS tags can then be used in the 
preprocess to remove unwanted tags. The only tags that I have allowed in my analysis are Noun, Adj, 
Verb and Adv.

In [25]:
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out
# or even higher

nlp = spacy.load('nl_core_news_lg', disable=['parser', 'ner'])
nlp.max_length = 1322782
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [28]:
paragraphsDF['content'] = data_lemmatized
newDF = paragraphsDF.drop(paragraphsDF[paragraphsDF['content'].apply(lambda x: len(x)==0)].index)

# **Creating Corpus & BERTopics**
BERTopic is a smart topic modeling algorithm that utilizes BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing model developed by Google, to create meaningful and accurate topics from a given corpus. Here are some reasons why BERTopic is considered smart to use:

1. Incorporates contextual understanding: BERT is designed to understand the context of text data, which allows BERTopic to create topics that are based on the full context of the documents. This makes it more accurate and meaningful compared to other topic modeling algorithms.

2. Utilizes clustering: BERTopic uses clustering algorithms, such as Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), to group similar documents together and create coherent topics. This clustering approach helps ensure that the topics are not only meaningful but also distinguishable from one another.

3. Customizable: BERTopic is highly customizable and can be tailored to specific needs. For example, users can adjust the number of topics they want to extract, or exclude specific words from the analysis to improve the quality of topics generated.

4. Efficient: BERTopic is designed to be computationally efficient and can process large datasets quickly. It also has the ability to update topics as new documents are added to the corpus, making it a scalable solution for topic modeling.

5. Easy to use: BERTopic is user-friendly and can be implemented with just a few lines of code. The resulting topics can be visualized using a variety of tools, making it easy to interpret and communicate the findings to others.

Overall, BERTopic is a smart choice for topic modeling as it combines the power of BERT with efficient clustering algorithms and customizability to create meaningful and accurate topics from text data.

We select the "dutch" as the main language for our documents. If you want a multilingual model that supports 50+ languages, please select "multilingual" instead.

In [32]:
newDF['corp'] = [','.join(map(str, l)) for l in newDF['content']]
newDF['corp'] = newDF['corp'].str.replace(',',' ', regex=False)

In [55]:
# reset the index of the dataframe
newDF = newDF.reset_index(drop=True)

In [65]:
newDF

Unnamed: 0,title,paragraph_num,paragraph_text,content,corp
0,Contaminatie in het vlees van ‘grote grazers’ ...,4,De Voorzitter van de Tweede Kamer der Staten-G...,"[voorzitter, generaal]",voorzitter generaal
1,Contaminatie in het vlees van ‘grote grazers’ ...,7,: Parnassusplein 5,[parnassusplein],parnassusplein
2,Contaminatie in het vlees van ‘grote grazers’ ...,9,www.rijksoverheid.nl,"[www, rijksoverheid]",www rijksoverheid
3,Contaminatie in het vlees van ‘grote grazers’ ...,10,Kenmerk 3477122-1040635-VGP,[Kenmerk],Kenmerk
4,Contaminatie in het vlees van ‘grote grazers’ ...,11,Uw brief,[brief],brief
...,...,...,...,...,...
15942,Antwoord op de vragen van het lid Tjeerd de Gr...,61,Dat geldt bijvoorbeeld voor certificeringsstan...,"[gelden, bijvoorbeeld, rtrs, nieuw, voorstelle...",gelden bijvoorbeeld rtrs nieuw voorstellen Eur...
15943,Antwoord op de vragen van het lid Tjeerd de Gr...,63,Zie hiervoor het antwoord op vraag 11.,"[zien, hiervoor]",zien hiervoor
15944,Antwoord op de vragen van het lid Tjeerd de Gr...,65,"Bent u van mening dat banken, pensioenfondsen ...","[mening, bank, pensioenfond, financieel, inste...",mening bank pensioenfond financieel instelling...
15945,Antwoord op de vragen van het lid Tjeerd de Gr...,67,"Banken, pensioenfondsen en andere financiële i...","[bank, pensioenfond, financieel, instelling, s...",bank pensioenfond financieel instelling spelen...


In [None]:
model = BERTopic(language="dutch")
topics, probs = model.fit_transform(newDF['corp'])

We can then extract the most and least frequent topics:

In [None]:
model.get_topic_freq()

# **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
model.visualize_topics()