#**Comparing corpora using distributional semantics methods**

---
Academic Year: 2022/23


Cognitive Science Program, University of Warsaw, Poland

**Abstract**:

Distributional semantics is a powerful concept that catches the meaning of linguistic units (here words) based on their occurrence patterns in corpora. The development of this field has resulted in a number of models that use various techniques to model meaning of texts by mapping them onto a multidimensional space. High-quality vectors (embeddings) should map various relations between linguistic units, such as the relation of similarity. So, similar words should be close to each other in space. Thanks to this property, it is possible to cluster similar words in order, for example, to find topics occurring in a given corpus.

The research question in my work was to find a method for extracting topics from Polish language corpora that uses embeddings clustering and that would be best suited for use in an application that compares corpora semantically. Thus, the evaluation of methods was done in terms of accuracy, but also in terms of ease and speed.

---

**Description of the work**

This work focuses on conducting a comprehensive comparison of three distinct models for topic modeling tasks, employing diverse corpora, clustering algorithms and pretrained language models.

It specifically focuses on conducting a comparison of three models for topic modeling tasks using Polish corpora. The primary objective is to evaluate the performance of these algorithms on Polish data, aiming to identify the most suitable approach that best captures the internal structure of the text.

Throughout this study, a key consideration was given to algorithm and application simplicity. The goal was to explore models that are not only effective but also easy to implement and interpret, allowing for practical and efficient utilization.

The selected corpora encompass:

1. A corpus consisting of journalistic texts.
2. A corpus containing literary and poetic texts.
3. A corpus comprised of non-fiction, guidebooks, and didactic texts.
4. A corpus comprising parliamentary texts.
5. Volume I of the Polish novel "Lalka."
6. Volume II of the Polish novel "Lalka."

These chosen corpora encompass both sets that are intentionally designed to be closely related to each other (e.g., corpora 5, 6, and 2) and those that exhibit semantic distance (e.g., corpus 1 and corpus 6).

The employed models are as follows:

1. A pretrained Fasttext model, applied on lemmatized corpora, and using KMeans as the clustering algorithm.
2. The BERTopic model, trained on sentences extracted from corpora, utilizing vectors from the SentenceTransformers library, and employing HDBSCAN for clustering.
3. Analogiacal BERTopic model, but using a pretrained Spacy model for Polish language.
4. The Top2Vec model, trained from scratch on sentences derived corpora.
5. Analogical Top2Vec model, but using SentenceTransformers.

Throughout the experimentation, different combinations of hyperparameters were evaluated for each of these models. Subsequently, the identified topics were subjected to an automated evaluation process, wherein metrics such as coherence, topic diversity, and cosine similarity were calculated.

It is important to note that this notebook is organized in a manner that facilitates a step-by-step approach rather than a one-time exhaustive process, resulting in some code repetition to enhance readability and comprehension. The order of the code mirrors the sequence of experiments undertaken.

This notebook contains the code that was used to generate the results that were ultimately included in my thesis.

## Libraries

Python version 3.10.12

bs4==4.11.2

spacy==3.5.4

umap-learn==0.5.3

fasttext==v0.9.1

sklearn==1.2.2

seaborn==0.12.2

pandas==1.5.3

matplotlib==3.7

tqdm==4.65.0

numpy==1.22.4

sentence_transformers==2.2.2

hdbscan==0.8.33

bertopic==v0.14.0

keybert==v0.7.0

jax>=0.4.9

bertopic[spacy]==v0.14.0

keybert[spacy]==v0.7.0

top2vec==1.0.29

top2vec[sentence_transformers]==1.0.29

gensim==4.3.1

editdistance==0.6.2

# Preprocessing

## Data fetching and dumping

---
Here are two helper functions for loading and dumping corresponding files.

---

In [None]:
import pickle

In [None]:
def load_file(file_name):
  file_path = ".../topic-modelling-polish/" + file_name + ".pkl"
  with open(file_path, "rb") as f:
    fil = pickle.load(f)
  return fil

In [None]:
def dump_result(result, result_name):
  result_name = ".../topic-modelling-polish/" + result_name + ".pkl"

  with open(result_name, "wb") as f:

    pickle.dump(result,f )

In [None]:
#Saving the .txt version

publ = load_file('corpora/pickle/original/publ')
lit = load_file('corpora/pickle/original/lit')
fakt_ind_nd = load_file('corpora/pickle/original/fakt_ind_nd')



with open('.../topic-modelling-polish/corpora/txt/publ.txt','w',encoding='utf-8') as f:
  f.writelines(publ)

with open('.../topic-modelling-polish/corpora/txt/lit.txt','w',encoding='utf-8') as g:
  g.writelines(lit)

with open('.../topic-modelling-polish/corpora/txt/fakt_ind_nd.txt','w',encoding='utf-8') as h:
  h.writelines(fakt_ind_nd)

## Downloading 1-milion-word subcorpus of the National Corpus of Polish

---
[National Corpus of Polish](http://nkjp.pl/index.php?page=0&lang=1) is a balanced corpus of Polish language. In this work, I use a 1-milion-word sample of it that is easy available here: http://nkjp.pl/index.php?page=14&lang=1

In this sample, there are texts coming from various registers according to the typology published in this book (only in Polish).

https://www.researchgate.net/publication/262184393_Recznie_znakowany_milionowy_podkorpus_NKJP/link/00463536e8212b2a3f000000/download, p. 53

Underneath, I present a slightly modified (and translated) table taken from the above book.

Category | Rate   | Type
-------- | ------ | ----
Journals | 25.5%  | #typ_publ
Other Periodicals | 23.5%  | #typ_publ
Journalistic Books | 1.0%   | #typ_publ
Fiction Literature | 16.0%  | #typ_lit, #typ_lit_poezja
Non-Fiction Literature | 5.5%   | #typ_fakt
Informative and Guide Type | 5.5%   | #typ_inf-por
Educational and Didactic Type | 2.0%   | #typ_nd
Online Interactive (Blogs, Forums, Usenet) | 3.5%   | #typ_net_interakt
Online Non-Interactive (Static Pages, Wikipedia) | 3.5%   | #typ_net_nieinterakt
Quasi-Spoken (Parliamentary Session Protocols) | 2.5%   | #typ_qmow
Media Spoken | 2.5%   | #typ_media
Conversational Spoken | 5.0%   | #typ_konwers
Other Written Texts | 3.0%   | #typ_urzed
Non-Fiction Unclassified Books | 1.0%   | #typ_nklas
 |   | #typ_listy

 ---


1-milion-word subcorpus was the source for the three corpora used here in the comparisons:

* **publ**
  * corpus with around 426,000 tokens composed of journals, other periodicals, and journalistic books (typ_publ)  
* **lit**
  * corpus with around 204,000 tokens composed of piction literature (typ_lit, typ_poezja).
* **fakt_ind_nd**
  * corpus with around 160,000 tokens composed of non-ficton literature, informative and guide type, and educational and didactic type (typ_fakt, typ_inf_por, typ_nd).

In [None]:
!mkdir .../working/corpora/

In [None]:
import re
import string
import pickle
import tarfile
import os

In [None]:
file = tarfile.open('.../topic-modelling-polish/corpora/raw/NKJP-PodkorpusMilionowy-1.2.tar.gz')
file.extractall('.../working/corpora/')

In [None]:
#Here, I'm extracting a list of text registers from NKJP sample corpus
lista_plikow = os.listdir('.../working/corpora/')


lista_typow = []

for l in lista_plikow:
  if not re.search('[a-zA-Z]', l):

    path = '.../working/corpora/' + l + '/header.xml'
    with open(path, 'r') as f:
      file = f.read()

      soup = BeautifulSoup(file, 'xml')
      types = soup.find('catRef')

      typ = types.attrs['target']

      if typ not in lista_typow:
        lista_typow.append(typ)


In [None]:
def extract_texts(path):
  with open(path, 'r') as f:
    file = f.read()

    texts = []

    soup = BeautifulSoup(file, 'xml')
    text_objects = soup.find_all('ab')

    for txt in text_objects:
      texts.append(txt.get_text())

    texts = "\n".join(texts)
    return texts

In [None]:
publ = []
lit = []
fakt_ind_nd = []

for folder in lista_plikow:
  if not re.search('[a-zA-Z]', folder):

    path_header = '.../working/corpora/' + folder + '/header.xml'
    path_text = '.../working/corpora/' + folder + '/text.xml'

    with open(path_header, 'r') as g:
      file_g = g.read()

      soup_g = BeautifulSoup(file_g, 'xml')
      types = soup_g.find('catRef')

      typ = types.attrs['target']

      if typ == "#typ_publ":
        publ.append(extract_texts(path_text))
      elif typ == "#typ_lit" or typ == "#typ_lit_poezja":
        lit.append(extract_texts(path_text))
      elif typ == "#typ_fakt" or typ == "#typ_inf-por" or typ == "#typ_nd":
        fakt_ind_nd.append(extract_texts(path_text))

with open(".../topic-modelling-polish/corpora/pickle/original/publ.pkl", "wb") as f:
  pickle.dump(publ, f)

with open(".../topic-modelling-polish/corpora/pickle/original/lit.pkl", "wb") as p:
  pickle.dump(lit, p)

with open(".../topic-modelling-polish/corpora/pickle/original/fakt_ind_nd.pkl", "wb") as t:
  pickle.dump(fakt_ind_nd, t)

## Downloading *Lalka* Corpus

---
*Lalka* is a 19th century Polish novel written by Boleslaw Prus. It consists of two volumes, each with 19 chapters. Volume I has around 154,000 tokens, while Volume II has around 173,000 tokens. I used it in comparisons as an example of texts that are very similar to each other.

Throughout the remainder of this notebook, the reference term `lalka-tom-pierwszy` pertains to Volume One, and `lalka-tom-drugi` pertains to Volume Two.

The novel *Lalka* was downloaded from the public domain book site [Wolne Lektury](https://wolnelektury.pl/katalog/lektura/lalka/).

---



In [None]:
import tarfile
from bs4 import BeautifulSoup
import pickle

In [None]:
with open('.../topic-modelling-polish/corpora/txt/lalka-tom-pierwszy.txt', 'r') as f:
  text = f.read()
  raw_korpus = text.split('\n\n\n\n') #In this corpus, chapters are considered documents
  raw_korpus = raw_korpus[2:23]
  with open('.../topic-modelling-polish/corpora/pickle/original/lalka-tom-pierwszy.pkl','wb') as g:
    pickle.dump(raw_korpus,g)

In [None]:
with open('.../topic-modelling-polish/corpora/txt/lalka-tom-drugi.txt', 'r') as f:
  text = f.read()
  raw_korpus = text.split('\n\n\n\n') #In this corpus, chapters are considered documents
  raw_korpus = raw_korpus[2:39]
  with open('.../topic-modelling-polish/corpora/pickle/original/lalka-tom-drugi.pkl','wb') as g:
    pickle.dump(raw_korpus,g)

## Downloading Polish Parliamentary Corpus

---
Polish Parliamentary Corpus is a huge collection of texts from the plenary sittings of the Sejm and Senate of Polish Republic dating from 1919 to present.

In this work, I use only a small sample of texts from different periods that can be downloaded from [this site](http://clip.ipipan.waw.pl/PPC) (around 84,000 tokens).

For a longer description of this corpus, visit [its website](https://clarin-pl.eu/index.php/en/kdp-en/).

In the rest of the notebook, I refer to this corpus using `parlament_sample` term.

---

In [None]:
!mkdir /content/parlament_corpus/

In [None]:
import zipfile

In [None]:
with zipfile.ZipFile('.../topic-modelling-polish/corpora/raw/ppc-sample.zip', 'r') as zip_ref:
    zip_ref.extractall('/content/parlament_corpus/')

In [None]:
def get_file_paths(directory):
    file_paths = []

    for root, directories, files in os.walk(directory):
        for filename in files:
            filepath = os.path.join(root, filename)
            file_paths.append(filepath)

    return file_paths

directory = '/content/parlament_corpus/'
file_paths = get_file_paths(directory)

In [None]:
def czy_ending(path):
  ending = 'text_structure.xml'
  if ending in path:
    return True
  else:
    return False

pliki = list(filter(czy_ending,file_paths))

In [None]:
corpus = []
for plik in tqdm.tqdm(pliki):
  with open(plik,'r') as f:
    file = f.read()

    soup = BeautifulSoup(file, 'xml')
    text_objects = soup.find_all('u')

    texts = []
    for txt in text_objects:
      texts.append(txt.get_text())

    corpus.append("\n".join(texts))


In [None]:
with open('.../topic-modelling-polish/corpora/txt/korpus_parlament_sample.txt','w',encoding='utf-8') as f:
  f.writelines(corpus)

In [None]:
with open('.../topic-modelling-polish/corpora/pickle/original/korpus_parlament_sample.pkl','wb') as f:
  pickle.dump(corpus,f)

## Corpora preprocessing

---
Corpus preprocessing here is language-specific and uses the language model [pl_core_news_lg](https://spacy.io/models/pl) trained for Polish.

---

In [None]:
def preprocess(corpus):
  corpus0 = []
  for text in corpus:
    text = re.sub(r'\d+','0',text) #change all the numbers into 0s
    text = re.sub(r'\n+',' ',text)
    corpus0.append(text)

  return corpus0


In [None]:
def spacy_tokenizer(text):
  text_object = nlp(text)
  tokenized = [text_raw.text for text_raw in text_object]
  return tokenized

In [None]:
def spacy_lemmatizer(text,nlp):
  text_object = nlp(text)
  tokenized = [str(text_raw.lemma_) for text_raw in text_object]
  return tokenized

In [None]:
def spacy_sentenciser(text,nlp):
  text_object = nlp(text)
  sentences = [str(sentence) for sentence in text_object.sents]

  return sentences

## Sentences preprocessing

---
The code below creates a sentence-based version of the corpora. Those particular versions will be utilized specifically for the BERTopic and Top2Vec models. In my experiments, I have found that the version where a document is only one sentence yields the best results in the topic modeling task.

---

In [None]:
!pip install https://github.com/explosion/spacy-models/releases/download/pl_core_news_lg-3.5.0/pl_core_news_lg-3.5.0.tar.gz

In [None]:
import re
import itertools
import spacy

In [None]:
nlp = spacy.load('pl_core_news_lg')

In [None]:
names = ['parlament_sample', 'lit', 'lalka-tom-pierwszy', 'lalka-tom-drugi', 'fakt_ind_nd', 'publ']

nlp = spacy.load('pl_core_news_lg')
for name in names:
  corpus = load_file(f'corpora/pickle/original/{name}')
  corpus = preprocess(corpus)
  sentencized_corpus = [] #list of sentences from a corpus
  for doc in corpus:
    sentencized_doc = spacy_sentenciser(doc, nlp=nlp)
    if name == 'parlament_sample': #In this corpus, some sentences were very long, which caused problems when using transformer models.
    #These were sentences in which the names of MPs were mentioned.
      new_sentences = []
      for s in sentencized_doc: #Therefore, I divided sentences longer than 80 words into shorter sentences.
        if len(s.split()) > 80:
          new_sentences.append([' '.join(s.split()[x:x+80]) for x in range(0, len(s.split()), 80)])
        else:
          new_sentences.append([s])
      sentencized_corpus.append(list(itertools.chain(*new_sentences)))
    else:
      sentencized_corpus.append(sentencized_doc)
  sentencized_corpus = [doc for doc in sentencized_corpus if len(doc) > 0]
  sentencized_corpus = list(itertools.chain(*sentencized_corpus))
  dump_result(sentencized_corpus,f'/corpora/pickle/sentences/{name}_sentences_27lipca')


# Models

## Fasttext

---
[FastText](https://fasttext.cc/) is a library and algorithm for efficient text classification and representation learning developed by Facebook's AI Research (FAIR) lab. It utilizes word embeddings to represent words as continuous vectors and trains a shallow neural network for various natural language processing tasks, such as text classification, part-of-speech tagging, and named entity recognition. FastText's key feature is its ability to handle out-of-vocabulary words and its fast training and prediction times, making it popular for tasks involving large text datasets. Here, I will use an already pretrained Fasttext model for Polish in a vector clustering task.

I downloaded the model from [the CLARIN-PL website](https://clarin-pl.eu/dspace/handle/11321/606) and then loaded it using the Facebook `fasttext` library, however it is also possible to load the model directly using [HuggingFace](https://huggingface.co/clarin-pl/fastText-kgr10).

The method presented here is inspired by article

Sia, S., Dalmia, A., & Mielke, S. J. (2020). Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1728–1736. https://doi.org/10.18653/v1/2020.emnlp-main.135.

The presented topic modeling methodology comprises several sequential steps:

1. The corpus undergoes lemmatization using the preprocessed Spacy `pl_core_news_lg` model, trained for the Polish language.
2. Employing TfidfVectorizer from the `sklearn` library, word weights are computed for individual terms in the corpus. This process involves eliminating words with minimal weights, denoting insignificance in the context of the corpus.
3. The pretrained Fasttext model is utilized to obtain word vectors, stored in a dictionary format of word:embedding.
4. Subsequently, the dimensionality of the stored vectors is reduced through the application of the UMAP algorithm.
5. Employing the KMeans algorithm, vector clustering is performed, with application of the previously derived weights.
6. Topics are extracted in the format of topic:list of tuples, each tuple containing (word, weight, vector). This extraction process is based on the top 30 words in each topic, comprising words with the highest weights calculated in step 2.
7. Topics are initially output in random order. A subsequent reranking of topics takes place to prioritize those conveying more informative content about the corpus, thus optimizing the readability.

---

First, I import the necessary libraries and define helper functions.

In [None]:
!pip install fasttext==v0.9.1

In [None]:
!pip install umap-learn==0.5.3

In [None]:
import umap
import pickle
import fasttext
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from sklearn.cluster import KMeans
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import tqdm
import numpy as np

In [None]:
def import_models(model_name):

  if model_name == "FastText100":
    nlp = spacy.load('pl_core_news_lg')
    model = fasttext.load_model("/content/drive/MyDrive/Projekt magisterski gotowy/models/Fasttext100/kgr10.plain.cbow.dim100.neg10.bin")
    return model

In [None]:
def get_embeddings(vector_model,weighting):
  embedding_list = [vector_model.get_word_vector(x) for x in weighting.keys()]
  embedding_dict = dict(zip(weighting.keys(),embedding_list))


  return embedding_dict


In [None]:
def get_tfidf_scores(corpus):

  vectorizer = TfidfVectorizer(use_idf=True)
  tfidf_vectorizer_vectors = vectorizer.fit_transform(corpus)
  words = vectorizer.get_feature_names_out()
  total_tf_idf = tfidf_vectorizer_vectors.toarray().sum(axis=0)


  tf_idf_score = {}
  for i, word in enumerate(words):
    tf_idf_score[word] = total_tf_idf[i]
  return tf_idf_score, words.astype("str") #tf_idf_score - dictionary lemma:score, words - list of words that were not rejected


In [None]:
def cluster(n_clusters,random_state,embeddings, weights=None):

  if weights is None:
    kmeans = KMeans(n_clusters=n_clusters,random_state=random_state, n_init=100).fit(embeddings)
  else:
    kmeans = KMeans(n_clusters=n_clusters,random_state=random_state, n_init=100).fit(embeddings,sample_weight = weights)
  return kmeans

In [None]:
#A function that pulls out words related to a topic
def extract_topic_words(clusters,embedding_dict, labels = False, top_n = None, weights = None):

  vocabulary = list(embedding_dict.keys())
  if labels: #if you pass clusters as a list of numbers
    topic_dict = dict(zip(list(vocabulary),clusters))
  else: #if you pass clusters as an cluster object
    topic_dict = dict(zip(list(vocabulary),clusters.labels_)) #dictionary word:cluster_number

  n_topics = len(set(topic_dict.values()))
  topic_words_embeddings = [[] for i in range(n_topics)]

  for word in topic_dict:
    n_topic = topic_dict[word]

    if top_n is not None:
      word_emb = (word,weights[word],embedding_dict[word])
      topic_words_embeddings[n_topic].append(word_emb)
    else:
      word_emb = (word,embedding_dict[word])
      topic_words_embeddings[n_topic].append(word_emb)


  if top_n is not None:
    final_topic_words_embeddings = []
    for topic in topic_words_embeddings:
      cut = sorted(topic,key=lambda x: x[1], reverse = True)[:top_n+1]
      final_topic_words_embeddings.append(cut)
    return final_topic_words_embeddings

  return topic_words_embeddings

In the following cell, a word embedding clustering process is conducted on the corpora utilizing Fasttext, umap, and KMeans. Subsequently, visualizations are generated to provide a representation of the clustered word embeddings as well as the relationship between the Silhouette score and the number of clusters.

In [None]:
names = ['parlament_sample', 'lit', 'lalka-tom-pierwszy', 'lalka-tom-drugi', 'fakt_ind_nd', 'publ']
model = import_models("FastText100")

for name in names:
  corpus = load_file(f'corpora/pickle/original/{name}')
  corpus = preprocess(corpus)




  lemmatized_korpus = [] #lemmatized documents; list of strings
  lemmas = [] #lemmatized documents; list of tokens
  for doc in tqdm.tqdm(corpus):
    lemmatized_doc = spacy_lemmatizer(doc,nlp)
    lemmatized_doc = list(map(lambda x:x.lower(),lemmatized_doc))
    lemmas.append(lemmatized_doc)
    lemmatized_korpus.append(' '.join(lemmatized_doc))
  new_lemmatized = []
  for l in lemmas:
    rob = []
    for word in l:
      new_word = word.split(' ')
      rob.append(new_word)
    new_lemmatized.append(list(itertools.chain(*rob)))

  weights, vocab = get_tfidf_scores(lemmatized_korpus)
  embeddings_dict = get_embeddings(model, weights)
  dump_result(embeddings_dict, f'embeddings/fasttext100/{name}_fasttext100_embeddings_dict')
  dump_result(weights, f'weights/{name}_tfidf_weights')

In [None]:
import matplotlib.colors as mcolors
names = ['parlament_sample', 'lit', 'lalka-tom-pierwszy', 'lalka-tom-drugi', 'fakt_ind_nd', 'publ']
umap_model = umap.UMAP(n_neighbors=15, n_components=5,
              min_dist=0.0, metric='cosine', random_state=42)
tsne = TSNE()

for name in tqdm.tqdm(names):


  embeddings_dict = load_file(f'embeddings/fasttext100/{name}_fasttext100_embeddings_dict')
  embeddings = list(embeddings_dict.values())
  reduced_embeddings = umap_model.fit_transform(embeddings)

  scores = load_file(f'weights/{name}_tfidf_weights')
  weights = list(scores.values())
  #projection = load_file(f'projections/fasttext100/{name}_projection_fasttext100')
  projection = TSNE().fit_transform(np.array(reduced_embeddings))
  dump_result(projection, f'projections/fasttext100/{name}_projection_fasttext100')
  print('Projection done')

  sns.set_theme()

  silhouette_scores = []
  for k in [10, 50, 100, 200]:
    kmeans = KMeans(n_clusters=k,random_state=42, n_init=100).fit(reduced_embeddings,sample_weight = weights)
    cluster_labels = kmeans.fit_predict(reduced_embeddings)
    dump_result(cluster_labels, f'clusters/fasttext100/{name}_{k}_clusters_fasttext100')
    silhouette_scores.append(silhouette_score(reduced_embeddings, cluster_labels))
    print(f'Clusters for {name}_{k} done')

    topics = extract_topic_words(kmeans,embeddings_dict, False, 30, scores)
    topic_words = list(map(lambda x: list(x[0]), list(map(lambda x: list(zip(*x)), topics))))
    dump_result(topics,f'topics/fasttext100/full topics/{name}_fasttext100_{k}_topics_full_topic')
    dump_result(topic_words, f'topics/fasttext100/word lists/{name}_fasttext100_{k}_topics_word_list')

    plt.clf()
    plot = sns.scatterplot(x=projection[:,0], y=projection[:,1], hue=cluster_labels, palette='Paired', legend=False)
    plt.title(f'{name} fasttext, {k} clusters')
    plot.figure.savefig(f'.../topic-modelling-polish/figures/fasttext100/clusters/{name}_fasttext100_{k}_clusters.png')
    plt.clf()
    print(f'Figure title {name} fasttext, {k} clusters done')


  df = pd.DataFrame({'clusters':[10,50,100, 200],'score':silhouette_scores})



  plot_sil = sns.relplot(
    data=df, kind="line",
    x="clusters", y="score")
  plt.xlabel('Number of clusters')
  plt.ylabel('Silhouette score')
  plt.title(f'{name}')
  plot_sil.savefig(f'.../topic-modelling-polish/figures/fasttext100/silhouette/silhouette_{name}.png')
  plt.clf()
  print(f'Silhouette figure title {name} done')





### BERTopic

[BERTopic](https://maartengr.github.io/BERTopic/index.html) is a topic modeling technique that leverages Transformers embeddings to cluster and organize documents into coherent topics. It utilizes the contextualized word representations to create document embeddings and then applies a hierarchical clustering algorithm to group similar documents together. Bertopic is particularly effective for unsupervised topic modeling tasks and has shown promising results in various natural language processing applications.


BERTopic is highly customizable. I have tested multiple approaches and here I am presenting only the ones I ultimately decided to proceed with.

These are BERTopic models utilizing two language models to generate vectors:

* [Sentence Transformers](https://www.sbert.net/docs/pretrained_models.html) `paraphrase-multilingual-MiniLM-L12-v2` (BERTopic Multilingual)
* Spacy `pl_core_news_lg` (BERTopic Spacy)


### Visualizing clustering results of Sentence Transformers vectors with HDBSCAN and UMAP

---
This section serves as an introduction to testing the BERTopic model's performance. BERTopic can use several pretrained language models, clustering algorithms, and a variety of other features. In this work, HDBSCAN clustering algorithm and Sentence Transformers `paraphrase-multilingual-MiniLM-L12-v2` language model will be applied.

Before committing to specific HDBSCAN hyperparameters, I sought to explore their impact on clustering quality of pretrained embeddings generated by this particular model. In this section, I delved into investigating the influence of the `min_cluster_size` hyperparameter. My approach was straightforward, using visualizations to gain insights.

In theory, each color on the graphs should represent a distinct cluster. However, in practice, the number of clusters is so vast that colors repeat themselves due to limited color availability. Nevertheless, this method provides valuable assessments of how effectively the algorithm clusters data points.

---

Although in the final version of BERTopic, word vectors are clustered, here I decided to visualize the results of clustering sentence vectors for the sake of implementation simplicity.

In [None]:
!pip install -U sentence-transformers==2.2.2

In [None]:
!pip install hdbscan==0.8.33


In [None]:
!pip install umap-learn==0.5.3

In [None]:
from sentence_transformers import SentenceTransformer
from hdbscan import HDBSCAN
import umap
from sklearn.manifold import TSNE
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import tqdm


In [None]:
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
umap_model = umap.UMAP(n_neighbors=15, n_components=5,
              min_dist=0.0, metric='cosine', random_state=42)

In [None]:
names = ['parlament_sample', 'lit', 'lalka-tom-pierwszy', 'lalka-tom-drugi', 'fakt_ind_nd', 'publ']


for name in tqdm.tqdm(names[1:]):
  sentences = load_file(f'corpora/pickle/sentences/{name}_sentences')
  embeddings = model.encode(sentences)
  dump_result(embeddings, f'embeddings/sentence_transformers/{name}_sentence_transformers_embeddings')
  embeddings_reduced = umap_model.fit_transform(embeddings)
  projection = TSNE().fit_transform(np.array(embeddings_reduced))
  dump_result(projection,f'projections/sentence_transformers/{name}_sentence_transformers_projection')
  print('Projection done')
  for k in [5, 20, 40, 60, 100]:
    hdb = HDBSCAN(min_cluster_size = k, cluster_selection_method="eom", prediction_data = True, metric = 'euclidean').fit(embeddings_reduced)
    print(f'Clustered for {k} min cluster size')
    dump_result(hdb.labels_, f'clusters/sentence_transformers/{name}_{k}_sentence_transformers_labels')

    sns.set_theme()


    values = np.linspace(0, 1, max(hdb.labels_)+1)
    hsv_colormap = mcolors.hsv_to_rgb(np.column_stack([values, np.ones_like(values), np.ones_like(values)]))
    custom_palette = sns.color_palette(hsv_colormap)
    color_palette = custom_palette
    print('Color palette done')
    cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in hdb.labels_]
    print('Cluster colors done')
    cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, hdb.probabilities_)]
    print('Cluster member color done')
    plt.clf()
    plot = plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
    print('Plot done')


    plt.title(f'{name}, {k} min cluster size')
    print('Plot title done')

    plot.figure.savefig(f'.../topic-modelling-polish/figures/sentence_transformers/{name}_sentence_transformers_{k}_min_cluster_size.png')
    print(f'Figure with the title {name}, {k} min cluster size saved')





## BERTopic Multilingual

---
In the subsequent cells, the finalized approach that has been chosen for implementation is presented.

BERTopic allows for a large range of settings. The library's documentation e.g., [Tips&Tricks section](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html) helped me choose the parameters that are used below.

---

In [None]:
!pip install bertopic==v0.14.0

In [None]:
!pip install keybert

In [None]:
from hdbscan import HDBSCAN
import umap
import re
import itertools
import tqdm
import pickle
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.dimensionality import BaseDimensionalityReduction
from bertopic.representation import MaximalMarginalRelevance



In the following cell, the BERTopic model is employed for text embedding, utilizing the `paraphrase-multilingual-MiniLM-L12-v2` language model. Subsequently, clustering is performed on these embeddings to group similar semantic representations of the texts. The resulting embeddings for the extracted lemmas from the corpora are then saved for later evalutation.

In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','publ','lalka-tom-pierwszy','lalka-tom-drugi']

for name in tqdm.tqdm(names):

  file_name = name
  corpus = load_file(f'corpora/pickle/sentences/{name}_sentences') #loading sentencized corpora



  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
  kw_model = KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2')
  keywords = kw_model.extract_keywords(corpus)
  vocabulary = [k[0] for keyword in keywords for k in keyword]
  vocabulary = list(set(vocabulary))
  vectorizer_model= CountVectorizer(vocabulary=vocabulary)


  umap_model = umap.UMAP(n_neighbors=15, n_components=5,
                  min_dist=0.0, metric='cosine', random_state=42)


  hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

  representation_model = MaximalMarginalRelevance(diversity=0.3)

  topic_model = BERTopic(language='multilingual', ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model, umap_model=umap_model, hdbscan_model=hdbscan_model,
                       representation_model = representation_model,
                       verbose=True)
  topics, probs = topic_model.fit_transform(corpus)
  topic_model.save(f".../topic-modelling-polish/models/bertopic_multilingual/{file_name}_bertopic_model_multilingual")



  tokenized = []
  for doc in corpus:
    tokenized_doc = spacy_tokenizer(doc)
    tokenized_doc = list(map(lambda x:x.lower(),tokenized_doc))
    tokenized.append(tokenized_doc)

  words = list(set(list(itertools.chain(*tokenized))))

  bertopic_embeddings = topic_model._extract_embeddings(words, method='word', verbose=True)

  embedding_dict = {}

  for i, word in enumerate(words):
    embedding_dict[word] = bertopic_embeddings[i]

  with open(f'.../topic-modelling-polish/embeddings/bertopic_multilingual/{file_name}_bertopic_multilingual_embeddings_dict.pkl', "wb") as f:
    pickle.dump(embedding_dict,f)


In this code cell, the process of extracting topics from the Bertopic model is elucidated. This approach involves the extraction of topics in three distinct versions to enhance the understanding and analysis of the textual data:

1. **Raw Form in the Bertopic Model**:
   Initially, the raw form of the topics is extracted directly from the Bertopic model. This involves retrieving the topics in their native representation, reflecting the clusters of semantically related words and phrases identified by the model. These raw topics serve as a foundation for further exploration and interpretation.

2. **Topics as Word Lists**:
   Subsequently, the topics are transformed into curated lists of words. This transformation facilitates better readability and comprehension, as the topics are presented in a structured format, displaying the constituent words within each topic cluster. By organizing the topics in this manner, key themes and concepts that emerge from the text data can be readily identifid and interpreted.

3. **Topic Vectors**:
   Lastly, the topics can be represented as vectors, encapsulating the essential semantic information within each topic cluster. These vectors serve as compact and meaningful representations of the topics' inherent meaning, facilitating quantitative analysis and enabling sophisticated clustering and similarity comparisons between topics. Those vectors are extracted for later evaluation purposes.



In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','publ','lalka-tom-pierwszy','lalka-tom-drugi']

for name in names:
  model_path = f".../topic-modelling-polish/models/bertopic_multilingual/{name}_bertopic_model_multilingual"

  model = BERTopic.load(model_path)
  topics = model.get_topics()
  extracted_topics = list(map(lambda x:list(zip(*x))[0],topics.values()))
  extracted_topics = list(map(lambda x:list(x),extracted_topics)) #topic_words
  topic_vectors = model.topic_embeddings_
  dump_result(extracted_topics, f'/topics/bertopic_multilingual/word lists/{name}_bertopic_multilingual_topics_word_list')
  dump_result(topics, f'/topics/bertopic_multilingual/full topics/{name}_bertopic_multilingual_full_topics')
  dump_result(topic_vectors, f'/embeddings/bertopic_multilingual/topic/{name}_bertopic_multilingual_topic_embeddings')

## BERTopic Spacy



---
The BERTopic library supports a variety of pre-trained models. The complete list can be found [here](https://maartengr.github.io/BERTopic/getting_started/embeddings/embeddings.html#flair).

As a second approach, the same pretrained Spacy model for the Polish language is being tested, which was used for preprocessing the corpora: `pl_core_news_lg`.

---

In [None]:
!pip install jax>=0.4.9

In [None]:
!pip install https://github.com/explosion/spacy-models/releases/download/pl_core_news_lg-3.5.0/pl_core_news_lg-3.5.0.tar.gz

In [None]:
!pip install bertopic[spacy]==v0.14.0

In [None]:
!pip install keybert[spacy]

In [None]:
from hdbscan import HDBSCAN
import umap
import re
import itertools
import spacy
import pickle
import tqdm
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from keybert import KeyBERT
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.dimensionality import BaseDimensionalityReduction
from bertopic.representation import MaximalMarginalRelevance

In the cell below, the BERTopic model using the language model `pl_core_news_lg` is trained, and subsequently, the vectors for the tokens extracted from the corpora are saved.

In [None]:
names = ['lalka-tom-pierwszy','lalka-tom-drugi','lit','fakt_ind_nd','publ', 'parlament_sample']



nlp = spacy.load("pl_core_news_lg", exclude=['tagger', 'parser', 'ner',
                                            'attribute_ruler', 'lemmatizer',  'tok2vec', 'morphologizer', '(trainable_lemmatizer)', 'senter'])

for name in tqdm.tqdm(names):

  corpus = load_file(f'corpora/pickle/sentences/{name}_sentences') #loading sentencized corpora

  ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
  kw_model = KeyBERT(model=nlp)
  keywords = kw_model.extract_keywords(corpus)
  vocabulary = [k[0] for keyword in keywords for k in keyword]
  vocabulary = list(set(vocabulary))
  vectorizer_model= CountVectorizer(vocabulary=vocabulary)


  umap_model = umap.UMAP(n_neighbors=15, n_components=5,
                  min_dist=0.0, metric='cosine', random_state=42)


  hdbscan_model = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

  representation_model = MaximalMarginalRelevance(diversity=0.3)

  topic_model = BERTopic(embedding_model=nlp, ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model, umap_model=umap_model, hdbscan_model=hdbscan_model,
                       representation_model = representation_model,
                       verbose=True)
  topics, probs = topic_model.fit_transform(corpus)
  topic_model.save(f".../topic-modelling-polish/models/bertopic_spacy/{name}_bertopic_spacy_model")




  tokenized = []
  for doc in corpus:
      tokenized_doc = spacy_tokenizer(doc)
      tokenized_doc = list(map(lambda x:x.lower(),tokenized_doc))
      tokenized.append(tokenized_doc)

  words = list(set(list(itertools.chain(*tokenized))))

  bertopic_embeddings = topic_model._extract_embeddings(words, method='word', verbose=True)

  embedding_dict = {}

  for i, word in enumerate(words):
    embedding_dict[word] = bertopic_embeddings[i]


  with open(f'.../topic-modelling-polish/embeddings/bertopic_spacy/{name}_bertopic_spacy', "wb") as f:
      pickle.dump(embedding_dict,f)


Topics and topic vectors are extracted.

In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','publ','lalka-tom-pierwszy','lalka-tom-drugi']

for name in names:
  model_path = f".../topic-modelling-polish/models/bertopic_spacy/{name}_bertopic_spacy_model"
  model = BERTopic.load(model_path)
  topics = model.get_topics()
  extracted_topics = list(map(lambda x:list(zip(*x))[0],topics.values()))
  extracted_topics = list(map(lambda x:list(x),extracted_topics)) #topic_words
  topic_vectors = model.topic_embeddings_





  dump_result(topics, f'topics/bertopic_spacy/full topics/{name}_bertopic_spacy_full_topics')
  dump_result(extracted_topics, f'topics/bertopic_spacy/word lists/{name}_bertopic_spacy_topics_word_list')
  dump_result(topic_vectors, f'embeddings/bertopic_spacy/topic/{name}_bertopic_spacy_topic_embeddings')


## Top2Vec

---
Top2Vec is a topic modeling technique that automatically discovers topics from text data without the need for a pre-defined number of topics. It uses a combination of Word2Vec and Doc2Vec to generate topic vectors, allowing for topic similarity measurements and efficient clustering of documents with similar thematic content. By using embeddings, it can also handle out-of-vocabulary words and scale well with large datasets.

You can find more details about the model [here](https://github.com/ddangelov/Top2Vec).

There are multiple ways to apply Top2Vec - here, I decided to try out the original algorithm with training the embeddings without the usage of any pretrained models.

---

In [None]:
!pip install jax>=0.4.9

In [None]:
!pip install top2vec==1.0.29

In [None]:
from top2vec import Top2Vec
import tqdm

In the cell below, the model is trained on each corpus and the topics are extracted.

In [None]:
names = ['lalka-tom-pierwszy','lalka-tom-drugi','lit','fakt_ind_nd','publ', 'parlament_sample']

for name in tqdm.tqdm(names):


  corpus = load_file(f'corpora/pickle/sentences/{name}_sentences') #loading sentencized corpora


  model = Top2Vec(corpus, speed = 'learn', workers = 8,
                hdbscan_args = {'min_cluster_size':5, 'metric':'euclidean', 'cluster_selection_method':'eom', 'prediction_data':True},
                umap_args = {'metric':'cosine', 'n_components':5, 'min_dist':0.0, 'random_state':42,'n_neighbors':15},
                min_count=5)



  model.save(f'.../topic-modelling-polish/models/top2vec/{name}_top2vec_model')

In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','publ','lalka-tom-pierwszy','lalka-tom-drugi']

for name in names:
  model_path = f".../topic-modelling-polish/models/top2vec/{name}_top2vec_model"
  model = Top2Vec.load(model_path)
  full_topics = model.get_topics()
  extracted_topics = list(map(lambda x: list(x), full_topics[0]))
  topic_embeddings = model.topic_vectors




  dump_result(full_topics, f'topics/top2vec/full topics/{name}_top2vec_full_topics')
  dump_result(extracted_topics, f'topics/top2vec/word lists/{name}_top2vec_topics_word_list')
  dump_result(topic_embeddings, f'embeddings/top2vec/topic/{name}_top2vec_topic_embeddings')

## Top2Vec Multilingual

---
Top2Vec can also be initialized using pre-trained vectors. In this approach, the same model as before will be tested: `paraphrase-multilingual-MiniLM-L12-v2`.

---

In [None]:
!pip install jax>=0.4.9

In [None]:
!pip install top2vec[sentence_transformers]

In [None]:
from top2vec import Top2Vec
import tqdm

In [None]:
names = ['lalka-tom-pierwszy','lalka-tom-drugi','lit','fakt_ind_nd','publ', 'parlament_sample']

for name in tqdm.tqdm(names):


  corpus = load_file(f'corpora/pickle/sentences/{name}_sentences') #loading sentencized corpora


  model = Top2Vec(corpus, embedding_model = 'paraphrase-multilingual-MiniLM-L12-v2', speed = 'learn', workers = 8,
                hdbscan_args = {'min_cluster_size':5, 'metric':'euclidean', 'cluster_selection_method':'eom', 'prediction_data':True},
                umap_args = {'metric':'cosine', 'n_components':5, 'min_dist':0.0, 'random_state':42,'n_neighbors':15},
                min_count=5)



  model.save(f'.../topic-modelling-polish/models/top2vec_multilingual/{name}_top2vec_multilingual_model')

In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','publ','lalka-tom-pierwszy','lalka-tom-drugi']

for name in names:
  model_path = f".../topic-modelling-polish/models/top2vec/{name}_top2vec_multilingual_model"
  model = Top2Vec.load(model_path)
  full_topics = model.get_topics()
  extracted_topics = list(map(lambda x: list(x), full_topics[0]))
  topic_embeddings = model.topic_vectors




  dump_result(full_topics, f'topics/top2vec_multilingual/full topics/{name}_top2vec_multilingual_full_topics')
  dump_result(extracted_topics, f'topics/top2vec_multilingual/word lists/{name}_top2vec_multilingual_topics_word_list')
  dump_result(topic_embeddings, f'embeddings/top2vec_multilingual/topic/{name}_top2vec_multilingual_topic_embeddings')

# Evaluation

## Coherence, unique words, Jaccard diversity

---
Assessing the quality of topic models is actually a difficult task and it is far from obvious how to do it in an automatic way. Several metrics are often employed to evaluate the effectiveness of topic models, including coherence, unique words, and Jaccard distance measures.

**Coherence**: Coherence is a measure that evaluates the interpretability and semantic consistency of topics generated by a model. It quantifies the degree to which words within a topic are related and can be understood as a coherent theme. High coherence values indicate that the words in a topic have strong semantic associations. Common coherence metrics include C_V, C_NPMI, and UMass, which assess word co-occurrence patterns and the strength of their connections.

**Unique Words**: The presence of unique words in a topic is another indicator of its quality. A high number of distinct words within a topic suggests that the model has successfully captured diverse aspects of the data. However, an excessively high count of unique words might also indicate noise or lack of topic focus.

**Jaccard Distance**: Jaccard distance is a measure of dissimilarity between sets. In the context of topic modeling, it can be used to compare the similarity between the word sets of different topics. By calculating the Jaccard distance between topics, one can assess how distinct or overlapping they are in terms of vocabulary. This measure helps in identifying topics that might be too similar, potentially indicating redundancy in the model's output.

---

In [None]:
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
import pandas as pd
from itertools import combinations

In [None]:
#Source: https://github.com/silviatti/topic-model-diversity/blob/master/diversity_metrics.py

def proportion_unique_words(topics):
    unique_words = set()
    for topic in topics:
      unique_words = unique_words.union(set(topic))
      puw = len(unique_words) / (len(topic) * len(topics))
    return puw

In [None]:
#Source: https://github.com/silviatti/topic-model-diversity/blob/master/diversity_metrics.py

def pairwise_jaccard_diversity(topics):

    dist = 0
    count = 0
    for list1, list2 in combinations(topics, 2):
        js = 1 - len(set(list1).intersection(set(list2)))/len(set(list1).union(set(list2)))
        dist = dist + js
        count = count + 1
    return dist/count

In [None]:
def evaluate(corpus, topic_words):

  dictionary = corpora.Dictionary(corpus)

  used_corpus = [dictionary.doc2bow(token) for token in corpus]

  coherence_model = CoherenceModel(topics=topic_words,
                                      texts=corpus,

                                      dictionary=dictionary,
                                      coherence='c_v', topn=30)
  cv = coherence_model.get_coherence()

  coherence_model = CoherenceModel(topics=topic_words,
                                      texts=corpus,

                                      dictionary=dictionary,
                                      coherence='c_npmi', topn=30)
  c_npmi = coherence_model.get_coherence()

  coherence_model = CoherenceModel(topics=topic_words,
                                      texts=corpus,
                                      corpus=used_corpus,
                                      dictionary=dictionary,
                                      coherence='u_mass',topn=30)
  umass = coherence_model.get_coherence()

  uniques = proportion_unique_words(topic_words)

  jaccard = pairwise_jaccard_diversity(topic_words)

  results = {"proportion unique words": uniques, "cv coherence": cv, "c_npmi coherence": c_npmi, "u_mass coherence":umass, "jaccard":jaccard}


  return results



In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','publ','lalka-tom-pierwszy','lalka-tom-drugi']
models = ['bertopic_multilingual', 'bertopic_spacy', 'top2vec', 'top2vec_multilingual']


for name in names:

  corpus = load_file(f'corpora/pickle/original/{name}')
  tokenized = []
  for doc in corpus:
    tokenized_doc = spacy_tokenizer(doc)
    tokenized_doc = list(map(lambda x:x.lower(),tokenized_doc))
    tokenized.append(tokenized_doc)

  for model in models:
    topics = load_file(f'topics/{model}/word lists/{name}_{model}_topics_word_list')
    topics = [topic for topic in topics if "" not in topic] #those models used min_cluster_size = 5, so sometimes it happened that Bertopic added empty strings to a topic


    results = evaluate(tokenized, topics)
    df = pd.DataFrame(results.values(), index=results.keys(), columns=['value'])


    df.to_csv(f'.../topic-modelling-polish/results/{model}/metrics/{name}_{model}_topics_results.csv')


Fasttext100

In [None]:
names = ['fakt_ind_nd','publ', 'parlament_sample', 'lit','lalka-tom-pierwszy','lalka-tom-drugi']

for name in tqdm.tqdm(names):

  corpus = load_file(f'corpora/pickle/original/{name}')
  corpus = preprocess(corpus)


  lemmatized_korpus = [] #lemmatized documents list; list of strings
  lemmas = [] #lemmatized documents list; list of lists
  for doc in corpus:
    lemmatized_doc = spacy_lemmatizer(doc,nlp)
    lemmatized_doc = list(map(lambda x:x.lower(),lemmatized_doc))
    lemmas.append(lemmatized_doc)
    lemmatized_korpus.append(' '.join(lemmatized_doc))
  new_lemmatized = []
  for l in lemmas:
    rob = []
    for word in l:
      new_word = word.split(' ')
      rob.append(new_word)
    new_lemmatized.append(list(itertools.chain(*rob)))


  for k in [10,100]:

    topics = load_file(f'topics/fasttext100/word lists/{name}_fasttext100_{k}_topics_word_list')




    results = evaluate(new_lemmatized,topics)
    df = pd.DataFrame(results.values(), index=results.keys(), columns=['value'])


    df.to_csv(f'.../topic-modelling-polish/results/fasttext100/metrics/{name}_fasttext100_{k}_topics_results.csv')


## Cosine similarity

---
Another effective method for assessing the quality of topics and their learnt embedding representations involves utilizing cosine similarity as a metric. Cosine similarity is a mathematical measure that evaluates the angular separation between two vectors in a high-dimensional space, providing an indication of their similarity in direction and orientation.

Cosine similarity between topics can be used for semantic comparison of corpora.Should two distinct corpora share a substantial number of akin topics, the cosine similarity metric should reflect this coherence, thereby yielding values close to 1 for topic pairs as well as an overall mean value close to 1. However, the efficacy of this approach hinges on the quality of the topic vectors, relying on the ability of individual models to accurately capture the nuances of the corpora and integrate this understanding into the embeddings.

In the code snippet below, an implementation for calculating the cosine similarity matrix across all topics in a given pair of corpora can be found. The outcomes and visual representations are accessible within the repository.

Both the BERTopic and Top2Vec models offer means to extract topic vectors from their class objects. BERTopic represents these vectors as a weighted average of constituent word vectors within the topic, while Top2Vec incorporates topic representation as a fundamental part of its algorithm.

Given that this method compares separately learned vectors from different corpora, it employs both the standard cosine similarity and an adjusted version for a comprehensive evaluation.

---

In [None]:
!pip3 install umap-learn==0.5.3

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from itertools import combinations_with_replacement
import umap
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tqdm

In [None]:
def adjusted_cosine_similarity(vec1, vec2):
    avg_vec1 = np.mean(vec1, axis=1)
    avg_vec2 = np.mean(vec2, axis=1)

    Cvec1 = vec1 - avg_vec1[:, np.newaxis]
    Cvec2 = vec2 - avg_vec2[:, np.newaxis]

    n_rows_vec1, n_rows_vec2 = Cvec1.shape[0], Cvec2.shape[0]
    sim_matrix = np.zeros((n_rows_vec1, n_rows_vec2))
    for u in range(n_rows_vec1):
        for v in range(n_rows_vec2):
            cos_sim = np.dot(Cvec1[u], Cvec2[v]) / (np.linalg.norm(Cvec1[u]) * np.linalg.norm(Cvec2[v]))
            sim_matrix[u, v] = cos_sim

    return sim_matrix


In [None]:
def compute_topic_similarities(vectors1, vectors2, fig_title, store_path, umap_model, adjust = False):
  vectors1 = umap_model.fit_transform(vectors1)
  vectors2 = umap_model.fit_transform(vectors2)

  if adjust:
    similarities = adjusted_cosine_similarity(vectors1,vectors2)

  else:
    similarities = cosine_similarity(vectors1,vectors2)



  mean_similarity = np.mean(similarities)
  plt.clf()
  fig, ax = plt.subplots(figsize=(6, 6))
  plot = sns.heatmap(similarities, ax=ax,annot=False, cmap='coolwarm')
  plt.title(f'{fig_title}, mean: {mean_similarity:.2f}')

  plot.figure.savefig(store_path)

  return similarities, mean_similarity



The cell below corresponds to cosine similarity computation for BERTopic and Top2Vec models.

In [None]:
names = ['lit','publ', 'parlament_sample', 'fakt_ind_nd','lalka-tom-pierwszy','lalka-tom-drugi']
models = ['bertopic_multilingual', 'bertopic_spacy', 'top2vec', 'top2vec_multilingual']
adjusted = [False, True]
adjusted_encoded = {True:'adjusted', False:"not_adjusted"}
pairs = list(combinations_with_replacement(names,2))
models_encoded = {'bertopic_multilingual':'multilingual', 'bertopic_spacy':'spacy', 'top2vec':'t2v', 'top2vec_multilingual':'t2v_m'}

umap_model = umap.UMAP(n_neighbors=15, n_components=5,
              min_dist=0.0, metric='cosine', random_state=42)


for model in models:
  sim = np.zeros(shape=(len(names), len(names)))
  df_adjusted = pd.DataFrame(data= sim, index=names,columns=names)
  df_not_adjusted = pd.DataFrame(data= sim.copy(), index=names,columns=names)

  for pair in tqdm.tqdm(pairs):
    embs1 = np.array(load_file(f'embeddings/{model}/topic/{pair[0]}_{model}_topic_embeddings'))
    embs2 = np.array(load_file(f'embeddings/{model}/topic/{pair[1]}_{model}_topic_embeddings'))

    for a in adjusted:

      path_to_store_fig = f'.../topic-modelling-polish/figures/cosine/{model}/corpora_similarity/{adjusted_encoded[a]}/{pair[0]}_{pair[1]}_cosine_similarities_{adjusted_encoded[a]}'
      plt.clf()


      fig_title = f'{pair[0]} {pair[1]} {models_encoded[model]} cosine {adjusted_encoded[a]}'

      similarities, mean_similarity  = compute_topic_similarities(embs1, embs2, fig_title, path_to_store_fig, umap_model, adjust = a)
      if a:
        df_adjusted.loc[pair[0], pair[1]] = mean_similarity
        df_adjusted.loc[pair[1], pair[0]] = mean_similarity
      else:
        df_not_adjusted.loc[pair[0], pair[1]] = mean_similarity
        df_not_adjusted.loc[pair[1], pair[0]] = mean_similarity

      #similarities between topics of two corpora
      dump_result(similarities, f'results/{model}/cosine/corpora_similarity/{adjusted_encoded[a]}/{pair[0]}_{pair[1]}_{model}_similarities_{adjusted_encoded[a]}')

  #mean similarities between pair of corpora
  df_adjusted.to_csv(f'.../topic-modelling-polish/results/{model}/cosine/corpora_similarity/adjusted/{model}_mean_similarities_adjusted.csv')

  df_not_adjusted.to_csv(f'.../topic-modelling-polish/results/{model}/cosine/corpora_similarity/not_adjusted/{model}_mean_similarities_not_adjusted.csv')


Cosine similarity computation for Fasttext is placed in a seperate cell, because the topic vector derivation has to be calculated independently. Topic vector is derived simply as the average of all vectors of a topic.

In [None]:
names = ['fakt_ind_nd','publ','lit', 'parlament_sample','lalka-tom-pierwszy','lalka-tom-drugi']
pairs = list(combinations_with_replacement(names,2))
adjusted = [False, True]
adjusted_encoded = {True:'adjusted', False:"not_adjusted"}
umap_model = umap.UMAP(n_neighbors=15, n_components=5,
              min_dist=0.0, metric='cosine', random_state=42)





def compute_topic_vectors_from_Fasttext(topics):
  topic_vectors = []
  unzipped = list(map(lambda x: list(zip(*x)),topics))
  for tupl in unzipped:
    result = np.array(tuple(s * a for s, a in zip(tupl[1], tupl[2])))
    topic_vector = np.mean(result,axis=0)
    topic_vectors.append(topic_vector)
  topic_vectors = np.array(topic_vectors)
  return topic_vectors


for k in tqdm.tqdm([10,50,100,200]):
  sim = np.zeros(shape=(len(names), len(names)))
  df_adjusted = pd.DataFrame(data= sim, index=names,columns=names)
  df_not_adjusted = pd.DataFrame(data= sim.copy(), index=names,columns=names)
  for pair in pairs:

    topics1 = load_file(f'topics/fasttext100/full topics/{pair[0]}_fasttext100_{k}_topics_full_topic')
    topics2 = load_file(f'topics/fasttext100/full topics/{pair[1]}_fasttext100_{k}_topics_full_topic')

    topic_vectors1 = compute_topic_vectors_from_Fasttext(topics1)
    topic_vectors2 = compute_topic_vectors_from_Fasttext(topics2)
    for a in adjusted:


      path_to_store_fig = f'.../topic-modelling-polish/figures/cosine/fasttext100/corpora_similarity/{adjusted_encoded[a]}/{pair[0]}_{pair[1]}_fasttext100_{k}_similarities_{adjusted_encoded[a]}'
      plt.clf()

      fig_title = f'{pair[0]} {pair[1]} fasttext100 cosine {adjusted_encoded[a]}'

      similarities, mean_similarity  = compute_topic_similarities(topic_vectors1, topic_vectors2, fig_title, path_to_store_fig, umap_model, adjust = a)
      if a:
        df_adjusted.loc[pair[0], pair[1]] = mean_similarity
        df_adjusted.loc[pair[1], pair[0]] = mean_similarity
      else:
        df_not_adjusted.loc[pair[0], pair[1]] = mean_similarity
        df_not_adjusted.loc[pair[1], pair[0]] = mean_similarity



      dump_result(similarities, f'results/fasttext100/cosine/corpora_similarity/{adjusted_encoded[a]}/{pair[0]}_{pair[1]}_similarities_fasttext100_{k}_{adjusted_encoded[a]}')


  df_adjusted.to_csv(f'.../topic-modelling-polish/results/fasttext100/cosine/corpora_similarity/adjusted/fasttext100_{k}_mean_similarities_adjusted.csv')
  df_not_adjusted.to_csv(f'.../topic-modelling-polish/results/fasttext100/cosine/corpora_similarity/not_adjusted/fasttext100_{k}_mean_similarities_not_adjusted.csv')


## Average closest topic

---
The last measure used in this work for semantic comparison of corpora is the average closest topic. For each topic in a given corpus, the closest topic in the other corpus was found. Then an average was drawn from the found cosine similarity values. This process was repeated for each model and for each pair of corpora.

The cell below corresponds again only to BERTopic and Top2Vec models.

---

In [None]:
names = ['lit','publ', 'parlament_sample', 'fakt_ind_nd','lalka-tom-pierwszy','lalka-tom-drugi']
pairs = list(combinations_with_replacement(names,2))
models = ['bertopic_multilingual', 'bertopic_spacy', 'top2vec', 'top2vec_multilingual']
adjusted = [False, True]
adjusted_encoded = {True:'adjusted', False:"not_adjusted"}




for model in tqdm.tqdm(models):
  sim = np.zeros(shape=(len(names), len(names)))
  df_adjusted = pd.DataFrame(data= sim, index=names,columns=names)
  df_not_adjusted = pd.DataFrame(data= sim.copy(), index=names,columns=names)
  for pair in pairs:
    for a in adjusted:

      matrix = load_file(f'results/{model}/cosine/corpora_similarity/{adjusted_encoded[a]}/{pair[0]}_{pair[1]}_{model}_similarities_{adjusted_encoded[a]}')
      column_max_values = np.max(matrix,axis=0) #corpus in pair[0]
      row_max_values = np.max(matrix,axis=1) #corpus in pair[1]

      mean_column = np.mean(column_max_values)
      mean_row = np.mean(row_max_values)

      if a:
        df_adjusted.loc[pair[0], pair[1]] = mean_column
        df_adjusted.loc[pair[1], pair[0]] = mean_row
      else:
        df_not_adjusted.loc[pair[0], pair[1]] = mean_column
        df_not_adjusted.loc[pair[1], pair[0]] = mean_row




    df_adjusted.to_csv(f'.../topic-modelling-polish/results/{model}/cosine/average_closest_topics/adjusted/{model}_average_closest_topic_adjusted.csv')
    df_not_adjusted.to_csv(f'.../topic-modelling-polish/results/{model}/cosine/average_closest_topics/not_adjusted/{model}_average_closest_topic_not_adjusted.csv')

100%|██████████| 4/4 [00:17<00:00,  4.27s/it]


Average clostest topic calculation for Fasttext.

In [None]:
names = ['fakt_ind_nd','publ','lit', 'parlament_sample','lalka-tom-pierwszy','lalka-tom-drugi']
pairs = list(combinations_with_replacement(names,2))
adjusted = [False, True]
adjusted_encoded = {True:'adjusted', False:"not_adjusted"}




for k in tqdm.tqdm([10,50,100,200]):
  sim = np.zeros(shape=(len(names), len(names)))
  df_adjusted = pd.DataFrame(data= sim, index=names,columns=names)
  df_not_adjusted = pd.DataFrame(data= sim.copy(), index=names,columns=names)
  for pair in pairs:
    for a in adjusted:

      matrix = load_file(f'results/fasttext100/cosine/corpora_similarity/{adjusted_encoded[a]}/{pair[0]}_{pair[1]}_similarities_fasttext100_{k}_{adjusted_encoded[a]}')
      column_max_values = np.max(matrix,axis=0) #corpus in pair[0]
      row_max_values = np.max(matrix,axis=1) #corpus in pair[1]

      mean_column = np.mean(column_max_values)
      mean_row = np.mean(row_max_values)

      if a:
        df_adjusted.loc[pair[0], pair[1]] = mean_column
        df_adjusted.loc[pair[1], pair[0]] = mean_row
      else:
        df_not_adjusted.loc[pair[0], pair[1]] = mean_column
        df_not_adjusted.loc[pair[1], pair[0]] = mean_row




  df_adjusted.to_csv(f'.../topic-modelling-polish/results/fasttext100/cosine/average_closest_topics/adjusted/fasttext100_{k}_average_closest_topic_adjusted.csv')
  df_not_adjusted.to_csv(f'.../topic-modelling-polish/results/fasttext100/cosine/average_closest_topics/not_adjusted/fasttext100_{k}_average_closest_topic_not_adjusted.csv')

100%|██████████| 4/4 [00:00<00:00,  4.78it/s]


## Reaching the topics

---
For better readability, the extracted topics are also saved in `json` format. They can be found in the `topics/json/` folder.

---

In [None]:
import json

In [None]:
def pickle_to_json(topics, model, name, k = None):

  if k is not None:
    model = f'{model}_{k}'

  top_dict = {}
  for i,top in enumerate(topics):
    top_dict[i] = list(top)

  json_object = json.dumps(top_dict, indent=3, ensure_ascii=False).encode('utf-8')

  with open(f'.../topic-modelling-polish/topics/json/{model}_{name}_topics.json', 'w', encoding = 'utf-8') as f:
      f.write(json_object.decode())


In [None]:
names = ['fakt_ind_nd','publ', 'lit','parlament_sample','lalka-tom-pierwszy','lalka-tom-drugi']
models = ['bertopic_multilingual', 'bertopic_spacy', 'top2vec', 'fasttext100', 'fasttext100', 'top2vec_multilingual']


for name in names:
  for model in models:
    if model == 'fasttext100':
      for k in [10,100]:
        topics = load_file(f'topics/fasttext100/word lists/{name}_fasttext100_{k}_topics_word_list')
        pickle_to_json(topics, model, name, k = k)
    else:
      topics = load_file(f'topics/{model}/word lists/{name}_{model}_topics_word_list')
      pickle_to_json(topics, model, name)



## Reranking Fasttext topics

---
Topics found with the Fasttext model appear in the list in random order. While many of these topics are of a very good quality - they are consistent and seem to effectively mirror the content of the corpora — certain topics appear to have been formulated based on orthographic resemblances rather than semantic relevance. The introduction of a reranking approach holds the potential to enhance the clarity of outcomes.

In the subsequent section, a simple reranking technique is applied, showcasing observable beneficial outcomes in result filtration. This approach capitalizes on three key variables: the mean edit distance between all words within a given topic (accounting for orthographic similarities), the mean cosine similarity inherent to the topic (encompassing semantic coherence), and the average TF-IDF weight associated with the topic. The latter is computed by deriving the average of the weights assigned to each word within the topic, offering insights into the significance of these words within the overall structure of the corpus.

In the following section, reranking will be applied to the topics extracted from the parliamentary corpus (in the version of 100 topics).

---

In [None]:
import itertools
import math
import pandas as pd
import numpy as np
import json
import editdistance

In [None]:
def average_cosine_similarity(vectors):
    vectors = [np.array(vector) for vector in vectors]
    num_vectors = len(vectors)
    total_similarity = 0

    for i in range(num_vectors):
        for j in range(i + 1, num_vectors):  # Only calculate upper triangular part (excluding diagonal)
            similarity = np.dot(vectors[i], vectors[j]) / (np.linalg.norm(vectors[i]) * np.linalg.norm(vectors[j]))
            total_similarity += similarity

    num_pairs = num_vectors * (num_vectors - 1) / 2  # Number of unique pairs (excluding self-similarity and reverse pairs)
    avg_similarity = total_similarity / num_pairs

    return avg_similarity

In [None]:
def mean_edit_distanc(str_list):
    total_distance = 0
    num_pairs = 0

    for i in range(len(str_list)):
        for j in range(i + 1, len(str_list)):
            distance = editdistance.eval(str_list[i], str_list[j])
            total_distance += distance
            num_pairs += 1

    if num_pairs > 0:
        mean_distance = total_distance / num_pairs
    else:
        mean_distance = 0

    return mean_distance

In [None]:
weights = load_file('weights/parlament_sample_tfidf_weights')
embeddings = load_file('embeddings/fasttext100/parlament_sample_fasttext100_embeddings_dict')
topics = load_file('topics/fasttext100/word lists/parlament_sample_fasttext100_100_topics_word_list')





top_dict = {}
top_weight = {}
epsilon = 1e-9
for i,top in enumerate(topics):
  l = []
  mean_weight = 0
  vectors = []
  for j,word in enumerate(list(top)):
    mean_weight += weights[word]
    l.append(str(word))
    vectors.append(embeddings[word])

  av_sim = average_cosine_similarity(vectors)
  mean_edit_dist = mean_edit_distanc(l)
  top_weight[i] = (mean_weight/ len(top), av_sim, mean_edit_dist)
  top_dict[i] = l

srt_edit = sorted(top_weight.values(), key = lambda x: x[2])
srt_weight = sorted(top_weight.values(), key = lambda x: x[0])
srt_av_sim = sorted(top_weight.values(), key = lambda x: x[1])
min_edit = srt_edit[0][2]
max_edit = srt_edit[-1][2]
min_weight = srt_weight[0][0]
max_weight = srt_weight[-1][0]
min_av_sim = srt_av_sim[0][1]
max_av_sim = srt_av_sim[-1][1]




top_weight_2 = {}
for key, value in top_weight.items():
  try:

    score = ((value[0] - min_weight) / (max_weight - min_weight)) * 0.5 +  math.log(((value[1] - min_av_sim) / (max_av_sim - min_av_sim)) * 0.3 + epsilon) + math.log(((value[2] - min_edit) / (max_edit - min_edit)) *0.4 + epsilon)
    top_weight_2[key] = score
  except ValueError:
    print((key,value))
sorted_dict_with_lists = {
    key: top_dict[key] for key, _ in sorted(top_weight_2.items(), key=lambda x: x[1], reverse=True)
}




with open('.../topic-modelling-polish/topics/json/reranked/parlament_sample_topics_fasttext100_100_reranked.json', 'w', encoding ='utf8') as json_file:
    json.dump(sorted_dict_with_lists, json_file, ensure_ascii = False, indent = 4)


Further steps to enhancing the results, e.g. stopwords removal, could also be considered.

The outcomes of the aforementioned procedure can be located within the `topics/json/reranked` directory. The application of the reranking technique led to the elevation of topics associated with parliament, politics, the nation, etc. This outcome aligns precisely with the thematic content anticipated for the parliamentary corpus!