<a href="https://www.kaggle.com/code/mortezaheidari/ctm-contextualized-topic-modeling?scriptVersionId=168601617" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Combined TM combines the BoW with SBERT

## Introduction: Unveiling the Significance of Topic Modeling

Topic modeling, a powerful analytical tool, enables the extraction of meaningful patterns embedded within textual data. Its primary function is to facilitate the comprehension of large document collections by uncovering latent distributions of topics. This process simplifies the task of gaining insights into the content without the need to read each document individually. Through the lens of topic models, we can obtain a bird's eye view of the textual landscape, making it an invaluable asset for anyone dealing with extensive sets of documents.

## Exploring the Functionality of Topic Models

In practical terms, topic models provide a structured approach to understanding the recurring themes present in a document collection. Rather than delving into individual documents one by one, topic modeling allows for the extraction of prevalent topics, offering a condensed summary of the overall content. By identifying patterns and themes that repeat across documents, this analytical technique transforms the daunting task of reviewing large datasets into a more manageable and insightful process. It becomes an indispensable aid for researchers, analysts, and anyone seeking a comprehensive overview of complex textual information.

## Applications and Benefits of Topic Modeling

Beyond the efficient summarization of document collections, topic modeling finds applications in various fields. It is widely used in natural language processing, information retrieval, and content recommendation systems. By revealing the underlying structures of textual data, topic models contribute to enhanced decision-making processes and foster a deeper understanding of the subject matter. Whether applied in academic research, business intelligence, or data analysis, topic modeling stands as a versatile tool that unlocks valuable insights from the vast realm of textual information.


#### Ref1: Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. European Chapter of the Association for Computational Linguistics (EACL). https://arxiv.org/pdf/2004.07737/
#### Ref2: https://github.com/MilaNLProc/contextualized-topic-models/tree/master

## Getting the data from Wikipedia for training and validation

In [1]:
%%capture
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_prep.txt

### Installing Contextualized Topic Models


In [2]:
%%capture
!pip install contextualized-topic-models
!pip install torch torchvision

In [3]:
conda install -c conda-forge library_name

Retrieving notices: ...working... done
done
Solving environment: unsuccessful initial attempt using frozen solve. Retrying with flexible solve.
done
Solving environment: unsuccessful initial attempt using frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - library_name

Current channels:

  - https://conda.anaconda.org/conda-forge/linux-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://conda.anaconda.org/rapidsai/linux-64
  - https://conda.anaconda.org/rapidsai/noarch
  - https://conda.anaconda.org/nvidia/linux-64
  - https://conda.anaconda.org/nvidia/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search ba

### Loading libraries:

In [4]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file, TopicModelDataPreparation
from contextualized_topic_models.datasets.dataset import CTMDataset
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
from gensim.corpora.dictionary import Dictionary
from gensim.models import ldamodel 
import os
import numpy as np
import pickle
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"



## Let's read our data files and store the documents as lists of strings

In [5]:
with open("dbpedia_sample_abstract_20k_prep.txt", 'r') as fr_prep:
  text_training_preprocessed = [line.strip() for line in fr_prep.readlines()]

with open("dbpedia_sample_abstract_20k_unprep.txt", 'r') as fr_unprep:
  text_training_not_preprocessed = [line.strip() for line in fr_unprep.readlines()]

### NOTE: Make sure that the lenghts of the two lists of documents are the same and the index of a not preprocessed document corresponds to the index of the same preprocessed document.

In [6]:
assert len(text_training_preprocessed) == len(text_training_not_preprocessed)

print(text_training_not_preprocessed[0])
print(text_training_preprocessed[0])

The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry


# Split the documents into training and testing

In [7]:
training_bow_documents = text_training_preprocessed[0:15000]
training_contextual_document = text_training_not_preprocessed[0:15000]

testing_bow_documents = text_training_preprocessed[15000:]
testing_contextual_documents = text_training_not_preprocessed[15000:]

## Creating the Training Dataset
#### Let's pass our files with preprocess data to our TopicModelDataPreparation object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

In [8]:
tp = TopicModelDataPreparation("bert-base-nli-mean-tokens")
# qt = TopicModelDataPreparation("all-mpnet-base-v2") # TODO: you can change this data preparation 
training_dataset = tp.fit(training_contextual_document, training_bow_documents)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/75 [00:00<?, ?it/s]

Why do we use the preprocessed text here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

And what about the unpreprocessed text? We provide unpreprocessed text as the input for BERT (or the contextualized model of your choice) to let the model output more accurate document representations.

In [9]:
tp.vocab[:10]

array(['abbreviated', 'academic', 'academy', 'access', 'according',
       'achieved', 'acquired', 'acre', 'acres', 'across'], dtype=object)

### Combined Topic Model
Here is how you can use the CombinedTM. This is a standard topic model that also uses contextualized embeddings. The good thing about CombinedTM is that it makes your topic much more coherent (see the paper https://arxiv.org/abs/2004.03974)
#### ***  Combined TM combines the BoW with SBERT, a process that seems to increase the coherence of the predicted topics (https://arxiv.org/pdf/2004.03974.pdf). 

In [10]:
TOKENIZERS_PARALLELISM= False
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=50)
ctm.fit(training_dataset)  
print(ctm.get_topics(2))

Epoch: [100/100]	 Seen Samples: [1497600/1500000]	Train Loss: 134.99105241563586	Time: 0:00:02.541938: : 100it [04:26,  2.67s/it]
100%|██████████| 235/235 [00:02<00:00, 103.54it/s]

defaultdict(<class 'list'>, {0: ['chinese', 'pinyin'], 1: ['film', 'best'], 2: ['west', 'within'], 3: ['new', 'york'], 4: ['rugby', 'union'], 5: ['chemical', 'given'], 6: ['league', 'club'], 7: ['women', 'world'], 8: ['states', 'united'], 9: ['government', 'responsible'], 10: ['town', 'parish'], 11: ['radio', 'fm'], 12: ['sierra', 'western'], 13: ['english', 'first'], 14: ['election', 'held'], 15: ['book', 'published'], 16: ['school', 'high'], 17: ['born', 'professional'], 18: ['brown', 'described'], 19: ['island', 'point'], 20: ['american', 'television'], 21: ['college', 'school'], 22: ['family', 'found'], 23: ['university', 'professor'], 24: ['french', 'son'], 25: ['iran', 'persian'], 26: ['park', 'state'], 27: ['research', 'journal'], 28: ['summer', 'metres'], 29: ['served', 'john'], 30: ['directed', 'film'], 31: ['album', 'band'], 32: ['historic', 'national'], 33: ['music', 'known'], 34: ['game', 'video'], 35: ['company', 'manufacturer'], 36: ['system', 'computer'], 37: ['member', 




## Save the model for the future references

In [11]:
ctm.save(models_dir="./")




## Loading the Model from saved directory

In [12]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=50)
file_dir = "/kaggle/working/contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99"
ctm.load("/kaggle/working/contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99/",
                                                                                                      epoch=99)



After training, now it is the time to look at our topics: we can use the function  get_topic_lists  to get the topics. It also accept a parameter that allows you to select how many words you want to see for each topic. If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge).

In [13]:
ctm.get_topic_lists(5)

[['chinese', 'pinyin', 'china', 'station', 'metro'],
 ['film', 'best', 'films', 'director', 'award'],
 ['west', 'within', 'south', 'approximately', 'east'],
 ['new', 'york', 'zealand', 'united', 'city'],
 ['rugby', 'union', 'australian', 'club', 'played'],
 ['chemical', 'given', 'either', 'space', 'defined'],
 ['league', 'club', 'season', 'football', 'stadium'],
 ['women', 'world', 'tennis', 'tournament', 'held'],
 ['states', 'united', 'list', 'county', 'state'],
 ['government', 'responsible', 'act', 'workers', 'police'],
 ['town', 'parish', 'small', 'village', 'located'],
 ['radio', 'fm', 'station', 'broadcasting', 'owned'],
 ['sierra', 'western', 'native', 'natural', 'southern'],
 ['english', 'first', 'made', 'right', 'class'],
 ['election', 'held', 'council', 'seats', 'member'],
 ['book', 'published', 'author', 'writer', 'books'],
 ['school', 'high', 'public', 'located', 'schools'],
 ['born', 'professional', 'world', 'player', 'russian'],
 ['brown', 'described', 'black', 'family', '

##  Now we are going to use the testset: we want to predict the topic for unseen documents.

In [14]:
# cear test set
testing_dataset = tp.fit(testing_contextual_documents, testing_bow_documents) # create dataset for the testset
predictions = ctm.get_doc_topic_distribution(testing_dataset, n_samples=10)

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 79/79 [00:00<00:00, 87.31it/s] 


In [15]:
print(testing_contextual_documents[15])

topic_index = np.argmax(predictions[15])
ctm.get_topic_lists(5)[topic_index]

Dhale (Arabic: الضالع‎‎ Aḍ Ḍāliʿ) province is one of the governorates of Yemen that have been created after the announcement of Yemeni unification. The population of the province accounted for (2.4%) of the total population of the Republic, and allocated administratively into (9) districts. Dali city is the centre of


['iran', 'persian', 'district', 'also', 'rural']

In [16]:
testing_dataset = tp.transform(text_for_contextual=testing_contextual_documents, text_for_bow=testing_bow_documents)

# n_sample how many times to sample the distribution (see the doc)
ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document


  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


Batches:   0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 79/79 [00:00<00:00, 84.37it/s] 


array([[0.00709262, 0.04367334, 0.00681018, ..., 0.00664655, 0.01195287,
        0.00797423],
       [0.00852594, 0.01326338, 0.00632783, ..., 0.0052157 , 0.01131592,
        0.00638541],
       [0.02267164, 0.01285871, 0.01620427, ..., 0.00576139, 0.00794419,
        0.02032054],
       ...,
       [0.02139176, 0.03955531, 0.01359563, ..., 0.0753573 , 0.01213   ,
        0.00568026],
       [0.01651042, 0.01284623, 0.01245495, ..., 0.01171216, 0.01281426,
        0.01155141],
       [0.01543206, 0.01359013, 0.00893943, ..., 0.0089167 , 0.00657846,
        0.00734071]], dtype=float32)

### Topic Predictions
Now we can take a document and see which topic has been assigned to it. Results will obviously change with respect to the documents you are using. For example, let's predict the topic of the first preprocessed document that is talking about a peninsula.

In [17]:
topics_predictions = ctm.get_thetas(testing_dataset, n_samples=5) # get all the topic predictions

100%|██████████| 79/79 [00:00<00:00, 81.96it/s] 


In [18]:
training_bow_documents[0] # see the text of our preprocessed document

'mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry'

In [19]:
topic_number = np.argmax(topics_predictions[0]) # get the topic id of the first document
print(ctm.get_topic_lists(5)[15])
print(ctm.get_topic_lists(5)[topic_number]) #and the topic should be about natural location/places/related things


['book', 'published', 'author', 'writer', 'books']
['university', 'professor', 'studied', 'received', 'degree']
