**Contextualized topic modeling to get topics out of a collections made of Wikipedia Abstracts**

Reference : https://colab.research.google.com/drive/1euxW3ya3_PX6Kj1tnCNrIQ7pjZIODsB6?usp=sharing <br/>
Dataset : Downloading some abstracts from Wikipedia and using them to run the topic modeling pipeline.

In [None]:
%%capture
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_prep.txt

In [None]:
## Installing the contextualized topic model library
%%capture
!pip install contextualized-topic-models==1.8.1
!pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

**Installing TensorBoard**

In [1]:
!pip install tensorboard



In [None]:
# from keras.callbacks import TensorBoard
# from time import time

# # Create a TensorBoard instance with the path to the logs directory
# tensorboard = TensorBoard(log_dir='logs/{}'.format(time()))

In [4]:
from torch.utils.tensorboard import SummaryWriter
tb = SummaryWriter()

**Installing necessary libraries**

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file, TopicModelDataPreparation
from contextualized_topic_models.datasets.dataset import CTMDataset
from contextualized_topic_models.evaluation.measures import CoherenceNPMI, InvertedRBO
from gensim.corpora.dictionary import Dictionary
from gensim.models import ldamodel 
import os
import numpy as np
import pickle

Reading our data files and storing the documents as a lists of strings:

In [None]:
with open("dbpedia_sample_abstract_20k_prep.txt", 'r') as fr_prep:
  text_training_preprocessed = [line.strip() for line in fr_prep.readlines()]

with open("dbpedia_sample_abstract_20k_unprep.txt", 'r') as fr_unprep:
  text_training_not_preprocessed = [line.strip() for line in fr_unprep.readlines()]

NOTE: It is important to make sure that the lengths of the two lists of documents are the same and the index of a not preprocessed document corresponds to the index of the same preprocessed document.

In [None]:
assert len(text_training_preprocessed) == len(text_training_not_preprocessed)

print(text_training_not_preprocessed[0])
print(text_training_preprocessed[0])

Splitting the documents into training and testing

In [None]:
training_bow_documents = text_training_preprocessed[0:15000]
training_contextual_document = text_training_not_preprocessed[0:15000]

testing_bow_documents = text_training_preprocessed[15000:]
testing_contextual_documents = text_training_not_preprocessed[15000:]

Creating the Training Dataset <br/>

*   Passing our files with preprocess data to our TopicModelDataPreparation object.
*   This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents.
*   This operation allows us to create our training dataset.








In [None]:
tp = TopicModelDataPreparation("bert-base-nli-mean-tokens")

training_dataset = tp.create_training_set(training_contextual_document, training_bow_documents)

Preprocessed text:<br/>

We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help. <br/>

Unpreprocessed text: <br/>

We provide unpreprocessed text as the input for BERT (or the contextualized model of your choice) to let the model output more accurate document representations. <br/>

Vocabulary:

In [None]:
tp.vocab[:10]

Training the Combined Contextualized Topic Model <br/>
Finally, we can fit our new topic model. Asking the model to find 50 topics in our collection (n_component parameter of the CombinedTM object).

In [None]:
ctm = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, num_epochs=50, n_components=50)
ctm.fit(training_dataset)

In [None]:
tb.add_scalar("Loss", ctm.best_loss_train)
tb.close()

Saving the Model

In [None]:
ctm.save(models_dir="./")

Loading the Model

In [None]:
# del ctm

In [None]:
!ls

In [None]:
ctm = CombinedTM(input_size=len(tp.vocab), bert_input_size=768, num_epochs=100, n_components=50)

ctm.load("contextualized_topic_model_nc_50_tpm_0.0_tpv_0.98_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99/",
                                                                                                      epoch=99)

**Topics** <br/>
After training, now it is the time to look at our topics: we can use the 'get_topic_lists' function to get the topics. It also accept a parameter that allows you to select how many words you want to see for each topic.<br/>

If you look at the topics, you will see that they all make sense and are representative of a collection of documents that comes from Wikipedia (general knowledge).

In [None]:
ctm.get_topic_lists(5)

**Using the Test Set** <br/>
Now we are going to use the testset: we want to predict the topic for unseen documents.

In [None]:
testing_dataset = tp.create_test_set(testing_contextual_documents, testing_bow_documents) # create dataset for the testset
predictions = ctm.get_doc_topic_distribution(testing_dataset, n_samples=1)

In [None]:
print(testing_contextual_documents[10])

topic_index = np.argmax(predictions[10])
ctm.get_topic_lists(5)[topic_index]

**Gradio**

In [None]:
!pip install -q gradio

In [None]:
import tensorflow as tf
import numpy as np
# from urllib.request import urlretrieve
import gradio as gr

def NER(text):
    # text_dataset = tp.create_test_set(text, text_for_bow=None)
    # prediction = ctm.get_doc_topic_distribution(text_dataset, n_samples=1)
    topic_index = np.argmax(predictions[10])
    return ctm.get_topic_lists(5)[topic_index]

gr.Interface(fn=NER, 
             inputs="textbox", 
             outputs='textbox').launch(share=True);