<a href="https://colab.research.google.com/github/maxmatical/ml-cheatsheet/blob/master/Contextualized_Topic_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installing Contextualized Topic Models

First, we install the contextualized topic model library

In [None]:
# !pip install contextualized-topic-models==2.2.0

In [None]:
# !pip install pyldavis

## Restart the Notebook

For the changes to take effect, we now need to restart the notebook.

From the Menu:

Runtime → Restart Runtime

In [None]:
import pandas as pd
import numpy as np
from contextualized_topic_models.models.ctm import CombinedTM, CTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk

In [None]:
df = pd.read_csv("freedom_intent_no_escalate.csv")
df.head(5)

Unnamed: 0,CLUSTER_ID,MESSAGE,SESSION_ID,CHANNEL,MESSAGE_INTENT_1,MESSAGE_INTENT_CONFIDENCE_1,MESSAGE_INTENT_2,MESSAGE_INTENT_CONFIDENCE_2,MESSAGE_INTENT_3,MESSAGE_INTENT_CONFIDENCE_3,MESSAGE_INTENT,MESSAGE_CONFIDENCE,IF_VALID_INTENT,LANGUAGE
0,0,netel,d63b71e0-e4c7-4012-822b-135f884dd90d:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
1,0,netel,5241a58a-9b9a-422e-b451-3832ac96e1dc:::6,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
2,0,netel,b864d516-66b3-4223-a852-e4b428b7578d:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
3,0,netel,719801fc-34c6-4ff9-8d7b-bb32d4de6766:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
4,0,netel,3808e306-4e23-4292-94f3-d1fd1b9bfd33:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en


## Preprocessing

Why do we use the **preprocessed text** here? We need text without punctuation to build the bag of word. Also, we might want only to have the most frequent words inside the BoW. Too many words might not help.

In [None]:
remove_words = ["message with a rep", "hello", "hi", "neper", "neprd", "neloc", "netel", "good morning", "message rep", "nemail"]
df["MESSAGE"] = df["MESSAGE"].str.replace("|".join(remove_words),'')# remove certain texts from messages

df.head(5)

  


Unnamed: 0,CLUSTER_ID,MESSAGE,SESSION_ID,CHANNEL,MESSAGE_INTENT_1,MESSAGE_INTENT_CONFIDENCE_1,MESSAGE_INTENT_2,MESSAGE_INTENT_CONFIDENCE_2,MESSAGE_INTENT_3,MESSAGE_INTENT_CONFIDENCE_3,MESSAGE_INTENT,MESSAGE_CONFIDENCE,IF_VALID_INTENT,LANGUAGE
0,0,,d63b71e0-e4c7-4012-822b-135f884dd90d:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
1,0,,5241a58a-9b9a-422e-b451-3832ac96e1dc:::6,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
2,0,,b864d516-66b3-4223-a852-e4b428b7578d:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
3,0,,719801fc-34c6-4ff9-8d7b-bb32d4de6766:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en
4,0,,3808e306-4e23-4292-94f3-d1fd1b9bfd33:::4,BOT,-1.0,-1.0,-1.0,-1.0,-1,-1,$shop_accessory_misc,0.9532752633,True,en


In [None]:
documents = [str(txt).strip() for txt in list(df["MESSAGE"].unique())]
documents[:5]

['', '', 'data + talk', 'general inquiries', 'billing, payments']

In [None]:
nltk.download('stopwords')
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
preprocessed_documents[:5]

['data talk',
 'general inquiries',
 'billing payments',
 'troubleshooting',
 'account details']

We don't discard the non-preprocessed texts, because we are going to use them as input for obtaining the contextualized document representations. 

Let's pass our files with preprocess and unpreprocessed data to our `TopicModelDataPreparation` object. This object takes care of creating the bag of words for you and of obtaining the contextualized BERT representations of documents. This operation allows us to create our training dataset.

Note: Here we use the contextualized model "paraphrase-distilroberta-base-v1".


In [None]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Batches:   0%|          | 0/11 [00:00<?, ?it/s]

Let's check the first ten words of the vocabulary 

In [None]:
len(tp.vocab)

1869

In [None]:
tp.vocab[:10]

['aap',
 'ab',
 'ability',
 'able',
 'ablle',
 'abroad',
 'absolute',
 'ac',
 'acc',
 'accept']

## Training Combined TM


In [None]:
# use this to see args for CombinedTM
# ??CTM

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=150, num_epochs=100)
ctm.fit(training_dataset) # run the model

Epoch: [100/100]	 Seen Samples: [216500/216500]	Train Loss: 93.05529463770208	Time: 0:00:01.062807: : 100it [01:42,  1.03s/it]


# Topics

use `ctm.get_topic_lists(k)` to get the top `k` key words associated with topics

In [None]:
ctm.get_topic_lists(3)

[['working', 'great', 'services'],
 ['images', 'https', 'activation'],
 ['data', 'bars', 'slower'],
 ['phone', 'card', 'number'],
 ['bill', 'paid', 'pay'],
 ['payment', 'error', 'go'],
 ['survey', 'freedommobile', 'signature'],
 ['sim', 'activate', 'new'],
 ['https', 'comes', 'hst'],
 ['agent', 'speak', 'person'],
 ['like', 'cancel', 'would'],
 ['question', 'missed', 'answering'],
 ['amendment', 'mentions', 'agreement'],
 ['recognizing', 'remembered', 'amendment'],
 ['account', 'value', 'months'],
 ['card', 'sim', 'credit'],
 ['locked', 'billing', 'maintenence'],
 ['change', 'address', 'payment'],
 ['payment', 'account', 'bill'],
 ['com', 'survey', 'freedommobile'],
 ['service', 'like', 'would'],
 ['issues', 'threats', 'continuously'],
 ['problem', 'billing', 'mytab'],
 ['bill', 'pay', 'account'],
 ['internet', 'want', 'home'],
 ['thank', 'past', 'much'],
 ['account', 'friends', 'card'],
 ['data', 'minutes', 'plan'],
 ['working', 'good', 'great'],
 ['plan', 'freedom', 'mobile'],
 ['pho

# Let's Draw!

We can use PyLDAvis to plot our topic in a nice and friendly manner :)

In [None]:
 lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [00:09,  1.06it/s]


In [None]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

  from collections import Iterable
  from collections import Mapping
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps
Sampling: [10/10]: : 10it [00:09,  1.08it/s]
  by='saliency', ascending=False).head(R).drop('saliency', 1)


# Topic Modelling - Inference

In [None]:
# can create inference_dataset for inference as well
topics_predictions = ctm.get_thetas(training_dataset, n_samples=20) # higher n_samples = better, default is 20

Sampling: [20/20]: : 20it [00:18,  1.08it/s]


In [None]:
preprocessed_documents[:5] # see the text of our preprocessed document

['data talk',
 'general inquiries',
 'billing payments',
 'troubleshooting',
 'account details']

In [None]:
topic_numbers = np.argmax(topics_predictions, axis=1) # get the topic ids

topic_numbers

array([ 48, 116,  16, ...,  28, 124,  19])

In [None]:
topic_names = [ctm.get_topic_lists(5)[topic_number] for topic_number in topic_numbers]
topic_names[:5]


[['talk', 'slow', 'matter', 'data', 'tng'],
 ['recognizing', 'cares', 'youtu', 'international', 'https'],
 ['locked', 'billing', 'maintenence', 'lowest', 'wash'],
 ['https', 'freedomcustomercare', 'yonyx', 'url', 'incident'],
 ['payment', 'account', 'credit', 'make', 'pay']]

In [None]:
data_inf = {"processed_txt": preprocessed_documents, "topic_num": topic_numbers, "topic_name": topic_names}
df_inf = pd.DataFrame(data_inf)
df_inf.head(5)

Unnamed: 0,processed_txt,topic_num,topic_name
0,data talk,48,"[talk, slow, matter, data, tng]"
1,general inquiries,116,"[recognizing, cares, youtu, international, https]"
2,billing payments,16,"[locked, billing, maintenence, lowest, wash]"
3,troubleshooting,108,"[https, freedomcustomercare, yonyx, url, incid..."
4,account details,110,"[payment, account, credit, make, pay]"


# Save and load ctm

In [None]:
ctm.save(models_dir="./")



In [None]:
del ctm

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, num_epochs=100, n_components=50)

# use the saved folder name
# epoch is the same as the epoch_{epoch}.pth file in that folder
ctm.load("./contextualized_topic_model_nc_150_tpm_0.0_tpv_0.9933333333333333_hs_prodLDA_ac_(100, 100)_do_softplus_lr_0.2_mo_0.002_rp_0.99",
                                                                                                      epoch=99)



In [None]:
ctm.get_topic_lists(5)

[['working', 'great', 'services', 'good', 'day'],
 ['images', 'https', 'activation', 'display', 'gl'],
 ['data', 'bars', 'slower', 'uber', 'normal'],
 ['phone', 'card', 'number', 'freedom', 'new'],
 ['bill', 'paid', 'pay', 'money', 'account'],
 ['payment', 'error', 'go', 'account', 'back'],
 ['survey', 'freedommobile', 'signature', 'com', 'questiontype'],
 ['sim', 'activate', 'new', 'card', 'insert'],
 ['https', 'comes', 'hst', 'ccken', 'mutton'],
 ['agent', 'speak', 'person', 'help', 'live'],
 ['like', 'cancel', 'would', 'service', 'plan'],
 ['question', 'missed', 'answering', 'show', 'restarted'],
 ['amendment', 'mentions', 'agreement', 'original', 'referring'],
 ['recognizing', 'remembered', 'amendment', 'mentions', 'threats'],
 ['account', 'value', 'months', 'bill', 'three'],
 ['card', 'sim', 'credit', 'ts', 'new'],
 ['locked', 'billing', 'maintenence', 'lowest', 'wash'],
 ['change', 'address', 'payment', 'set', 'account'],
 ['payment', 'account', 'bill', 'trying', 'money'],
 ['com