<h2 align=center> Topic Modelling with BERTopic B2B Case</h2>

<div align="center">
    <img width="1112px" src='Capture.PNG' />
    <p style="text-align: center;color:gray">Figure 1: BERTopic() Topic Modelling</p>
</div>

In this notebook, we conduct Topic Modelling on the B2B Use case with the goal of discovering new topics. 

####  The steps that are followed are: 
##### 1) Take in input of the different columns in the dataset: `col1`,`col2`,`col3`,and `col4`. 
##### The `col1` column seems to be the most important column as it contains all the new scraped keywords from the websites. The column `col3` contains the top 5 keywords with the highest similarity from the `col4` column. 
##### 2) After getting results from all the different columns, the <b>results_df</b> contains the results of each topic varied from -1, 0, ... along with the keywords present in them. This can be used to analyze and discover new topics. 
##### 3) The results of the topic modelling technique gives us topics from -1 to .... Here the -1 is the topic which the model understood as different, too general thus it categorizes them into this separate topic. 
             - The keywords in topic -1 are then analyzed to see if they contain any relevant keywords which could be maybe put in other discovered topics or simply thrown away. 
             - This process is done by firstly using keywords from topic -1 as the input for a new topic modelling and then in case there are new topics generated, we can analyze them.
             - Secondly, we can also look at them manually since the count of the keywords are relatively low and would build reassurance and trust in our sets of topics. 
##### 4) To analyse the topic generated the following steps are followed: 
    1) We look at the intertopic ditance map, the clusters formed give us a good representation of which topics are similar, and which maybe an outlier.
    2) We also find semantically similar topics by using cosine similarity to group topics together.
    3) We form several visualizations such as bar graph, top words score for each topic for better clearity.
             
Evaluation of Topic Modelling: https://highdemandskills.com/topic-model-evaluation/

### Installing the dependencies

In [1]:
### Installing all the dependencies 
!pip install bertopic[visualization] --quiet

[31mException:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2851, in _dep_map
    return self.__dep_map
  File "/opt/conda/lib/python3.7/site-packages/pip/_vendor/pkg_resources/__init__.py", line 2685, in __getattr__
    raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/pip/basecommand.py", line 209, in main
    status = self.run(options, args)
  File "/opt/conda/lib/python3.7/site-packages/pip/commands/install.py", line 310, in run
    wb.build(autobuilding=True)
  File "/opt/conda/lib/python3.7/site-packages/pip/wheel.py", line 748, in build
    self.requirement_set.prepare_files(self.finder)
  File "/opt/conda/lib/python3.7/site-packages/pip/req/req_set.py", line 360, in prepare_files
    ignore_dependencies=self.ignore_dependen

In [115]:
#Importing Libraries
import numpy as np 
import pandas as pd
from ast import literal_eval
import openpyxl
from copy import deepcopy
from bertopic import BERTopic

import matplotlib.pyplot as plt

from wordcloud import WordCloud

import plotly as py
import plotly.graph_objs as go
import ipywidgets as widgets
from scipy import special
import plotly.express as px

py.offline.init_notebook_mode(connected = True)
%matplotlib inline

In [3]:
#Printing the requirements 
print("=======================Library Versions=================================")
print(f'Numpy Version: {np.__version__}')
print(f'Pandas Version: {pd.__version__}')
print(f'Plotly Version: {py.__version__}')

Numpy Version: 1.20.0
Pandas Version: 1.3.4
Plotly Version: 4.14.2


## Loading the Dataset and Analysing

In [None]:
df = pd.read_excel('df.xlsx')
df = df.iloc[: , :5]
df = df.rename_axis('Index').reset_index()
df.pop('Unnamed: 0')
df

In [5]:
def get_analysis_values(dataframe, column):
    print("============================================Exploratory Data Analysis=====================================================")
    print(f'Shape of the dataframe is {dataframe.shape}')
    print()
    print(dataframe.info())
    print()
    wordcloud2 = WordCloud().generate(' '.join(dataframe[column]))
    plt.figure(figsize = (10, 8), facecolor = None)
    plt.imshow(wordcloud2)
    plt.axis("off")
    plt.show()

In [None]:
get_analysis_values(df,'topics')

## Reusable Functions

In [7]:
def get_topic_val(modelname, topics):
    """
    Input: a) modelname: Name of the BERTopic() model used. 
           b) topics: The list of topics generated by the model.
           
    Function: Takes in the input and returns a dictionary with topic names as the keys and the keyword's index values from the df as the values.
    """
    
    grouped_topics = {topic: [] for topic in set(topics)}
    
    for index, topic in enumerate(topics):
        grouped_topics[topic].append(index)
        
    return grouped_topics

In [8]:
def make_result_df(dictionary, topicsdict):
    ''''
    Input: a) Dictionary: The dictionary with results of dict of get_topic_val function. It has all the topics as the keys with their respective keywords as the index given by the model.
           b) topicsdict: The dictionary with the index of the corresponding keyword row in the dataframe as the key and their string keyword as the value.
           
    Function: Takes in the inputs, maps the index values of keywords with their actual keyword names and returns a result dataframe.
    '''
    
    key = []
    re_keywords = []
    val = dictionary.items()
    
    for i, value in val:
        key.append(i)
        val = [*map(topicsdict.get, value)]
        re_keywords.append(val)
        
    new_dict = {k: v for k, v in zip(key, re_keywords)}
    result_df = pd.DataFrame(new_dict.items(), columns = ['Topic Nr','Present_Input_Keywords'])
    result_df = result_df.rename_axis('Index').reset_index()
    
    
    return result_df

In [9]:
def representativedocs(model, topics, docs, keywords):
    """
    Input: a) Model: Name of the model you want the results for.
           b) topics: topics extracted by the model
           c) docs: documents given as the input to the model. This is the different topic names that the model suggests. (Top n)
           d) Keywords: the input keywords given to the model
    
    Function: Takes in all the inputs and extracts the representative documents per topic.
    """
    model.get_topic_info()
    
    #extracting the topic names/numbers 
    top_names = model.topic_names
    top_names = pd.DataFrame(top_names.items(), columns = [topics,docs])
    
    #extracting representative docs for all the topics 
    rep_docs = model.representative_docs
    rep_docs = pd.DataFrame(rep_docs.items(), columns = [topics, keywords])
    
    #get topics with probability 
    top_proba = model.get_topics()
    
    output = pd.merge(top_names, 
                rep_docs, 
                how='left', 
                left_on='topic_num', 
                right_on='topic_num')
    return output

In [60]:
from sklearn.metrics.pairwise import cosine_similarity

def get_similarity_score(model,topicnr,resultdf, threshold):
    '''
    Parameters: 
        Inputs: a) model: the model used to train your topic modelling
                b) topicnr: the topic for which you want to see the similarity score. IMP: here the nr is the index of the row and not the topic nr so for topic -1 = topicnr is 0
                c) resultdf: the resultant df to merge to get combined results 
                d) threshold: the threshold above which you want to get similar topics
        Ouput: A pandas dataframe with topicnr, topic names, keywords present (input) and the distance score for each. 
    '''
    
    if model.topic_embeddings is not None:
        embeddings = np.array(model.topic_embeddings)
    else:
        embeddings = model.c_tf_idf
        
    distance_matrix = cosine_similarity(embeddings)
    data = distance_matrix[topicnr]
    score_df = pd.DataFrame(data = data, columns = {'similarity_score'})
    score_df = score_df.rename_axis('Index').reset_index()

    #merging score with resultant dataframe
    df = pd.merge(score_df, resultdf, on = 'Index')
    
    df = df[df['similarity_score'] >= threshold]
    
    return df

In [61]:
def make_final_dataframe(model, representdocsdf):
    """
    Inputs: a) Model: name of the model
            b) dataframe1: This is the dataframe formed including the topics and their top n topic names for each
            c) representdocsdf: This is the resultant dataframe of the representative docs function
            
            
    Function: Returns the resultant dataframe with topic number, their top n names with c-tf-idf scores and all the keywords they contain. 
    """
    dataframe1 = pd.DataFrame(model.topics.items(), columns = ['Topic Nr', 'Possible Topic Names'])
    finaldfname = pd.merge(dataframe1, representdocsdf)
    
    return finaldfname

In [74]:
from sklearn.metrics.pairwise import cosine_similarity

def get_class_similarity_score(model):
    '''
    Parameters: 
        Inputs: a) model: the model used to train your topic modelling
                b) topicnr: the topic for which you want to see the similarity score. IMP: here the nr is the index of the row and not the topic nr so for topic -1 = topicnr is 0
                c) resultdf: the resultant df to merge to get combined results 
                d) threshold: the threshold above which you want to get similar topics
        Ouput: A pandas dataframe with topicnr, topic names, keywords present (input) and the distance score for each. 
    '''
    
    topics = sorted(list(model.get_topics().keys()))

    # Extract topic words and their frequencies
    topic_list = sorted(topics)
    
    embeddings = model.c_tf_idf
    distance_matrix = cosine_similarity(embeddings)
    
    most_similar_ind = []
    most_similar_val = [] 

    for topic in topic_list:
        data = distance_matrix[topic] #topic -1
        i = np.argsort(data, axis=0)[-2] 
        most_similar_ind.append(i)   #ensure length and order for the list 
        most_similar_val.append(data[i])
                 
    similar_df = pd.DataFrame()
    similar_df['Topic Nr'] = topic_list
    similar_df['most_similar'] =  most_similar_ind
    similar_df['c_similarity_score'] = most_similar_val
    similar_df = similar_df.rename_axis('Index').reset_index()
    return similar_df


In [42]:
from umap import UMAP
from typing import List
from sklearn.preprocessing import MinMaxScaler

def intertopic_distance(topic_model,topics):
# Select topics based on top_n and topics args
    if topics is not None:
        topics = list(topics)
    elif top_n_topics is not None:
        topics = sorted(topic_model.get_topic_freq().Topic.to_list()[1:top_n_topics + 1])
    else:
        topics = sorted(list(topic_model.get_topics().keys()))

    # Extract topic words and their frequencies
    topic_list = sorted(topics)
    frequencies = [topic_model.topic_sizes[topic] for topic in topic_list]
    words = [" | ".join([word[0] for word in topic_model.get_topic(topic)[:5]]) for topic in topic_list]

    # Embed c-TF-IDF into 2D
    all_topics = sorted(list(topic_model.get_topics().keys()))
    indices = np.array([all_topics.index(topic) for topic in topics])
    embeddings = topic_model.c_tf_idf.toarray()[indices]
    embeddings = MinMaxScaler().fit_transform(embeddings)
    embeddings = UMAP(n_neighbors=2, n_components=2, metric='hellinger').fit_transform(embeddings)
    
    print(embeddings)

    # Dataframe containing the values
    df = pd.DataFrame({"x": embeddings[1:, 0], "y": embeddings[1:, 1],
                       "Topic": topic_list[1:], "Words": words[1:], "Size": frequencies[1:]})
    
    # Prepare figure range
    #narrowing down the size of the fig
    #x_range = (df.x.min() - abs((df.x.min()) * .15), df.x.max() + abs((df.x.max()) * .15))
    #y_range = (df.y.min() - abs((df.y.min()) * .15), df.y.max() + abs((df.y.max()) * .15))
    
    

    return df

## `Use Case 0`: Trying on the <i>col1</i>  Column

The default embedding model for english is `all-MiniLM-L6-v2`. While for multi-lingual it is `paraphrase-multilingual-MiniLM-L12-v2`.

In [None]:
docs = list(df.loc[:,'col1'].values)
print(docs[:5])
print(len(docs))

In [None]:
model_0 = BERTopic(embedding_model = 'sentence-transformers/LaBSE', language="multilingual",calculate_probabilities=True,verbose=True)
topics, probs = model_0.fit_transform(docs)

In [None]:
input_topics_freq_0 = model_0.get_topic_info()

fig = px.bar(input_topics_freq_0,x='Topic',y='Count', title = 'Distribution of Input Topic Generated')
fig.show()

In [None]:
model_0.visualize_barchart(topics = [0,1])

In [None]:
most_similar_dict = dict(zip(df.Index, df.col1))
grouped_topics = get_topic_val(model_0, topics)
res_df = make_result_df(grouped_topics,most_similar_dict)
result_df = make_final_dataframe(model_0,res_df)
result_df

In [None]:
get_similarity_score(model_0, 1, result_df, 0.50)

In [None]:
result_df['Present_Input_Keywords'][0]

`get_topic()`: Return top n words for a specific topic and their c-TF-IDF scores

In [None]:
model_0.get_topic(-1)

Looking at the top words in the topic -1 by making a wordclous of all the keywords as shown below. 

In [None]:
my_list = result_df['Present_Input_Keywords'][1]

from collections import Counter
word_could_dict=Counter(my_list)
wordcloud = WordCloud(width = 1000, height = 500).generate_from_frequencies(word_could_dict)

plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

-----

## `Use Case 1`: Trying on the <i>col2</i> column

In [None]:
docs = list(df.loc[:,'col2'].values)
print(docs[:5])
print(len(docs))

In [None]:
model_1 = BERTopic(embedding_model = 'paraphrase-multilingual-mpnet-base-v2',language="multilingual",calculate_probabilities=True,verbose=True)
topics, probs = model_1.fit_transform(docs)

In [None]:
input_topics_freq = model_1.get_topic_info()
fig = px.bar(input_topics_freq,x='Topic',y='Count', title = 'Distribution of Input ''col2'' Topic Generated')
fig.show()

In [None]:
model_1.visualize_barchart(topics = [-1,0,1])

In [None]:
#making a dictionary of the topics and their corresponding keywords
df_col2_dict = dict(zip(df.Index, df.col2))

In [None]:
grouped_val = get_topic_val(model_1, topics)
res = make_result_df(grouped_val,df_col2_dict)
result_df = make_final_dataframe(model_1, res) 
result_df

In [None]:
get_similarity_score(model_1, 1, result_df, 0.70)

The `result_df` above is the result dataframe for the column col2. Here, the column `Possible Topic Names` represents the possible topic names with their c-td-idf scores and the column `Present_Input_Keywords` contains all the keywords that were the input to the model and were combined together to form a new topic. 

The analysis of the above dataframe can be done by looking at the list of the keywords from the `Present_Input_Keywords` and then looking at the top 10 topic names the model assigns them by using the <b>.get_topic()</b> function. Then, by analyzing which keywords are combined together correctly and has an appropriate topic name (as suggested by the model), we can assign topic names to them and discover new relevant topics. For better visualization, we can also have a look at the similarity matrix and the word scores for each topic. 

In [None]:
model_1.get_topic(-1)

---

## `Use Case 02`: Trying on the <i>col3</i>

In [None]:
#removes the '' from string set values
df['col3'] = df.col3.apply(lambda x: literal_eval(str(x)))
df.head(20)

In [None]:
#Take the new topics column and explode each topic into a new row and add it into a pd Dataframe
newdf = df['col3']
topics = newdf.explode('col3')
topics_df = pd.DataFrame(topics)
topics_df = topics_df.rename_axis('Index').reset_index()
topics_df

In [None]:
#making a dictionary of the topics and their corresponding keywords
topics_dict = dict(zip(topics_df.Index, topics_df.col2))

In [None]:
docs = list(topics_df.loc[:,'col3'].values)

model_02 = BERTopic(embedding_model = "sentence-transformers/LaBSE",language="multilingual",calculate_probabilities=True,verbose=True,n_gram_range=(1, 2), nr_topics = 'auto')
topics, probs = model_02.fit_transform(docs)

In [None]:
new_topics = model_02.get_topic_freq()

In [None]:
fig = px.bar(new_topics,x='Topic',y='Count', title = 'Distribution of Topic Generated Use Case 02')
fig.show()

In [None]:
#save to excel
result_df.to_csv('b2b_results_keywords', sep='\t', encoding='utf-8')

In [None]:
fig = model_02.visualize_topics()
fig

In [None]:
model_02.visualize_heatmap(n_clusters = 4)

In [None]:
model_02.visualize_barchart(topics = [-1,0,1,2,3,105])

In [None]:
result_df_2['Present_Input_Keywords'][2]

In [None]:
result_df_2['Present_Input_Keywords'][1]

---

## Analysing the not classified keywords i.e. `Topic -1` from the <b>topics</b> column. 

Since, there are about 545 keywords which were not put into the specific topics from the keywords from the topics column. We will be using those as an input to the bert model to see if the model finds any new topics within them. 

In [None]:
not_classified = result_df['Present_Input_Keywords'][0]
not_classified_df = pd.DataFrame(not_classified, columns = {'not_classified'})
not_classified_df = not_classified_df.rename_axis('Index').reset_index()

not_classified_topics_dict = dict(zip(not_classified_df.Index, not_classified_df.not_classified))
not_classified_df.to_csv('not_classified_topics_col', sep='\t', encoding='utf-8')
not_classified = list(not_classified_df['not_classified'])
print(not_classified[:7])
print(len(not_classified))

In [None]:
not_classified_model = BERTopic(embedding_model = "sentence-transformers/LaBSE",language="multilingual",calculate_probabilities=True,verbose=True, nr_topics = 'auto')
topics, probs = not_classified_model.fit_transform(not_classified)

In [None]:
topics_formed = not_classified_model.get_topic_freq()
fig = px.bar(topics_formed,x='Topic',y='Count', title = 'Distribution of Topic Generated by Topic -1')
fig.show()

In [None]:
not_classified_model.visualize_topics()

### Let's look at the `topic 3 ` and `topic 0` for investigating if they actually are similar.

In [None]:
all_topics = get_topic_val(not_classified_model, topics)
r_top = make_result_df(all_topics,not_classified_topics_dict)
result_df = make_final_dataframe(not_classified_model, r_top)
result_df

In [None]:
#Topic number 3 keywords
result_df['Present_Input_Keywords'][4]

In [None]:
not_classified_model.get_topic(3)

In [None]:
#Topic number 0 keywords
result_df['Present_Input_Keywords'][1][:10]

In [None]:
not_classified_model.get_topic(0)

In [None]:
get_similarity_score(not_classified_model, 3, result_df, 0.60)

---

## `Use Case 03`: Trying on the <i>col4</i>

In [None]:
#removes the '' from string set values
df['col4'] = df.topic_selection.apply(lambda x: literal_eval(str(x)))
df.head(20)

In [None]:
#Take the new topics column and explode each topic into a new row and add it into a pd Dataframe
df2 = df['col4']
#df2.loc[0] = np.array(['français'])
topics2 = df2.explode('col4')
topics2 = pd.DataFrame(topics2)
topics2.iloc[0] = np.array(['français'])
topics2 = topics2.rename_axis('Index').reset_index()
topics2.head(6)

In [None]:
#making a dictionary of the topics and their corresponding keywords
topics_dict_03 = dict(zip(topics2.Index, topics2.col4))

In [None]:
docs_2 = list(topics2['col4'])
docs_2[:2]
print(len(docs_2))

In [None]:
model_3 = BERTopic(embedding_model = "sentence-transformers/LaBSE",language="multilingual",calculate_probabilities=True,verbose=True,n_gram_range=(1, 2), nr_topics = 'auto')
topics, probs = model_3.fit_transform(docs_2)

In [None]:
topics_freq_3_use = model_3.get_topic_freq()
topics_freq_3_use

In [None]:
fig = px.bar(topics_freq_3_use,x='Topic',y='Count', title = 'Distribution of Topic Generated UseCase 03')
fig.show()

In [None]:
model_3.visualize_barchart(topics = [-1,0,1,2,3,4,5,6,7])

In [None]:
model_3.visualize_topics()

In [None]:
all_topics = get_topic_val(model_3, topics)
r_top = make_result_df(all_topics,topics_dict_03)
result_df = make_final_dataframe(model_3, r_top)
result_df

In [None]:
result_df.to_csv('topics_selection_res', sep = '\t', encoding = 'utf8')

In [None]:
result_df['Present_Input_Keywords'][1][:10]

In [None]:
model_3.get_topic(0)

By looking at the topic 0 all the keywords in it, we can see that the topic could be about gaining consultancy but also getting some info about topic name possible. Which can be seen in the model's predicted names as well containing string, keyword as the possible topic names. 

In [None]:
#Looking at the similar topics using cosine similarity
get_similarity_score(model_3, 1, result_df, 0.50)

In [None]:
model_3.visualize_heatmap()

## Finding Similar Topics
###  UMAP embeddings

In [None]:
intertopic_distance = intertopic_distance(model_02,topics)

In [None]:
intertopic_distance = intertopic_distance.groupby('Topic').mean()
intertopic_distance = intertopic_distance.rename_axis('Topic Nr').reset_index()
#intertopic_distance['centroid'] = intertopic_distance[['x','y']].mean(axis=1)
intertopic_distance['diff'] = intertopic_distance.x - intertopic_distance.y
intertopic_distance 

In [None]:
grouped_topics = get_topic_val(model_02, topics)
res_top = make_result_df(grouped_topics,topics_dict)
result_df = make_final_dataframe(model_02,res_top)
result_df_2 = pd.merge(result_df, intertopic_distance, on = 'Topic Nr')
result_df_2

In [None]:
get_similarity_score(model_3, 6, result_df, 0.50)

In [None]:
result_df['Possible Topic Names'][14][:10]

In [None]:
result_df['Possible Topic Names'][22][:10]

The drawback with using UMAP embeddings for getting similar topics is that since it's a dimension reduction technique, it's distance from the clusters are the best metrics to compare and additionally, new values of x and y are generated after each run of the chunk. 

### C-TF-IDF scores

In [None]:
c_df = get_class_similarity_score(model_3)
c_df = pd.merge(c_df, result_df, on = 'Index')
c_df.pop('Topic Nr_y')
c_df

In [None]:
c_df.sort_values(['c_similarity_score'], ascending=[False]).head(10)

In [None]:
c_df['Possible Topic Names'][1][:10]

In [None]:
c_df['Possible Topic Names'][3][:10]

## Assigning New Keywords to Topics

In [None]:
input_keywords = ["my account","your account"]
assign_new_topics(model_3, input_keywords,5)

In [None]:
similar_topics, similarity = model_3.find_topics("your account", top_n=5); 
print(similar_topics)
print(similarity)

In [None]:
model_3.get_topic(1)

In [None]:
topics, similarity = model_3.find_topics("我的賬戶", top_n=5);
print(topics)
print(similarity)

## Save/Load the Model

In [None]:
topic_model = model_3
topic_model.save("my_model")

In [None]:
#load the model
topic_model = BERTopic.load("my_model")