<a href="https://colab.research.google.com/github/nisha1365/Sanofi-POC/blob/main/Bert_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

You can install the package via pip:

In [None]:
!pip install bertopic



### The Libraries

We will use the following libraries that will help us to load data and create a model from BerTopic.

In [None]:
#import packages

import pandas as pd
import numpy as np
from bertopic import BERTopic

we will use Olympic Tokyo 2020 Tweets with a goal to create a model that can automatically categorize the tweets by their topics.

In [None]:
#load data
import pandas as pd

df = pd.read_csv("/content/tokyo_2020_tweets.csv")

# select only 6000 tweets
df = df[0:6000]

  df = pd.read_csv("/content/tokyo_2020_tweets.csv")


 selected only 6,000 tweets for computational reasons.

In [None]:
df.head()

Unnamed: 0,id,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,retweets,favorites,is_retweet
0,1418888645105356803,Abhishek Srivastav,"Udupi, India",Trying to be mediocre in many things,2021-02-01 06:33:51,45.0,39.0,293.0,False,2021-07-24 10:59:49,Let the party begin\n#Tokyo2020,['Tokyo2020'],Twitter for Android,0.0,0.0,False
1,1418888377680678918,Saikhom Mirabai Channu🇮🇳,"Manipur, India",Indian weightlifter 48 kg category. Champion🏆,2018-04-07 10:10:22,5235.0,5.0,2969.0,False,2021-07-24 10:58:45,Congratulations #Tokyo2020 https://t.co/8OFKMs...,['Tokyo2020'],Twitter for Android,0.0,0.0,False
2,1418888260886073345,Big Breaking,Global,All breaking news related to Financial Market....,2021-05-29 08:51:25,3646.0,3.0,5.0,False,2021-07-24 10:58:17,Big Breaking Now \n\nTokyo Olympic Update \n\n...,,Twitter for Android,0.0,1.0,False
3,1418888172864299008,International Hockey Federation,Lausanne,Official International Hockey Federation Twitt...,2010-10-20 10:45:59,103975.0,2724.0,36554.0,True,2021-07-24 10:57:56,Q4: 🇬🇧3-1🇿🇦\n\nGreat Britain finally find a wa...,,Twitter Web App,1.0,0.0,False
4,1418886894478270464,Cameron Hart,Australia,Football & Tennis Coach,2020-10-31 08:46:17,6.0,37.0,31.0,False,2021-07-24 10:52:51,All I can think of every time I watch the ring...,"['Tokyo2020', 'ArtisticGymnastics', '7Olympics...",Twitter for iPhone,0.0,0.0,False


### Create Model

To create a model using BERTopic, you need to load the tweets as a list and then pass it to the fit_transform method. This method will do the following:

Fit the model on the collection of tweets.

Generate topics.

Return the tweets with the topics.

In [None]:
# create model

model = BERTopic(verbose=True)

#convert to list
docs = df.text.to_list()

topics, probabilities = model.fit_transform(docs)

2024-01-10 16:19:02,718 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/188 [00:00<?, ?it/s]

2024-01-10 16:21:38,984 - BERTopic - Embedding - Completed ✓
2024-01-10 16:21:38,986 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-01-10 16:22:22,582 - BERTopic - Dimensionality - Completed ✓
2024-01-10 16:22:22,585 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-01-10 16:22:22,824 - BERTopic - Cluster - Completed ✓
2024-01-10 16:22:22,834 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-01-10 16:22:23,159 - BERTopic - Representation - Completed ✓


### Select Top Topics

After training the model, you can access the size of topics in descending order.

In [None]:
model.get_topic_freq().head(11)

Unnamed: 0,Topic,Count
7,-1,1709
0,0,305
9,1,222
102,2,207
1,3,177
15,4,164
27,5,145
3,6,135
59,7,99
28,8,86


Note:  Topic -1 is the largest and it refers to outliers tweets that do not assign to any topics generated. In this case, we will ignore Topic -1.

Select One Topic
You can select a specific topic and get the top n words for that topic and their c-TF-IDF scores.

In [None]:
model.get_topic(6)

[('olympics', 0.030642462533385522),
 ('olympicgames', 0.02073734076762076),
 ('tokyo2020', 0.014879021909668473),
 ('olympic', 0.0145096057229863),
 ('ceremony', 0.014465319252353906),
 ('tokyoolympics', 0.01343766689525597),
 ('the', 0.012514722044885825),
 ('parade', 0.01221655766156882),
 ('games', 0.01219401203633868),
 ('wearing', 0.01178352823886967)]

### Visualize Topics
The visualize_topics method can help you visualize topics generated with their sizes and corresponding words. The visualization is inspired by LDavis.

In [None]:
model.visualize_topics()

### Visualize Terms
The visualize_barchart method will show the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores. You can then compare topic representations to each other and gain more insights from the topic generated.

In [None]:
model.visualize_barchart()

### Visualize Topic Similarity
You can also visualize how similar certain topics are to each other. To visualize the heatmap, simply call.

In [None]:
model.visualize_heatmap()

## Topic Reduction
Sometimes you may end up with too many topics or too few topics generated, BerTopic gives you an option to control this behavior in different ways.

(a) You can set the number of topics you want by setting the argument "nr_topics" with a number of topics you want. The BerTopic will find similar topics and merge them.

In [None]:
model = BERTopic(nr_topics=20)

In the above code, the number of topics that will be generated is 20.

(b)Another option is to reduce the number of topics automatically. To use this option, you need to set "nr_topics" to "auto" before training the model.

In [None]:
model = BERTopic(nr_topics="auto")

(c) The last option is to reduce the number of topics after training the model. This is a great option if retraining the model will take many hours

In [None]:
#new_topics, new_probs = model.reduce_topics(docs, topics, probabilities, nr_topics=15)

## Save Model
You can save a trained model by using the save method.

In [None]:
model.save("my_topics_model")



## Load Model
You can load the model by using the load method

In [None]:
BerTopic_model = BERTopic.load("my_topics_model")