# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls

drive  sample_data


In [None]:
%%capture
!pip install bertopic

# Data
For this example, we use the popular 20 Newsgroups dataset which contains roughly 18000 newsgroups posts

In [None]:
import pandas as pd
df2 = pd.read_csv('articles_with_sentences and words1.csv')

FileNotFoundError: ignored

In [None]:
df2.head(2)

Unnamed: 0,cord_uid,title,doi,abstract,publish_time,authors,journal,doi.1,pmcid,pubmed_id,publish_time_new,lang,sentences,words
0,azdtbnqj,Combined Hyperglycemic Hyperosmolar Syndrome a...,10.1155/2021/6429710,although most children with coronavirus diseas...,2021-11-27,"Tseng, Yu Shan; Tilford, Bradley; Sethuraman, ...",Case Rep Crit Care,10.1155/2021/6429710,PMC8627355,34791286.0,2021-11-27,en,"as the pandemic continues, clinicians should ...",further studies
1,k8nbgxsf,Evidences suggesting a possible role of Vitami...,10.4103/ijp.ijp_654_20,the severe acute respiratory syndrome coronavi...,2021-11-24,"Singh, Shruti; Singh, C. M.; Ranjan, Alok; Kum...",Indian J Pharmacol,10.4103/ijp.ijp_654_20,PMC8641745,34854410.0,2021-11-24,en,while some risk factors such as the presence ...,further research


In [None]:
print(len(df2))
df2.drop_duplicates(subset='cord_uid',inplace=True)
print(len(df2))
df2.drop_duplicates(subset='doi',inplace=True)
print(len(df2))
df2.drop_duplicates(subset='pmcid',inplace=True)
print(len(df2))
df2.drop_duplicates(subset='pubmed_id',inplace=True)
print(len(df2))
df2.drop_duplicates(subset='title',inplace=True)
print(len(df2))
df2.drop_duplicates(subset='abstract',inplace=True)
print(len(df2))

33206
30267
22977
14468
12880
12858
12853


In [None]:
docs = df2.sentences.to_list()

In [None]:
docs[2]

' ehealth apps have been recognized as a valuable tool to reduce covid-19â\x80\x99s effective reproduction number. the factors that determine the acceptance of covid-19 apps remain unknown. the exception here is privacy'

# **Topic Modeling**

In this example, we will go through the main components of BERTopic and the steps necessary to create a strong topic model.




In [None]:
from bertopic import BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(docs)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Batches:   0%|          | 0/402 [00:00<?, ?it/s]

2022-08-30 13:10:29,478 - BERTopic - Transformed documents to Embeddings
2022-08-30 13:11:03,805 - BERTopic - Reduced dimensionality
2022-08-30 13:11:22,045 - BERTopic - Clustered reduced embeddings


**NOTE**: Use `language="multilingual"` to select a model that support 50+ languages.

## Extracting Topics
After fitting our model, we can start by looking at the results. Typically, we look at the most frequent topics first as they best represent the collection of documents.

In [None]:
freq = topic_model.get_topic_info(); freq.head(15)

Unnamed: 0,Topic,Count,Name
0,-1,5176,-1_the_and_of_to
1,0,238,0_telehealth_telemedicine_care_services
2,1,202,1_vte_venous_anticoagulation_thrombosis
3,2,200,2_pregnancy_pregnant_women_neonates
4,3,192,3_cardiac_myocarditis_myocardial_injury
5,4,189,4_syndrome_respiratory_acute_severe
6,5,136,5_nurses_frontline_workers_mental
7,6,122,6_ventilation_prone_oxygen_positioning
8,7,121,7_learning_online_students_teaching
9,8,115,8_china_wuhan_pneumonia_december


In [None]:
len(freq)

191

-1 refers to all outliers and should typically be ignored. Next, let's take a look at a frequent topic that were generated:

In [None]:
topic_model.get_topic(0)  # Select the most frequent topic

[('telehealth', 0.03822303992054911),
 ('telemedicine', 0.03195903052099399),
 ('care', 0.016150336489070678),
 ('services', 0.011328970031638243),
 ('access', 0.01020407715167184),
 ('satisfaction', 0.009816386721125605),
 ('use', 0.009803743760159415),
 ('inperson', 0.009675272017503388),
 ('patient', 0.008667799602995923),
 ('visits', 0.007866743913834982)]

**NOTE**: BERTopic is stocastich which mmeans that the topics might differ across runs. This is mostly due to the stocastisch nature of UMAP.

## Visualize Topics
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation.
Instead, we can visualize the topics that were generated in a way very similar to
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_hierarchy(top_n_topics=100)

## Visualize Terms

We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation. Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other.

In [None]:
topic_model.visualize_barchart(top_n_topics=10)

In [None]:
topic_model.visualize_term_rank()

# **Topic Representation**
After having created the topic model, you might not be satisfied with some of the parameters you have chosen. Fortunately, BERTopic allows you to update the topics after they have been created.

This allows for fine-tuning the model to your specifications and wishes.

## Update Topics
When you have trained a model and viewed the topics and the words that represent them,
you might not be satisfied with the representation. Perhaps you forgot to remove
stopwords or you want to try out a different `n_gram_range`. We can use the function `update_topics` to update
the topic representation with new parameters for `c-TF-IDF`:


## Topic Reduction
We can also reduce the number of topics after having trained a BERTopic model. The advantage of doing so,
is that you can decide the number of topics after knowing how many are actually created. It is difficult to
predict before training your model how many topics that are in your documents and how many will be extracted.
Instead, we can decide afterwards how many topics seems realistic:





In [None]:
new_topics, new_probs = topic_model.reduce_topics(docs, topics, probs, nr_topics=50)

2022-08-30 13:26:21,830 - BERTopic - Reduced number of topics from 191 to 51


In [None]:
freq = topic_model.get_topic_info(); freq.head(15)

Unnamed: 0,Topic,Count,Name
0,-1,7581,-1_the_and_of_to
1,0,256,0_syndrome_respiratory_acute_severe
2,1,248,1_telehealth_telemedicine_care_to
3,2,224,2_thrombosis_patients_venous_vte
4,3,218,3_pregnancy_pregnant_women_neonates
5,4,216,4_cardiac_myocarditis_myocardial_patients
6,5,177,5_neurological_gbs_brain_cognitive
7,6,158,6_surgery_surgical_elective_pandemic
8,7,149,7_ecmo_ventilation_patients_respiratory
9,8,146,8_model_the_parameters_models


In [None]:
topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=20)

In [None]:
topic_model.visualize_hierarchy(top_n_topics=50)

In [None]:
len(new_topics),len(topics)

(12853, 12853)

In [None]:
df2["topics"] = topics
df2["new_topics"] = new_topics

In [None]:
!ls

'articles_with_sentences and words1.csv'   drive   sample_data


In [None]:
df2.to_csv("articles_with_topics.csv",index=None)

# **Search Topics**
After having trained our model, we can use `find_topics` to search for topics that are similar
to an input search_term. Here, we are going to be searching for topics that closely relate the
search term "vehicle". Then, we extract the most similar topic and check the results:

In [None]:
similar_topics, similarity = topic_model.find_topics("vehicle", top_n=5); similar_topics

[71, 45, 77, 9, 56]

In [None]:
topic_model.get_topic(71)

[('car', 0.03740731827314482),
 ('the car', 0.027790363401304377),
 ('dealer', 0.013837911908704722),
 ('the dealer', 0.009515109324321468),
 ('owner', 0.008430722097917726),
 ('previous owner', 0.008157988442865012),
 ('cars', 0.005827046491488879),
 ('the odometer', 0.00514870077683653),
 ('bought car', 0.004667512506960727),
 ('car with', 0.004498685875558186)]

# New section

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved.

In [None]:
# Save model
topic_model.save("my_model")

In [None]:
drive.flush_and_unmount()

In [None]:
# Load model
my_model = BERTopic.load("my_model")

# **Embedding Models**
The parameter `embedding_model` takes in a string pointing to a sentence-transformers model, a SentenceTransformer, or a Flair DocumentEmbedding model.

## Sentence-Transformers
You can select any model from sentence-transformers here and pass it through BERTopic with embedding_model:



In [None]:
topic_model = BERTopic(embedding_model="xlm-r-bert-base-nli-stsb-mean-tokens")

Or select a SentenceTransformer model with your own parameters:


In [None]:
from sentence_transformers import SentenceTransformer

sentence_model = SentenceTransformer("distilbert-base-nli-mean-tokens", device="cpu")
topic_model = BERTopic(embedding_model=sentence_model, verbose=True)

Click [here](https://www.sbert.net/docs/pretrained_models.html) for a list of supported sentence transformers models.  
