# **BERTopic - Tutorial**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [1]:
!pip install bertopic[all]

Collecting bertopic[all]
  Downloading bertopic-0.8.1-py2.py3-none-any.whl (53 kB)
[?25l[K     |██████                          | 10 kB 34.1 MB/s eta 0:00:01[K     |████████████▏                   | 20 kB 35.7 MB/s eta 0:00:01[K     |██████████████████▎             | 30 kB 20.6 MB/s eta 0:00:01[K     |████████████████████████▍       | 40 kB 15.9 MB/s eta 0:00:01[K     |██████████████████████████████▌ | 51 kB 9.1 MB/s eta 0:00:01[K     |████████████████████████████████| 53 kB 2.3 MB/s 
Collecting hdbscan>=0.8.27
  Downloading hdbscan-0.8.27.tar.gz (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 11.7 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting plotly<4.14.3,>=4.7.0
  Downloading plotly-4.14.2-py2.py3-none-any.whl (13.2 MB)
[K     |████████████████████████████████| 13.2 MB 145 kB/s 
Collecting numpy>=1.20.0
  Using cached num

# **Imports**

In [2]:
import numpy as np
import pandas as pd
from copy import deepcopy
from bertopic import BERTopic

# **Load data**

In [3]:
df = pd.read_csv('drive/MyDrive/bbc_1807_1906_novideos.csv',encoding = 'utf-8')
docs = list(df.loc[:, "text"].values)

In [4]:
docs[:5]

['My name is Tim and I\'m a cheese addict. But what I\'ve been discovering recently has shaken me to the core. I can barely look a Babybel in the face. A half-eaten halloumi squeaklessly lies yellowing in the fridge. My cheese dreams are shattering. For, after a lifetime of unfettered devotion, could it possibly be that cheese is more foe than friend? That I am addicted to something that is not so good for my body? That cheese should be toast? These are questions that began surfacing a couple of months ago when I began making an episode for my new podcast for the BBC, All Hail Kale, looking into whether dairy was scary.  For some time, I\'d increasingly been questioning the logic of adults drinking milk.  While milk and dairy products, such as cheese and yoghurt, are good sources of protein and calcium and can form part of a healthy, balanced diet, as Dr Michael Greger, from NutritionFacts.org, put it to me: "There\'s no animal on the planet that drinks milk after weaning - and then to

# **Creating Topics**

In [5]:
model = BERTopic(language="english")

In [6]:
topics, probs = model.fit_transform(docs)

HBox(children=(FloatProgress(value=0.0, max=83426730.0), HTML(value='')))






# We can then extract most frequent topics:

In [7]:
model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,259
1,0,80
2,1,57
3,2,36
4,3,35
5,4,26
6,5,25
7,6,24
8,7,22
9,8,20


# Get Individual Topics

In [8]:
model.get_topic(0)

[('she', 0.02613080485909688),
 ('her', 0.02123995155920055),
 ('my', 0.01872332156081929),
 ('had', 0.016157602529131707),
 ('with', 0.01547396473354893),
 ('he', 0.013427907553789229),
 ('mental', 0.013206239680318466),
 ('me', 0.01296629064408543),
 ('have', 0.012638890043611907),
 ('health', 0.012132838509576084)]

In [9]:
model.get_topic(1)

[('food', 0.025027787431095724),
 ('meat', 0.021521680420279153),
 ('of', 0.02061127831515749),
 ('sugar', 0.019907622221104117),
 ('diet', 0.01741842153875879),
 ('that', 0.016955937269056545),
 ('calories', 0.013862823832960667),
 ('eating', 0.013446457308502366),
 ('eat', 0.012799717414837701),
 ('foods', 0.01259666049069027)]

In [10]:
model.get_topic(2)

[('is', 0.021062173424953964),
 ('women', 0.018767955003600803),
 ('baby', 0.01765251940271902),
 ('sperm', 0.01617947587167811),
 ('ivf', 0.016176304761559672),
 ('not', 0.014011990023695754),
 ('babies', 0.01324254744359828),
 ('fertility', 0.012713861947914532),
 ('placenta', 0.012336980293269871),
 ('she', 0.012166021109298125)]

# **Visualize Topics**

In [11]:
model.visualize_topics()

# Model BioBert


In [12]:
from flair.embeddings import TransformerDocumentEmbeddings

biobert = TransformerDocumentEmbeddings("emilyalsentzer/Bio_ClinicalBERT")
topic_model = BERTopic(embedding_model=biobert).fit(docs)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=385.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435778770.0, style=ProgressStyle(descri…




In [13]:
topic_model.get_topic_freq()

Unnamed: 0,Topic,Count
0,-1,577
1,0,135


In [14]:
topic_model.get_topic(0)

[('the', 0.12576350776718567),
 ('and', 0.07383185081671338),
 ('in', 0.07362510339692079),
 ('that', 0.03502013409434996),
 ('it', 0.03139620672066278),
 ('was', 0.030408994939807955),
 ('were', 0.020631000905447953),
 ('they', 0.01979948641744089),
 ('been', 0.018640708211968046),
 ('who', 0.01857105869593315)]

In [15]:
topic_model.get_topic(1)

False

In [16]:
topic_model.visualize_hierarchy()