
<div align="center">
  <h2 style="font-family: 'Helvetica', sans-serif; color: #2980b9;">BERT</h2>
</div>
<p style="font-family: 'Arial', sans-serif; font-size: 14px; color: #444;">BERT (Bidirectional Encoder Representations from Transformers) is a language model that is designed to understand the contextual relationships between words in a sentence by leveraging the Transformer architecture. Unlike previous models that read text in a unidirectional manner, BERT is trained bidirectionally, allowing it to capture the meaning and dependencies of words from both the left and right sides.</p>
<div align="center">
  <p style="font-family: 'Arial', sans-serif; font-size: 16px; color: #2980b9;">BERTopic</p>
</div>
<p style="font-family: 'Arial', sans-serif; font-size: 14px; color: #444;">BERTopic is a topic modeling technique that builds upon the BERT (Bidirectional Encoder Representations from Transformers) model.</p>
<div align="center">
  <p style="font-family: 'Arial', sans-serif; font-size: 14px; color: #444;"><b>Note:</b> In Colab, BERTopic needs to be installed each session. The command to install was placed at the bottom of this notebook so that it was not distracting.</p>
</div>

In [12]:
# load data
# connect Drive to Colab notebook 
from google.colab import drive 
import pandas as pd 
import json 
drive.mount('/content/drive') # gives access to google drive for my account 

# next, open drive and copy the path to the data

# load csv file 
spam_data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Data/spam_dataset.csv")

# load json data

with open("/content/drive/MyDrive/Colab Notebooks/Data/vol7.json", "r", 
          encoding = "utf-8") as f: 
          docs = json.load(f)['descriptions'][:1000] # first 1,000 docs

len(docs)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


1000

In [25]:
%%capture
from bertopic import BERTopic

# instantiate the BERT model 
topic_model = BERTopic(embedding_model = "all-MiniLM-L6-v2")

# fit the model - this may take a bit!
topics, probs = topic_model.fit_transform(docs)

In [26]:
# print out topics 
# top words for each topic 
# with counts for times the topic occurs
topic_model.get_topic_info() 

Unnamed: 0,Topic,Count,Name
0,-1,220,-1_was_the_in_and
1,0,159,0_ifp_burnt_supporters_down
2,1,94,1_mk_operatives_ac_were
3,2,76,2_police_was_detained_in
4,3,63,3_shot_sap_cape_on
5,4,42,4_udf_1985_who_supporter
6,5,41,5_shot_ifp_transvaal_dead
7,6,32,6_dead_ifp_natal_anc
8,7,32,7_apla_attack_were_amnesty
9,8,31,8_detained_he_was_anc


In [27]:
# print out top words for a specific topic
topic_model.get_topic(0) 

[('ifp', 0.054013239986103904),
 ('burnt', 0.05055532420602651),
 ('supporters', 0.04918616111074514),
 ('down', 0.049134632458068886),
 ('her', 0.047308801534641),
 ('home', 0.0419612367267321),
 ('near', 0.04144539320179428),
 ('in', 0.04106927371347143),
 ('had', 0.04006711419702383),
 ('1994', 0.0379906951995378)]

Notice that stopwords will be present in BERT topics. It is not recommended to remove stopwords when using BERT as it handles them meaningfully unlike less sophisticated models like LDA. 

In [28]:
# print out representative docs for a given topic
topic_model.get_representative_docs(0)

['Had her house burnt down by IFP supporters on 16 March 1994 at Sonkombo, Ndwedwe, KwaZulu, near Durban, in intense political conflict in the area. See Sonkombo arson attacks.',
 'Had her house burnt down by IFP supporters on 16 March 1994 at Sonkombo, Ndwedwe, KwaZulu, near Durban, in intense political conflict in the area. See Sonkombo arson attacks.',
 'Had her house burnt down by IFP supporters on 16 March 1994 at Sonkombo, Ndwedwe, KwaZulu, near Durban, in intense political conflict in the area. See Sonkombo arson attacks.']

Next we can 

In [29]:
# input the BERTopic information into a data frame 
df = pd.DataFrame({"topic": topics, "document": docs})
df

Unnamed: 0,topic,document
0,-1,An ANCYL member who was shot and severely inju...
1,-1,A member of the SADF who was severely injured ...
2,-1,A member of QIBLA who disappeared in September...
3,2,A COSAS supporter who was kicked and beaten wi...
4,16,Was shot and blinded in one eye by members of ...
...,...,...
995,10,An Inkatha supporter whose home was burnt down...
996,10,An ANC supporter whose house was burnt down by...
997,10,An ANC supporter who was forced to flee his ho...
998,-1,An ANC supporter who was axed to death by name...


Visualize the data in 2D space to examine the relationships between topics. Proximity in the 2D space indicates (or it should) the semantic similarity between topics. 

In [30]:
topic_model.visualize_topics() 

In [23]:
topic_model.visualize_barchart() 

Below is the command needed to install BERT Topic for each session in Colab. 

In [None]:
# install bertopic in Colab for the session
!pip install bertopic