<a href="https://colab.research.google.com/github/iwan-rg/Arabic-Topic-Modeling/blob/main/BERT_for_Arabic_Topic_Modeling_ACLing2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook is based on Maarten Grootendorst [BERTopic](https://github.com/MaartenGr/BERTopic/tree/v0.4.2) tutorial avalible [here](https://github.com/MaartenGr/BERTopic/blob/v0.4.2/notebooks/BERTopic.ipynb).

# **BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique**
Abeer Abuzayed and Hend Al-Khalifa

# Abstract
Topic modeling is an unsupervised machine learning technique for finding abstract topics in a large collection of documents. It
helps in organizing, understanding and summarizing large collections of textual information and discovering the latent topics that
vary among documents in a given corpus. Latent Dirichlet allocation (LDA) and Non-Negative Matrix Factorization (NMF) are
two of the most popular topic modeling techniques. LDA uses a probabilistic approach whereas NMF uses matrix factorization
approach, however, new techniques that are based on BERT for topic modeling do exist. In this paper, we aim to experiment with
BERTopic using different Pre-Trained Arabic Language Models as embeddings, and compare its results against LDA and NMF
techniques. We used Normalized Pointwise Mutual Information (NPMI) measure to evaluate the results of topic modeling
techniques. The overall results generated by BERTopic showed better results compared to NMF and LDA.

**NOTE**: Make sure to select a GPU runtime. Otherwise, the model can take quite some time to create the document embeddings!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# we start with installing bertopic from pypi before preparing the data

!pip install bertopic[all]

In [42]:
import pandas as pd
from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
from gensim.models import LdaMulticore
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF


# Load Data
For this experiment, We used [(DataSet for Arabic Classification)](https://data.mendeley.com/datasets/v524p5dhpj/2) which contains 111,728 Arabic documents written in Modern
Standard Arabic (MSA). The dataset was collected from three Arabic online newspapers: Assabah, Hespress and
Akhbarona. The documents in the dataset are categorized into 5 classes: sport, politics, culture, economy and diverse.
We removed 2939 missing documents and ran the experiments with the remaining 108789 documents without any
document labels. 

In [35]:
# add your data path 

data=  pd.read_csv("/content/drive/MyDrive/Topic Modeling/arabic_dataset_classifiction.csv")
data.head()

Unnamed: 0,text,targe
0,بين أستوديوهات ورزازات وصحراء مرزوكة وآثار ولي...,0
1,قررت النجمة الأمريكية أوبرا وينفري ألا يقتصر ع...,0
2,أخبارنا المغربية الوزاني تصوير الشملالي ألهب ا...,0
3,اخبارنا المغربية قال ابراهيم الراشدي محامي سعد...,0
4,تزال صناعة الجلود في المغرب تتبع الطريقة التقل...,0


In [None]:
data.shape

In [7]:
data=data.dropna()
data.shape

In [None]:
documents = data['text'].values

#Embedding model
BERTopic has two default embedding models: "distilbert-base-nli-stsb-mean-tokens'' for the English language and "xlm-r-bert-base-nli-stsb-meantokens" for any language other than English, where XLM-R models support 50+ languages.

Also, you can select any model from [Hugging Face](https://huggingface.co/models)  and use it instead of the preselected models by simply passing the model through
BERTopic with embedding_model.

For more deatelis check out BERTopic decomntion [here](https://maartengr.github.io/BERTopic/tutorial/embeddings/embeddings.html).

In [10]:
#to experiment with other BERT models simply change the model name below

arabert = TransformerDocumentEmbeddings('aubmindlab/bert-base-arabertv02')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=384.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=824793.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2642362.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=381.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=543490667.0, style=ProgressStyle(descri…




# **Create Topics**


For BERTopic you do not need to define the number of topics in advance, however, if you want to do so simply pass the number of topics to BERTopic with nr_topics paramete.

In [11]:
topic_model = BERTopic(language="arabic", low_memory=True ,calculate_probabilities=False,
                     embedding_model=arabert)

NOTE: Calculating probabilities can slow down BERTopic significantly at large amounts of data (>100_000 documents). It is advised to turn this off if you want to speed up the model.

In [12]:
topics, probs = topic_model.fit_transform(documents)

In [None]:
#extract most frequent topics

topic_model.get_topic_freq().head(5)

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated:

In [None]:
#show the top 10 words in topic 1

topic_model.get_topic(1)[:10]

# Evaluation
To evaluate the model topics coherence we use [Gensim](https://radimrehurek.com/gensim/models/coherencemodel.html) implementation of the Normalized
Pointwise Mutual Information (NPMI).

In [28]:
texts = [[word for word in str(document).split()] for document in documents]
id2word = corpora.Dictionary(texts)
corpus = [id2word.doc2bow(text) for text in texts]

In [29]:
topics=[]
for i in topic_model.get_topics():
  row=[]
  topic= topic_model.get_topic(i)
  for word in topic:
     row.append(word[0])
  topics.append(row)

In [None]:
# compute Coherence Score

cm = CoherenceModel(topics=topics, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_npmi')
coherence = cm.get_coherence() 
print('\nCoherence Score: ', coherence)

# **Visualize Topics**
After having trained our `BERTopic` model, we can iteratively go through perhaps a hundred topic to get a good 
understanding of the topics that were extract. However, that takes quite some time and lacks a global representation. 
Instead, we can visualize the topics that were generated in a way very similar to 
[LDAvis](https://github.com/cpsievert/LDAvis):

In [None]:
topic_model.visualize_topics()

# **Model serialization**
The model and its internal settings can easily be saved. Note that the documents and embeddings will not be saved. However, UMAP and HDBSCAN will be saved. 

In [None]:
# Save model
topic_model.save("my_model")	

In [None]:
# Load model
my_model = BERTopic.load("my_model")	

# LDA

We use the [ parallelized Latent Dirichlet Allocation (LDA)](https://radimrehurek.com/gensim/models/ldamulticore.html) from Gensim.

Note: for LDA you have to define topics number in advance.

In [39]:
#chang the number of topics here
no_topics = 5

# run LDA
lda = LdaMulticore(corpus, id2word=id2word, num_topics=no_topics)


In [None]:
#compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda, texts=texts, dictionary=id2word, coherence='c_npmi')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

#NMF
We use Scikit-learn implementation of [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html).

Note: for NMF you have to define topics number in advance.

In [43]:
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2)
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

In [44]:
#chang the number of topics here
no_topics = 5

# run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

In [48]:
topics_NMF=[]
for index, topic in enumerate(nmf.components_):
    row=[]
    for i in topic.argsort()[-10:]:
      row.append(tfidf_vectorizer.get_feature_names()[i])
    topics_NMF.append(row)

In [None]:
cm = CoherenceModel(topics=topics_NMF, texts=texts, corpus=corpus, dictionary=id2word, coherence='c_npmi')
coherence_nmf = cm.get_coherence()  
print('\nCoherence Score: ', coherence_nmf)

If you use this notebook, please cite our paper :)

```
Abeer Abuzayed and Hend Al-Khalifa. BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique. Arabic Computational Linguistics, Procedia Computer Science, Elsevier, (in press).
```



