<a href="https://colab.research.google.com/github/muhanangmahrub/topic-modeling/blob/main/topic_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling using Scikit-Learn

In [None]:
!pip install bertopic

Mounting Google Drive to colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Extract file dataset

In [None]:
import tarfile

with tarfile.open('/content/drive/MyDrive/dataMachineLearning/aclImdb_v1.tar.gz', 'r:gz') as tar:
  tar.extractall()

Install library PyPrind
PyPrind (Python Progress Indicator) is module that provide progress bar or percentage indicator in Python so let we check progress from our running process.

In [None]:
!pip install PyPrind



Import necessary libraries like pyprind, pandas, os, and sys. Initialize base folder path to dataset

In [None]:
import pyprind
import pandas as pd
import os
import sys

basepath = 'aclImdb'

Create labels dictionary with positive key refer to 1 and negative key refer to 0. Initialize progress bar and join dataset from folder test and folder train into a single list. After all record append to list, we will convert our list to

In [None]:
labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
ls = []
for s in ('test', 'train'):
  for l in ('pos', 'neg'):
    path = os.path.join(basepath, s, l)
    for file in sorted(os.listdir(path)):
      with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
        txt = infile.read()
      ls.append([txt, labels[l]])
      pbar.update()
df = pd.DataFrame(ls)
df.columns = ['review', 'sentiment']
df

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1
...,...,...
49995,"Towards the end of the movie, I felt it was to...",0
49996,This is the kind of movie that my enemies cont...,0
49997,I saw 'Descent' last night at the Stockholm Fi...,0
49998,Some films that you pick up for a pound turn o...,0


Convert our dataframe into csv file with name `movie_data`

In [None]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

Convert a collection of documents into list of numerical tokens or matrix words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english', max_df=0.1, max_features=5000)
X = count.fit_transform(df['review'])

## Class LatentDirichletAllocation

Using class `LatentDirichletAllocation` from scikit-learn to modelling our matrix words to helps uncover latent topics from collections of documents

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, random_state=123, learning_method='batch')
X_topics = lda.fit_transform(X)

Check how much topic information is stored in an array and the second number is words in your

In [None]:
lda.components_.shape

(10, 5000)

Showing n top words for each topics

In [None]:
n_top_words = 6
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
  print(f'Topic {(topic_idx + 1)}:')
  print(' '.join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1: -1]]))

Topic 1:
comedy jokes laugh humor fun original
Topic 2:
guy girl sex women woman minutes
Topic 3:
war american wife murder men police
Topic 4:
human book feel audience documentary different
Topic 5:
series tv episode dvd episodes shows
Topic 6:
horror gore house scary blood killer
Topic 7:
performance role wonderful beautiful family performances
Topic 8:
action john western killer hero town
Topic 9:
script worst minutes awful budget terrible
Topic 10:
action fun music animation disney nice


Showing reviews related to horror topic

In [None]:
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
  print(f'{iter_idx + 1}: {df.loc[movie_idx, "review"]}')

1: There is so much that can be said about this film. It is not your typical nunsploitation. Of course, there is nudity and sex with nuns, but that is almost incidental to the story.<br /><br />It is set in 15th Century Italy, at the time of the martyrdom of 800 Christians at Otranto. The battle between the Muslims and the Christians takes up a good part of the film. It was interesting when everyone was running from the Muslim hoards, that the mother superior would ask, "Why do you fear the Muslims,; they will not do anything that the Christians have done to you?" Certainly, there was enough torture on both sides.<br /><br />Sister Flavia (Florinda Bolkan) is sent to a convent for defying her father. In the process, she witnesses and endures many things: the gelding of a stallion, the rape of a local woman by a new Duke, the torture of a nun who was overcome during a visit by the Tarantula Sect, and a whipping herself when she ran off with a Jew. The torture was particularly gruesome w

# Topic Modeling using BERTopic

---
Based on the BERTopic [documentation](https://maartengr.github.io/BERTopic/index.html), BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.


## Embedding Documents

---
Processing text into numeric representation vector using Sentence Transformers. Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models (quickstart) or to calculate similarity scores using Cross-Encoder models (quickstart). This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining [source](https://sbert.net/).

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(df['review'], show_progress_bar=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

In [None]:
embeddings.shape

(50000, 384)

## Reducing Dimensionality

---
Referring to the documentation, BERTopic supports UMAP, PCA, Truncated SVD, and cuML UMAP as methods for dimensionality reduction. In this example, we tried to use UMAP to reduce the dimension to our desired number. Even if it defaults to reducing dimensionality, BERTopic can skip the part related to dimensionality reduction.

In [None]:
from umap import UMAP

umap_model = UMAP(n_components=5, min_dist=0.0, metric='cosine', random_state=42)

## Clustering Reduced Embeddings

---
After reducing dimensionality, the next part of the BERTopic is clustering. We use the default method for clustering, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN).

In [None]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=50, metric='euclidean', cluster_selection_method='eom')

## Vectorizer

---
Count vectorizer responsible for creating topic representations.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words='english', max_df=0.9, max_features=500)

## Complete the last puzzle: BERTopic

---
The last step is stack model embedding, UMAP model, HDBSCAN model, and vectorizer into the BERTopic class.

In [None]:
from bertopic import BERTopic

topic_model = BERTopic(
    embedding_model=model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    vectorizer_model=vectorizer_model,
    verbose=True
).fit(df['review'], embeddings)

topic_model.get_topic_info()

2025-02-01 11:33:07,245 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-02-01 11:34:14,495 - BERTopic - Dimensionality - Completed ✓
2025-02-01 11:34:14,498 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-02-01 11:34:21,176 - BERTopic - Cluster - Completed ✓
2025-02-01 11:34:21,190 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-02-01 11:34:28,946 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,27982,-1_class_earth_sequence_named,"[class, earth, sequence, named, sequel, britis...",[I was very willing to give Rendition the bene...
1,0,1752,0_powerful_disney_british_tom,"[powerful, disney, british, tom, filled, siste...","[Dreamgirls, despite its fistful of Tony wins ..."
2,1,1630,1_tom_force_season_animation,"[tom, force, season, animation, famous, sequen...",[I'm going to write about this movie and about...
3,2,1423,2_bond_dance_train_impressive,"[bond, dance, train, impressive, plenty, gay, ...",[I am beginning to see a very consistent patte...
4,3,779,3_charlie_bond_zombie_race,"[charlie, bond, zombie, race, doctor, mad, mys...",[DarkWolf tells the tale of a young waitress n...
...,...,...,...,...,...
96,95,52,95_race_soul_state_shooting,"[race, soul, state, shooting, basic, created, ...","[""The Muppets Take Manhattan"" is different in ..."
97,96,52,96_japanese_rate_silent_garbage,"[japanese, rate, silent, garbage, zombies, tra...",[Jacqueline Hyde starts like any other normal ...
98,97,52,97_state_general_government_ghost,"[state, general, government, ghost, law, pictu...","[Hearkening back to those ""Good Old Days"" of 1..."
99,98,52,98_steve_south_japanese_soldiers,"[steve, south, japanese, soldiers, computer, 1...",[What I hoped for (or even expected) was the w...


In [None]:
topic_model.get_topic(0)

[('powerful', 0.00587381079307173),
 ('disney', 0.005811938821039015),
 ('british', 0.005731934120619491),
 ('tom', 0.005705048142691955),
 ('filled', 0.005615485349470466),
 ('sister', 0.005429710772993798),
 ('killing', 0.005411342140862361),
 ('german', 0.005390414174562585),
 ('suspense', 0.005371182502925229),
 ('monster', 0.005282683041516854)]

In [None]:
topic_model.find_topics('history')

([87, 30, 57, 29, 69],
 [0.17280701, 0.16494063, 0.16327499, 0.13606219, 0.12904948])

In [None]:
topic_model.get_topic(87)

[('george', 0.025907660558383134),
 ('dancing', 0.017904626955317245),
 ('girlfriend', 0.016332432303062555),
 ('fair', 0.015776469061416235),
 ('trip', 0.015730679808340642),
 ('law', 0.014994102580259249),
 ('mystery', 0.014802338969874243),
 ('island', 0.013453648374804754),
 ('ship', 0.013331499692873738),
 ('detective', 0.013233534195267141)]