## Using Gensim for Topic Modeling

We’re going to use the gensim implementations because they offer more functionality out of the box and then we’ll replicate that functionality with sklearn. Let’s first prepare the dataset we’ll be working with.


In [1]:
!pip install sastrawi
!pip install pyldavis
!pip install gensim==3.8.0

import nltk
from bs4 import BeautifulSoup
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re 

import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')


Collecting sastrawi
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4b/bab676953da3103003730b8fcdfadbdd20f333d4add10af949dd5c51e6ed/Sastrawi-1.0.1-py2.py3-none-any.whl (209kB)
[K     |█▋                              | 10kB 18.7MB/s eta 0:00:01[K     |███▏                            | 20kB 1.8MB/s eta 0:00:01[K     |████▊                           | 30kB 2.3MB/s eta 0:00:01[K     |██████▎                         | 40kB 2.6MB/s eta 0:00:01[K     |███████▉                        | 51kB 2.0MB/s eta 0:00:01[K     |█████████▍                      | 61kB 2.3MB/s eta 0:00:01[K     |███████████                     | 71kB 2.5MB/s eta 0:00:01[K     |████████████▌                   | 81kB 2.8MB/s eta 0:00:01[K     |██████████████                  | 92kB 3.0MB/s eta 0:00:01[K     |███████████████▋                | 102kB 2.9MB/s eta 0:00:01[K     |█████████████████▏              | 112kB 2.9MB/s eta 0:00:01[K     |██████████████████▊             | 122kB 2.9MB/s 

True

In [2]:
!pip install gensim==3.8.0
import pkg_resources
pkg_resources.get_distribution("gensim").version




'3.6.0'

In [3]:
!mkdir -p dataset
!wget https://raw.githubusercontent.com/project303/dataset/master/Berita.txt -P dataset
!wget https://raw.githubusercontent.com/project303/dataset/master/Judul-Berita.txt -P dataset

--2020-10-08 17:37:30--  https://raw.githubusercontent.com/project303/dataset/master/Berita.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76189 (74K) [text/plain]
Saving to: ‘dataset/Berita.txt’


2020-10-08 17:37:30 (2.53 MB/s) - ‘dataset/Berita.txt’ saved [76189/76189]

--2020-10-08 17:37:30--  https://raw.githubusercontent.com/project303/dataset/master/Judul-Berita.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1753 (1.7K) [text/plain]
Saving to: ‘dataset/Judul-Berita.txt’


2020-10-08 17:37:31 (31.8 MB/s) - ‘dataset/Judul-Ber

In [4]:
article = open('dataset/Berita.txt', encoding="utf8").read().split('BERHENTI DISINI')
len(article)

32

Clean the data from html tags with ``beautifulsoup``

In [5]:
article_clean = []
for text in article:
    text = BeautifulSoup(text, 'html.parser').getText()
    article_clean.append(text)
article = article_clean
print(article[0][:100])



Kroasia: Melawan Argentina adalah Pertandingan Termudah





Jakarta, CNN Indonesia -- Agung Rahma


Tokenize and clean stopwords

In [6]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

In [7]:
def tokenize_and_stem(text):
    stopwords = nltk.corpus.stopwords.words('indonesian')
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token) and token not in stopwords:
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [8]:
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in article:
    tokenized_data.append(tokenize_and_stem(text))

# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...


[(2, 1), (3, 1), (11, 1), (19, 4), (22, 2), (24, 1), (26, 1), (44, 1), (50, 1), (164, 2), (183, 9), (196, 1), (223, 2), (230, 1), (252, 4), (274, 1), (280, 1), (284, 1), (309, 1), (314, 1), (335, 2), (341, 1), (404, 1), (431, 1), (434, 1), (452, 1), (465, 1), (474, 1), (480, 1), (485, 2), (489, 1), (500, 1), (504, 1), (520, 1), (523, 1), (534, 1), (538, 2), (546, 1), (547, 9), (552, 1), (596, 1), (597, 3), (599, 1), (619, 2), (676, 1), (845, 1), (904, 1), (914, 2), (927, 1), (932, 1), (967, 1), (1032, 1), (1038, 2), (1165, 2), (1174, 1), (1283, 1), (1397, 1), (1398, 8), (1399, 1), (1400, 1), (1401, 3), (1402, 1), (1403, 1), (1404, 2), (1405, 4), (1406, 1), (1407, 2), (1408, 3), (1409, 1), (1410, 1), (1411, 1), (1412, 3), (1413, 1), (1414, 4), (1415, 2), (1416, 1), (1417, 1), (1418, 1), (1419, 1), (1420, 1), (1421, 1), (1422, 1), (1423, 1), (1424, 2), (1425, 1), (1426, 1), (1427, 1), (1428, 1), (1429, 1), (1430, 1), (1431, 4), (1432, 1), (1433, 1), (1434, 1), (1435, 1), (1436, 1), (1437

In [9]:
NUM_TOPICS = 4

# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha = 'auto', eval_every=5)#, per_word_topics=True)
 
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

We’re going to run LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models, which implementation included in the gensim package.

Let’s now display the topics the two models have inferred:

In [10]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)

LDA Model:
Topic #0: 0.013*"persen" + 0.007*"indonesia" + 0.007*"dunia" + 0.006*"diskon" + 0.006*"uang" + 0.005*"dolar" + 0.005*"jakarta" + 0.005*"tabung" + 0.004*"tampil" + 0.004*"posisi"
Topic #1: 0.025*"persen" + 0.006*"indonesia" + 0.006*"novel" + 0.006*"duga" + 0.005*"indeks" + 0.005*"lemah" + 0.005*"jakarta" + 0.005*"cnn" + 0.005*"oknum" + 0.004*"uang"
Topic #2: 0.012*"persen" + 0.007*"balap" + 0.006*"jakarta" + 0.005*"indonesia" + 0.005*"poin" + 0.005*"cnn" + 0.005*"dunia" + 0.005*"hasil" + 0.005*"main" + 0.005*"tanding"
Topic #3: 0.009*"main" + 0.008*"persen" + 0.007*"jakarta" + 0.006*"tanding" + 0.005*"lapor" + 0.005*"aman" + 0.005*"dunia" + 0.004*"menteri" + 0.004*"lawan" + 0.004*"belanja"
LSI Model:
Topic #0: 0.731*"persen" + 0.216*"lemah" + 0.171*"dolar" + 0.165*"minus" + 0.133*"indeks" + 0.128*"bunga" + 0.116*"kuat" + 0.111*"uang" + 0.104*"as" + 0.103*"dagang"
Topic #1: -0.385*"novel" + -0.311*"oknum" + -0.300*"duga" + -0.271*"jenderal" + -0.165*"kpk" + -0.142*"main" + -0.

Let’s now put the models to work and transform unseen documents to their topic distribution:

In [11]:
text = "Pertandingan berjalan dengan seru. Tim lawan berhasil dikalahkan dengan skor 1-0."
bow = dictionary.doc2bow(tokenize_and_stem(text))

print(lda_model[bow]) 
print(lsi_model[bow])
print(bow)

[(0, 0.0583168), (1, 0.054374084), (2, 0.8333789), (3, 0.053930227)]
[(0, 0.13315407974835505), (1, -0.35909145512635077), (2, 0.8065685310165082), (3, 0.03966512669010389)]
[(19, 1), (46, 1), (75, 1), (137, 1), (404, 1), (454, 1), (930, 1)]


The LDA result can be interpreted as a distribution over topics.
Gensim offers a simple way of performing similarity queries using topic models.

In [12]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(article[document_id][:1000])

[(2, 0.99338967), (10, 0.99338967), (18, 0.99338967), (20, 0.99338967), (27, 0.99338967), (25, 0.9510065), (4, 0.6282035), (31, 0.5958772), (1, 0.19155347), (15, 0.13292967)]



Jakarta, CNN Indonesia -- Kapolres Jakarta Selatan, Komisaris Besar Indra Jafar mengatakan pihaknya bakal menindaklanjuti laporan seorang warga, Ronny Yuniarto terkait kasus pemukulan dan pengeroyokan yang diduga dilakukan oleh Politisi PDIP Herman Hery.

Indra mengatakan pihaknya masih mengumpulkan keterangan polisi lalu lintas (Polantas) yang menjadi saksi di lokasi kejadian serta hasil visum dari korban untuk menindaklanjuti laporan tersebut.

"Proses tetap kita lanjutkan, kita masih minta hasil visum rerhadap korban, kita minta di salah satu rumah sakit rujukan, selain itu juga masih dilakukan penyelidikan yang lain untuk mencari saksi-saksi," ujar Indra kepada wartawan di Mapolres Jakarta Selatan, Kamis (21/6).

Indra juga mengatakan bakal mendalami keterangan saksi dari polisi yang dinilai melakukan pembi

Notice how the factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. The purpose of LDA is to compute how much of the document was generated by which topic. 

LDA is an iterative algorithm. Here are the two main steps:

   - In the initialization stage, each word is assigned to a random topic.
   - Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
        - What’s the probability of the word belonging to a topic
        - What’s the probability of the document to be generated by a topic

Due to these important qualities, we can visualize LDA results easily. We’re going to use a specialized tool called PyLDAVis:

In [13]:
import pyLDAvis.gensim
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
panel