## Using Gensim for Topic Modeling

We’re going to use the gensim implementations because they offer more functionality out of the box and then we’ll replicate that functionality with sklearn. Let’s first prepare the dataset we’ll be working with.


In [1]:
!pip install sastrawi
!pip install pyldavis
!pip install gensim==3.8.0

import nltk
from bs4 import BeautifulSoup
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
import re 

import re
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')


Collecting sastrawi
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4b/bab676953da3103003730b8fcdfadbdd20f333d4add10af949dd5c51e6ed/Sastrawi-1.0.1-py2.py3-none-any.whl (209kB)
[K     |████████████████████████████████| 215kB 8.7MB/s 
[?25hInstalling collected packages: sastrawi
Successfully installed sastrawi-1.0.1
Collecting pyldavis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 8.9MB/s 
Collecting funcy (from pyldavis)
  Downloading https://files.pythonhosted.org/packages/eb/3a/fc8323f913e8a9c6f33f7203547f8a2171223da5ed965f2541dafb10aa09/funcy-1.13-py2.py3-none-any.whl
Building wheels for collected packages: pyldavis
  Building wheel for pyldavis (setup.py) ... [?25l[?25hdone
  Created wheel for pyldavis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=1da3206b33ee5bfc1fd3ac3739871748b5eda42b9c

True

In [2]:
!pip install gensim==3.8.0
import pkg_resources
pkg_resources.get_distribution("gensim").version




'3.6.0'

In [3]:
!git clone https://github.com/project303/dataset.git
  
article = open('dataset/Berita.txt', encoding="utf8").read().split('BERHENTI DISINI')
len(article)

Cloning into 'dataset'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 5 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (5/5), done.


32

Clean the data from html tags with ``beautifulsoup``

In [4]:
article_clean = []
for text in article:
    text = BeautifulSoup(text, 'html.parser').getText()
    article_clean.append(text)
article = article_clean
print(article[0][:100])



Kroasia: Melawan Argentina adalah Pertandingan Termudah





Jakarta, CNN Indonesia -- Agung Rahma


Tokenize and clean stopwords

In [0]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

In [0]:
def tokenize_and_stem(text):
    stopwords = nltk.corpus.stopwords.words('indonesian')
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token) and token not in stopwords:
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

In [7]:
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in article:
    tokenized_data.append(tokenize_and_stem(text))

# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...


[(2, 1), (3, 1), (11, 1), (19, 4), (22, 2), (24, 1), (26, 1), (44, 1), (50, 1), (164, 2), (183, 9), (196, 1), (223, 2), (230, 1), (252, 4), (274, 1), (280, 1), (284, 1), (309, 1), (314, 1), (335, 2), (341, 1), (404, 1), (431, 1), (434, 1), (452, 1), (465, 1), (474, 1), (480, 1), (485, 2), (489, 1), (500, 1), (504, 1), (520, 1), (523, 1), (534, 1), (538, 2), (546, 1), (547, 9), (552, 1), (596, 1), (597, 3), (599, 1), (619, 2), (676, 1), (845, 1), (904, 1), (914, 2), (927, 1), (932, 1), (967, 1), (1032, 1), (1038, 2), (1165, 2), (1174, 1), (1283, 1), (1397, 1), (1398, 8), (1399, 1), (1400, 1), (1401, 3), (1402, 1), (1403, 1), (1404, 2), (1405, 4), (1406, 1), (1407, 2), (1408, 3), (1409, 1), (1410, 1), (1411, 1), (1412, 3), (1413, 1), (1414, 4), (1415, 2), (1416, 1), (1417, 1), (1418, 1), (1419, 1), (1420, 1), (1421, 1), (1422, 1), (1423, 1), (1424, 2), (1425, 1), (1426, 1), (1427, 1), (1428, 1), (1429, 1), (1430, 1), (1431, 4), (1432, 1), (1433, 1), (1434, 1), (1435, 1), (1436, 1), (1437

In [0]:
NUM_TOPICS = 4

# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha = 'auto', eval_every=5)#, per_word_topics=True)
 
# Build the LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

We’re going to run LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) models, which implementation included in the gensim package.

Let’s now display the topics the two models have inferred:

In [9]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)

LDA Model:
Topic #0: 0.013*"persen" + 0.007*"indonesia" + 0.006*"jakarta" + 0.006*"duga" + 0.005*"diskon" + 0.005*"polisi" + 0.005*"laku" + 0.005*"lapor" + 0.005*"uang" + 0.004*"cnn"
Topic #1: 0.025*"persen" + 0.008*"main" + 0.006*"dunia" + 0.006*"tanding" + 0.006*"indonesia" + 0.006*"lawan" + 0.006*"lemah" + 0.006*"jakarta" + 0.005*"indeks" + 0.005*"dolar"
Topic #2: 0.006*"duga" + 0.005*"cnn" + 0.005*"lapor" + 0.005*"jakarta" + 0.005*"persen" + 0.005*"indonesia" + 0.005*"dunia" + 0.004*"namun" + 0.004*"orang" + 0.004*"laku"
Topic #3: 0.012*"persen" + 0.006*"oknum" + 0.006*"duga" + 0.006*"novel" + 0.006*"indonesia" + 0.005*"jakarta" + 0.005*"diskon" + 0.004*"minus" + 0.004*"jenderal" + 0.004*"aman"
LSI Model:
Topic #0: 0.731*"persen" + 0.216*"lemah" + 0.171*"dolar" + 0.165*"minus" + 0.133*"indeks" + 0.128*"bunga" + 0.116*"kuat" + 0.111*"uang" + 0.104*"as" + 0.103*"dagang"
Topic #1: -0.385*"novel" + -0.311*"oknum" + -0.300*"duga" + -0.271*"jenderal" + -0.165*"kpk" + -0.142*"main" + -0.1

Let’s now put the models to work and transform unseen documents to their topic distribution:

In [10]:
text = "Pertandingan berjalan dengan seru. Tim lawan berhasil dikalahkan dengan skor 1-0."
bow = dictionary.doc2bow(tokenize_and_stem(text))

print(lda_model[bow]) 
print(lsi_model[bow])
print(bow)

[(0, 0.05583), (1, 0.8432569), (2, 0.053605795), (3, 0.047307327)]
[(0, 0.13315407974835333), (1, -0.3590914551263508), (2, 0.8065685310165087), (3, 0.039665126690103754)]
[(19, 1), (46, 1), (75, 1), (137, 1), (404, 1), (454, 1), (930, 1)]


The LDA result can be interpreted as a distribution over topics.
Gensim offers a simple way of performing similarity queries using topic models.

In [11]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(article[document_id][:1000])

[(1, 0.99426746), (4, 0.99426746), (5, 0.99426746), (6, 0.99426746), (7, 0.99426746), (10, 0.99426746), (12, 0.99426746), (13, 0.99426746), (14, 0.99426746), (18, 0.99426746)]



Jakarta, CNN Indonesia -- Nilai tukar mata uang negara-negara di kawasan Asia terpantau melemah di hadapan dolar AS pada hari ini, Rabu (13/6), tepat sehari sebelum The Federal Reserve, bank sentral Amerika Serikat (AS), mengumumkan keputusan tingkat suku bunga acuannya.

Sebelumnya, The Fed memberi sinyal akan menaikkan tingkat suku bunga acuannya sebanyak tiga kali pada tahun ini dan kenaikan kedua diperkirakan terjadi pada Juni ini. 

Pelemahan tertinggi terjadi pada mata uang Korea Selatan, yaitu won hingga 0,82 persen. Diikuti, rupee India minus 0,19 persen, yen Jepang minus 0,18 persen, dan baht Thailand minus 0,13 persen. 

Lalu, ringgit Malaysia melemah 0,12 persen, peso Filipina minus 0,12 persen, renmimbi China minus 0,04 persen, dolar Hong Kong minus 0,02 persen, dan dolar Singapura minus 0,01 perse

Notice how the factors corresponding to each component (topic) add up to 1. That’s not a coincidence. Indeed, LDA considers documents as being generated by a mixture of the topics. The purpose of LDA is to compute how much of the document was generated by which topic. 

LDA is an iterative algorithm. Here are the two main steps:

   - In the initialization stage, each word is assigned to a random topic.
   - Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration:
        - What’s the probability of the word belonging to a topic
        - What’s the probability of the document to be generated by a topic

Due to these important qualities, we can visualize LDA results easily. We’re going to use a specialized tool called PyLDAVis:

In [12]:
import pyLDAvis.gensim
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
panel

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
