<a href="https://colab.research.google.com/github/ianz88/text-mining/blob/master/Belajar_Text_Mining_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Berkenalan dengan Topic Modelling

Kita akan belajar untuk

## Persiapan Environment

Install beberapa library dan package yang diperlukan dalam project (dijalankan dalam Google Colab)

In [None]:
# Library corpus bahasa Indonesia (Sastrawi)
!pip install sastrawi

# Library Visualisasi
!pip install pyldavis

# Library machine learning untuk topic modelling
!pip install gensim==3.8.0
import pkg_resources
pkg_resources.get_distribution("gensim").version

# Natural Language Tool Kit (NLTK)
import nltk

from bs4 import BeautifulSoup

# Python Regex
import re

from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from gensim import models, corpora
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')


## Persiapan Preprocessing

Fungsi-fungsi yang digunakan untuk mempersiapkan dokumen (teks) yang akan diolah.

In [None]:
# Fungsi memecah dokumen menjadi token (array elemen per kata)
def tokenize_clean(text):
    
    #tokenisasi
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word
        in nltk.word_tokenize(sent)]
    
    #clean token from numeric and other character like puntuation
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token) and token not in stopwords:
            filtered_tokens.append(token)
            
    return filtered_tokens

In [None]:
# Daftar Stopwords
stopwords_all = nltk.corpus.stopwords.words('indonesian')
stopwords_tambahan = {"ya","yak","iya","yg","ga","gak","gk","udh","sdh","udah","dah","nih","ini","deh","sih","dong","donk",
                 "sm","knp","utk","yaa","tdk","gini","gitu","bgt","gt","nya","kalo","cb","jg","jgn","gw","ge",
                 "sy","min","mas","mba","mbak","pak","kak","trus","trs","bs","bisa","aja","saja","no",
                 "w","g","gua","gue","emang","emg","wkwk","dr","kau","dg","gimana","apapun","apa",
                 "klo","yah","banget","pake","terus","krn","jadi","jd","mu","ku","si","hehe",
                 "tp","pa","lu","lo","lw","tw","tau","karna","kayak","ky","lg","untuk","tuk","dg","dgn"
                }
stopwords_all.extend(stopwords_tambahan)
stopwords = stopwords_all
print(len(stopwords))

In [None]:
# Fungsi menghilangkan stopwords dan tanda baca
def remove_stopwords(tokenized_text):
    
    cleaned_token = []
    for token in tokenized_text:
        if token not in stopwords:
            cleaned_token.append(token)
            
    return cleaned_token

In [None]:
# Fungsi mengubah kata ke bentuk kata dasar (bahasa Indonesia)
def stemming_text(tokenized_text):
    
    #stem using Sastrawi StemmerFactory 
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    stems = []
    for token in tokenized_text:
        stems.append(stemmer.stem(token))

    return stems

In [None]:
# Fungsi preprocessing
def text_preprocessing(text):
    
    prep01 = tokenize_clean(text)
    prep02 = remove_stopwords(prep01)
    prep03 = stemming_text(prep02)
    
    return prep03
    

## Step 01 : Tentukan Set Data

In [None]:
article = open('sample_data/HCBPC.txt', encoding="utf8").read().split('\n')
len(article)

## Step 02 : Membentuk Corpus Data

In [None]:
# For gensim we need to tokenize the data and filter out stopwords
tokenized_data = []
for text in article:
    tokenized_data.append(text_preprocessing(text))

# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)
 
# Transform the collection of texts to a numerical form
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
 
# Have a look at how the 20th document looks like: [(word_id, count), ...]
print(corpus[20])
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
print(tokenized_data)

## Step 03 : Membangun Model LDA dan LSI

Kita akan coba membangun model LDA and LSI (Latent Semantic Indexing AKA Latent Semantic Analysis) dengan jumlah topik 4

In [None]:
NUM_TOPICS = 4

# Membangun LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary, alpha = 'auto', eval_every=5)#, per_word_topics=True)
 
# Membangun LSI model
lsi_model = models.LsiModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)

Mari kita lihat topik yang telah dihasilkan

In [None]:
print("LDA Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lda_model.print_topic(idx, 10))
 
print("=" * 20)
 
print("LSI Model:")
 
for idx in range(NUM_TOPICS):
    # Print the first 10 most representative topics
    print("Topic #%s:" % idx, lsi_model.print_topic(idx, 10))
 
print("=" * 20)

## Step 04 : Transformasi Dokumen Baru

Sekarang kita cobakan model LDA dan LSI ke sebuah dokumen baru.

In [None]:
text = "Selamat ulang tahun pak, semoga sehat dan sukses selalu."
bow = dictionary.doc2bow(tokenize_and_stem(text))

print(lda_model[bow]) 
print(lsi_model[bow])
print(bow)

## Step 05 : Membandingkan dengan Dokumen Corpus

Hasil LDA dapat diinterpretasikan sebagai distribusi terhadap topik. Menggunakan Gensim kita bisa dengan mudah melakukan query terhadap dokumen corpus yang paling mirip dengan dokumen baru.

In [None]:
from gensim import similarities
 
lda_index = similarities.MatrixSimilarity(lda_model[corpus])
 
# Let's perform some queries
similarities = lda_index[lda_model[bow]]
# Sort the similarities
similarities = sorted(enumerate(similarities), key=lambda item: -item[1])
 
# Top most similar documents:
print(similarities[:10])
 
# Let's see what's the most similar document
document_id, similarity = similarities[0]
print(article[document_id][:1000])

## Step 06 : Visualisasi Topik Dokumen Corpus

In [None]:
import pyLDAvis.gensim
 
pyLDAvis.enable_notebook()
panel = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary)
panel