<a href="https://colab.research.google.com/github/ianz88/text-mining/blob/master/Belajar_Text_Mining_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Berkenalan dengan TF-IDF

Kita akan belajar untuk:


1.   Melakukan text preprocessing dalam bahasa Indonesia
2.   Menghitung TF-IDF dengan TfidfVectorizer
3.   Melihat data dengan Pandas Dataframe
4.   Fine tuning stopwords
5.   Melihat term penting dalam dokumen






## Persiapan environment

Install beberapa library dan package yang diperlukan dalam project (dijalankan dalam Google Colab)

In [None]:
# Library corpus bahasa Indonesia (Sastrawi)
!pip install sastrawi 
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# Natural Language Tool Kit (NLTK)
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# Python Regex
import re

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer

## Persiapan Preprocessing

Fungsi-fungsi yang digunakan untuk mempersiapkan dokumen (teks) yang akan diolah.



In [None]:
# Fungsi memecah dokumen menjadi token (array elemen per kata)
def tokenize_clean(text):
    
    #tokenisasi
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word
        in nltk.word_tokenize(sent)]
    
    #clean token from numeric and other character like puntuation
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token) and token not in stopwords:
            filtered_tokens.append(token)
            
    return filtered_tokens

In [None]:
# Daftar Stopwords
stopwords_all = nltk.corpus.stopwords.words('indonesian')
stopwords_tambahan = {"ya","yak","iya","yg","ga","gak","gk","udh","sdh","udah","dah","nih","ini","deh","sih","dong","donk",
                 "sm","knp","utk","yaa","tdk","gini","gitu","bgt","gt","nya","kalo","cb","jg","jgn","gw","ge",
                 "sy","min","mas","mba","mbak","pak","kak","trus","trs","bs","bisa","aja","saja","no",
                 "w","g","gua","gue","emang","emg","wkwk","dr","kau","dg","gimana","apapun","apa",
                 "klo","yah","banget","pake","terus","krn","jadi","jd","mu","ku","si","hehe",
                 "tp","pa","lu","lo","lw","tw","tau","karna","kayak","ky","lg","untuk","tuk","dg","dgn"
                }
stopwords_all.extend(stopwords_tambahan)
stopwords = stopwords_all
print(len(stopwords))

In [None]:
# Fungsi menghilangkan stopwords dan tanda baca
def remove_stopwords(tokenized_text):
    
    cleaned_token = []
    for token in tokenized_text:
        if token not in stopwords:
            cleaned_token.append(token)
            
    return cleaned_token

In [None]:
# Fungsi mengubah kata ke bentuk kata dasar (bahasa Indonesia)
def stemming_text(tokenized_text):
    
    #stem using Sastrawi StemmerFactory 
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    stems = []
    for token in tokenized_text:
        stems.append(stemmer.stem(token))

    return stems

In [None]:
# Fungsi preprocessing
def text_preprocessing(text):
    
    prep01 = tokenize_clean(text)
    prep02 = remove_stopwords(prep01)
    prep03 = stemming_text(prep02)
    
    return prep03
    

## Step 01 : Tentukan Set Data

In [None]:
files = []
files = open('sample_data/diarium3.txt', encoding="utf8").read().split('\n')

#files.append("Sekelompok ibu dan kaum perempuan duduk beralaskan rumput lapangan sambil fokus menganyam bambu yang ia genggam di tangan.")
#files.append("Sebagian besar masyarakat rupanya tak mau melewatkan waktu begitu  saja untuk meratapi erupsi.")
#files.append("Lombok memang memiliki sejuta pesona yang mampu menyedot perhatian orang untuk datang berwisata.")
#files.append("Perempuan yang bergelut di dunia kerelawanan akan belajar caranya bertanggung jawab bagi sendiri dan orang lain.")
#files.append("Kami berkoordinasi dan melapor pada posko relawan, kami berkomitmen  siap membantu dengan siaga 24 jam")

len(files)

## Step 02 : Membentuk Corpus Data

In [None]:
#Persiapan corpus, load ke dalam dictionary
token_dict = {}
i = 0
for t in files:
    filename = "file" + str(i)
    token_dict[filename] = t
    i = i + 1

len(token_dict)

In [None]:
token_dict.values()

In [None]:
token_dict['file0']

## Step 03 : Menghitung TF-IDF

TF-IDF (term frequency–inverse document frequency) adalah nilai perhitungan statistik yang mencerminkan seberapa pentingnya sebuah kata dalam suatu dokumen, terhadap semua kumpulan dokumen yang ada.

Makin kecil nilai TF-IDF, makin sering kata tersebut muncul dalam dokumen. Bisa juga sebagai indikasi kata tersebut kurang penting.

Makin besar nilai TF-IDF, makin jarang kata muncul. Kemungkinan kata tersebut adalah topik yang penting.

In [None]:
#perform tf-idf vectorization
tfidf = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.01,           # terms with document frequency value < 0.02 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(1,2))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs = tfidf.fit_transform(token_dict.values())

In [None]:
#perform tf-idf vectorization
tfidf22 = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.008,           # terms with document frequency value < 0.008 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(2,2))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs22 = tfidf22.fit_transform(token_dict.values())

#perform tf-idf vectorization
tfidf23 = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.006,           # terms with document frequency value < 0.006 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(2,3))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs23 = tfidf23.fit_transform(token_dict.values())

In [None]:
# Cek tabel perhitungan. Tabel berisi rows=jumlah dokumen, columns=jumlah kata
print("tfs12 : ",tfs.shape)

In [None]:
# Lihat hasil proses
feature_names = tfidf.get_feature_names()
print('Jumlah n-gram relevan: ', len(feature_names))
print('n-gram temuan: ', feature_names)

In [None]:
import pandas as pd

# print idf values
df_idf = pd.DataFrame(tfidf.idf_, index=feature_names,columns=["tf-idf"])

# sort ascending
df_idf = df_idf.sort_values(by=['tf-idf'])

print(df_idf)
#print(df_idf.head())
#print(df_idf.tail(10))

In [None]:
#with pd.option_context('display.max_rows', None, 'display.max_columns', None):
# print(df_idf)

Bandingkan dengan hasil n-gram 2 dan 3

In [None]:
# Cek tabel perhitungan. Tabel berisi rows=jumlah dokumen, columns=jumlah kata
print("tfs12 : ",tfs.shape)
print("tfs22 : ",tfs22.shape)
print("tfs23 : ",tfs23.shape)

In [None]:
# Lihat hasil proses
feature_names22 = tfidf22.get_feature_names()
print('Jumlah bigram relevan: ', len(feature_names22))
print('bigram temuan: ', feature_names22)

feature_names23 = tfidf23.get_feature_names()
print('\nJumlah bi/trigram relevan: ', len(feature_names23))
print('bi/trigram temuan: ', feature_names23)

In [None]:
# print idf values
df_idf22 = pd.DataFrame(tfidf22.idf_, index=feature_names22,columns=["tf-idf"])

# sort ascending
df_idf22 = df_idf22.sort_values(by=['tf-idf'])

print("TF_IDF bigram:")
print(df_idf22)

In [None]:
# print idf values
df_idf23 = pd.DataFrame(tfidf23.idf_, index=feature_names23,columns=["tf-idf"])

# sort ascending
df_idf23 = df_idf23.sort_values(by=['tf-idf'])

print("TF_IDF bi/trigram:")
print(df_idf23)

## Step 04 : Fine Tuning Stopwords

Kita bisa memanfaatkan hasil kalkukasi awal TF IDF untuk memperbaiki proses selanjutnya agar lebih relevan

In [None]:
print("Sebelum fine tune : ", len(stopwords))
stopwords_finetune = {"diarium","fitur","telkom","ok","dll"}

stopwords_all.extend(stopwords_finetune)
print("Setelah fine tune : ", len(stopwords))

Uji hasil fine tuning

In [None]:
#perform tf-idf vectorization
tfidf22 = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.006,           # terms with document frequency value < 0.008 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(2,2))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs22 = tfidf22.fit_transform(token_dict.values())

# Cek tabel perhitungan. Tabel berisi rows=jumlah dokumen, columns=jumlah kata
print("tfs22 shape : ",tfs22.shape)


In [None]:
# Lihat hasil proses
feature_names22 = tfidf22.get_feature_names()
print('Jumlah bigram relevan: ', len(feature_names22))
print('bigram temuan: ', feature_names22)

In [None]:
# print idf values
df_idf22 = pd.DataFrame(tfidf22.idf_, index=feature_names22,columns=["tf-idf"])

# sort ascending
df_idf22 = df_idf22.sort_values(by=['tf-idf'])

print("TF_IDF bigram:")
print(df_idf22)

## Step 05 : Transformasi TF-IDF

Kita bisa mengguakan model hasil proses untuk mengecek kemunculan term penting hasil kalkulasi di dokumen lain.

In [None]:
str1 = 'Update aplikasi jika bisa dipermudah. Kalau bisa hanya dengan satu tombol update. Saat ini agak repot jika untuk update aplikasi prosesnya seperti download - install ulang.'
response = tfidf22.transform([str1])
#show result
print('\nHasil temuan str1:')
for col in response.nonzero()[1]:
    print (feature_names22[col], ' - ', response[0, col])
print('\nHasil response :', response.shape)
print('Hasil preprocess str1: ', text_preprocessing(str1))



In [None]:
str2 = '1. Info rekan = plis searching nya dipermudah (membaca keyword nya) 2. kalau update tolong yang lebih user friendly tanpa harus uninstall APK yang eksisting 3. HC Wiki = keyword nya diperbanyak'
response2 = tfidf22.transform([str2])

print('\nHasil temuan str2:')
for col in response2.nonzero()[1]:
    print (feature_names22[col], ' - ', response2[0, col])
print('\nHasil response:', response2.shape)
print('Hasil preprocess str2: ', text_preprocessing(str2))