<a href="https://colab.research.google.com/github/ianz88/text-mining/blob/master/Belajar_Text_Mining_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Berkenalan dengan TF-IDF

Kita akan belajar untuk:


1.   Melakukan text preprocessing dalam bahasa Indonesia
2.   Menghitung TF-IDF dengan TfidfVectorizer
3.   Melihat data dengan Pandas Dataframe
4.   Fine tuning stopwords
5.   Melihat term penting dalam dokumen






## Persiapan environment

Install beberapa library dan package yang diperlukan dalam project (dijalankan dalam Google Colab)

In [124]:
# Library corpus bahasa Indonesia (Sastrawi)
!pip install sastrawi 
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# Natural Language Tool Kit (NLTK)
import nltk
nltk.download('stopwords')
nltk.download('punkt')

# Python Regex
import re

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Persiapan Preprocessing

Fungsi-fungsi yang digunakan untuk mempersiapkan dokumen (teks) yang akan diolah.



In [125]:
# Fungsi memecah dokumen menjadi token (array elemen per kata)
def tokenize_clean(text):
    
    #tokenisasi
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word
        in nltk.word_tokenize(sent)]
    
    #clean token from numeric and other character like puntuation
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token) and token not in stopwords:
            filtered_tokens.append(token)
            
    return filtered_tokens

In [126]:
# Daftar Stopwords
stopwords_all = nltk.corpus.stopwords.words('indonesian')
stopwords_tambahan = {"ya","yak","iya","yg","ga","gak","gk","udh","sdh","udah","dah","nih","ini","deh","sih","dong","donk",
                 "sm","knp","utk","yaa","tdk","gini","gitu","bgt","gt","nya","kalo","cb","jg","jgn","gw","ge",
                 "sy","min","mas","mba","mbak","pak","kak","trus","trs","bs","bisa","aja","saja","no",
                 "w","g","gua","gue","emang","emg","wkwk","dr","kau","dg","gimana","apapun","apa",
                 "klo","yah","banget","pake","terus","krn","jadi","jd","mu","ku","si","hehe",
                 "tp","pa","lu","lo","lw","tw","tau","karna","kayak","ky","lg","untuk","tuk","dg","dgn"
                }
stopwords_all.extend(stopwords_tambahan)
stopwords = stopwords_all
print(len(stopwords))

844


In [127]:
# Fungsi menghilangkan stopwords dan tanda baca
def remove_stopwords(tokenized_text):
    
    cleaned_token = []
    for token in tokenized_text:
        if token not in stopwords:
            cleaned_token.append(token)
            
    return cleaned_token

In [128]:
# Fungsi mengubah kata ke bentuk kata dasar (bahasa Indonesia)
def stemming_text(tokenized_text):
    
    #stem using Sastrawi StemmerFactory 
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    stems = []
    for token in tokenized_text:
        stems.append(stemmer.stem(token))

    return stems

In [129]:
# Fungsi preprocessing
def text_preprocessing(text):
    
    prep01 = tokenize_clean(text)
    prep02 = remove_stopwords(prep01)
    prep03 = stemming_text(prep02)
    
    return prep03
    

## Step 01 : Tentukan Set Data

In [130]:
files = []
files = open('sample_data/diarium3.txt', encoding="utf8").read().split('\n')

#files.append("Sekelompok ibu dan kaum perempuan duduk beralaskan rumput lapangan sambil fokus menganyam bambu yang ia genggam di tangan.")
#files.append("Sebagian besar masyarakat rupanya tak mau melewatkan waktu begitu  saja untuk meratapi erupsi.")
#files.append("Lombok memang memiliki sejuta pesona yang mampu menyedot perhatian orang untuk datang berwisata.")
#files.append("Perempuan yang bergelut di dunia kerelawanan akan belajar caranya bertanggung jawab bagi sendiri dan orang lain.")
#files.append("Kami berkoordinasi dan melapor pada posko relawan, kami berkomitmen  siap membantu dengan siaga 24 jam")

len(files)

678

## Step 02 : Membentuk Corpus Data

In [131]:
#Persiapan corpus, load ke dalam dictionary
token_dict = {}
i = 0
for t in files:
    filename = "file" + str(i)
    token_dict[filename] = t
    i = i + 1

len(token_dict)

678

In [132]:
token_dict.values()

dict_values(['Tampilan UI nya buat yg lebih bagus lagi', 'agar diarium mobile ini tdk terjadi sering eror', 'notifikasi untuk check in sebelum pukul 8 pagi. tidak hanya check out', 'Kehandalan sistem perlu ditingkatkan agar tidak sering terjadi error atau lambat pada saat user mengakses bersamaan.', 'update info2 bisnis dan produk telkom, arahan Mgt', 'Sudah sangat bagus, lbh ditingkatkan lagi agar warna warna diaroumm lebh besar semangatt dan cantik.', 'Terkadang feed / status dari user lain tidak terupdate di halaman depan', 'Lebih lancar lagi jangan sering error. NDE sering tidak bisa di akses.', 'sangat bermanfaat secara sistem', 'Akan lebih baik jika update otomatis tanpa harus delete apk dulu baru install ulang.', 'Notifikasi ulang tahun pada Diarium tidak bisa dihilangkan tanda serunya walaupun sudah dilihat notifikasinya, harus menuliskan ucapan baru tanda seru nya bisa hilang. Saran saja, untuk pengembangan ke depan saat notif sudah dilihat / dibaca, tanda seru nya juga bisa h

In [133]:
token_dict['file0']

'Tampilan UI nya buat yg lebih bagus lagi'

## Step 03 : Menghitung TF-IDF

TF-IDF (term frequency–inverse document frequency) adalah nilai perhitungan statistik yang mencerminkan seberapa pentingnya sebuah kata dalam suatu dokumen, terhadap semua kumpulan dokumen yang ada.

Makin kecil nilai TF-IDF, makin sering kata tersebut muncul dalam dokumen. Bisa juga sebagai indikasi kata tersebut kurang penting.

Makin besar nilai TF-IDF, makin jarang kata muncul. Kemungkinan kata tersebut adalah topik yang penting.

In [134]:
#perform tf-idf vectorization
tfidf = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.01,           # terms with document frequency value < 0.02 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(1,2))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs = tfidf.fit_transform(token_dict.values())

In [135]:
#perform tf-idf vectorization
tfidf22 = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.008,           # terms with document frequency value < 0.008 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(2,2))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs22 = tfidf22.fit_transform(token_dict.values())

#perform tf-idf vectorization
tfidf23 = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.006,           # terms with document frequency value < 0.006 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(2,3))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs23 = tfidf23.fit_transform(token_dict.values())

In [136]:
# Cek tabel perhitungan. Tabel berisi rows=jumlah dokumen, columns=jumlah kata
print("tfs12 : ",tfs.shape)

tfs12 :  (678, 99)


In [137]:
# Lihat hasil proses
feature_names = tfidf.get_feature_names()
print('Jumlah n-gram relevan: ', len(feature_names))
print('n-gram temuan: ', feature_names)

Jumlah n-gram relevan:  99
n-gram temuan:  ['absen', 'absensi', 'activity', 'akses', 'android', 'aplikasi', 'apps', 'bagus', 'bantu', 'beda', 'bug', 'buka', 'butuh', 'cepat', 'check', 'check in', 'check out', 'data', 'desain', 'diarium', 'diarium mobile', 'dinas', 'dll', 'eror', 'error', 'event', 'fitur', 'fitur2', 'foto', 'friendly', 'fungsi', 'ganti', 'halaman', 'hang', 'hc', 'hp', 'in', 'info', 'info rekan', 'informasi', 'integrasi', 'ios', 'kadang', 'kait', 'kali', 'karyawan', 'kembang', 'kerja', 'klik', 'langsung', 'lbh', 'lengkap', 'loading', 'login', 'lot', 'manfaat', 'masuk', 'menu', 'mobile', 'moga', 'mohon', 'mudah', 'muncul', 'notifikasi', 'ok', 'optimal', 'otomatis', 'out', 'pilih', 'portal', 'presensi', 'rekan', 'responsif', 'server', 'sesuai', 'sesuai butuh', 'simple', 'sistem', 'smooth', 'stabil', 'suka', 'tahan', 'tampil', 'tampil tarik', 'tarik', 'telkom', 'terkadang', 'tingkat', 'tolong', 'ui', 'ui ux', 'ulang', 'update', 'update versi', 'user', 'user friendly', 'ux',

In [138]:
import pandas as pd

# print idf values
df_idf = pd.DataFrame(tfidf.idf_, index=feature_names,columns=["tf-idf"])

# sort ascending
df_idf = df_idf.sort_values(by=['tf-idf'])

print(df_idf)
#print(df_idf.head())
#print(df_idf.tail(10))

                tf-idf
diarium       3.151173
aplikasi      3.257941
fitur         3.477570
tampil        3.495269
update        3.495269
...                ...
sesuai butuh  5.441180
pilih         5.441180
optimal       5.441180
dinas         5.441180
langsung      5.441180

[99 rows x 1 columns]


In [139]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
 print(df_idf)

                  tf-idf
diarium         3.151173
aplikasi        3.257941
fitur           3.477570
tampil          3.495269
update          3.495269
tingkat         3.608598
bagus           3.628801
menu            3.691980
mudah           3.883035
user            3.994261
karyawan        4.054885
akses           4.054885
tarik           4.119424
error           4.119424
kadang          4.153325
versi           4.153325
cepat           4.153325
check           4.188417
kembang         4.224784
moga            4.262525
out             4.342567
info            4.385127
muncul          4.429579
sesuai          4.429579
in              4.429579
kerja           4.429579
activity        4.524889
fungsi          4.524889
check in        4.576182
telkom          4.630249
otomatis        4.687408
lengkap         4.687408
masuk           4.687408
friendly        4.687408
mobile          4.687408
user friendly   4.687408
data            4.748032
rekan           4.748032
presensi        4.748032


Bandingkan dengan hasil n-gram 2 dan 3

In [140]:
# Cek tabel perhitungan. Tabel berisi rows=jumlah dokumen, columns=jumlah kata
print("tfs12 : ",tfs.shape)
print("tfs22 : ",tfs22.shape)
print("tfs23 : ",tfs23.shape)

tfs12 :  (678, 99)
tfs22 :  (678, 11)
tfs23 :  (678, 16)


In [141]:
# Lihat hasil proses
feature_names22 = tfidf22.get_feature_names()
print('Jumlah bigram relevan: ', len(feature_names22))
print('bigram temuan: ', feature_names22)

feature_names23 = tfidf23.get_feature_names()
print('\nJumlah bi/trigram relevan: ', len(feature_names23))
print('bi/trigram temuan: ', feature_names23)

Jumlah bigram relevan:  11
bigram temuan:  ['butuh karyawan', 'cepat akses', 'check in', 'check out', 'diarium mobile', 'info rekan', 'sesuai butuh', 'tampil tarik', 'update aplikasi', 'update versi', 'user friendly']

Jumlah bi/trigram relevan:  16
bi/trigram temuan:  ['butuh karyawan', 'cepat akses', 'check in', 'check in check', 'check out', 'diarium mobile', 'in check out', 'info rekan', 'menu diarium', 'sesuai butuh', 'tampil tarik', 'ui ux', 'update aplikasi', 'update otomatis', 'update versi', 'user friendly']


In [142]:
# print idf values
df_idf22 = pd.DataFrame(tfidf22.idf_, index=feature_names22,columns=["tf-idf"])

# sort ascending
df_idf22 = df_idf22.sort_values(by=['tf-idf'])

print("TF_IDF bigram:")
print(df_idf22)

TF_IDF bigram:
                   tf-idf
check in         4.576182
user friendly    4.687408
check out        4.812571
info rekan       4.881564
tampil tarik     5.035714
diarium mobile   5.218036
update versi     5.323397
sesuai butuh     5.441180
butuh karyawan   5.574711
cepat akses      5.574711
update aplikasi  5.574711


In [143]:
# print idf values
df_idf23 = pd.DataFrame(tfidf23.idf_, index=feature_names23,columns=["tf-idf"])

# sort ascending
df_idf23 = df_idf23.sort_values(by=['tf-idf'])

print("TF_IDF bi/trigram:")
print(df_idf23)

TF_IDF bi/trigram:
                   tf-idf
check in         4.576182
user friendly    4.687408
check out        4.812571
info rekan       4.881564
tampil tarik     5.035714
diarium mobile   5.218036
update versi     5.323397
sesuai butuh     5.441180
butuh karyawan   5.574711
cepat akses      5.574711
update aplikasi  5.574711
check in check   5.728862
in check out     5.728862
menu diarium     5.728862
ui ux            5.728862
update otomatis  5.728862


## Step 04 : Fine Tuning Stopwords

Kita bisa memanfaatkan hasil kalkukasi awal TF IDF untuk memperbaiki proses selanjutnya agar lebih relevan

In [144]:
print("Sebelum fine tune : ", len(stopwords))
stopwords_finetune = {"diarium","fitur","telkom","ok","dll"}

stopwords_all.extend(stopwords_finetune)
print("Setelah fine tune : ", len(stopwords))

Sebelum fine tune :  844
Setelah fine tune :  849


Uji hasil fine tuning

In [145]:
#perform tf-idf vectorization
tfidf22 = TfidfVectorizer(max_df=0.8,             # terms with document frequency value > 0.8 will be removed (terlalu sering muncul)
                        min_df=0.006,           # terms with document frequency value < 0.008 will be removed (terlalu jarang)
                        max_features=200000,    # create maximum 200.000 vocabulary that only consider the top max_features ordered by term frequency across the corpus.
                        stop_words = stopwords, # stopwords list
                        use_idf=True,           # enable inverse-document-frequency reweighting
                        tokenizer=text_preprocessing, # override the string tokenization step by using text_prepocessing function 
                        ngram_range=(2,2))      # ngram range 1 - 2 (unigram=1, bigram=2, trigram=3)

tfs22 = tfidf22.fit_transform(token_dict.values())

# Cek tabel perhitungan. Tabel berisi rows=jumlah dokumen, columns=jumlah kata
print("tfs22 shape : ",tfs22.shape)


tfs22 shape :  (678, 12)


In [146]:
# Lihat hasil proses
feature_names22 = tfidf22.get_feature_names()
print('Jumlah bigram relevan: ', len(feature_names22))
print('bigram temuan: ', feature_names22)

Jumlah bigram relevan:  12
bigram temuan:  ['butuh karyawan', 'cepat akses', 'check in', 'check out', 'info rekan', 'sesuai butuh', 'tampil tarik', 'ui ux', 'update aplikasi', 'update otomatis', 'update versi', 'user friendly']


In [147]:
# print idf values
df_idf22 = pd.DataFrame(tfidf22.idf_, index=feature_names22,columns=["tf-idf"])

# sort ascending
df_idf22 = df_idf22.sort_values(by=['tf-idf'])

print("TF_IDF bigram:")
print(df_idf22)

TF_IDF bigram:
                   tf-idf
check in         4.576182
user friendly    4.687408
check out        4.812571
info rekan       4.881564
tampil tarik     5.035714
update versi     5.323397
sesuai butuh     5.441180
butuh karyawan   5.574711
cepat akses      5.574711
update aplikasi  5.574711
ui ux            5.728862
update otomatis  5.728862


## Step 05 : Transformasi TF-IDF

Kita bisa mengguakan model hasil proses untuk mengecek kemunculan term penting hasil kalkulasi di dokumen lain.

In [163]:
str1 = 'Update aplikasi jika bisa dipermudah. Kalau bisa hanya dengan satu tombol update. Saat ini agak repot jika untuk update aplikasi prosesnya seperti download - install ulang.'

#show result
print('\nHasil temuan str1:')
for col in response.nonzero()[1]:
    print (feature_names22[col], ' - ', response[0, col])
print('\nHasil response :', response.shape)
print('Hasil preprocess str1: ', text_preprocessing(str1))




Hasil temuan str1:
update aplikasi  -  1.0

Hasil response : (1, 12)
Hasil preprocess str1:  ['update', 'aplikasi', 'mudah', 'tombol', 'update', 'repot', 'update', 'aplikasi', 'proses', 'download', 'install', 'ulang']


In [164]:
str2 = '1. Info rekan = plis searching nya dipermudah (membaca keyword nya) 2. kalau update tolong yang lebih user friendly tanpa harus uninstall APK yang eksisting 3. HC Wiki = keyword nya diperbanyak'
response2 = tfidf22.transform([str2])

print('\nHasil temuan str2:')
for col in response2.nonzero()[1]:
    print (feature_names22[col], ' - ', response2[0, col])
print('\nHasil response:', response2.shape)
print('Hasil preprocess str2: ', text_preprocessing(str2))


Hasil temuan str2:
user friendly  -  0.6926169104272608
info rekan  -  0.7213056324403656

Hasil response: (1, 12)
Hasil preprocess str2:  ['info', 'rekan', 'plis', 'searching', 'mudah', 'baca', 'keyword', 'update', 'tolong', 'user', 'friendly', 'uninstall', 'apk', 'eksisting', 'hc', 'wiki', 'keyword', 'banyak']
