# **TEXT MINING 6 - TERM FREQUENCY & INVERSE DOCUMENT FREQUENCY**

Build Date | April, 18th 2020
--- | ---
Male Tutor | Gustian Herlambang & Pahmi Alifya Bahri
Female | Siti Rahmah & Imelda Putri Anggraini

## **A. PROCESS**

### **1. Import Library**

Kali ini kita akan menggunakan 3 library yaitu :
1. `RE`
2. `Sastrawi`, dan 
3. `Scikit-Learn`

Kemudian, untuk fungsi yang akan digunakan adalah : 
1. `CountVectorizer`, 
2. `TfidfVectorizer`, dan 
3. `StemmerFactory`.

*Notes : konsep strukturisasi variable dalam Python termaktub dalam `8 PEP` , dicover more : https://www.python.org/dev/peps/pep-0008/*

In [1]:
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

### **2. Open Text**

Sebelum menjalankan semua kodingan dibawah ini pastikan bahwa kalian sudah melakukan scrap data text/document, yang disimpan dalam bentuk file `txt` dan sudah disimpan didalam `Jupyter Notebook`. Kali ini kita akan menggunakan 5 dokumen.

Cara menyiapkan dokumen :
1. Buka web browser
2. Cari topik, misal tentang *internet*.
3. Cari artikel apapun tentang internet, **copy** paragraf artikel dan **paste** di notepad.
4. Save dalam bentuk `txt`.
5. Upload semua txt kedalam Jupyter. 
6. Baru kerjakan coding code dibawah ini. 

Notes : U

In [2]:
clear_character = re.compile(r"\s")
def tokenize(text):
    return [tokens.strip().lower() for tokens in clear_character.split(text)]

In [3]:
file = open("source1.txt","r");
doc0 = file.read()

file = open("source2.txt","r");
doc1 = file.read()

file = open("source3.txt","r");
doc2 = file.read()

file = open("source4.txt","r");
doc3 = file.read()

file = open("source5.txt","r");
doc4 = file.read()

## **3. Case Folding**

In [4]:
# menghilangkan tanda baca
tandabaca = [".",",","-","%","(",")","?"]
for td in tandabaca:
	doc0=doc0.replace(td,"")
	doc1=doc1.replace(td,"")
	doc2=doc2.replace(td,"")
	doc3=doc3.replace(td,"")
	doc4=doc4.replace(td,"")

In [5]:
# menghilangkan angka
docs0 = re.sub(r"\d+", " ", doc0)
docs1 = re.sub(r"\d+", " ", doc1)
docs2 = re.sub(r"\d+", " ", doc2)
docs3 = re.sub(r"\d+", " ", doc3)
docs4 = re.sub(r"\d+", " ", doc4)

## **4. Stopwords**

In [6]:
#stopwords

factory = StopWordRemoverFactory()
stopword = factory.create_stop_word_remover()

stopdocs0 = stopword.remove(docs0)
stopdocs1 = stopword.remove(docs1)
stopdocs2 = stopword.remove(docs2)
stopdocs3 = stopword.remove(docs3)
stopdocs4 = stopword.remove(docs4)

## **5. Stemming**

In [7]:
# stemming 

factory = StemmerFactory()
stemmer = factory.create_stemmer()

stem1 = stemmer.stem(stopdocs0)
stem2 = stemmer.stem(stopdocs1)
stem3 = stemmer.stem(stopdocs2)
stem4 = stemmer.stem(stopdocs3)
stem5 = stemmer.stem(stopdocs4)

## **6. Tokenisasi**

In [8]:
#tokenization
train_set = [stem1,stem2,stem3,stem4,stem5]

In [9]:
count_vectorizer = CountVectorizer(tokenizer=tokenize)
data = count_vectorizer.fit_transform(train_set).toarray()
vocab = count_vectorizer.get_feature_names()

## **7. TF - IDF**

In [10]:
print ("Jumlah Term of Frequency")
print(data)

Jumlah Term of Frequency
[[1 0 3 ... 0 1 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 5 0 0]
 [0 3 0 ... 0 0 1]]


In [11]:
print ("Vektor Features")
print(vocab)

Vektor Features
['accumulative', 'aceh', 'ada', 'adalah', 'adi', 'adu', 'ahp', 'aju', 'akan', 'akhir', 'akses', 'aksi', 'akurasi', 'alternatif', 'ambil', 'analisa', 'analytical', 'anggap', 'antara', 'apache', 'aparatur', 'atas', 'awal', 'badan', 'bagi', 'baik', 'bakat', 'balap', 'banding', 'bangun', 'bantu', 'banyak', 'bas', 'basis', 'bayes', 'beberapa', 'beda', 'belimbing', 'benar', 'beri', 'besar', 'biasa', 'bidang', 'bisa', 'bkd', 'blue', 'bobot', 'borda', 'buah', 'buka', 'bukti', 'burundi', 'butuh', 'calon', 'cctv', 'change', 'cipta', 'circuit', 'ciri', 'citra', 'closed', 'cocok', 'coratcoret', 'coretcoret', 'cropping', 'daerah', 'dalam', 'dan', 'dapat', 'dari', 'dasar', 'data', 'database', 'dekat', 'dengan', 'detection', 'deteksi', 'di', 'diamana', 'differences', 'dinding', 'dindingdinding', 'diri', 'dukung', 'dunia', 'e', 'efektif', 'efisien', 'egovernment', 'ekspektasi', 'ekstraksi', 'elektronik', 'end', 'fitur', 'front', 'fungsi', 'gabung', 'gam', 'gatewayhasil', 'gdss', 'gera'

In [12]:
tfidf = TfidfVectorizer().fit_transform(train_set)

In [13]:
print ("Jumlah TF - IDF")
print (tfidf)

Jumlah TF - IDF
  (0, 12)	0.048398203571365365
  (0, 273)	0.04017488095286213
  (0, 39)	0.05998832584107542
  (0, 7)	0.05998832584107542
  (0, 107)	0.05998832584107542
  (0, 164)	0.05998832584107542
  (0, 250)	0.05998832584107542
  (0, 241)	0.05998832584107542
  (0, 193)	0.05998832584107542
  (0, 36)	0.05998832584107542
  (0, 289)	0.05998832584107542
  (0, 144)	0.05998832584107542
  (0, 35)	0.05998832584107542
  (0, 278)	0.04017488095286213
  (0, 220)	0.048398203571365365
  (0, 28)	0.05998832584107542
  (0, 64)	0.05998832584107542
  (0, 259)	0.05998832584107542
  (0, 75)	0.05998832584107542
  (0, 55)	0.05998832584107542
  (0, 119)	0.05998832584107542
  (0, 115)	0.05998832584107542
  (0, 4)	0.05998832584107542
  (0, 116)	0.05998832584107542
  (0, 79)	0.05998832584107542
  :	:
  (4, 52)	0.07117616412913944
  (4, 165)	0.03558808206456972
  (4, 204)	0.07117616412913944
  (4, 221)	0.07117616412913944
  (4, 30)	0.03558808206456972
  (4, 14)	0.10676424619370914
  (4, 136)	0.059082645833188785

In [14]:
# bandingkan doc1 dan doc5
from sklearn.metrics.pairwise import manhattan_distances
doc1_vect = tfidf[0].reshape(1, -1) # doc1
doc2_vect = tfidf[4].reshape(1, -1) # doc2

# hitung jarak
distance = manhattan_distances(doc1_vect, doc2_vect)
print("doc1 dan doc2 punya kemiripan sebesar: {:.2%}".format(distance.item(0)))

doc1 dan doc2 punya kemiripan sebesar: 1254.54%


## **C. Kesimpulan**

Dapat disimpulkan bahwa document 1 dengan document 5 mempunyai kemiripan sebesar **1254%**