# **Tugas 8 - Implementasi Pencarian Dokumen Berbasis Latent Semantic Analysis (LSA) dengan Cosine Similarity**

Nama : Isnita Widyur Rahmah
NIM : 220411100048
Kelas : IF 7A

Link Project : https://github.com/nittyaa99/ppw

## Install Library

In [None]:
!pip install pandas numpy tqdm nltk Sastrawi scikit-learn

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


## Import Library

In [None]:
# Library untuk pengolahan data
import pandas as pd
import numpy as np

# Library untuk operasi teks
import re
from tqdm import tqdm

# Library untuk stopwords dan stemming Bahasa Indonesia
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

# Library untuk NLTK
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Library untuk pembobotan teks dan reduksi dimensi
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Library untuk menghitung kemiripan
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Load Data

In [None]:
data = pd.read_csv('crawl_berita.csv')

data

Unnamed: 0,Judul,Isi,Tanggal,Kategori
0,Pertemuan Sri Mulyani-Prabowo Tak Banyak Bahas...,Wakil Menteri Keuangan II Thomas Djiwandono me...,"Rabu, 11 Sep 2024 18:10 WIB",Ekonomi
1,Pebisnis Minta Jokowi Cabut Larangan Jual Roko...,Gabungan pengusaha rokok dan petani tembakau m...,"Rabu, 11 Sep 2024 17:31 WIB",Ekonomi
2,IHSG Melemah Tipis ke 7.760 Sore Ini,Indeks Harga Saham Gabungan (IHSG) ditutup di ...,"Rabu, 11 Sep 2024 16:37 WIB",Ekonomi
3,Rupiah Menguat Rp15.402 per Dolar AS Usai Deba...,Nilai tukar rupiah berada di level Rp15.402 pe...,"Rabu, 11 Sep 2024 16:24 WIB",Ekonomi
4,Sri Mulyani Usai Nonton Timnas-Australia: Teri...,Menteri Keuangan Sri MulyaniÂ berkomentar soal...,"Rabu, 11 Sep 2024 15:47 WIB",Ekonomi
...,...,...,...,...
95,Hasil Liga 1: PSM vs Persib Sama Kuat,PSM Makassar harus puas berbagi satu angka usa...,"Rabu, 11 Sep 2024 17:25 WIB",Olahraga
96,"Jokowi Beri Bonus Rp36,25 Miliar ke Peraih Med...",Presiden Joko Widodo (Jokowi) menyerahkan bonu...,"Rabu, 11 Sep 2024 17:13 WIB",Olahraga
97,Megawati Ungkap Target di Liga Korea: Jadi Pem...,Megawati Hangestri Pertiwi mengungkapkan targe...,"Rabu, 11 Sep 2024 16:49 WIB",Olahraga
98,Media Vietnam: Indonesia Buat Kejutan Besar La...,Media Vietnam memuji performa Timnas Indonesia...,"Rabu, 11 Sep 2024 16:24 WIB",Olahraga


## Menggabungkan Judul dan Isi
Tujuannya untuk membantu memberikan hasil yang lebih relevan saat pengguna mencari dokumen berdasarkan keyword

In [None]:
titles = data['Judul']
contents = data['Isi']

data['Isi'] = titles + " " + contents

data['Isi']

Unnamed: 0,Isi
0,Pertemuan Sri Mulyani-Prabowo Tak Banyak Bahas...
1,Pebisnis Minta Jokowi Cabut Larangan Jual Roko...
2,IHSG Melemah Tipis ke 7.760 Sore Ini Indeks Ha...
3,Rupiah Menguat Rp15.402 per Dolar AS Usai Deba...
4,Sri Mulyani Usai Nonton Timnas-Australia: Teri...
...,...
95,Hasil Liga 1: PSM vs Persib Sama Kuat PSM Maka...
96,"Jokowi Beri Bonus Rp36,25 Miliar ke Peraih Med..."
97,Megawati Ungkap Target di Liga Korea: Jadi Pem...
98,Media Vietnam: Indonesia Buat Kejutan Besar La...


## Mengonversi Semua Huruf Besar Menjadi Huruf Kecil

In [None]:
def clean_lower(text):
    if isinstance(text, str):
        return text.lower()
    return text

data['lower case'] = data['Isi'].apply(clean_lower)
casefolding = pd.DataFrame(data['lower case'])

data['lower case']

Unnamed: 0,lower case
0,pertemuan sri mulyani-prabowo tak banyak bahas...
1,pebisnis minta jokowi cabut larangan jual roko...
2,ihsg melemah tipis ke 7.760 sore ini indeks ha...
3,rupiah menguat rp15.402 per dolar as usai deba...
4,sri mulyani usai nonton timnas-australia: teri...
...,...
95,hasil liga 1: psm vs persib sama kuat psm maka...
96,"jokowi beri bonus rp36,25 miliar ke peraih med..."
97,megawati ungkap target di liga korea: jadi pem...
98,media vietnam: indonesia buat kejutan besar la...


## Menghapus Simbol dan Angka dari Teks

In [None]:
def clean_punct(text):
    if isinstance(text, str):
        clean_patterns = re.compile(r'[0-9]|[/(){}\[\]\|@,;_]|[^a-z ]')
        text = clean_patterns.sub(' ', text)
        return text
    return text

data['tanda baca'] = data['lower case'].apply(clean_punct)

data['tanda baca']

Unnamed: 0,tanda baca
0,pertemuan sri mulyani prabowo tak banyak bahas...
1,pebisnis minta jokowi cabut larangan jual roko...
2,ihsg melemah tipis ke sore ini indeks harga sa...
3,rupiah menguat rp per dolar as usai debat trum...
4,sri mulyani usai nonton timnas australia terim...
...,...
95,hasil liga psm vs persib sama kuat psm makassa...
96,jokowi beri bonus rp miliar ke peraih medali p...
97,megawati ungkap target di liga korea jadi pema...
98,media vietnam indonesia buat kejutan besar law...


## Menghapus Spasi Awal dan Akhir dari Sebuah String

In [None]:
def _normalize_whitespace(text):
    if isinstance(text, str):
        corrected = re.sub(r'\s+', ' ', text)
        return corrected.strip()
    return text

data['spasi'] = data['tanda baca'].apply(_normalize_whitespace)
data['spasi']

Unnamed: 0,spasi
0,pertemuan sri mulyani prabowo tak banyak bahas...
1,pebisnis minta jokowi cabut larangan jual roko...
2,ihsg melemah tipis ke sore ini indeks harga sa...
3,rupiah menguat rp per dolar as usai debat trum...
4,sri mulyani usai nonton timnas australia terim...
...,...
95,hasil liga psm vs persib sama kuat psm makassa...
96,jokowi beri bonus rp miliar ke peraih medali p...
97,megawati ungkap target di liga korea jadi pema...
98,media vietnam indonesia buat kejutan besar law...


## Mengurangi Jumlah Kata dalam Sebuah Dokumen

In [None]:
def clean_stopwords(text):
    if isinstance(text, str):
        stopword = set(stopwords.words('indonesian'))
        text = ' '.join(word for word in text.split() if word not in stopword)
        return text.strip()
    return text

data['stopwords'] = data['spasi'].apply(clean_stopwords)
data['stopwords']

Unnamed: 0,stopwords
0,pertemuan sri mulyani prabowo bahas makan berg...
1,pebisnis jokowi cabut larangan jual rokok mete...
2,ihsg melemah tipis sore indeks harga saham gab...
3,rupiah menguat rp dolar as debat trump harris ...
4,sri mulyani nonton timnas australia terima kas...
...,...
95,hasil liga psm vs persib kuat psm makassar pua...
96,jokowi bonus rp miliar peraih medali paralimpi...
97,megawati target liga korea pemain asing terbai...
98,media vietnam indonesia kejutan lawan raksasa ...


## Mereduksi Kata Menjadi Bentuk Dasar

In [None]:
def sastrawistemmer(text):
    factory = StemmerFactory()
    st = factory.create_stemmer()
    text = ' '.join(st.stem(word) for word in tqdm(text.split()) if word in text)
    return text

data['stemming'] = data['stopwords'].apply(sastrawistemmer)

data['stemming']

100%|██████████| 218/218 [00:05<00:00, 43.02it/s]
100%|██████████| 182/182 [00:06<00:00, 28.73it/s]
100%|██████████| 125/125 [00:05<00:00, 21.79it/s]
100%|██████████| 162/162 [00:08<00:00, 18.03it/s]
100%|██████████| 284/284 [00:08<00:00, 31.98it/s]
100%|██████████| 236/236 [00:06<00:00, 36.01it/s]
100%|██████████| 159/159 [00:04<00:00, 34.69it/s]
100%|██████████| 188/188 [00:06<00:00, 28.22it/s]
100%|██████████| 256/256 [00:14<00:00, 18.18it/s]
100%|██████████| 108/108 [00:03<00:00, 34.63it/s]
100%|██████████| 256/256 [00:09<00:00, 28.36it/s]
100%|██████████| 108/108 [00:03<00:00, 28.84it/s]
100%|██████████| 229/229 [00:06<00:00, 34.30it/s]
100%|██████████| 193/193 [00:05<00:00, 35.78it/s]
100%|██████████| 272/272 [00:05<00:00, 48.91it/s]
100%|██████████| 172/172 [00:08<00:00, 20.81it/s]
100%|██████████| 199/199 [00:04<00:00, 42.55it/s]
100%|██████████| 223/223 [00:03<00:00, 60.40it/s]
100%|██████████| 229/229 [00:07<00:00, 31.66it/s]
100%|██████████| 150/150 [00:04<00:00, 35.34it/s]


Unnamed: 0,stemming
0,temu sri mulyani prabowo bahas makan gizi grat...
1,bisnis jokowi cabut larang jual rokok meter se...
2,ihsg lemah tipis sore indeks harga saham gabun...
3,rupiah kuat rp dolar as debat trump harris nil...
4,sri mulyani nonton timnas australia terima kas...
...,...
95,hasil liga psm vs persib kuat psm makassar pua...
96,jokowi bonus rp miliar raih medali paralimpiad...
97,megawati target liga korea main asing baik meg...
98,media vietnam indonesia kejut lawan raksasa as...


## Transformasi Teks Menjadi Matriks TF-IDF dengan TfidfVectorizer
TF-IDF adalah metode yang digunakan dalam NLP untuk mengukur seberapa penting sebuah kata dalam sebuah dokumen dibandingkan dengan seluruh dokumen lainnya

In [None]:
tfidf_vectorizer = TfidfVectorizer()

corpus = data['stemming'].tolist()
x_tfidf = tfidf_vectorizer.fit_transform(corpus)

df_tfidf = pd.DataFrame(x_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
df_tfidf

Unnamed: 0,abroad,absolut,acara,achmad,acu,adab,adam,adaptif,adb,adi,...,yoppy,yuan,yudha,yuran,yusuf,zayana,zona,zonasi,zulhas,zulkifli
0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
1,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.053357,0.0,0.0
2,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.043733,0.000000,0.0,0.0
3,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.048855,0.000000,0.000000,0.0,0.000000,0.078973,0.000000,0.0,0.0
4,0.0,0.052684,0.0,0.0,0.0,0.0,0.052684,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.060259,0.0,0.000000,0.000000,0.000000,0.0,0.0
96,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.039231,0.000000,0.0,0.078461,0.000000,0.000000,0.0,0.0
97,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0
98,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.000000,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0


## Mengurangi Dimensi Data Teks dengan SVD
Menerapkan teknik Latent Semantic Analysis (LSA) menggunakan Singular Value Decomposition (SVD) pada matriks TF-IDF yang telah dihitung sebelumnya. LSA adalah teknik yang digunakan untuk mengurangi dimensi data teks dan menemukan hubungan tersembunyi (latent) antar kata dan dokumen dalam sebuah korpus

In [None]:
# Menerapkan SVD untuk LSA
n_components = 100  # Sesuaikan jumlah komponen sesuai kebutuhan
svd = TruncatedSVD(n_components=n_components, random_state=42)
lsa_matrix = svd.fit_transform(x_tfidf)

# Mengambil sebagian dari matriks V dan matriks Σ (Sigma)
v_matrix = svd.components_  # Matriks V (komponen term)
singular_values = svd.singular_values_  # Singular values (Sigma)

# Menentukan berapa banyak komponen dari V dan Sigma yang ingin digunakan
partial_n = 50  # Misalkan kita hanya ingin mengambil 50 komponen teratas

# Ambil sebagian dari matriks V dan Sigma
partial_v_matrix = v_matrix[:partial_n]  # Ambil 50 baris pertama dari V
partial_sigma = np.diag(singular_values[:partial_n]) # Buat matriks diagonal dengan 50 singular values terbesar

In [None]:
data_svd = pd.DataFrame(lsa_matrix)

In [None]:
data_svd

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.099578,0.589178,-0.018286,-0.044352,-0.007629,-0.024987,-0.006220,-0.014348,0.022595,0.036825,...,-1.067939e-17,-2.512639e-17,-1.496199e-17,-2.797242e-17,-2.217193e-17,-2.406929e-17,6.884684e-18,4.998172e-17,1.355253e-18,-8.917563e-18
1,0.089575,0.148609,0.020872,0.199788,-0.244637,-0.122914,-0.239428,0.064962,0.450071,0.283107,...,-8.565197e-18,7.155734e-18,-5.312591e-18,2.200930e-17,-2.818926e-18,-2.097931e-17,1.816039e-17,1.680513e-18,-2.439455e-18,-8.619407e-17
2,0.056072,0.040851,0.037627,0.592620,0.402106,-0.131675,0.241172,-0.041677,-0.113225,0.003894,...,-1.096399e-17,1.642566e-17,-2.764716e-17,-2.303930e-18,1.517883e-18,1.951564e-18,-6.938894e-18,1.379647e-17,-1.355253e-17,-2.507506e-16
3,0.100437,0.062139,0.013606,0.612722,0.429134,-0.092178,0.108567,-0.030567,-0.094085,-0.040650,...,-4.255494e-17,8.768485e-18,9.540979e-18,-1.104531e-17,1.433857e-17,1.023216e-18,1.767250e-17,-1.426403e-17,2.981556e-18,1.590321e-16
4,0.612329,-0.003007,-0.017408,-0.085261,-0.111041,-0.021704,0.425391,0.012077,0.167411,-0.002976,...,-7.177418e-17,-2.059984e-18,-3.230922e-17,2.794531e-17,1.050321e-17,-4.092863e-17,-6.356135e-17,-3.573801e-17,7.100169e-17,-1.721442e-16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.253708,-0.032547,0.616369,-0.090914,0.147317,0.053425,-0.069635,-0.308917,0.290630,-0.383066,...,8.131516e-19,1.029992e-18,-2.428613e-17,-4.130810e-17,-1.084202e-19,-2.168404e-18,3.672735e-18,8.510987e-18,9.161508e-18,4.109126e-17
96,0.115517,0.035815,0.061040,0.149445,-0.202105,-0.236757,-0.160461,-0.427468,0.107241,0.185644,...,-9.676504e-18,-2.706440e-17,1.214306e-17,1.336279e-17,2.786400e-17,-9.161508e-18,-6.776264e-19,-8.510987e-18,-4.092863e-18,1.460827e-16
97,0.158220,-0.018249,0.241176,0.014005,0.091386,0.022298,-0.129911,-0.028739,-0.040590,0.215609,...,5.204170e-18,-3.523657e-19,-3.903128e-18,2.969359e-17,7.209944e-18,3.577867e-17,1.707618e-18,3.783866e-17,-3.388132e-18,-6.850802e-17
98,0.620968,-0.102329,-0.228453,-0.011914,0.194992,0.262029,-0.379488,0.017257,-0.000655,0.034103,...,-4.119968e-18,1.376937e-17,-8.077306e-18,-2.209062e-17,6.613633e-18,-1.160096e-17,-1.084202e-17,6.938894e-18,-3.355606e-17,-1.436568e-16


## Save Data

In [None]:
data_svd.to_csv('data_svd.csv', index=False)

## Preprocessing untuk Query

In [None]:
def preprocess_query(query):
    query = clean_lower(query)
    query = clean_punct(query)
    query = _normalize_whitespace(query)
    query = clean_stopwords(query)
    query = sastrawistemmer(query)

    return query

## Pencarian Dokumen berbasis Kemiripann (Similarity)
Cosine Similarity adalah sebuah ukuran yang digunakan untuk mengukur kemiripan antara dua vektor dalam ruang vektor, yang dihitung berdasarkan cosinus sudut antara keduanya.

In [None]:
def search_documents(query):
    # Preprocess query
    query = preprocess_query(query)

    # Mengonversi query ke dalam ruang LSA
    query_tfidf = tfidf_vectorizer.transform([query])
    query_lsa = svd.transform(query_tfidf)

    # Menghitung kemiripan dengan cosine similarity
    similarities = cosine_similarity(lsa_matrix[:, :partial_n], query_lsa[:, :partial_n]).flatten()

    # Memilih hanya dokumen dengan skor kemiripan > 0
    top_indices = [i for i, score in enumerate(similarities) if score > 0]

    # Membuat DataFrame untuk menampilkan hasil pencarian
    results = pd.DataFrame({
        'Judul': titles.iloc[top_indices].values,
        'Isi': contents.iloc[top_indices].values,
        'Skor Kemiripan': similarities[top_indices]
    })

    # Mengurutkan berdasarkan skor kemiripan dari terbesar ke terkecil
    results = results.sort_values(by='Skor Kemiripan', ascending=False).reset_index(drop=True)

    return results

In [None]:
query = "sepak bola"
results = search_documents(query)

results

100%|██████████| 2/2 [00:00<00:00, 611.10it/s]


Unnamed: 0,Judul,Isi,Skor Kemiripan
0,Netizen Australia Kecewa Berat setelah Ditahan...,Jika suporter Indonesia begitu gegap gempita m...,0.8064478
1,Netizen Australia Kecewa Berat setelah Ditahan...,Jika suporter Indonesia begitu gegap gempita m...,0.8064478
2,Hasil Liga 1: Bali United vs Arema Tanpa Pemenang,Bali United dan Arema FC bermain imbang dalam ...,0.2251146
3,Hasil Liga 1: Bali United vs Arema Tanpa Pemenang,Bali United dan Arema FC bermain imbang dalam ...,0.2251146
4,Hasil Liga 1: Bali United vs Arema Tanpa Pemenang,Bali United dan Arema FC bermain imbang dalam ...,0.2251146
5,5 Fakta Maarten Paes Kawal Gawang Indonesia da...,Maarten Paes tampil gemilang di bawah mistar T...,0.2193764
6,5 Fakta Maarten Paes Kawal Gawang Indonesia da...,Maarten Paes tampil gemilang di bawah mistar T...,0.2193764
7,5 Fakta Maarten Paes Kawal Gawang Indonesia da...,Maarten Paes tampil gemilang di bawah mistar T...,0.2193764
8,Sri Mulyani Usai Nonton Timnas-Australia: Teri...,Menteri Keuangan Sri MulyaniÂ berkomentar soal...,0.1621123
9,Sri Mulyani Usai Nonton Timnas-Australia: Teri...,Menteri Keuangan Sri MulyaniÂ berkomentar soal...,0.1621123
