# Tugas UTS : Melakukan Keyword Extraction pada Berita

Pada Tugas UTS ini diminta untuk melakukan proses pembuatan struktur graph untuk mencari kata yang sering muncul (keyword extraction) pada satu berita.
Dibuat Oleh:

*   Nama : Sabil Ahmad Hidayat
*   NIM : 220411100058
*   Kelas : PPW A

Link Code : https://colab.research.google.com/drive/1J7JcHiGjk45g-yObNPwWZ1YyeJwogw6J?usp=sharing

Link Github : https://github.com/meinhere/ppw/tree/master/publish/tugas-uts

## Import Library

In [1]:
!pip install -q Sastrawi


[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# alat untuk crawling
from urllib.request import urlopen
from bs4 import BeautifulSoup

# library dasar
import pandas as pd
import numpy as np

# preprocessing
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import sent_tokenize

# library untuk centrality
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# monitoring
from tqdm import tqdm

# library untuk plot
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns

## Prepare Data

### Crawl Data

Melakukan crawl data untuk satu berita online yang akan dilakukan proses keyword extraction. Berita diambil pada website KOMPAS.com

In [None]:
# URL dari halaman yang akan di-crawl
url = 'https://money.kompas.com/read/2024/10/10/071000426/sebut-kondisi-ekonomi-ri-positif-prabowo--kita-sering-kurang-bersyukur-'

html = urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')

div = soup.find("div", {"class": "read__content"})
paragraf = div.find_all("p")

### Fungsi untuk Praproses Text

In [None]:
# Case Folding
def clean_lower(lwr):
    lwr = lwr.lower() # lowercase text
    return lwr

# Menghapus tanda baca, angka, dan simbol
def clean_punct(text):
    clean_spcl = re.compile('[/(){}\[\]\|@,;_]')
    clean_symbol = re.compile('[^0-9a-z]')
    clean_number = re.compile('[0-9]')
    text = clean_spcl.sub('', text)
    text = clean_symbol.sub(' ', text)
    text = clean_number.sub('', text)
    return text

# Menghaps double atau lebih whitespace
def _normalize_whitespace(text):
    corrected = str(text)
    corrected = re.sub(r"//t",r"\t", corrected)
    corrected = re.sub(r"( )\1+",r"\1", corrected)
    corrected = re.sub(r"(\n)\1+",r"\1", corrected)
    corrected = re.sub(r"(\r)\1+",r"\1", corrected)
    corrected = re.sub(r"(\t)\1+",r"\1", corrected)
    return corrected.strip(" ")

# Menghapus stopwords
def clean_stopwords(text):
    stopword = set(stopwords.words('indonesian'))
    text = ' '.join(word for word in text.split() if word not in stopword) # hapus stopword dari kolom deskripsi
    return text

# Stemming with Sastrawi
def sastrawistemmer(text):
    factory = StemmerFactory()
    st = factory.create_stemmer()
    text = ' '.join(st.stem(word) for word in tqdm(text.split()) if word in text)
    return text

function **clean_lower** digunakan untuk merubah semua kata atau huruf menjadi huruf kecil semua

function **clean_punct** digunakan untuk menghapus karakter, simbol, dan angka

function **_normalize_whitespace** digunakan untuk menghapus spasi yang double atau lebih dari 2 spasi

function **clean_stopwords** digunakan untuk menghilangkan kata yang tidak perlu (kata hubung, kata tambahan dll)

function **sastrawistemmer** digunakan untuk proses stemming (mendapatkan kata dasar dari suatu kata)

### Praproses Teks

In [None]:
# Implement all functions to paragraf
cleaned_paragraf = []
for p in paragraf:
    text = p.get_text()
    text = clean_lower(text)
    text = clean_punct(text)
    text = _normalize_whitespace(text)
    text = clean_stopwords(text)
    text = sastrawistemmer(text)
    cleaned_paragraf.append(text)

100%|██████████| 22/22 [00:01<00:00, 19.58it/s]
100%|██████████| 13/13 [00:00<00:00, 75.72it/s]
100%|██████████| 14/14 [00:00<00:00, 84.66it/s]
100%|██████████| 18/18 [00:00<00:00, 34.89it/s]
100%|██████████| 12/12 [00:03<00:00,  3.49it/s]
100%|██████████| 10/10 [00:02<00:00,  3.64it/s]
100%|██████████| 15/15 [00:00<00:00, 17.92it/s]
100%|██████████| 8/8 [00:01<00:00,  7.12it/s]
100%|██████████| 2/2 [00:00<00:00, 423.56it/s]
100%|██████████| 7/7 [00:00<00:00, 18.08it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
100%|██████████| 12/12 [00:01<00:00, 10.30it/s]
100%|██████████| 9/9 [00:00<00:00, 16.83it/s]
100%|██████████| 6/6 [00:00<00:00,  9.72it/s]
100%|██████████| 10/10 [00:00<00:00, 25.71it/s]


## Pembuatan Dataframe

### Persiapan Kalimat dan Kata

In [None]:
# Membuat kumpulan kalimat
sentences = []
for p in cleaned_paragraf:
  sentences.extend(sent_tokenize(p))

# Membuat kumpulan kata unik
vocabulary = set()
for sentence in sentences:
  for word in sentence.split():
    vocabulary.add(word)

print(sentences)
print(vocabulary)

['jakarta kompas com presiden pilih prabowo subianto nilai kembang kondisi ekonomi indonesia positif tanda laju inflasi tumbuh produk domestik bruto pdb jaga', 'menteri tahan bilang laju tumbuh ekonomi nasional banding negara negara kondisi global tantang', 'data anyar tumbuh ekonomi indonesia capai persen kuartal ii tumbuh ekonomi negara g periode', 'iring tumbuh ekonomi stabil kisar persen pdb indonesia capai triliun dollar as jadi indonesia negara ekonomi besar dunia', 'syukur terima kasih harga capai prabowo bni investor daily summit jakarta rabu', 'baca faisal basri utang perintah potensi tembus rp triliun prabowo', 'prabowo bilang kondisi ekonomi global tantang pandemi covid utang negara bengkak rasio utang pdb lonjak', 'prabowo contoh perancis rasio utang pdb capai persen', 'bayar utang', 'baca prabowo rencana turun tarif pph badan', 'prabowo sadar kondisi global warna tingkat tensi geopolitik tantang sendiri ekonomi nasional', 'turut hati hati ambil putus jaga damai negara anta

### Implementasi ke Dataframe

In [None]:
# Membuat Dataframe kosong dengan kolom sesuai dengan vocabulary
pd.set_option('future.no_silent_downcasting', True)
df = pd.DataFrame(columns=list(vocabulary), index=sentences)

# Mengisi nilai kosong dengan 0
df = df.fillna(0)  # Fill with 0s

# Menghitung nilai setiap kata pada kalimat
for i, sentence in enumerate(sentences):
  for word in sentence.split():
    df.loc[sentence, word] += 1

# Melihat isi Dataframe
df

Unnamed: 0,tarif,capai,rp,basri,terima,presiden,nilai,geopolitik,produk,periode,...,rencana,sendiri,tunggu,persen,perintah,daily,pandemi,bayar,tanda,putus
jakarta kompas com presiden pilih prabowo subianto nilai kembang kondisi ekonomi indonesia positif tanda laju inflasi tumbuh produk domestik bruto pdb jaga,0,0,0,0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
menteri tahan bilang laju tumbuh ekonomi nasional banding negara negara kondisi global tantang,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
data anyar tumbuh ekonomi indonesia capai persen kuartal ii tumbuh ekonomi negara g periode,0,1,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
iring tumbuh ekonomi stabil kisar persen pdb indonesia capai triliun dollar as jadi indonesia negara ekonomi besar dunia,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
syukur terima kasih harga capai prabowo bni investor daily summit jakarta rabu,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
baca faisal basri utang perintah potensi tembus rp triliun prabowo,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
prabowo bilang kondisi ekonomi global tantang pandemi covid utang negara bengkak rasio utang pdb lonjak,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
prabowo contoh perancis rasio utang pdb capai persen,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
bayar utang,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
baca prabowo rencana turun tarif pph badan,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


## Keyword Extraction

In [None]:
# Menghitung jumlah kemunculan pada tiap kata
word_frequencies = df.sum(axis=0)

# Mengurutkan berdasarkan jumlah kemunculan kata
sorted_word_frequencies = word_frequencies.sort_values(ascending=False)

# Menampilkan kata teratas
rank = 5

for i, (word, freq) in enumerate(sorted_word_frequencies.items()):
    if i < rank:
        print(f"Rank {i+1}: {word} (Frequency: {freq})")

Rank 1: prabowo (Frequency: 9)
Rank 2: ekonomi (Frequency: 9)
Rank 3: negara (Frequency: 6)
Rank 4: tumbuh (Frequency: 6)
Rank 5: utang (Frequency: 5)
