# Pencarian dan Penambangan Web - Tugas 3 : Implementasi hasil VSM dengan algoritma Logistic Regression

Pada Tugas 3 ini diminta untuk melakukan proses pembuatan model dari data VSM yang telah dibuat sebelumnya menggunakan algoritma Logistic Regression.

Dibuat Oleh:

*   Nama : Sabil Ahmad Hidayat
*   NIM : 220411100058
*   Kelas : PPW A

Link Projek : https://github.com/meinhere/ppw



# Import Library

In [60]:
!pip install -q Sastrawi

In [61]:
# library awal untuk perhitungan dan pengolahan teks
import numpy as np
import re
import pandas as pd

# alat untuk crawling
from urllib.request import urlopen
from bs4 import BeautifulSoup

# monitoring
from tqdm import tqdm

# library untuk praproses teks
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# library untuk proses modeling
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# library untuk evaluasi model
from sklearn.metrics import classification_report, confusion_matrix

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# save model
import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**preprocessing** disini digunakan untuk melakukan proses encoding pada label

**train_test_split** digunakan untuk membagi dataset menjadi data training dan testing

**LogisticRegression** digunakan untuk tahap modeling menggunakan library LogisticRegression

**classification_report** dan **confusion_matrix** digunakan untuk melihat laporan dan hasil evaluasi setelah proses training data

**matplotlib** dan **seaborn** digunakan untuk plotting grafik

**pickle** digunakan untuk menyimpan model hasil training dan testing

# Persiapan Data

## Load Data Model

In [62]:
main_df = pd.read_csv('https://raw.githubusercontent.com/meinhere/ppw/master/publish/tugas-2/data_berita.csv', delimiter=',')
main_df

Unnamed: 0,No,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
0,1,Simak Jadwal dan Lokasi SIM Keliling di Jakart...,"JAKARTA, KOMPAS.com - Surat Izin Mengemudi (S...",07/09/2024,OTOMOTIF
1,2,[POPULER OTOMOTIF] Diskon Motor Honda Septembe...,"JAKARTA, KOMPAS.com - Banyak pembaca yang ingi...",07/09/2024,OTOMOTIF
2,3,"Cek Saldo Minimal BRI, BNI, BCA, Mandiri, dan BSI","JAKARTA, KOMPAS.com - Penting bagi calon nasab...",06/09/2024,MONEY
3,4,"KAI Uji Coba Teknologi ""Face Recognition Board...",KOMPAS.com - PT Kereta Api Indonesia (KAI) Div...,06/09/2024,MONEY
4,5,OJK Blokir 10.890 Entitas Keuangan Ilegal Seja...,"JAKARTA, KOMPAS.com - Otoritas Jasa Keuangan (...",06/09/2024,MONEY
...,...,...,...,...,...
95,96,Waspada Masalah yang Timbul akibat Telat Ganti...,"JAKARTA, KOMPAS.com - Oli mesin pada mobil den...",06/09/2024,OTOMOTIF
96,97,"Sosok Faisal Basri di Mata Para Tokoh, Ekonom ...","JAKARTA, KOMPAS.com - Ekonom senior Faisal Bas...",06/09/2024,MONEY
97,98,"Pendaftaran CPNS Diperpanjang 4 Hari, Pelamar ...","JAKARTA, KOMPAS.com - Pemerintah telah memperp...",06/09/2024,MONEY
98,99,"Harga Emas Terbaru Pegadaian, Jumat 6 Septembe...","JAKARTA, KOMPAS.com - Pegadaian menyediakan be...",06/09/2024,MONEY


## Membuat Fungsi untuk Persiapan Crawling

In [63]:
# fungsi untuk mengambil link yang akan dilakukan crawling
def extract_urls(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

    urls = soup.find_all("a", {"class": "paging__link"})
    urls = [url.get('href') for url in urls]

    return urls

# fungsi untuk mengambil isi dari berita
def get_content(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

    div = soup.find("div", {"class": "read__content"})
    paragraf = div.find_all("p")

    content = ''
    for p in paragraf:
        content += p.text

    return content


# fungsi utama crawling
def crawl(link = "https://indeks.kompas.com", max_money = 1, max_otomotif = 1, allow_category = ["OTOMOTIF", "MONEY"], is_train = True, title_old = []):
    # inisialisasi variabel penampung hasil berita
    news_data = []

    # inisialisasi persiapan untuk crawling berita
    last_url = extract_urls(link).pop()
    page = last_url.split('=').pop() # jumlah halaman secara otomatis
    # page = 1 # jumlah halaman secara manual

    # persiapan link yang akan dilakukan crawling
    urls = [link + '/?page=' + str(a) for a in range(1, int(page) + 1)]
    count_money = 0
    count_otomotif = 0

    # menelusuri semua link yang telah ditentukan
    for idx, url in enumerate(urls):
        if (len(news_data) == max_money + max_otomotif) :
          break

        html = urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')

        # mengambil data yang diperlukan pada struktur html
        links       = soup.find_all("a", {"class": "article-link"})
        titles      = soup.find_all("h2", {"class": "articleTitle"})
        dates       = soup.find_all("div", {"class": "articlePost-date"})
        categories  = soup.find_all("div", {"class": "articlePost-subtitle"})

        news_per_page = len(links) # berita artikel yang ditampilkan

        # memasukkan data ke dalam list
        for elem in tqdm(range(news_per_page), desc=f"Crawling page {idx+1}"):
          news = {}
          category = categories[elem].text
          title = titles[elem].text

          if (category in allow_category):
            if (is_train):
              cond = (category == "MONEY" and count_money < max_money) or (category == "OTOMOTIF" and count_otomotif < max_otomotif)
            else:
              cond = (category == "MONEY" and count_money < max_money) or (category == "OTOMOTIF" and count_otomotif < max_otomotif) and title not in title_old


            if (cond):
              news['No'] = len(news_data) + 1
              news['Judul Berita']     = title
              news['Isi Berita']       = get_content(links[elem].get("href"))
              news['Tanggal Berita']   = dates[elem].text
              news['Kategori Berita']  = category
              news_data.append(news)

              if (category == "MONEY"):
                count_money += 1
              else:
                count_otomotif += 1

        print(f"=======> Money: {count_money} | Otomotif: {count_otomotif} | Total: {count_money + count_otomotif}")

    return news_data

function **extract_urls** digunakan untuk melakukan ekstraksi link url yang memiliki pagination pada halaman awal, sehingga didapat beberapa url yang bisa mengarah ke halaman selanjutnya atau sebelumnya.

function **get_content** digunakan untuk melakukan proses pembuatan isi berita sesuai link berita yang dicari.

## Pengambilan Data Baru

In [64]:
title_old = main_df["Judul Berita"].tolist()

test_news = crawl(max_money=5, max_otomotif=5, is_train=False, title_old=title_old)

Crawling page 1: 100%|██████████| 15/15 [00:01<00:00, 14.34it/s]




Crawling page 2: 100%|██████████| 15/15 [00:00<00:00, 44.94it/s]




Crawling page 3: 100%|██████████| 15/15 [00:00<00:00, 18120.55it/s]




Crawling page 4: 100%|██████████| 15/15 [00:02<00:00,  7.22it/s]




Crawling page 5: 100%|██████████| 15/15 [00:01<00:00, 10.78it/s]




Crawling page 6: 100%|██████████| 15/15 [00:00<00:00, 22.18it/s]




Crawling page 7: 100%|██████████| 15/15 [00:00<00:00, 23.57it/s]




Crawling page 8: 100%|██████████| 15/15 [00:00<00:00, 28.56it/s]






In [65]:
main_df = pd.DataFrame(test_news)
main_df

Unnamed: 0,No,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
0,1,KSPI Khawatir Kisruh Kadin Berdampak pada Nasi...,"JAKARTA, KOMPAS.com - Serikat Pekerja Indonesi...",18/09/2024,MONEY
1,2,Alasan Berdikari Insurance Dapat Sanksi Pembat...,"JAKARTA, KOMPAS.com - Otoritas Jasa Keuangan (...",18/09/2024,MONEY
2,3,Jack Miller Diduga Isi Bangku Pramac Yamaha Mu...,"JAKARTA, KOMPAS.com - Pebalap Red Bull KTM Jac...",18/09/2024,OTOMOTIF
3,4,Jumlah Penumpang DAMRI Tembus 58.000 Orang sel...,"JAKARTA, KOMPAS.com – DAMRI mencatat jumlah pe...",18/09/2024,OTOMOTIF
4,5,Ekonom: Sudah Waktunya BI Turunkan Suku Bunga ...,"JAKARTA, KOMPAS.con - Bank Indonesia (BI) dipe...",18/09/2024,MONEY
5,6,Petani Hanya Perlu Bayar Rp 36.000 per Hektare...,"JAKARTA, KOMPAS.com - PT Asuransi Jasa Indones...",18/09/2024,MONEY
6,7,Langkah Penting Merencanakan Perjalanan Jarak ...,"JAKARTA, KOMPAS.com – Berkendara jarak jauh me...",18/09/2024,OTOMOTIF
7,8,"Mau Beli Emas di Pegadaian, Cek Dulu Harganya ...",Harga emas batangan Antam dengan berat 1 gram ...,18/09/2024,MONEY
8,9,PO Bimo Rilis 3 Bus Baru Pakai Jetbus 5 Edisi ...,"JAKARTA, KOMPAS.com - PO Bimo Transport merili...",18/09/2024,OTOMOTIF
9,10,Sektor UMKM Dongkrak Populasi Daihatsu Gran Max,"JAKARTA, KOMPAS.com - Sejak pertama kali dilun...",18/09/2024,OTOMOTIF


## Praproses Teks

### Membuat Fungsi

In [66]:
# Case Folding
def clean_lower(lwr):
    lwr = lwr.lower() # lowercase text
    return lwr

# Menghapus tanda baca, angka, dan simbol
def clean_punct(text):
    clean_spcl = re.compile('[/(){}\[\]\|@,;_]')
    clean_symbol = re.compile('[^0-9a-z]')
    clean_number = re.compile('[0-9]')
    text = clean_spcl.sub('', text)
    text = clean_symbol.sub(' ', text)
    text = clean_number.sub('', text)
    return text

# Menghaps double atau lebih whitespace
def _normalize_whitespace(text):
    corrected = str(text)
    corrected = re.sub(r"//t",r"\t", corrected)
    corrected = re.sub(r"( )\1+",r"\1", corrected)
    corrected = re.sub(r"(\n)\1+",r"\1", corrected)
    corrected = re.sub(r"(\r)\1+",r"\1", corrected)
    corrected = re.sub(r"(\t)\1+",r"\1", corrected)
    return corrected.strip(" ")

# Menghapus stopwords
def clean_stopwords(text):
    stopword = set(stopwords.words('indonesian'))
    text = ' '.join(word for word in text.split() if word not in stopword) # hapus stopword dari kolom deskripsi
    return text

# Stemming with Sastrawi
def sastrawistemmer(text):
    factory = StemmerFactory()
    st = factory.create_stemmer()
    text = ' '.join(st.stem(word) for word in tqdm(text.split()) if word in text)
    return text

function **clean_lower** digunakan untuk merubah semua kata atau huruf menjadi huruf kecil semua

function **clean_punct** digunakan untuk menghapus karakter, simbol, dan angka

function **_normalize_whitespace** digunakan untuk menghapus spasi yang double atau lebih dari 2 spasi

function **clean_stopwords** digunakan untuk menghilangkan kata yang tidak perlu (kata hubung, kata tambahan dll)

function **sastrawistemmer** digunakan untuk proses stemming (mendapatkan kata dasar dari suatu kata)

### Clean Lower

In [67]:
# Buat kolom tambahan untuk data description yang telah dilakukan proses case folding
main_df['lwr'] = main_df['Isi Berita'].apply(clean_lower)
casefolding=pd.DataFrame(main_df['lwr'])
casefolding

Unnamed: 0,lwr
0,"jakarta, kompas.com - serikat pekerja indonesi..."
1,"jakarta, kompas.com - otoritas jasa keuangan (..."
2,"jakarta, kompas.com - pebalap red bull ktm jac..."
3,"jakarta, kompas.com – damri mencatat jumlah pe..."
4,"jakarta, kompas.con - bank indonesia (bi) dipe..."
5,"jakarta, kompas.com - pt asuransi jasa indones..."
6,"jakarta, kompas.com – berkendara jarak jauh me..."
7,harga emas batangan antam dengan berat 1 gram ...
8,"jakarta, kompas.com - po bimo transport merili..."
9,"jakarta, kompas.com - sejak pertama kali dilun..."


### Clean Punct

In [68]:
# Buat kolom tambahan untuk data description yang telah dilakukan proses penghapusan tanda baca
main_df['clean_punct'] = main_df['lwr'].apply(clean_punct)
main_df['clean_punct']

Unnamed: 0,clean_punct
0,jakarta kompas com serikat pekerja indonesia...
1,jakarta kompas com otoritas jasa keuangan oj...
2,jakarta kompas com pebalap red bull ktm jack...
3,jakarta kompas com damri mencatat jumlah pen...
4,jakarta kompas con bank indonesia bi diperki...
5,jakarta kompas com pt asuransi jasa indonesi...
6,jakarta kompas com berkendara jarak jauh men...
7,harga emas batangan antam dengan berat gram d...
8,jakarta kompas com po bimo transport merilis...
9,jakarta kompas com sejak pertama kali dilunc...


### Normalize Whitespace

In [69]:
main_df['clean_double_ws'] = main_df['clean_punct'].apply(_normalize_whitespace)
main_df['clean_double_ws']

Unnamed: 0,clean_double_ws
0,jakarta kompas com serikat pekerja indonesia k...
1,jakarta kompas com otoritas jasa keuangan ojk ...
2,jakarta kompas com pebalap red bull ktm jack m...
3,jakarta kompas com damri mencatat jumlah penum...
4,jakarta kompas con bank indonesia bi diperkira...
5,jakarta kompas com pt asuransi jasa indonesia ...
6,jakarta kompas com berkendara jarak jauh mengg...
7,harga emas batangan antam dengan berat gram di...
8,jakarta kompas com po bimo transport merilis t...
9,jakarta kompas com sejak pertama kali diluncur...


### Clean Stopwords

In [70]:
# Buat kolom tambahan untuk data description yang telah dilakukan proses penghapusan stopwords
main_df['clean_sw'] = main_df['clean_double_ws'].apply(clean_stopwords)
main_df['clean_sw']

Unnamed: 0,clean_sw
0,jakarta kompas com serikat pekerja indonesia k...
1,jakarta kompas com otoritas jasa keuangan ojk ...
2,jakarta kompas com pebalap red bull ktm jack m...
3,jakarta kompas com damri mencatat penumpang ko...
4,jakarta kompas con bank indonesia bi menurunka...
5,jakarta kompas com pt asuransi jasa indonesia ...
6,jakarta kompas com berkendara jarak sepeda mot...
7,harga emas batangan antam berat gram dibandero...
8,jakarta kompas com po bimo transport merilis b...
9,jakarta kompas com kali diluncurkan indonesia ...


### Stemming dengan Sastrawi

In [71]:
# Buat kolom tambahan untuk data description yang telah dilemmatization
main_df['desc_clean_stem'] = main_df['clean_sw'].apply(sastrawistemmer)
main_df['desc_clean_stem']

100%|██████████| 190/190 [00:12<00:00, 14.97it/s]
100%|██████████| 195/195 [00:08<00:00, 22.49it/s]
100%|██████████| 120/120 [00:05<00:00, 20.43it/s]
100%|██████████| 186/186 [00:05<00:00, 31.56it/s]
100%|██████████| 173/173 [00:04<00:00, 37.83it/s]
100%|██████████| 193/193 [00:04<00:00, 45.52it/s]
100%|██████████| 100/100 [00:01<00:00, 54.85it/s]
100%|██████████| 79/79 [00:01<00:00, 61.33it/s]
100%|██████████| 136/136 [00:05<00:00, 26.93it/s]
100%|██████████| 267/267 [00:07<00:00, 36.17it/s]


Unnamed: 0,desc_clean_stem
0,jakarta kompas com serikat kerja indonesia ksp...
1,jakarta kompas com otoritas jasa uang ojk sank...
2,jakarta kompas com balap red bull ktm jack mil...
3,jakarta kompas com damri catat tumpang kota pr...
4,jakarta kompas con bank indonesia bi turun suk...
5,jakarta kompas com pt asuransi jasa indonesia ...
6,jakarta kompas com kendara jarak sepeda motor ...
7,harga emas batang antam berat gram banderol ha...
8,jakarta kompas com po bimo transport rilis bus...
9,jakarta kompas com kali luncur indonesia popul...


## Pembuatan VSM

In [72]:
# Load the saved model from file
filename = 'https://raw.githubusercontent.com/meinhere/ppw/master/publish/tugas-3/model/tfidf_vectorizer.sav'
'
tfidf_vectorizer = pickle.load(open(filename, 'rb'))

In [73]:
corpus = main_df['desc_clean_stem']
tfidf = tfidf_vectorizer.transform(corpus)

tfidf.shape

(10, 3555)

In [74]:
vocabulary = tfidf_vectorizer.get_feature_names_out().tolist()

tfidf_df = pd.DataFrame(tfidf.toarray(), columns=vocabulary)
tfidf_df.insert(0, 'Kategori Berita', main_df['Kategori Berita'])
tfidf_df

Unnamed: 0,Kategori Berita,aaion,aali,abadi,abai,abenkh,abnormal,absurd,ac,acapkali,...,za,zad,zag,zaman,zarco,zenix,zero,zig,zigzag,zona
0,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,OTOMOTIF,0.0,0.0,0.0,0.081094,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preparing Data

### Encode Label
Dilakukan tahap encoding pada kolom **[Kategori Berita]** dimana data yang yang terdapat didalamnya masih berupa data kategorik (kata) sehingga perlu dirubah menjadi angka agar bisa dimasukkan ke dalam proses training model

In [75]:
# menggunakan label_encoder untuk merubah kata menjadi angka
label_encoder = preprocessing.LabelEncoder()
tfidf_df['Kategori Berita'] = label_encoder.fit_transform(tfidf_df['Kategori Berita'])

tfidf_df

Unnamed: 0,Kategori Berita,aaion,aali,abadi,abai,abenkh,abnormal,absurd,ac,acapkali,...,za,zad,zag,zaman,zarco,zenix,zero,zig,zigzag,zona
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1,0.0,0.0,0.0,0.081094,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Testing Data

In [76]:
# Load the saved model from file
filename = 'https://raw.githubusercontent.com/meinhere/ppw/master/publish/tugas-3/model/lr_model.sav'
lr_model = pickle.load(open(filename, 'rb'))

In [77]:
y_test = tfidf_df['Kategori Berita']
x_test = tfidf_df.drop(['Kategori Berita'], axis=1)
y_pred = lr_model.predict(x_test)

print(y_pred)

[0 0 1 0 0 0 1 0 1 0]


In [78]:
# melihat nilai actual dan predicted
a = pd.DataFrame({'Actual value': y_test, 'Predicted value':y_pred})
a

Unnamed: 0,Actual value,Predicted value
0,0,0
1,0,0
2,1,1
3,1,0
4,0,0
5,0,0
6,1,1
7,0,0
8,1,1
9,1,0
