# Pencarian dan Penambangan Web - Tugas 3 : Implementasi hasil VSM dengan algoritma Logistic Regression

Pada Tugas 3 ini diminta untuk melakukan proses pembuatan model dari data VSM yang telah dibuat sebelumnya menggunakan algoritma Logistic Regression.

Dibuat Oleh:

*   Nama : Sabil Ahmad Hidayat
*   NIM : 220411100058
*   Kelas : PPW A

Link Projek : https://github.com/meinhere/ppw



# Import Library

In [1]:
!pip install -q Sastrawi

In [2]:
# library awal untuk perhitungan dan pengolahan teks
import numpy as np
import re
import pandas as pd

# alat untuk crawling
from urllib.request import urlopen
from bs4 import BeautifulSoup

# monitoring
from tqdm import tqdm

# library untuk praproses teks
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# library untuk proses modeling
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# library untuk evaluasi model
from sklearn.metrics import classification_report, confusion_matrix

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

# save model
import pickle

**preprocessing** disini digunakan untuk melakukan proses encoding pada label

**train_test_split** digunakan untuk membagi dataset menjadi data training dan testing

**LogisticRegression** digunakan untuk tahap modeling menggunakan library LogisticRegression

**classification_report** dan **confusion_matrix** digunakan untuk melihat laporan dan hasil evaluasi setelah proses training data

**matplotlib** dan **seaborn** digunakan untuk plotting grafik

**pickle** digunakan untuk menyimpan model hasil training dan testing

# Persiapan Data

## Load Data Model

In [55]:
main_df = pd.read_csv('https://raw.githubusercontent.com/meinhere/ppw/master/publish/tugas-2/data_berita.csv', delimiter=',')
main_df

Unnamed: 0,No,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
0,1,Simak Jadwal dan Lokasi SIM Keliling di Jakart...,"JAKARTA, KOMPAS.com - Surat Izin Mengemudi (S...",07/09/2024,OTOMOTIF
1,2,[POPULER OTOMOTIF] Diskon Motor Honda Septembe...,"JAKARTA, KOMPAS.com - Banyak pembaca yang ingi...",07/09/2024,OTOMOTIF
2,3,"Cek Saldo Minimal BRI, BNI, BCA, Mandiri, dan BSI","JAKARTA, KOMPAS.com - Penting bagi calon nasab...",06/09/2024,MONEY
3,4,"KAI Uji Coba Teknologi ""Face Recognition Board...",KOMPAS.com - PT Kereta Api Indonesia (KAI) Div...,06/09/2024,MONEY
4,5,OJK Blokir 10.890 Entitas Keuangan Ilegal Seja...,"JAKARTA, KOMPAS.com - Otoritas Jasa Keuangan (...",06/09/2024,MONEY
...,...,...,...,...,...
95,96,Waspada Masalah yang Timbul akibat Telat Ganti...,"JAKARTA, KOMPAS.com - Oli mesin pada mobil den...",06/09/2024,OTOMOTIF
96,97,"Sosok Faisal Basri di Mata Para Tokoh, Ekonom ...","JAKARTA, KOMPAS.com - Ekonom senior Faisal Bas...",06/09/2024,MONEY
97,98,"Pendaftaran CPNS Diperpanjang 4 Hari, Pelamar ...","JAKARTA, KOMPAS.com - Pemerintah telah memperp...",06/09/2024,MONEY
98,99,"Harga Emas Terbaru Pegadaian, Jumat 6 Septembe...","JAKARTA, KOMPAS.com - Pegadaian menyediakan be...",06/09/2024,MONEY


## Membuat Fungsi untuk Persiapan Crawling

In [56]:
# fungsi untuk mengambil link yang akan dilakukan crawling
def extract_urls(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

    urls = soup.find_all("a", {"class": "paging__link"})
    urls = [url.get('href') for url in urls]

    return urls

# fungsi untuk mengambil isi dari berita
def get_content(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, 'html.parser')

    div = soup.find("div", {"class": "read__content"})
    paragraf = div.find_all("p")

    content = ''
    for p in paragraf:
        content += p.text

    return content


# fungsi utama crawling
def crawl(link = "https://indeks.kompas.com", max_money = 1, max_otomotif = 1, allow_category = ["OTOMOTIF", "MONEY"], is_train = True, title_old = []):
    # inisialisasi variabel penampung hasil berita
    news_data = []

    # inisialisasi persiapan untuk crawling berita
    last_url = extract_urls(link).pop()
    page = last_url.split('=').pop() # jumlah halaman secara otomatis
    # page = 1 # jumlah halaman secara manual

    # persiapan link yang akan dilakukan crawling
    urls = [link + '/?page=' + str(a) for a in range(1, int(page) + 1)]
    count_money = 0
    count_otomotif = 0

    # menelusuri semua link yang telah ditentukan
    for idx, url in enumerate(urls):
        if (len(news_data) == max_money + max_otomotif) :
          break

        html = urlopen(url).read()
        soup = BeautifulSoup(html, 'html.parser')

        # mengambil data yang diperlukan pada struktur html
        links       = soup.find_all("a", {"class": "article-link"})
        titles      = soup.find_all("h2", {"class": "articleTitle"})
        dates       = soup.find_all("div", {"class": "articlePost-date"})
        categories  = soup.find_all("div", {"class": "articlePost-subtitle"})

        news_per_page = len(links) # berita artikel yang ditampilkan

        # memasukkan data ke dalam list
        for elem in tqdm(range(news_per_page), desc=f"Crawling page {idx+1}"):
          news = {}
          category = categories[elem].text
          title = titles[elem].text

          if (category in allow_category):
            if (is_train):
              cond = (category == "MONEY" and count_money < max_money) or (category == "OTOMOTIF" and count_otomotif < max_otomotif)
            else:
              cond = (category == "MONEY" and count_money < max_money) or (category == "OTOMOTIF" and count_otomotif < max_otomotif) and title not in title_old


            if (cond):
              news['No'] = len(news_data) + 1
              news['Judul Berita']     = title
              news['Isi Berita']       = get_content(links[elem].get("href"))
              news['Tanggal Berita']   = dates[elem].text
              news['Kategori Berita']  = category
              news_data.append(news)

              if (category == "MONEY"):
                count_money += 1
              else:
                count_otomotif += 1

        print(f"=======> Money: {count_money} | Otomotif: {count_otomotif} | Total: {count_money + count_otomotif}")

    return news_data

function **extract_urls** digunakan untuk melakukan ekstraksi link url yang memiliki pagination pada halaman awal, sehingga didapat beberapa url yang bisa mengarah ke halaman selanjutnya atau sebelumnya.

function **get_content** digunakan untuk melakukan proses pembuatan isi berita sesuai link berita yang dicari.

## Pengambilan Data Baru

In [57]:
title_old = main_df["Judul Berita"].tolist()

test_news = crawl(max_money=5, max_otomotif=5, is_train=False, title_old=title_old)

Crawling page 1: 100%|██████████| 15/15 [00:04<00:00,  3.06it/s]




Crawling page 2: 100%|██████████| 15/15 [00:01<00:00,  7.69it/s]




Crawling page 3: 100%|██████████| 15/15 [00:00<00:00, 76.12it/s]




Crawling page 4: 100%|██████████| 15/15 [00:00<00:00, 59409.41it/s]




Crawling page 5: 100%|██████████| 15/15 [00:00<00:00, 60669.78it/s]




Crawling page 6: 100%|██████████| 15/15 [00:00<00:00, 58.51it/s]




Crawling page 7: 100%|██████████| 15/15 [00:00<00:00, 59634.65it/s]




Crawling page 8: 100%|██████████| 15/15 [00:00<00:00, 125.20it/s]






In [58]:
main_df = pd.DataFrame(test_news)
main_df

Unnamed: 0,No,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
0,1,"Tupperware Ajukan Bangkrut, Imbas Permintaan T...","JAKARTA, KOMPAS.com - Perusahaan yang memprodu...",18/09/2024,MONEY
1,2,Strategi Jobstreet Siapkan 1 Juta Lowongan Kerja,"JAKARTA, KOMPAS.com - Platform ketenagakerjaan...",18/09/2024,MONEY
2,3,"Promo Tiket Bus DAMRI Beli 3 Gratis 1, Simak W...","JAKARTA, KOMPAS.com – Dalam rangka memperingat...",18/09/2024,OTOMOTIF
3,4,Begini Cara Bayar Pajak Kendaraan yang Masih K...,"JAKARTA, KOMPAS.com - Setiap pemilik kendaraan...",18/09/2024,OTOMOTIF
4,5,"Apa Itu Pasar Modal: Pengertian, Fungsi, Jenis...","JAKARTA, KOMPAS.com - Pasar modal memegang per...",18/09/2024,MONEY
5,6,"Bandung Diguncang 8 Kali Gempa, KCIC Periksa S...","JAKARTA, KOMPAS.com - PT KCIC menyampaikan per...",18/09/2024,MONEY
6,7,"Pelabuhan Tuas, Proyek Reklamasi Singapura yan...",KOMPAS.com - Negara tetangga Indonesia di sebe...,18/09/2024,MONEY
7,8,Benarkah Oli Habis Bisa Bikin Motor Mogok di J...,"KLATEN, KOMPAS.com - Motor mogok karena kehabi...",18/09/2024,OTOMOTIF
8,9,Jack Miller Diduga Isi Bangku Pramac Yamaha Mu...,"JAKARTA, KOMPAS.com - Pebalap Red Bull KTM Jac...",18/09/2024,OTOMOTIF
9,10,Jumlah Penumpang DAMRI Tembus 58.000 Orang sel...,"JAKARTA, KOMPAS.com – DAMRI mencatat jumlah pe...",18/09/2024,OTOMOTIF


## Praproses Teks

### Membuat Fungsi

In [59]:
# Case Folding
def clean_lower(lwr):
    lwr = lwr.lower() # lowercase text
    return lwr

# Menghapus tanda baca, angka, dan simbol
def clean_punct(text):
    clean_spcl = re.compile('[/(){}\[\]\|@,;_]')
    clean_symbol = re.compile('[^0-9a-z]')
    clean_number = re.compile('[0-9]')
    text = clean_spcl.sub('', text)
    text = clean_symbol.sub(' ', text)
    text = clean_number.sub('', text)
    return text

# Menghaps double atau lebih whitespace
def _normalize_whitespace(text):
    corrected = str(text)
    corrected = re.sub(r"//t",r"\t", corrected)
    corrected = re.sub(r"( )\1+",r"\1", corrected)
    corrected = re.sub(r"(\n)\1+",r"\1", corrected)
    corrected = re.sub(r"(\r)\1+",r"\1", corrected)
    corrected = re.sub(r"(\t)\1+",r"\1", corrected)
    return corrected.strip(" ")

# Menghapus stopwords
def clean_stopwords(text):
    stopword = set(stopwords.words('indonesian'))
    text = ' '.join(word for word in text.split() if word not in stopword) # hapus stopword dari kolom deskripsi
    return text

# Stemming with Sastrawi
def sastrawistemmer(text):
    factory = StemmerFactory()
    st = factory.create_stemmer()
    text = ' '.join(st.stem(word) for word in tqdm(text.split()) if word in text)
    return text

function **clean_lower** digunakan untuk merubah semua kata atau huruf menjadi huruf kecil semua

function **clean_punct** digunakan untuk menghapus karakter, simbol, dan angka

function **_normalize_whitespace** digunakan untuk menghapus spasi yang double atau lebih dari 2 spasi

function **clean_stopwords** digunakan untuk menghilangkan kata yang tidak perlu (kata hubung, kata tambahan dll)

function **sastrawistemmer** digunakan untuk proses stemming (mendapatkan kata dasar dari suatu kata)

### Clean Lower

In [60]:
# Buat kolom tambahan untuk data description yang telah dilakukan proses case folding
main_df['lwr'] = main_df['Isi Berita'].apply(clean_lower)
casefolding=pd.DataFrame(main_df['lwr'])
casefolding

Unnamed: 0,lwr
0,"jakarta, kompas.com - perusahaan yang memprodu..."
1,"jakarta, kompas.com - platform ketenagakerjaan..."
2,"jakarta, kompas.com – dalam rangka memperingat..."
3,"jakarta, kompas.com - setiap pemilik kendaraan..."
4,"jakarta, kompas.com - pasar modal memegang per..."
5,"jakarta, kompas.com - pt kcic menyampaikan per..."
6,kompas.com - negara tetangga indonesia di sebe...
7,"klaten, kompas.com - motor mogok karena kehabi..."
8,"jakarta, kompas.com - pebalap red bull ktm jac..."
9,"jakarta, kompas.com – damri mencatat jumlah pe..."


### Clean Punct

In [61]:
# Buat kolom tambahan untuk data description yang telah dilakukan proses penghapusan tanda baca
main_df['clean_punct'] = main_df['lwr'].apply(clean_punct)
main_df['clean_punct']

Unnamed: 0,clean_punct
0,jakarta kompas com perusahaan yang memproduk...
1,jakarta kompas com platform ketenagakerjaan ...
2,jakarta kompas com dalam rangka memperingati...
3,jakarta kompas com setiap pemilik kendaraan ...
4,jakarta kompas com pasar modal memegang pera...
5,jakarta kompas com pt kcic menyampaikan perm...
6,kompas com negara tetangga indonesia di sebe...
7,klaten kompas com motor mogok karena kehabis...
8,jakarta kompas com pebalap red bull ktm jack...
9,jakarta kompas com damri mencatat jumlah pen...


### Normalize Whitespace

In [62]:
main_df['clean_double_ws'] = main_df['clean_punct'].apply(_normalize_whitespace)
main_df['clean_double_ws']

Unnamed: 0,clean_double_ws
0,jakarta kompas com perusahaan yang memproduksi...
1,jakarta kompas com platform ketenagakerjaan jo...
2,jakarta kompas com dalam rangka memperingati h...
3,jakarta kompas com setiap pemilik kendaraan be...
4,jakarta kompas com pasar modal memegang peran ...
5,jakarta kompas com pt kcic menyampaikan permoh...
6,kompas com negara tetangga indonesia di sebera...
7,klaten kompas com motor mogok karena kehabisan...
8,jakarta kompas com pebalap red bull ktm jack m...
9,jakarta kompas com damri mencatat jumlah penum...


### Clean Stopwords

In [63]:
# Buat kolom tambahan untuk data description yang telah dilakukan proses penghapusan stopwords
main_df['clean_sw'] = main_df['clean_double_ws'].apply(clean_stopwords)
main_df['clean_sw']

Unnamed: 0,clean_sw
0,jakarta kompas com perusahaan memproduksi wada...
1,jakarta kompas com platform ketenagakerjaan jo...
2,jakarta kompas com rangka memperingati perhubu...
3,jakarta kompas com pemilik kendaraan bermotor ...
4,jakarta kompas com pasar modal memegang peran ...
5,jakarta kompas com pt kcic permohonan maaf pem...
6,kompas com negara tetangga indonesia seberang ...
7,klaten kompas com motor mogok kehabisan bensin...
8,jakarta kompas com pebalap red bull ktm jack m...
9,jakarta kompas com damri mencatat penumpang ko...


### Stemming dengan Sastrawi

In [64]:
# Buat kolom tambahan untuk data description yang telah dilemmatization
main_df['desc_clean_stem'] = main_df['clean_sw'].apply(sastrawistemmer)
main_df['desc_clean_stem']

100%|██████████| 152/152 [00:05<00:00, 27.31it/s]
100%|██████████| 157/157 [00:11<00:00, 13.88it/s]
100%|██████████| 149/149 [00:11<00:00, 12.49it/s]
100%|██████████| 175/175 [00:04<00:00, 36.78it/s]
100%|██████████| 205/205 [00:03<00:00, 51.58it/s]
100%|██████████| 162/162 [00:05<00:00, 31.50it/s]
100%|██████████| 256/256 [00:04<00:00, 51.38it/s]
100%|██████████| 142/142 [00:02<00:00, 61.08it/s]
100%|██████████| 120/120 [00:06<00:00, 19.56it/s]
100%|██████████| 186/186 [00:05<00:00, 33.47it/s]


Unnamed: 0,desc_clean_stem
0,jakarta kompas com usaha produksi wadah makan ...
1,jakarta kompas com platform ketenagakerjaan jo...
2,jakarta kompas com rangka ingat hubung nasiona...
3,jakarta kompas com milik kendara motor wajib b...
4,jakarta kompas com pasar modal pegang peran ek...
5,jakarta kompas com pt kcic mohon maaf batal ja...
6,kompas com negara tetangga indonesia seberang ...
7,klaten kompas com motor mogok habis bensin mud...
8,jakarta kompas com balap red bull ktm jack mil...
9,jakarta kompas com damri catat tumpang kota pr...


## Pembuatan VSM

In [65]:
# Load the saved model from file
filename = 'tfidf_vectorizer.sav'
tfidf_vectorizer = pickle.load(open(filename, 'rb'))

In [66]:
corpus = main_df['desc_clean_stem']
tfidf = tfidf_vectorizer.transform(corpus)

tfidf.shape

(10, 3555)

In [67]:
vocabulary = tfidf_vectorizer.get_feature_names_out().tolist()

tfidf_df = pd.DataFrame(tfidf.toarray(), columns=vocabulary)
tfidf_df.insert(0, 'Kategori Berita', main_df['Kategori Berita'])
tfidf_df

Unnamed: 0,Kategori Berita,aaion,aali,abadi,abai,abenkh,abnormal,absurd,ac,acapkali,...,za,zad,zag,zaman,zarco,zenix,zero,zig,zigzag,zona
0,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,OTOMOTIF,0.0,0.0,0.0,0.079912,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,MONEY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,OTOMOTIF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preparing Data

### Encode Label
Dilakukan tahap encoding pada kolom **[Kategori Berita]** dimana data yang yang terdapat didalamnya masih berupa data kategorik (kata) sehingga perlu dirubah menjadi angka agar bisa dimasukkan ke dalam proses training model

In [68]:
# menggunakan label_encoder untuk merubah kata menjadi angka
label_encoder = preprocessing.LabelEncoder()
tfidf_df['Kategori Berita'] = label_encoder.fit_transform(tfidf_df['Kategori Berita'])

tfidf_df

Unnamed: 0,Kategori Berita,aaion,aali,abadi,abai,abenkh,abnormal,absurd,ac,acapkali,...,za,zad,zag,zaman,zarco,zenix,zero,zig,zigzag,zona
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0.0,0.0,0.0,0.079912,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Testing Data

In [69]:
# Load the saved model from file
filename = 'lr_model.sav'
lr_model = pickle.load(open(filename, 'rb'))

In [70]:
y_test = tfidf_df['Kategori Berita']
x_test = tfidf_df.drop(['Kategori Berita'], axis=1)
y_pred = lr_model.predict(x_test)

print(y_pred)

[0 0 0 0 0 1 0 1 1 0]


In [71]:
# melihat nilai actual dan predicted
a = pd.DataFrame({'Actual value': y_test, 'Predicted value':y_pred})
a

Unnamed: 0,Actual value,Predicted value
0,0,0
1,0,0
2,1,0
3,1,0
4,0,0
5,0,1
6,0,0
7,1,1
8,1,1
9,1,0
