# **Proyek Analisis Sentimen Review Aplikasi Coinbase**
- **Nama:** Harry Mardika
- **Email:** harrymardika48@gmail.com
- **ID Dicoding:** hkacode

## Alur Kerja:
- Setup & Impor Library
- Fase 1: Akuisisi Data (Scraping) & Pembersihan Awal
- Fase 2: Pelabelan Sentimen & Penyeimbangan Dataset
- Fase 3: Pembersihan Data Mendalam
- Fase 4: Pembagian Data (Train, Validation, Test)
- Fase 5: Ekstraksi Fitur (TF-IDF & Tokenization/Padding untuk Deep Learning)
- Fase 6: Definisi & Pelatihan Model
    - Eksperimen 1: Artificial Neural Network (ANN) - PyTorch DirectML
    - Eksperimen 2: Convolutional Neural Network (CNN) - PyTorch DirectML
    - Eksperimen 3: Random Forest (RF) - Scikit-learn
    - Eksperimen 4: XGBoost - Scikit-learn
    - Eksperimen 5: LightGBM - Scikit-learn
- Fase 7: Evaluasi & Pemilihan Model Terbaik
- Fase 8: Inference (Pengujian pada Data Baru)


## Setup & Impor Library

In [1]:
# !pip install google-play-scraper pandas numpy nltk scikit-learn xgboost lightgbm matplotlib seaborn tensorflow torch torchvision torchaudio torch-directml

In [2]:
print("Tahap 1: Setup & Impor Library")

import pandas as pd
import numpy as np
import re
import nltk
import time
import os
import warnings
import random

# Scraping
from google_play_scraper import reviews_all, Sort, reviews

# Text Processing & Labeling
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Pastikan resource NLTK sudah diunduh
try:
    nltk.data.find('sentiment/vader_lexicon.zip')
except:
    print("Mengunduh resource NLTK (vader_lexicon)...")
    nltk.download('vader_lexicon')
try:
    nltk.data.find('corpora/stopwords')
except:
    print("Mengunduh resource NLTK (stopwords)...")
    nltk.download('stopwords')
try:
    nltk.data.find('corpora/wordnet')
except:
    print("Mengunduh resource NLTK (wordnet)...")
    nltk.download('wordnet')


# Machine Learning & Deep Learning
import fasttext
import fasttext.util
import xgboost as xgb
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder

# PyTorch & DirectML
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Coba impor torch_directml
try:
    import torch_directml
    if torch_directml.is_available():
        DEVICE = torch_directml.device()
        print(f"Menggunakan backend DirectML pada perangkat: {torch_directml.device_name(0)}")
    else:
        DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"DirectML tidak tersedia atau tidak terdeteksi. Menggunakan backend PyTorch default: {DEVICE}")
except ImportError:
    DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"torch_directml tidak terinstal. Menggunakan backend PyTorch default: {DEVICE}")


# Tokenization & Padding (untuk Deep Learning)
from tensorflow.keras.preprocessing.text import Tokenizer # Bisa pakai Keras tokenizer untuk kemudahan
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Visualisasi
import matplotlib.pyplot as plt
import seaborn as sns

# Pengaturan Lain
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', 150) # Tampilkan teks lebih panjang di Pandas
random.seed(42) # Seed untuk reproduktifitas
np.random.seed(42)
torch.manual_seed(42)
if DEVICE.type == 'cuda':
    torch.cuda.manual_seed_all(42)

print("Setup selesai.")

Tahap 1: Setup & Impor Library
Mengunduh resource NLTK (wordnet)...


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\harry\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Menggunakan backend DirectML pada perangkat: AMD Radeon RX 6800S
Setup selesai.


## Fase 1: Akuisisi Data (Scraping) & Pembersihan Awal

In [3]:
print("\nTahap 2: Akuisisi Data & Pembersihan Awal")

# Konfigurasi
APP_ID = 'com.coinbase.android'
LANG = 'en'
COUNTRY = 'us'
NUM_REVIEWS_TARGET = 300000
RAW_CSV_PATH = 'data/reviews_raw.csv'
LIGHT_CLEAN_CSV_PATH = 'data/reviews_lightly_cleaned.csv'


Tahap 2: Akuisisi Data & Pembersihan Awal


In [4]:
# --- Scraping ---
if not os.path.exists(RAW_CSV_PATH):
    print(f"Memulai scraping review untuk {APP_ID}...")
    start_time = time.time()
    try:
        all_reviews = reviews(
            APP_ID,
            lang=LANG,
            country=COUNTRY,
            sort=Sort.NEWEST,
            count=NUM_REVIEWS_TARGET
        )
        
        df_raw = pd.DataFrame(all_reviews[0])
        print(f"Scraping selesai. Mendapatkan {len(df_raw)} review.")
        # Ambil kolom yang relevan saja
        df_raw = df_raw[['reviewId', 'userName', 'content', 'score', 'thumbsUpCount', 'reviewCreatedVersion', 'at']]
        df_raw.rename(columns={'content': 'review_text', 'score': 'rating'}, inplace=True)
        df_raw.to_csv(RAW_CSV_PATH, index=False)
        print(f"Data mentah disimpan ke {RAW_CSV_PATH}")
    except Exception as e:
        print(f"Terjadi error saat scraping: {e}")
        df_raw = pd.DataFrame() # Buat dataframe kosong jika gagal
    end_time = time.time()
    print(f"Waktu scraping: {end_time - start_time:.2f} detik.")
else:
    print(f"Memuat data mentah dari {RAW_CSV_PATH}...")
    df_raw = pd.read_csv(RAW_CSV_PATH)
    print(f"Berhasil memuat {len(df_raw)} review mentah.")

Memuat data mentah dari data/reviews_raw.csv...
Berhasil memuat 172223 review mentah.


In [5]:
# --- Pembersihan Awal (Ringan) ---
if not df_raw.empty and not os.path.exists(LIGHT_CLEAN_CSV_PATH):
    print("Memulai pembersihan awal...")
    df_light_clean = df_raw.copy()

    # Hapus duplikat berdasarkan review_text (kadang ada review sama persis)
    df_light_clean.drop_duplicates(subset=['review_text'], inplace=True)

    # Hapus baris dengan review kosong atau NaN
    df_light_clean.dropna(subset=['review_text'], inplace=True)
    df_light_clean = df_light_clean[df_light_clean['review_text'].str.strip() != '']

    # Konversi tipe data jika perlu (rating harus integer)
    df_light_clean['rating'] = df_light_clean['rating'].astype(int)

    # Pembersihan teks ringan: lowercase
    df_light_clean['review_text_cleaned'] = df_light_clean['review_text'].str.lower()

    # (Opsional) Hapus URL sederhana jika ada
    df_light_clean['review_text_cleaned'] = df_light_clean['review_text_cleaned'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x, flags=re.MULTILINE))
    # (Opsional) Hapus mention @username jika ada
    df_light_clean['review_text_cleaned'] = df_light_clean['review_text_cleaned'].apply(lambda x: re.sub(r'\@\w+', '', x))

    print(f"Pembersihan awal selesai. Jumlah data setelah pembersihan: {len(df_light_clean)}")
    df_light_clean.to_csv(LIGHT_CLEAN_CSV_PATH, index=False)
    print(f"Data hasil pembersihan awal disimpan ke {LIGHT_CLEAN_CSV_PATH}")

elif os.path.exists(LIGHT_CLEAN_CSV_PATH):
     print(f"Memuat data hasil pembersihan awal dari {LIGHT_CLEAN_CSV_PATH}...")
     df_light_clean = pd.read_csv(LIGHT_CLEAN_CSV_PATH)
     # Pastikan kolom utama ada dan tipe data sesuai
     if 'review_text_cleaned' not in df_light_clean.columns and 'review_text' in df_light_clean.columns:
         df_light_clean['review_text_cleaned'] = df_light_clean['review_text'].str.lower() # Buat jika belum ada
     df_light_clean.dropna(subset=['review_text_cleaned'], inplace=True)
     df_light_clean['rating'] = df_light_clean['rating'].astype(int)
     print(f"Berhasil memuat {len(df_light_clean)} review hasil pembersihan awal.")
else:
    print("Tidak ada data mentah untuk diproses.")

Memuat data hasil pembersihan awal dari data/reviews_lightly_cleaned.csv...
Berhasil memuat 138244 review hasil pembersihan awal.


## Fase 2: Pelabelan Sentimen & Penyeimbangan Dataset

In [6]:
print("\nTahap 3: Pelabelan Sentimen & Penyeimbangan Dataset")

BALANCED_DATA_COUNT = 30000 # Jumlah sampel per kelas
FINAL_CLEAN_CSV_PATH = 'data/reviews_final_cleaned_balanced.csv'


Tahap 3: Pelabelan Sentimen & Penyeimbangan Dataset


In [7]:
# --- Pelabelan Sentimen (menggunakan VADER) ---
if 'sentiment_label' not in df_light_clean.columns:
    print("Melakukan pelabelan sentimen menggunakan VADER...")
    analyzer = SentimentIntensityAnalyzer()

    def get_vader_sentiment(text):
        vs = analyzer.polarity_scores(str(text)) # Pastikan input string
        return vs['compound'] # Menggunakan compound score

    df_light_clean['polarity_score'] = df_light_clean['review_text_cleaned'].apply(get_vader_sentiment)
    print("Pelabelan selesai.")

    # --- Penyeimbangan Dataset ---
    print(f"Menyeimbangkan dataset: {BALANCED_DATA_COUNT} sampel per kelas (Positif, Negatif, Netral)...")

    # Kriteria:
    # Positif: Polarity > 0.1 DAN rating >= 4 (prioritaskan rating tinggi)
    # Negatif: Polarity < -0.1 DAN rating <= 2 (prioritaskan rating rendah)
    # Netral: Polarity antara -0.1 dan 0.1 ATAU rating == 3 (utamakan rating 3 untuk netral)

    # Urutkan data untuk memudahkan pengambilan sampel ekstrem
    df_light_clean_sorted = df_light_clean.sort_values(by=['rating', 'polarity_score'], ascending=[True, True])

    # Ambil data Negatif
    df_neg = df_light_clean_sorted[
        (df_light_clean_sorted['polarity_score'] < -0.1) & (df_light_clean_sorted['rating'] <= 2)
    ].head(BALANCED_DATA_COUNT)
    df_neg['sentiment_label'] = 'negative'
    print(f"Data negatif ditemukan dan diambil: {len(df_neg)}")

    # Ambil data Positif (urutkan terbalik)
    df_light_clean_sorted = df_light_clean.sort_values(by=['rating', 'polarity_score'], ascending=[False, False])
    df_pos = df_light_clean_sorted[
        (df_light_clean_sorted['polarity_score'] > 0.1) & (df_light_clean_sorted['rating'] >= 4)
    ].head(BALANCED_DATA_COUNT)
    df_pos['sentiment_label'] = 'positive'
    print(f"Data positif ditemukan dan diambil: {len(df_pos)}")

    # Ambil data Netral (prioritaskan rating 3, lalu skor polaritas mendekati 0)
    # Urutkan berdasarkan kedekatan polaritas ke 0 untuk rating 3, lalu untuk rating lain
    df_light_clean['abs_polarity'] = df_light_clean['polarity_score'].abs()
    df_light_clean_sorted = df_light_clean.sort_values(by=['rating', 'abs_polarity'], ascending=[True, True]) # Sort rating ascending
    
    # Prioritas utama: rating 3
    df_neu_rating3 = df_light_clean_sorted[df_light_clean_sorted['rating'] == 3].head(BALANCED_DATA_COUNT)
    
    # Jika rating 3 kurang, tambahkan dari yang polaritasnya mendekati nol terlepas dari rating lain
    remaining_neutral_needed = BALANCED_DATA_COUNT - len(df_neu_rating3)
    df_neu_other = pd.DataFrame() # Inisialisasi df kosong
    if remaining_neutral_needed > 0:
        # Ambil data di luar yang sudah dipilih (negatif/positif/rating3)
        ids_to_exclude = set(df_neg['reviewId']) | set(df_pos['reviewId']) | set(df_neu_rating3['reviewId'])
        df_potential_neutral = df_light_clean[~df_light_clean['reviewId'].isin(ids_to_exclude)]
        # Urutkan berdasarkan kedekatan polaritas ke 0
        df_potential_neutral = df_potential_neutral.sort_values(by=['abs_polarity'], ascending=True)
        df_neu_other = df_potential_neutral.head(remaining_neutral_needed)

    df_neu = pd.concat([df_neu_rating3, df_neu_other], ignore_index=True).head(BALANCED_DATA_COUNT)
    df_neu['sentiment_label'] = 'neutral'
    print(f"Data netral ditemukan dan diambil: {len(df_neu)}")

    # Gabungkan semua data
    df_balanced = pd.concat([df_pos, df_neg, df_neu], ignore_index=True)

    # Periksa jumlah akhir per kelas
    print("\nJumlah data per kelas setelah penyeimbangan:")
    print(df_balanced['sentiment_label'].value_counts())

    # Acak dataset
    df_final = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

    # Pilih kolom yang relevan untuk disimpan
    df_final = df_final[['reviewId', 'review_text', 'review_text_cleaned', 'rating', 'polarity_score', 'sentiment_label']]

    # Simpan hasil sementara sebelum cleaning mendalam (jika perlu checkpoint)
    df_final.to_csv('data/reviews_balanced_uncleaned.csv', index=False)
    print("Data balanced (belum cleaned mendalam) disimpan.")

else:
    print("Kolom 'sentiment_label' sudah ada atau proses balancing sudah dilakukan.")
    print(f"Memuat data dari file yang mungkin sudah ada ({FINAL_CLEAN_CSV_PATH})...")
    if os.path.exists(FINAL_CLEAN_CSV_PATH):
        df_final = pd.read_csv(FINAL_CLEAN_CSV_PATH)
        print(f"Berhasil memuat {len(df_final)} data dari {FINAL_CLEAN_CSV_PATH}")
        # Pastikan kolom ada
        if 'review_text_cleaned' not in df_final.columns and 'review_text' in df_final.columns:
             df_final['review_text_cleaned'] = df_final['review_text'] # Gunakan teks asli jika cleaned belum ada
        if 'sentiment_label' not in df_final.columns:
            print("WARNING: Kolom 'sentiment_label' tidak ditemukan di file CSV. Proses pelabelan mungkin perlu diulang.")
            # Opsional: bisa tambahkan kode untuk melabel ulang di sini jika diperlukan
    else:
        print(f"File {FINAL_CLEAN_CSV_PATH} tidak ditemukan. Perlu menjalankan langkah pelabelan dan balancing.")

# Pastikan df_final terdefinisi sebelum lanjut
if 'df_final' not in locals():
    print("ERROR: Dataframe 'df_final' tidak terdefinisi. Silakan jalankan ulang bagian sebelumnya.")

Melakukan pelabelan sentimen menggunakan VADER...
Pelabelan selesai.
Menyeimbangkan dataset: 30000 sampel per kelas (Positif, Negatif, Netral)...
Data negatif ditemukan dan diambil: 28132
Data positif ditemukan dan diambil: 30000
Data netral ditemukan dan diambil: 30000

Jumlah data per kelas setelah penyeimbangan:
sentiment_label
positive    30000
neutral     30000
negative    28132
Name: count, dtype: int64
Data balanced (belum cleaned mendalam) disimpan.


## Fase 3: Pembersihan Data Mendalam

In [8]:
print("\nTahap 4: Pembersihan Data Mendalam")


Tahap 4: Pembersihan Data Mendalam


In [9]:
# Cek apakah pembersihan mendalam sudah pernah dilakukan (misal, ada kolom 'review_text_deep_cleaned')
# Jika belum atau jika file final belum ada, lakukan pembersihan
NEEDS_DEEP_CLEANING = True
if os.path.exists(FINAL_CLEAN_CSV_PATH):
     # Coba baca header saja untuk cek kolom
     try:
         header_check = pd.read_csv(FINAL_CLEAN_CSV_PATH, nrows=0)
         if 'review_text_deep_cleaned' in header_check.columns:
             print("Kolom 'review_text_deep_cleaned' sudah ada. Melewati pembersihan mendalam.")
             df_final = pd.read_csv(FINAL_CLEAN_CSV_PATH) # Muat ulang data yang sudah bersih
             # Pastikan tidak ada NaN di kolom teks yang akan digunakan
             df_final.dropna(subset=['review_text_deep_cleaned'], inplace=True)
             NEEDS_DEEP_CLEANING = False
         else:
             print("Kolom 'review_text_deep_cleaned' tidak ditemukan. Melakukan pembersihan mendalam...")
     except Exception as e:
         print(f"Error saat memeriksa file {FINAL_CLEAN_CSV_PATH}: {e}. Melakukan pembersihan mendalam...")

if NEEDS_DEEP_CLEANING:
    print("Memulai pembersihan data mendalam...")
    start_time_clean = time.time()

    # Inisialisasi lemmatizer dan stopwords
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    # Tambahkan stopwords custom jika perlu (misal, nama aplikasi)
    custom_stopwords = {'coinbase', 'app', 'application'}
    stop_words.update(custom_stopwords)

    def deep_clean_text(text):
        if not isinstance(text, str):
            return "" # Kembalikan string kosong jika bukan string

        text = text.lower() # Lowercase (sudah dilakukan di light clean, tapi pastikan lagi)
        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Hapus URL
        text = re.sub(r'\@\w+', '', text) # Hapus mention
        text = re.sub(r'#\w+', '', text) # Hapus hashtag
        text = re.sub(r'[^\w\s]', '', text) # Hapus Punctuation
        text = re.sub(r'\d+', '', text) # Hapus angka
        text = re.sub(r'\s+', ' ', text).strip() # Hapus spasi berlebih

        # Tokenisasi (opsional di sini, bisa juga saat feature extraction)
        # words = text.split()

        # Hapus stopwords dan lakukan lemmatisasi
        words = text.split() # Tokenisasi sederhana berdasarkan spasi
        cleaned_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and len(word) > 1] # Hanya kata > 1 huruf

        return ' '.join(cleaned_words)

    # Terapkan fungsi cleaning ke kolom 'review_text_cleaned' (hasil light clean)
    # atau ke 'review_text' jika 'review_text_cleaned' tidak ada
    source_column = 'review_text_cleaned' if 'review_text_cleaned' in df_final.columns else 'review_text'
    df_final['review_text_deep_cleaned'] = df_final[source_column].apply(deep_clean_text)

    # Hapus baris yang teksnya menjadi kosong setelah cleaning
    df_final = df_final[df_final['review_text_deep_cleaned'].str.strip() != '']
    df_final.dropna(subset=['review_text_deep_cleaned'], inplace=True) # Pastikan tidak ada NaN

    print(f"Pembersihan mendalam selesai. Jumlah data setelah pembersihan: {len(df_final)}")
    end_time_clean = time.time()
    print(f"Waktu pembersihan mendalam: {end_time_clean - start_time_clean:.2f} detik.")

    # Simpan hasil akhir yang sudah bersih dan seimbang
    df_final.to_csv(FINAL_CLEAN_CSV_PATH, index=False)
    print(f"Data final (cleaned & balanced) disimpan ke {FINAL_CLEAN_CSV_PATH}")

# Tampilkan beberapa contoh hasil cleaning
print("\nContoh hasil pembersihan mendalam:")
print(df_final[['review_text', 'review_text_deep_cleaned', 'sentiment_label']].head())

# Cek kembali jumlah data per kelas setelah deep cleaning (mungkin ada yg hilang)
print("\nJumlah data per kelas setelah pembersihan mendalam:")
print(df_final['sentiment_label'].value_counts())

Kolom 'review_text_deep_cleaned' sudah ada. Melewati pembersihan mendalam.

Contoh hasil pembersihan mendalam:
                                                                                                                                             review_text  \
0                       If you cant verified some customers name and address through utility bill just close this stupid app because is a waste of time.   
1                                                                                                                                                   Uwuu   
2                                                                                                                        4 star for too much network fee   
3  Great app , to the point . Buy Crypto. Easily create account, and in 1,2,3 your buying your crypto. If new to crypto, the app may require addition...   
4                                                                                                            

## Fase 4: Pembagian Data (Train, Validation, Test)

In [10]:
print("\nTahap 5: Pembagian Data")

# Pastikan df_final sudah siap dan memiliki kolom yang benar
if 'review_text_deep_cleaned' not in df_final.columns or 'sentiment_label' not in df_final.columns:
    print("ERROR: Kolom 'review_text_deep_cleaned' atau 'sentiment_label' tidak ditemukan. Proses tidak dapat dilanjutkan.")
    exit()


Tahap 5: Pembagian Data


In [11]:
# Pisahkan fitur (X) dan target (y)
X = df_final['review_text_deep_cleaned']
y = df_final['sentiment_label']

In [12]:
# Encode label menjadi numerik (Penting untuk sebagian besar model ML/DL)
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
# Simpan mapping label untuk interpretasi nanti
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print(f"Mapping Label: {label_mapping}") # Misal: {'negative': 0, 'neutral': 1, 'positive': 2}

Mapping Label: {'negative': 0, 'neutral': 1, 'positive': 2}


In [13]:
# Pembagian data: 70% Train, 20% Validation, 10% Test
# Stratify=y_encoded untuk menjaga proporsi kelas di setiap set
TRAIN_SIZE = 0.70
VALIDATION_SIZE = 0.20
TEST_SIZE = 0.10

In [14]:
# Bagi data menjadi Train (70%) dan Temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y_encoded,
    test_size=(1 - TRAIN_SIZE),
    random_state=42,
    stratify=y_encoded
)

In [15]:
# Bagi data Temp menjadi Validation (20% dari total) dan Test (10% dari total)
# Perhatikan perhitungan test_size di sini relatif terhadap ukuran temp
relative_test_size = TEST_SIZE / (VALIDATION_SIZE + TEST_SIZE)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=relative_test_size,
    random_state=42,
    stratify=y_temp
)

print(f"Ukuran Data Latih (Train): {len(X_train)} sampel")
print(f"Ukuran Data Validasi (Validation): {len(X_val)} sampel")
print(f"Ukuran Data Uji (Test): {len(X_test)} sampel")

Ukuran Data Latih (Train): 61339 sampel
Ukuran Data Validasi (Validation): 17526 sampel
Ukuran Data Uji (Test): 8763 sampel


In [16]:
# Verifikasi proporsi kelas di setiap set (opsional tapi bagus)
print("\nProporsi kelas di set Latih:")
print(pd.Series(y_train).value_counts(normalize=True))
print("\nProporsi kelas di set Validasi:")
print(pd.Series(y_val).value_counts(normalize=True))
print("\nProporsi kelas di set Uji:")
print(pd.Series(y_test).value_counts(normalize=True))


Proporsi kelas di set Latih:
2    0.342360
1    0.336686
0    0.320954
Name: proportion, dtype: float64

Proporsi kelas di set Validasi:
2    0.342349
1    0.336700
0    0.320952
Name: proportion, dtype: float64

Proporsi kelas di set Uji:
2    0.342349
1    0.336643
0    0.321009
Name: proportion, dtype: float64


## Fase 5: Ekstraksi Fitur

In [17]:
print("\nTahap 6: Ekstraksi Fitur")


Tahap 6: Ekstraksi Fitur


In [18]:
# --- 6.1 Ekstraksi Fitur TF-IDF (untuk RF, XGBoost, LightGBM) ---
print("Melakukan ekstraksi fitur TF-IDF...")
MAX_FEATURES_TFIDF = 15000 # Batasi jumlah fitur untuk efisiensi

tfidf_vectorizer = TfidfVectorizer(
    max_features=MAX_FEATURES_TFIDF,
    ngram_range=(1, 2),
    stop_words='english',
    lowercase=True,
    max_df=0.8,
    min_df=0.003 
)

# Fit hanya pada data latih
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
# Transform data validasi dan test
X_val_tfidf = tfidf_vectorizer.transform(X_val)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"Bentuk matriks TF-IDF Latih: {X_train_tfidf.shape}")
print(f"Bentuk matriks TF-IDF Validasi: {X_val_tfidf.shape}")
print(f"Bentuk matriks TF-IDF Uji: {X_test_tfidf.shape}")

Melakukan ekstraksi fitur TF-IDF...
Bentuk matriks TF-IDF Latih: (61339, 563)
Bentuk matriks TF-IDF Validasi: (17526, 563)
Bentuk matriks TF-IDF Uji: (8763, 563)


In [19]:
# --- 6.2 Tokenisasi dan Padding (untuk ANN/CNN PyTorch) ---
print("\nMelakukan tokenisasi dan padding untuk model Deep Learning...")

# Parameter Tokenizer dan Padding
MAX_WORDS = 15000
MAX_LEN = 128

# Inisialisasi dan fit Tokenizer Keras pada data latih
tokenizer = Tokenizer(num_words=MAX_WORDS, oov_token='<OOV>') # OOV token untuk kata tak dikenal
tokenizer.fit_on_texts(X_train)

# Konversi teks ke sekuens integer
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_val_seq = tokenizer.texts_to_sequences(X_val)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Padding sekuens agar panjangnya sama
X_train_pad = pad_sequences(X_train_seq, maxlen=MAX_LEN, padding='post', truncating='post')
X_val_pad = pad_sequences(X_val_seq, maxlen=MAX_LEN, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=MAX_LEN, padding='post', truncating='post')

# Ukuran vocabulary aktual (+1 untuk padding/unknown jika 0 tidak dipakai OOV)
vocab_size = len(tokenizer.word_index) + 1
print(f"Ukuran Vocabulary: {vocab_size}")
print(f"Panjang Sekuens (Max Len): {MAX_LEN}")
print(f"Bentuk data Latih (padded): {X_train_pad.shape}")
print(f"Bentuk data Validasi (padded): {X_val_pad.shape}")
print(f"Bentuk data Uji (padded): {X_test_pad.shape}")


Melakukan tokenisasi dan padding untuk model Deep Learning...
Ukuran Vocabulary: 23639
Panjang Sekuens (Max Len): 128
Bentuk data Latih (padded): (61339, 128)
Bentuk data Validasi (padded): (17526, 128)
Bentuk data Uji (padded): (8763, 128)


In [None]:
# --- 6.3 FastText Embedding (untuk ANN/CNN PyTorch) ---
# Lokasi untuk menyimpan model FastText yang diunduh
FASTTEXT_MODEL_PATH = 'cc.en.300.bin' 
EMBEDDING_DIM = 300

In [24]:
# Fungsi untuk mengunduh dan memuat model FastText
def load_fasttext_model():
    if not os.path.exists(FASTTEXT_MODEL_PATH):
        print(f"Mengunduh model FastText...")
        fasttext.util.download_model('en', if_exists='ignore')
        os.rename('cc.en.300.bin', FASTTEXT_MODEL_PATH)
    
    print(f"Memuat model FastText dari {FASTTEXT_MODEL_PATH}...")
    ft_model = fasttext.load_model(FASTTEXT_MODEL_PATH)
    return ft_model

# Muat model FastText
ft_model = load_fasttext_model()

Mengunduh model FastText...
Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
Memuat model FastText dari cc.id.300.bin...




In [None]:
# Mengubah ukuran model FastText untuk efisiensi
fasttext.util.reduce_model(ft_model, 128)
ft_model.get_dimension()

128

In [25]:
# Buat matrix embedding berdasarkan vocabulary yang sudah dibangun
def create_embedding_matrix(ft_model, word_index, embedding_dim):
    # Inisialisasi matrix embedding dengan nilai random kecil
    embedding_matrix = np.random.uniform(-0.25, 0.25, (len(word_index) + 1, embedding_dim))
    
    # Set embedding untuk padding token ke 0
    embedding_matrix[0] = np.zeros(embedding_dim)
    
    # Isi matrix dengan vector FastText untuk setiap kata dalam vocabulary
    print("Membangun embedding matrix berdasarkan FastText...")
    found_words = 0
    for word, i in word_index.items():
        try:
            embedding_vector = ft_model.get_word_vector(word.lower())  # Lowercase for better matching
            embedding_matrix[i] = embedding_vector
            found_words += 1
        except:
            # Jika kata tidak ditemukan, gunakan vector random yang sudah diinisialisasi
            pass
    
    print(f"Embedding matrix selesai dibuat dengan dimensi: {embedding_matrix.shape}")
    print(f"Kata yang ditemukan dalam FastText: {found_words}/{len(word_index)} ({found_words/len(word_index)*100:.2f}%)")
    return embedding_matrix

# Buat embedding matrix
embedding_matrix = create_embedding_matrix(ft_model, tokenizer.word_index, EMBEDDING_DIM)

Membangun embedding matrix berdasarkan FastText...
Embedding matrix selesai dibuat dengan dimensi: (23639, 300)
Kata yang ditemukan dalam FastText: 23638/23638 (100.00%)


## Fase 6: Definisi & Pelatihan Model

### Persiapan Sebelum Training

In [26]:
print("\nTahap 7: Definisi & Pelatihan Model")

NUM_CLASSES = len(label_mapping)
history = {} # Dictionary untuk menyimpan riwayat pelatihan


Tahap 7: Definisi & Pelatihan Model


In [27]:
# --- Helper Function untuk Plotting Hasil Training Deep Learning ---
def plot_history(hist, title):
    plt.figure(figsize=(12, 4))

    # Plot Akurasi
    plt.subplot(1, 2, 1)
    plt.plot(hist['epoch'], hist['train_accuracy'], label='Train Accuracy')
    plt.plot(hist['epoch'], hist['val_accuracy'], label='Validation Accuracy')
    plt.title(f'{title} - Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid(True)

    # Plot Loss
    plt.subplot(1, 2, 2)
    plt.plot(hist['epoch'], hist['train_loss'], label='Train Loss')
    plt.plot(hist['epoch'], hist['val_loss'], label='Validation Loss')
    plt.title(f'{title} - Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)

    plt.tight_layout()
    plt.show()

# --- Helper Function untuk Training Loop PyTorch ---
def train_pytorch_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, model_name):
    print(f"\nMemulai pelatihan model: {model_name}")
    model.to(DEVICE) # Pindahkan model ke device (CPU/GPU/DirectML)
    best_val_accuracy = 0.0
    train_losses, val_losses = [], []
    train_accuracies, val_accuracies = [], []
    epochs_list = []

    for epoch in range(num_epochs):
        model.train() # Set model ke mode training
        running_loss = 0.0
        correct_train = 0
        total_train = 0

        start_epoch_time = time.time()
        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)

            optimizer.zero_grad() # Reset gradients
            outputs = model(inputs) # Forward pass
            loss = criterion(outputs, labels) # Hitung loss
            loss.backward() # Backward pass
            optimizer.step() # Update weights

            running_loss += loss.item() * inputs.size(0)
            _, predicted = torch.max(outputs.data, 1)
            total_train += labels.size(0)
            correct_train += (predicted == labels).sum().item()

            # Cetak progres mini-batch (opsional)
            # if (i+1) % 100 == 0:
            #     print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

        epoch_loss = running_loss / len(train_loader.dataset)
        epoch_acc = correct_train / total_train
        train_losses.append(epoch_loss)
        train_accuracies.append(epoch_acc)

        # Validasi setelah setiap epoch
        model.eval() # Set model ke mode evaluasi
        val_loss = 0.0
        correct_val = 0
        total_val = 0
        with torch.no_grad(): # Tidak perlu hitung gradient saat validasi
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(DEVICE), labels.to(DEVICE)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                val_loss += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs.data, 1)
                total_val += labels.size(0)
                correct_val += (predicted == labels).sum().item()

        epoch_val_loss = val_loss / len(val_loader.dataset)
        epoch_val_acc = correct_val / total_val
        val_losses.append(epoch_val_loss)
        val_accuracies.append(epoch_val_acc)
        epochs_list.append(epoch + 1)

        end_epoch_time = time.time()
        print(f"Epoch [{epoch+1}/{num_epochs}] - Time: {end_epoch_time - start_epoch_time:.2f}s - "
              f"Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.4f} - "
              f"Val Loss: {epoch_val_loss:.4f}, Val Acc: {epoch_val_acc:.4f}")

        # Simpan model jika akurasi validasi membaik
        if epoch_val_acc > best_val_accuracy:
            best_val_accuracy = epoch_val_acc
            # torch.save(model.state_dict(), f'{model_name}_best.pth')
            # print(f"Model terbaik disimpan dengan Val Acc: {best_val_accuracy:.4f}")

    print(f"Pelatihan {model_name} selesai.")
    return {
        'epoch': epochs_list,
        'train_loss': train_losses, 'train_accuracy': train_accuracies,
        'val_loss': val_losses, 'val_accuracy': val_accuracies,
        'model': model # Kembalikan model yang sudah terlatih
    }

In [None]:
# --- PyTorch Dataset Class ---
class SentimentDataset(Dataset):
    def __init__(self, sequences, labels):
        self.sequences = torch.tensor(sequences, dtype=torch.long)
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

# --- Buat DataLoaders untuk PyTorch ---
BATCH_SIZE = 128

train_dataset = SentimentDataset(X_train_pad, y_train)
val_dataset = SentimentDataset(X_val_pad, y_val)
test_dataset = SentimentDataset(X_test_pad, y_test)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

In [29]:
# --- Persiapan untuk Model Scikit-learn ---
# Konversi sparse matrix TF-IDF ke format yang diterima XGBoost/LGBM jika perlu
X_train_tfidf_dense = X_train_tfidf.toarray()
X_val_tfidf_dense = X_val_tfidf.toarray()
X_test_tfidf_dense = X_test_tfidf.toarray()

### Model 1: Artificial Neural Network (ANN)

In [37]:
# --- Eksperimen 1: Artificial Neural Network (ANN) - PyTorch ---
print("\n--- Eksperimen 1: ANN dengan FastText Embeddings ---")

# Parameter ANN
EMBEDDING_DIM_ANN = 128
HIDDEN_DIM_ANN = 256
DROPOUT_ANN = 0.4
LEARNING_RATE_ANN = 1e-3
EPOCHS_ANN = 10

class FastTextANN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, embedding_matrix, hidden_dim, output_dim, dropout):
        super().__init__()
        
        # Inisialisasi embedding layer dengan bobot FastText
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # Set embedding weights dengan nilai FastText dan freeze (tidak diupdate saat training)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float))
        self.embedding.weight.requires_grad = False  # Freeze embeddings
        
        self.fc1 = nn.Linear(embedding_dim, hidden_dim)
        self.leaky_relu = nn.LeakyReLU()
        self.dropout = nn.Dropout(dropout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        # Tambahan: layer normalization untuk stabilitas
        self.layer_norm = nn.LayerNorm(hidden_dim)

    def forward(self, text):
        # text shape: (batch_size, seq_len)
        embedded = self.embedding(text)
        # embedded shape: (batch_size, seq_len, embedding_dim)
        
        # Rata-ratakan embedding sepanjang sequence length
        pooled = embedded.mean(dim=1)
        # pooled shape: (batch_size, embedding_dim)
        
        hidden = self.leaky_relu(self.fc1(pooled))
        hidden = self.layer_norm(hidden)
        hidden_drop = self.dropout(hidden)
        output = self.fc2(hidden_drop)
        # output shape: (batch_size, output_dim)
        return output

model_fasttext_ann = FastTextANN(vocab_size, EMBEDDING_DIM, embedding_matrix, HIDDEN_DIM_ANN, NUM_CLASSES, DROPOUT_ANN)
optimizer_ann = optim.AdamW(model_fasttext_ann.parameters(), lr=LEARNING_RATE_ANN, weight_decay=1e-5)
criterion_ann = nn.CrossEntropyLoss()

history['fasttext_ann'] = train_pytorch_model(model_fasttext_ann, train_loader, val_loader, criterion_ann, optimizer_ann, EPOCHS_ANN, "FastText-ANN")
plot_history(history['fasttext_ann'], "FastText ANN Training History")


--- Eksperimen 1: ANN dengan FastText Embeddings ---

Memulai pelatihan model: FastText-ANN
Epoch [1/10] - Time: 3.98s - Train Loss: 1.2681, Train Acc: 0.3386 - Val Loss: 1.2002, Val Acc: 0.3426
Epoch [2/10] - Time: 3.85s - Train Loss: 1.2623, Train Acc: 0.3428 - Val Loss: 1.2002, Val Acc: 0.3426
Epoch [3/10] - Time: 4.08s - Train Loss: 1.2635, Train Acc: 0.3382 - Val Loss: 1.2002, Val Acc: 0.3426


KeyboardInterrupt: 

### Model 2: Convolutional Neural Network (CNN)

In [None]:
# --- Eksperimen 2: Convolutional Neural Network (CNN) - PyTorch ---
print("\n--- Eksperimen 2: CNN dengan FastText Embeddings ---")

# Parameter CNN
EMBEDDING_DIM_CNN = 128
NUM_FILTERS = 256
FILTER_SIZES = [2, 3, 4, 5]
DROPOUT_CNN = 0.4
LEARNING_RATE_CNN = 1e-4
EPOCHS_CNN = 15

class FastTextCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, embedding_matrix, n_filters, filter_sizes, output_dim, dropout, pad_idx=0):
        super().__init__()
        
        # Inisialisasi embedding layer dengan bobot FastText
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float))
        self.embedding.weight.requires_grad = False  # Freeze embeddings
        
        # Buat layer Conv1d untuk setiap ukuran filter
        self.convs = nn.ModuleList([
            nn.Conv1d(in_channels=embedding_dim,
                     out_channels=n_filters,
                     kernel_size=fs)
            for fs in filter_sizes
        ])
        
        # Tambahan: Batch Normalization setelah konvolusi
        self.batch_norms = nn.ModuleList([
            nn.BatchNorm1d(n_filters)
            for _ in filter_sizes
        ])
        
        self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.leaky_relu = nn.LeakyReLU(0.1)

    def forward(self, text):
        # text shape: (batch_size, seq_len)
        embedded = self.dropout(self.embedding(text))
        # embedded shape: (batch_size, seq_len, embedding_dim)

        # Conv1d mengharapkan input (batch_size, channels, seq_len)
        embedded = embedded.permute(0, 2, 1)
        # embedded shape: (batch_size, embedding_dim, seq_len)

        # Terapkan konvolusi, batch norm, dan pooling
        conved = [self.leaky_relu(bn(conv(embedded))) 
                 for conv, bn in zip(self.convs, self.batch_norms)]
        
        # Max pooling over time
        pooled = [torch.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
        
        # Gabungkan hasil pooling dari semua filter
        cat = self.dropout(torch.cat(pooled, dim=1))
        
        output = self.fc(cat)
        return output

# Tentukan pad_idx jika Tokenizer Keras menggunakan 0 untuk padding
pad_token_index = 0
model_fasttext_cnn = FastTextCNN(vocab_size, EMBEDDING_DIM, embedding_matrix, NUM_FILTERS, FILTER_SIZES, NUM_CLASSES, DROPOUT_CNN, pad_idx=pad_token_index)
optimizer_cnn = optim.AdamW(model_fasttext_cnn.parameters(), lr=LEARNING_RATE_CNN, weight_decay=1e-5)
criterion_cnn = nn.CrossEntropyLoss()

history['fasttext_cnn'] = train_pytorch_model(model_fasttext_cnn, train_loader, val_loader, criterion_cnn, optimizer_cnn, EPOCHS_CNN, "FastText-CNN")
plot_history(history['fasttext_cnn'], "FastText CNN Training History")


--- Eksperimen 2: CNN dengan FastText Embeddings ---

Memulai pelatihan model: FastText-CNN
Epoch [1/15] - Time: 58.53s - Train Loss: 3.7532, Train Acc: 0.3283 - Val Loss: 2.6515, Val Acc: 0.3400
Epoch [2/15] - Time: 57.10s - Train Loss: 3.7369, Train Acc: 0.3266 - Val Loss: 2.6956, Val Acc: 0.3407


KeyboardInterrupt: 

### Model 3: Random Forest (RF)

In [33]:
# --- Eksperimen 3: Random Forest (RF) ---
print("\n--- Eksperimen 3: Random Forest (RF) ---")
model_rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,   
    min_samples_split=5, 
    min_samples_leaf=2,   
    random_state=42,
    n_jobs=-1             
)

print("Melatih Random Forest...")
start_time_rf = time.time()
model_rf.fit(X_train_tfidf, y_train)
end_time_rf = time.time()
print(f"Pelatihan RF selesai. Waktu: {end_time_rf - start_time_rf:.2f} detik.")

# Evaluasi cepat di data latih & validasi (untuk perbandingan kasar)
acc_train_rf = model_rf.score(X_train_tfidf, y_train)
acc_val_rf = model_rf.score(X_val_tfidf, y_val)
print(f"Akurasi RF - Train: {acc_train_rf:.4f}, Validation: {acc_val_rf:.4f}")
history['rf'] = {'model': model_rf, 'train_accuracy': acc_train_rf, 'val_accuracy': acc_val_rf}


--- Eksperimen 3: Random Forest (RF) ---
Melatih Random Forest...
Pelatihan RF selesai. Waktu: 14.33 detik.
Akurasi RF - Train: 0.9048, Validation: 0.8309


### Model 4: XGBoost

In [None]:
# --- Eksperimen 4: XGBoost ---
print("\n--- Eksperimen 4: XGBoost ---")

# Parameter XGBoost
params = {
    'objective': 'multi:softmax',
    'num_class': NUM_CLASSES,
    'max_depth': 7,
    'eta': 0.1,  # learning rate
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'tree_method': 'hist',
    'eval_metric': 'mlogloss',
    'random_state': 42
}

print("Melatih XGBoost...")
start_time_xgb = time.time()

# Konversi data ke format DMatrix XGBoost
dtrain = xgb.DMatrix(X_train_tfidf, label=y_train)
dval = xgb.DMatrix(X_val_tfidf, label=y_val)

# Train dengan early stopping
model_xgb = xgb.train(
    params,
    dtrain,
    num_boost_round=300,
    evals=[(dtrain, 'train'), (dval, 'validation')],
    early_stopping_rounds=20,
    verbose_eval=False  # Set True untuk lihat progres per iterasi
)

end_time_xgb = time.time()
print(f"Pelatihan XGBoost selesai. Waktu: {end_time_xgb - start_time_xgb:.2f} detik.")

# Evaluasi cepat
preds_train = model_xgb.predict(dtrain)
preds_val = model_xgb.predict(dval)
acc_train_xgb = (preds_train == y_train).mean()
acc_val_xgb = (preds_val == y_val).mean()
print(f"Akurasi XGBoost - Train: {acc_train_xgb:.4f}, Validation: {acc_val_xgb:.4f}")
history['xgb'] = {'model': model_xgb, 'train_accuracy': acc_train_xgb, 'val_accuracy': acc_val_xgb}


--- Eksperimen 4: XGBoost (Native API) ---
Melatih XGBoost (Native API)...
Pelatihan XGBoost (Native API) selesai. Waktu: 51.39 detik.
Akurasi XGBoost (Native API) - Train: 0.8776, Validation: 0.8386


### Model 5: LightGBM

In [None]:
# --- Eksperimen 5: LightGBM ---
print("\n--- Eksperimen 5: LightGBM ---")
model_lgbm = lgbm.LGBMClassifier(
    objective='multiclass',
    num_class=NUM_CLASSES,
    n_estimators=300,
    learning_rate=0.1,
    num_leaves=31, # Kontrol kompleksitas pohon (default)
    max_depth=-1,  # Default (-1 = tanpa batas)
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1,
    # Untuk GPU (jika terinstal versi GPU LightGBM & ada driver):
    # device = 'gpu',
    # gpu_platform_id = 0, # Sesuaikan jika perlu
    # gpu_device_id = 0    # Sesuaikan jika perlu
)

print("Melatih LightGBM...")
start_time_lgbm = time.time()
# Gunakan data validasi untuk early stopping
eval_set_lgbm = [(X_val_tfidf, y_val)]
callbacks = [
    # Perhatikan nama parameter early stopping berbeda di LightGBM
    # Gunakan lgbm.early_stopping jika mengimport lightgbm langsung
    # Di scikit-learn API, gunakan parameter fit
    # early_stopping_rounds=20 # Tidak langsung di constructor, tapi di fit
]

model_lgbm.fit(X_train_tfidf, y_train,
               eval_set=eval_set_lgbm,
               eval_metric='multi_logloss',
               # Callback untuk early stopping dalam API scikit-learn
               callbacks=[lgbm.early_stopping(stopping_rounds=20, verbose=False)]
              )
end_time_lgbm = time.time()
print(f"Pelatihan LightGBM selesai. Waktu: {end_time_lgbm - start_time_lgbm:.2f} detik.")

# Evaluasi cepat
acc_train_lgbm = model_lgbm.score(X_train_tfidf, y_train)
acc_val_lgbm = model_lgbm.score(X_val_tfidf, y_val)
print(f"Akurasi LightGBM - Train: {acc_train_lgbm:.4f}, Validation: {acc_val_lgbm:.4f}")
history['lgbm'] = {'model': model_lgbm, 'train_accuracy': acc_train_lgbm, 'val_accuracy': acc_val_lgbm}


--- Eksperimen 5: LightGBM ---
Melatih LightGBM...
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.145992 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 84737
[LightGBM] [Info] Number of data points in the train set: 61339, number of used features: 563
[LightGBM] [Info] Start training from score -1.136457
[LightGBM] [Info] Start training from score -1.088604
[LightGBM] [Info] Start training from score -1.071893
Pelatihan LightGBM selesai. Waktu: 10.34 detik.
Akurasi LightGBM - Train: 0.8831, Validation: 0.8460
