
# Fake News Detection (Bahasa Indonesia) — Baseline Notebook

Notebook ini menyiapkan pipeline end-to-end untuk **Fake News Detection** pada teks Bahasa Indonesia:
- Load data (`Data_latih.csv`, `Data_uji.csv`)
- Preprocessing (lowercase, cleaning URL/punctuation, stopwords ID, stemming **Sastrawi**)
- Ekstraksi fitur **TF–IDF**
- Training baseline **Multinomial Naive Bayes**, **Logistic Regression**, **Linear SVM**
- Evaluasi (Accuracy, Precision, Recall, F1, Confusion Matrix, ROC-AUC untuk model probabilistic)
- Simpan artefak: vectorizer & model (folder `models/`)

> Catatan: Pastikan file **Data_latih.csv** dan **Data_uji.csv** berada di folder yang sama dengan notebook ini, dan **memiliki kolom**: `text`, `label`.
> Jika nama kolom berbeda, sesuaikan variabel `TEXT_COL` dan `LABEL_COL` di bawah ini.


## 0. Persiapan Library (jalankan sekali saja di environment lokal/Colab)

In [None]:
# Jika perlu di Colab:
# !pip install pandas numpy scikit-learn nltk Sastrawi matplotlib seaborn wordcloud


## 1. Import Library

In [1]:
import os, re, json, joblib
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from wordcloud import WordCloud

import nltk
from nltk.corpus import stopwords
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import (classification_report, confusion_matrix, accuracy_score,
                             precision_recall_fscore_support, roc_auc_score, roc_curve)

nltk.download('stopwords')
print('NLTK stopwords downloaded.')

NLTK stopwords downloaded.


[nltk_data] Downloading package stopwords to C:\Users\MSI
[nltk_data]     ID\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Konfigurasi kolom & path file

In [3]:
# Konfigurasi kolom (ubah jika dataset Anda memakai nama berbeda)
TEXT_COL = 'text'   # ganti jika perlu
LABEL_COL = 'label' # ganti jika perlu

TRAIN_PATH = 'Data_latih.csv'
TEST_PATH  = 'Data_uji.csv'

assert os.path.exists(TRAIN_PATH), f'File tidak ditemukan: {TRAIN_PATH}'
assert os.path.exists(TEST_PATH),  f'File tidak ditemukan: {TEST_PATH}'
print('OK: file data latih & uji ditemukan')

OK: file data latih & uji ditemukan


## 3. Muat Data

In [5]:
df_train = pd.read_csv(TRAIN_PATH)
df_test  = pd.read_csv(TEST_PATH)

# Cek kolom otomatis jika TEXT_COL/LABEL_COL tidak ada
def auto_detect_columns(df):
    cols = [c.lower() for c in df.columns]
    mapping = {}
    # heuristics
    text_candidates = [c for c in cols if 'text' in c or 'berita' in c or 'konten' in c or 'isi' in c or 'judul' in c]
    label_candidates = [c for c in cols if 'label' in c or 'target' in c or 'tag' in c or 'kelas' in c or 'hoax' in c]
    mapping['text'] = df.columns[cols.index(text_candidates[0])] if text_candidates else df.columns[0]
    mapping['label'] = df.columns[cols.index(label_candidates[0])] if label_candidates else df.columns[-1]
    return mapping

for name, df in [('train', df_train), ('test', df_test)]:
    missing = [c for c in [TEXT_COL, LABEL_COL] if c not in df.columns]
    if missing:
        md = auto_detect_columns(df)
        print(f'Kolom {missing} tidak ditemukan di {name}. Otomatis mapping ke: text -> {md["text"]}, label -> {md["label"]}')
        if name=='train':
            df_train = df_train.rename(columns={md['text']:TEXT_COL, md['label']:LABEL_COL})
        else:
            df_test = df_test.rename(columns={md['text']:TEXT_COL, md['label']:LABEL_COL})

print('Preview train:')
display(df_train.head())
print('Distribusi label train:')
print(df_train[LABEL_COL].value_counts())

Kolom ['text'] tidak ditemukan di train. Otomatis mapping ke: text -> judul, label -> label
Kolom ['text', 'label'] tidak ditemukan di test. Otomatis mapping ke: text -> judul, label -> nama file gambar
Preview train:


Unnamed: 0,ID,label,tanggal,text,narasi,nama file gambar
0,71,1,17-Aug-20,Pemakaian Masker Menyebabkan Penyakit Legionna...,A caller to a radio talk show recently shared ...,71.jpg
1,461,1,17-Jul-20,Instruksi Gubernur Jateng tentang penilangan ...,Yth.Seluruh Anggota Grup Sesuai Instruksi Gube...,461.png
2,495,1,13-Jul-20,Foto Jim Rohn: Jokowi adalah presiden terbaik ...,Jokowi adalah presiden terbaik dlm sejarah ban...,495.png
3,550,1,8-Jul-20,"ini bukan politik, tapi kenyataan Pak Jokowi b...","Maaf Mas2 dan Mbak2, ini bukan politik, tapi k...",550.png
4,681,1,24-Jun-20,Foto Kadrun kalo lihat foto ini panas dingin,Kadrun kalo lihat foto ini panas dingin . .,681.jpg


Distribusi label train:
label
1    3465
0     766
Name: count, dtype: int64


## 4. Preprocessing Bahasa Indonesia

In [6]:
stop_words = set(stopwords.words('indonesian'))
stemmer = StemmerFactory().create_stemmer()

def clean_text(text: str) -> str:
    text = str(text).lower()
    text = re.sub(r"http\S+|www\S+", "", text)
    text = re.sub(r"@[\w_]+", "", text)      # mention
    text = re.sub(r"#\w+", "", text)         # hashtag
    text = re.sub(r"[^a-z0-9 ]+", " ", text)  # keep alnum+space
    tokens = [w for w in text.split() if w not in stop_words and len(w) > 2]
    text = " ".join(tokens)
    text = stemmer.stem(text)
    return text

df_train['clean_text'] = df_train[TEXT_COL].astype(str).apply(clean_text)
df_test['clean_text']  = df_test[TEXT_COL].astype(str).apply(clean_text)

display(df_train[[TEXT_COL,'clean_text']].head())

Unnamed: 0,text,clean_text
0,Pemakaian Masker Menyebabkan Penyakit Legionna...,pakai masker sebab sakit legionnaires
1,Instruksi Gubernur Jateng tentang penilangan ...,instruksi gubernur jateng tilang masker muka 1...
2,Foto Jim Rohn: Jokowi adalah presiden terbaik ...,foto jim rohn jokowi presiden baik dlm sejarah...
3,"ini bukan politik, tapi kenyataan Pak Jokowi b...",politik nyata jokowi hasil pulang 000 triliun ...
4,Foto Kadrun kalo lihat foto ini panas dingin,foto kadrun kalo lihat foto panas dingin


## 5. Ekstraksi Fitur TF–IDF & Training Model

In [8]:
X_train_text = df_train['clean_text'].values
y_train = df_train[LABEL_COL].values

X_test_text = df_test['clean_text'].values
y_test = df_test[LABEL_COL].values

tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
Xtr = tfidf.fit_transform(X_train_text)
Xte = tfidf.transform(X_test_text)

models = {
    'NaiveBayes': MultinomialNB(),
    'LogReg': LogisticRegression(max_iter=2000, n_jobs=None),
    'LinearSVM': LinearSVC()
}

results = {}
for name, clf in models.items():
    clf.fit(Xtr, y_train)
    pred = clf.predict(Xte)
    acc = accuracy_score(y_test, pred)
    p, r, f1, _ = precision_recall_fscore_support(y_test, pred, average='weighted', zero_division=0)
    results[name] = {'accuracy': acc, 'precision': p, 'recall': r, 'f1': f1}
    print(f'[{name}] acc={acc:.4f} | P={p:.4f} R={r:.4f} F1={f1:.4f}\n')

best_model_name = max(results, key=lambda k: results[k]['f1'])
best_model = models[best_model_name]
print('Best model:', best_model_name, results[best_model_name])

ValueError: Mix of label input types (string and number)

## 6. Evaluasi Detail & Confusion Matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.metrics import ConfusionMatrixDisplay

pred_best = best_model.predict(Xte)
print(classification_report(y_test, pred_best))
cm = confusion_matrix(y_test, pred_best)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(xticks_rotation=45)
plt.title(f'Confusion Matrix - {best_model_name}')
plt.tight_layout()
plt.show()

## 7. Wordcloud (opsional)

In [None]:
def plot_wordcloud(texts, title):
    wc = WordCloud(width=900, height=500, background_color='white').generate(" ".join(texts))
    plt.figure(figsize=(10,5))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(title)
    plt.show()

if LABEL_COL in df_train.columns:
    for lbl in df_train[LABEL_COL].unique():
        plot_wordcloud(df_train[df_train[LABEL_COL]==lbl]['clean_text'].tolist(), f'Wordcloud - {lbl}')

## 8. Simpan Artefak (Model & Vectorizer)

In [None]:
os.makedirs('models', exist_ok=True)
joblib.dump(tfidf, f'models/tfidf_vectorizer.pkl')
joblib.dump(best_model, f'models/{best_model_name}_model.pkl')
print('Model & vectorizer tersimpan di folder models/')

## 9. Fungsi Prediksi untuk 1 Teks

In [None]:
def predict_text(text: str):
    ct = clean_text(text)
    X = tfidf.transform([ct])
    return best_model.predict(X)[0]

print(predict_text("Breaking: vaksin menyebabkan chip 5G di tubuh manusia, ini hoax?"))