
# Liputan6 ‚Äî Pre-processing for Summarization (Versi 5 ‚Äî Linked to EDA)

Notebook ini **terhubung langsung** dengan output EDA (v4.2a).  
Secara otomatis membaca dataset dari folder:
`/content/drive/MyDrive/Tugas/Liputan6/EDA_Outputs/liputan6_dataset.csv`

Langkah-langkah:
1) Mount Drive & load dataset hasil EDA  
2) Hapus duplikat (berdasarkan isi artikel)  
3) Bersihkan boilerplate ‚ÄúLiputan6.com, Jakarta : ‚Ä¶‚Äù  
4) Hapus noise umum (Baca juga:, Simak video:, Reporter:, Penulis:)  
5) Normalisasi dasar (spasi/tanda baca/lowercase)  
6) Filter teks terlalu pendek  
7) Rekap hasil pembersihan  
8) Simpan dataset final ke `Preprocessed/liputan6_clean_ready.csv`  
9) Preview contoh acak


## Langkah 1 ‚Äî Mount Google Drive & Load Dataset (from EDA_Outputs)

In [1]:
# üöÄ Google Drive Mount + PATH Setup (Simplified)
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

from pathlib import Path
import pandas as pd

# === Base folder for your project ===
BASE_DIR = Path("/content/drive/MyDrive/Tugas/Liputan6")  #@param {type:"string"}

# === Define paths derived from BASE_DIR ===
DATA_PATH = BASE_DIR / "Data" / "liputan6_clean_ready.csv"  # clean dataset
OUTPUT_DIR = BASE_DIR / "Outputs"                            # outputs folder

# Ensure output folder exists
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print("Using DATA_PATH:", DATA_PATH)
print("Using OUTPUT_DIR:", OUTPUT_DIR)

# Optional: quick sanity check to load a small sample
try:
    _df_sample = pd.read_csv(DATA_PATH, nrows=3, low_memory=False)
    print("Sample loaded OK. Columns:", list(_df_sample.columns))
except Exception as e:
    print("‚ö†Ô∏è Cannot read DATA_PATH ‚Äî please verify path exists.")
    print(e)


Mounted at /content/drive
Using DATA_PATH: /content/drive/MyDrive/Tugas/Liputan6/Data/liputan6_clean_ready.csv
Using OUTPUT_DIR: /content/drive/MyDrive/Tugas/Liputan6/Outputs
‚ö†Ô∏è Cannot read DATA_PATH ‚Äî please verify path exists.
[Errno 2] No such file or directory: '/content/drive/MyDrive/Tugas/Liputan6/Data/liputan6_clean_ready.csv'


## Langkah 2 ‚Äî Pilih Kolom Relevan

In [4]:

# Gunakan kolom bersih bila ada
article_col = 'clean_article_text' if 'clean_article_text' in df.columns else ('article' if 'article' in df.columns else None)
summary_col = 'clean_summary_text' if 'clean_summary_text' in df.columns else ('summary' if 'summary' in df.columns else None)

if article_col is None or summary_col is None:
    raise KeyError("Kolom artikel/ringkasan tidak ditemukan. Cek header CSV Anda.")

df = df[[article_col, summary_col]].dropna().copy()
df.rename(columns={article_col: 'clean_article_text', summary_col: 'clean_summary_text'}, inplace=True)

print("Kolom terpakai:", list(df.columns))
print("Baris setelah drop NA:", len(df))
df.head(2)


Kolom terpakai: ['clean_article_text', 'clean_summary_text']
Baris setelah drop NA: 193883


Unnamed: 0,clean_article_text,clean_summary_text
0,"Liputan6 . com , Jakarta : Presiden Susilo Bam...","Menurut Presiden Susilo Bambang Yudhoyono , ke..."
1,"Liputan6 . com , Jakarta : Perdana Menteri Jep...",Pada masa silam Jepang terlalu ambisius untuk ...


In [3]:
# Load the dataset from the specified path
DATASET_PATH = BASE_DIR / "EDA_Outputs" / "liputan6_dataset.csv"

try:
    df = pd.read_csv(DATASET_PATH, low_memory=False)
    print("Dataset loaded successfully from:", DATASET_PATH)
    print("Number of rows:", len(df))
    print("Columns:", list(df.columns))
except Exception as e:
    print("‚ö†Ô∏è Error loading dataset from:", DATASET_PATH)
    print(e)
    df = pd.DataFrame() # Create an empty DataFrame to avoid errors in subsequent cells

Dataset loaded successfully from: /content/drive/MyDrive/Tugas/Liputan6/EDA_Outputs/liputan6_dataset.csv
Number of rows: 193883
Columns: ['id', 'url', 'clean_article_text', 'clean_summary_text', 'extractive_summary_indices']


## Langkah 3 ‚Äî Hapus Duplikat (berdasarkan isi artikel)

In [5]:

before = len(df)
df = df.drop_duplicates(subset=['clean_article_text'])
print(f"Duplikat dihapus: {before - len(df)} baris. Sisa: {len(df)}")


Duplikat dihapus: 871 baris. Sisa: 193012


## Langkah 4 ‚Äî Bersihkan Boilerplate ‚ÄúLiputan6.com, Jakarta : ‚Ä¶‚Äù

In [6]:
import re
import pandas as pd

def clean_boilerplate(text: str) -> str:
    if pd.isna(text):
        return text
    s = str(text)

    # Pola umum berbagai variasi "Liputan6.com, Jakarta :"
    patterns = [
        r'(?i)^\s*liputan\s*6\s*\.\s*com\s*[,:\-‚Äì‚Äî]*\s*',
        r'(?i)^\s*liputan6\s*\.\s*com\s*,?\s*(jakarta|[A-Za-z\s]+)?\s*[:\-‚Äì‚Äî]*\s*',
    ]
    for p in patterns:
        s = re.sub(p, '', s).strip()

    # Normalisasi spasi & tanda baca langsung
    s = re.sub(r'\s+', ' ', s)
    s = re.sub(r'\s+([,.;:!?])', r'\1', s)
    s = re.sub(r'([,.;:!?])(?!\s|$)', r'\1 ', s)
    return s.strip()

# Preview sebelum/sesudah (5 contoh)
sample_idx = df.index[:5]
preview = pd.DataFrame({
    "before": df.loc[sample_idx, 'clean_article_text'].astype(str).str.slice(0, 140),
    "after_preview": df.loc[sample_idx, 'clean_article_text'].astype(str).apply(clean_boilerplate).str.slice(0, 140)
})
display(preview)

# Terapkan ke seluruh kolom
df['clean_article_text'] = df['clean_article_text'].astype(str).apply(clean_boilerplate)
print("Boilerplate dibersihkan.")

Unnamed: 0,before,after_preview
0,"Liputan6 . com , Jakarta : Presiden Susilo Bam...",Jakarta: Presiden Susilo Bambang Yudhoyono men...
1,"Liputan6 . com , Jakarta : Perdana Menteri Jep...",Jakarta: Perdana Menteri Jepang Junichiro Koiz...
2,"Liputan6 . com , Kutai : Banjir dengan ketingg...",Kutai: Banjir dengan ketinggian dua meter di K...
3,"Liputan6 . com , Jakarta : Presiden Susilo Bam...",Jakarta: Presiden Susilo Bambang Yudhoyono men...
4,"Liputan6 . com , Solok : Warga Kampung Batu Da...","Solok: Warga Kampung Batu Dalam, Kecamatan Dan..."


Boilerplate dibersihkan.


## Langkah 5 ‚Äî Hapus Noise Umum (Baca juga, Simak video, Reporter, Penulis)

In [7]:

patterns_noise = [
    r'Baca juga[:Ôºö].*?($|\n)',
    r'Simak video.*?($|\n)',
    r'Lihat juga[:Ôºö].*?($|\n)',
    r'Reporter[:Ôºö].*?($|\n)',
    r'Penulis[:Ôºö].*?($|\n)',
]
for p in patterns_noise:
    df['clean_article_text'] = df['clean_article_text'].str.replace(p, '', regex=True)

print("Noise umum dibersihkan (jika ada).")


Noise umum dibersihkan (jika ada).


## Langkah 6 ‚Äî Normalisasi Dasar (spasi, tanda baca, lowercase)

In [8]:

def normalize_text(s: str) -> str:
    if pd.isna(s): return s
    s = str(s)
    s = re.sub(r'\s+', ' ', s)                       # spasi ganda
    s = re.sub(r'\s+([,.;:!?])', r'\1', s)           # spasi sebelum tanda baca
    s = s.strip().lower()                             # trim + lowercase
    return s

df['clean_article_text'] = df['clean_article_text'].apply(normalize_text)
df['clean_summary_text'] = df['clean_summary_text'].apply(normalize_text)
print("Normalisasi selesai.")


Normalisasi selesai.


## Langkah 7 ‚Äî Filter Teks Terlalu Pendek

In [9]:

def wc(x): return len(str(x).split())
before = len(df)
df = df[df['clean_article_text'].apply(wc) >= 30]      # artikel minimal 30 kata
df = df[df['clean_summary_text'].apply(wc) >= 5]       # ringkasan minimal 5 kata
print(f"Filter pendek: {before - len(df)} baris dihapus. Sisa: {len(df)} baris.")


Filter pendek: 1 baris dihapus. Sisa: 193011 baris.


## Langkah 8 ‚Äî Rekap Hasil Pembersihan

In [11]:
import numpy as np

art_len = df['clean_article_text'].astype(str).apply(lambda x: len(x.split()))
sum_len = df['clean_summary_text'].astype(str).apply(lambda x: len(x.split()))
ratio = (sum_len / art_len.replace(0, np.nan)).dropna()

summary_stats = {
    "rows_final": int(len(df)),
    "article_words_mean": float(art_len.mean()),
    "summary_words_mean": float(sum_len.mean()),
    "compression_ratio_median": float(np.median(ratio)),
}
import pandas as pd
pd.Series(summary_stats, dtype='object')

Unnamed: 0,0
rows_final,193011.0
article_words_mean,202.049681
summary_words_mean,27.243162
compression_ratio_median,0.151261


## Langkah 9 ‚Äî Simpan Dataset Final

In [13]:
CLEAN_PATH = OUTPUT_DIR / "liputan6_clean_ready.csv"
df.to_csv(CLEAN_PATH, index=False)
print("‚úÖ Dataset bersih disimpan di:", CLEAN_PATH)

‚úÖ Dataset bersih disimpan di: /content/drive/MyDrive/Tugas/Liputan6/Outputs/liputan6_clean_ready.csv


## Langkah 10 ‚Äî Preview Contoh (Random)

In [14]:

df[['clean_article_text','clean_summary_text']].sample(3, random_state=42)


Unnamed: 0,clean_article_text,clean_summary_text
181438,"jakarta: husein mutahar, rabu ( 9/6 ), pukul 1...",mutahar meninggal dunia pada usia 88 tahun set...
80343,manchester city berhasil mempertahankan rekor ...,manchester city mempertahankan rekor tak terka...
43115,denpasar: pemilihan langsung gubernur bali aka...,persiapan pemilihan gubernur bali yang akan di...
