# Datavidia ISPU Prediction - Robust Version

Notebook ini memperbaiki masalah **ID tidak ditemukan** dengan cara menggabungkan template submission ke dalam dataframe utama sebelum proses pembuatan fitur (*Feature Engineering*).

### Mengapa ID Anda Hilang?
Karena di kode sebelumnya, kita hanya membuat fitur untuk data yang ada di file `merged`. Karena ID di `sample_submission` (September - November 2025) tidak ada di file `merged` (yang berakhir di Agustus 2025), maka ID tersebut tidak memiliki baris fitur.

### Solusi di Notebook ini:
1. **Append Submission**: Memasukkan ID submission ke dataframe utama.
2. **Time-Series Alignment**: Menghitung Lag dan Rolling Mean sehingga data Agustus 2025 otomatis menjadi 'kemarin' bagi data September 2025.
3. **Calendar Features**: Mengandalkan fitur Bulan dan Hari untuk prediksi jangka panjang (karena polutan sensor pasti kosong di masa depan).

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from pathlib import Path
from sklearn.metrics import classification_report, f1_score

# =========================
# 1. LOAD DATA
# =========================
NA_VALUES = ["---", "--", "", " ", "NA", "N/A"]
LABEL_MAP = {"BAIK": 0, "SEDANG": 1, "TIDAK SEHAT": 2}
INV_LABEL_MAP = {v: k for k, v in LABEL_MAP.items()}

def find_file(name):
    for path in [Path.cwd()] + list(Path.cwd().parents):
        matches = list(path.rglob(name))
        if matches: return matches[0]
    return None

path_main = find_file("merged_cuaca_ndvi_ispu.csv")
path_sub = find_file("sample_submission.csv")

df = pd.read_csv(path_main, na_values=NA_VALUES)
sub = pd.read_csv(path_sub)

print(f"Data Histori: {len(df)} baris")
print(f"Data Submission: {len(sub)} baris")

# Preprocessing Histori
df["tanggal"] = pd.to_datetime(df["tanggal"])
df["kategori"] = df["kategori"].replace({"SANGAT TIDAK SEHAT": "TIDAK SEHAT", "BERBAHAYA": "TIDAK SEHAT"})
df["y"] = df["kategori"].map(LABEL_MAP)

# Preprocessing Submission (Agar bisa digabung)
sub_df = sub.copy()
sub_df["tanggal"] = pd.to_datetime(sub_df["id"].str.split("_").str[0])
sub_df["lokasi_clean"] = sub_df["id"].str.split("_").str[1]

# Gabungkan Histori + Submission untuk kalkulasi fitur
full_df = pd.concat([df, sub_df], axis=0).sort_values(["lokasi_clean", "tanggal"]).reset_index(drop=True)

Data Histori: 15257 baris
Data Submission: 455 baris


In [2]:
# =========================
# 2. FEATURE ENGINEERING
# =========================
def create_features(data):
    data = data.copy()
    data["tanggal"] = pd.to_datetime(data["tanggal"])
    
    # Fitur Kalender (Sangat penting untuk prediksi jangka panjang/future)
    data["month"] = data["tanggal"].dt.month
    data["day_of_week"] = data["tanggal"].dt.dayofweek
    data["is_weekend"] = data["day_of_week"].isin([5, 6]).astype(int)
    
    # List fitur sensor/cuaca
    COLS = [
        "pm_sepuluh", "sulfur_dioksida", "karbon_monoksida", "ozon", "nitrogen_dioksida",
        "temperature_2m_mean (¬∞C)", "relative_humidity_2m_mean (%)", "ndvi"
    ]
    
    for col in COLS:
        if col in data.columns:
            # Lag & Rolling (Mencegah Leakage)
            data[f"{col}_lag_1"] = data.groupby("lokasi_clean")[col].shift(1)
            data[f"{col}_roll7"] = data.groupby("lokasi_clean")[col].transform(lambda x: x.shift(1).rolling(7, min_periods=3).mean())
            
            # Untuk prediksi masa depan yang jauh (Sept-Nov),
            # Lag-1 akan banyak yang NaN. Kita tambahkan Forward Fill pada fitur lag
            # agar model tetap punya referensi data terakhir yang diketahui.
            data[f"{col}_lag_1"] = data.groupby("lokasi_clean")[f"{col}_lag_1"].ffill()
            data[f"{col}_roll7"] = data.groupby("lokasi_clean")[f"{col}_roll7"].ffill()
            
    return data

print("üî® Building features for all dates...")
full_df = create_features(full_df)

# Tentukan fitur final
FEATURES = [c for c in full_df.columns if "_lag_" in c or "_roll" in c or c in ["month", "day_of_week", "is_weekend"]]
print(f"Total features: {len(FEATURES)}")

üî® Building features for all dates...
Total features: 19


In [3]:
# =========================
# 3. TRAINING
# =========================
# Ambil hanya data yang punya label untuk training
train_data = full_df[full_df["y"].notna()].copy()

# Split Train/Valid (Misal: validasi 6 bulan terakhir di 2024)
SPLIT_DATE = "2024-07-01"
X_train = train_data[train_data["tanggal"] < SPLIT_DATE][FEATURES]
y_train = train_data[train_data["tanggal"] < SPLIT_DATE]["y"]
X_valid = train_data[train_data["tanggal"] >= SPLIT_DATE][FEATURES]
y_valid = train_data[train_data["tanggal"] >= SPLIT_DATE]["y"]

print(f"Train set: {len(X_train)} baris")

model = lgb.LGBMClassifier(
    objective="multiclass",
    num_class=3,
    n_estimators=1000,
    learning_rate=0.03,
    class_weight={0: 1.0, 1: 0.8, 2: 4.5},
    random_state=42
)

model.fit(
    X_train, y_train,
    eval_set=[(X_valid, y_valid)],
    callbacks=[lgb.early_stopping(stopping_rounds=50)]
)

Train set: 13136 baris
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002334 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3014
[LightGBM] [Info] Number of data points in the train set: 13136, number of used features: 19
[LightGBM] [Info] Start training from score -2.298530
[LightGBM] [Info] Start training from score -1.036706
[LightGBM] [Info] Start training from score -0.607019
Training until validation scores don't improve for 50 rounds


[WinError 2] The system cannot find the file specified
  File "C:\Users\USER\AppData\Roaming\Python\Python312\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "c:\Program Files\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Program Files\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\Program Files\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Early stopping, best iteration is:
[394]	valid_0's multi_logloss: 0.733032


In [4]:
# =========================
# 4. INFERENCE & SUBMISSION
# =========================
print("üöÄ Predicting for submission...")

# Ambil baris yang ID-nya ada di sample submission
test_data = full_df[full_df["id"].isin(sub["id"])].copy()

if len(test_data) == len(sub):
    preds = model.predict(test_data[FEATURES])
    test_data["category"] = [INV_LABEL_MAP[p] for p in preds]
    
    # Pastikan urutan ID sama dengan submission
    final_sub = sub[["id"]].merge(test_data[["id", "category"]], on="id", how="left")
    final_sub.to_csv("submission.csv", index=False)
    print("‚úÖ submission.csv berhasil dibuat dengan 100% ID terisi!")
else:
    print(f"‚ö†Ô∏è Masih ada mismatch: Test data ({len(test_data)}) vs Sub ({len(sub)})")
    missing = set(sub["id"]) - set(test_data["id"])
    print("ID pertama yang hilang:", list(missing)[:3])

üöÄ Predicting for submission...
‚úÖ submission.csv berhasil dibuat dengan 100% ID terisi!
