# Prediksi Tingkat Kemiskinan di Indonesia

**Deskripsi singkat:** Notebook ini berisi langkah-langkah lengkap (step-by-step) untuk melakukan *data mining* dan membangun model prediksi tingkat kemiskinan di Indonesia dengan menggabungkan tiga dataset Kaggle yang Anda berikan.

**Dataset yang dipakai (letakkan file CSV di folder `data/`)**:
- `data/klasifikasi_kemiskinan.csv`  (dataset target)
- `data/socio_economic_2021.csv`     (fitur sosial-ekonomi)
- `data/pendidikan_provinsi_2023.csv` (indikator pendidikan per provinsi)

> Jika nama file CSV Anda berbeda, ubah path di cell `FILE PATHS` di bawah.

---


## 0. Persiapan & Install (opsional)

Beberapa package mungkin perlu di-install. Jalankan cell ini (hapus komentar `!`) jika paket belum tersedia di environment Anda. Instalasi `geopandas` dan `folium` kadang memerlukan dependensi sistem — jika instalasi gagal, Anda bisa skip bagian peta dan jalankan analisis model saja.


In [1]:
# !pip install pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm shap folium geopandas plotly category_encoders imbalanced-learn joblib
# Jika instalasi gagal pada beberapa package (mis. geopandas), lewati peta dan jalankan model.


## 1. Import libraries

In [2]:
import os
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Optional / advanced
try:
    import xgboost as xgb
except Exception as e:
    xgb = None

try:
    import shap
except Exception:
    shap = None

import joblib

print('libraries loaded')


libraries loaded


## 2. FILE PATHS

Pastikan CSV Anda sudah di-download dari Kaggle dan diletakkan di folder `data/`. Jika tidak, ubah path di bawah sesuai lokasi file Anda.

In [3]:
DATA_DIR = 'data'
Poverty_path = os.path.join(DATA_DIR, 'klasifikasi_kemiskinan.csv')
Socio_path   = os.path.join(DATA_DIR, 'socio_economic_2021.csv')
Edu_path     = os.path.join(DATA_DIR, 'pendidikan_provinsi_2023.csv')

print('Expected file paths:')
print(Poverty_path)
print(Socio_path)
print(Edu_path)


Expected file paths:
data/klasifikasi_kemiskinan.csv
data/socio_economic_2021.csv
data/pendidikan_provinsi_2023.csv


## 3. Helper functions: safe CSV loader & column normalizer

In [4]:
def safe_read_csv(path):
    """Try to read CSV with common encodings; return DataFrame or None."""
    if not os.path.exists(path):
        print(f'File not found: {path}')
        return None
    encodings = ['utf-8', 'latin1', 'cp1252']
    for enc in encodings:
        try:
            return pd.read_csv(path, encoding=enc)
        except Exception:
            continue
    # final try (pandas default) and let errors show
    try:
        return pd.read_csv(path)
    except Exception as e:
        print(f'Failed to read {path}: {e}')
        return None


def normalize_columns(df):
    df = df.copy()
    df.columns = (
        df.columns.str.strip()
                  .str.lower()
                  .str.replace(' ', '_')
                  .str.replace('-','_')
                  .str.replace('\n','_')
    )
    return df

print('helpers ready')


helpers ready


## 4. Load datasets (preview)

In [5]:
poverty = safe_read_csv(Poverty_path)
socio = safe_read_csv(Socio_path)
edu = safe_read_csv(Edu_path)

# show what was loaded
for name, df in [('poverty', poverty), ('socio', socio), ('edu', edu)]:
    if df is None:
        print(f"{name} -> NOT LOADED")
    else:
        print(f"{name} -> loaded, shape: {df.shape}")
        display(df.head())


File not found: data/klasifikasi_kemiskinan.csv
File not found: data/socio_economic_2021.csv
File not found: data/pendidikan_provinsi_2023.csv
poverty -> NOT LOADED
socio -> NOT LOADED
edu -> NOT LOADED


## 5. Inspect and normalize column names

Kita normalisasi nama kolom supaya lebih mudah digabungkan/diolah.

In [6]:
for var_name, df in [('poverty', poverty), ('socio', socio), ('edu', edu)]:
    if df is None:
        continue
    print('\n---', var_name, 'columns ---')
    df_norm = normalize_columns(df)
    print(df_norm.columns.tolist()[:40])
    # replace in-place reference
    if var_name == 'poverty':
        poverty = df_norm
    elif var_name == 'socio':
        socio = df_norm
    elif var_name == 'edu':
        edu = df_norm


## 6. Quick missing-value check and column suggestions

Lihat ringkasan missing dan tipe data. Jika kolom target tidak ada namanya persis `tingkat_kemiskinan`, kita akan coba detect otomatis.

In [7]:
def df_summary(df, name='df'):
    if df is None:
        print(f'{name} is None')
        return
    print(f"{name}: shape={df.shape}")
    display(df.info())
    display(pd.DataFrame({
        'n_missing': df.isnull().sum(),
        'pct_missing': df.isnull().mean()*100
    }).sort_values('pct_missing', ascending=False).head(20))

for name, df in [('poverty', poverty), ('socio', socio), ('edu', edu)]:
    df_summary(df, name)


poverty is None
socio is None
edu is None


## 7. Find a suitable join key (provinsi / province / kode_prov)

Kita cari kolom yang memungkinkan penggabungan ketiga dataset. Kalau dataset `poverty` berskala household, kita tetap menggabungkan fitur provinsi dari dataset pendidikan secara `map`.

In [8]:
def find_possible_join_cols(df):
    candidates = ['provinsi','province','kode_prov','kode_provinsi','nama_provinsi','nama_prov','kabupaten','district']
    cols = set(df.columns)
    found = [c for c in candidates if c in cols]
    # also check substring matching
    for col in df.columns:
        for cand in candidates:
            if cand in col:
                if col not in found:
                    found.append(col)
    return found

print('poverty join cols:', find_possible_join_cols(poverty) if poverty is not None else [])
print('socio   join cols:', find_possible_join_cols(socio) if socio is not None else [])
print('edu     join cols:', find_possible_join_cols(edu) if edu is not None else [])


poverty join cols: []
socio   join cols: []
edu     join cols: []


## 8. Merge datasets

Strategi:
- Jika semua dataset punya kolom `provinsi` (atau sejenis), kita merge pada tingkat provinsi.
- Jika `poverty` adalah household-level (lebih granular), kita akan merge `socio` ke `poverty` berdasarkan kolom wilayah, dan gunakan data `edu` (provinsi) dengan mapping provinsi → indikator pendidikan.

Catatan: sesuaikan nama kolom `join_key` jika notebook menemukan nama lain.


In [9]:
# attempt to find join key automatically
possible_keys = ['provinsi','province','kode_prov','nama_provinsi','nama_prov','prov']

def find_join_for_all(dfs, keys):
    for k in keys:
        if all((df is not None) and any(k == c or k in c for c in df.columns) for df in dfs):
            # return the actual column name from first df (match)
            actual_cols = []
            for df in dfs:
                col = next((c for c in df.columns if (k == c or k in c)), None)
                actual_cols.append(col)
            return actual_cols
    return None

dfs = [poverty, socio, edu]
join_cols = find_join_for_all(dfs, possible_keys)
print('join columns found:', join_cols)

# If we didn't find a common key, try merging socio+edu on provinsi and then attach socio to poverty where possible.
merged = None
if join_cols is not None:
    # rename columns to a common name provinsi for merge
    cols_map = {}
    for df, actual_col in zip(dfs, join_cols):
        if df is None or actual_col is None:
            continue
        df.columns = [c if c!=actual_col else 'provinsi' for c in df.columns]
    try:
        merged = poverty.merge(socio, on='provinsi', how='left')
        merged = merged.merge(edu, on='provinsi', how='left')
        print('merged shape:', merged.shape)
    except Exception as e:
        print('Merge error:', e)
else:
    print('No single join key found across all datasets. Merging socio+edu by provinsi if possible.')
    # try merge socio + edu
    socio_cols = find_possible_join_cols(socio) if socio is not None else []
    edu_cols = find_possible_join_cols(edu) if edu is not None else []
    common = set(socio_cols).intersection(set(edu_cols))
    if common:
        common_col = list(common)[0]
        print('Merging socio and edu on', common_col)
        merged_se = socio.merge(edu, left_on=common_col, right_on=common_col, how='left')
        # then try attaching merged_se to poverty based on any shared column
        shared = set(merged_se.columns).intersection(set(poverty.columns))
        if shared:
            shared_col = list(shared)[0]
            print('Merging poverty with socio+edu on', shared_col)
            merged = poverty.merge(merged_se, on=shared_col, how='left')
            print('merged shape:', merged.shape)
        else:
            print('No shared column to merge poverty with socio+edu automatically. You might need to merge manually based on your data schema.')

# if merge didn't run, keep merged = poverty for further steps (so notebook remains runnable)
if merged is None and poverty is not None:
    merged = poverty.copy()

# keep a copy
df = merged.copy()
print('\nResulting dataframe ready for preprocessing. Shape:', df.shape)


join columns found: None
No single join key found across all datasets. Merging socio+edu by provinsi if possible.


AttributeError: 'NoneType' object has no attribute 'copy'

## 9. Mendeteksi kolom target (`tingkat_kemiskinan`) secara otomatis

Jika nama kolom target berbeda, notebook akan mencoba mendeteksi kolom yang mengandung kata `kemiskin` atau `poverty`.


In [None]:
def detect_target_column(df):
    candidates = [c for c in df.columns if 'kemiskin' in c or 'poverty' in c or 'level' in c or 'status' in c]
    if len(candidates)>0:
        print('Possible target columns:', candidates)
        return candidates[0]
    # fallback: ask user to set target manually by editing the variable `TARGET_COL` below
    return None

TARGET_COL = detect_target_column(df)
print('Detected target col:', TARGET_COL)

if TARGET_COL is None:
    print('\n*** WARNING: target column not detected automatically. Please edit TARGET_COL to the correct column name from df.columns list below:')
    print(df.columns.tolist())


## 10. Simple cleaning & feature selection

- Memilih fitur numerik dan kategorikal
- Imputasi sederhana
- Membuat beberapa fitur baru contoh (jika tersedia)


In [None]:
# If TARGET_COL is None, set it manually here (edit as needed). E.g. TARGET_COL = 'tingkat_kemiskinan'
# TARGET_COL = 'tingkat_kemiskinan'

if TARGET_COL is None:
    TARGET_COL = 'tingkat_kemiskinan'  # try default -- edit if necessary

# Ensure TARGET_COL exists
if TARGET_COL not in df.columns:
    print('Target column not in df.columns; available cols:')
    print(df.columns.tolist()[:80])
    # we'll try to continue but user should edit the target if running locally

# quick feature engineering examples (only if columns exist)
# create per-capita features if penghasilan and anggota_keluarga exist
if 'penghasilan' in df.columns and 'jumlah_angota_keluarga' in df.columns:
    df['penghasilan_perkapita'] = df['penghasilan'] / (df['jumlah_angota_keluarga'].replace(0,1))

# generic numeric / categorical split
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = df.select_dtypes(include=['object','category']).columns.tolist()

# drop obvious identifier columns if present
for c in ['id','id_rumah_tangga','no','index']:
    if c in df.columns:
        df = df.drop(columns=[c])

print('Numerical cols (sample):', num_cols[:20])
print('Categorical cols (sample):', cat_cols[:20])


## 11. Train/Test split

Kita lakukan split stratified jika target terdeteksi. Jika target kategorikal, pastikan ia sudah berbentuk label (0/1/2...).

In [None]:
if TARGET_COL not in df.columns:
    raise ValueError('Please set TARGET_COL to a valid column name in the dataframe before running modeling cells.')

# drop rows with missing target
df_model = df.copy()
df_model = df_model[df_model[TARGET_COL].notnull()].reset_index(drop=True)

# if target is string labels, convert to categorical codes
if df_model[TARGET_COL].dtype == 'object' or str(df_model[TARGET_COL].dtype).startswith('category'):
    df_model[TARGET_COL] = df_model[TARGET_COL].astype('category').cat.codes

X = df_model.drop(columns=[TARGET_COL])
y = df_model[TARGET_COL]

# for reproducibility, select a manageable set of columns if too many
MAX_FEATURES = 200
if X.shape[1] > MAX_FEATURES:
    print(f"Warning: {X.shape[1]} features detected. Selecting top {MAX_FEATURES} numeric features by variance for baseline.")
    variances = X.var().sort_values(ascending=False)
    keep = variances.index[:MAX_FEATURES].tolist()
    # also keep categorical cols
    cat_keep = [c for c in X.select_dtypes(include=['object','category']).columns]
    keep = list(dict.fromkeys(keep + cat_keep))
    X = X[keep]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
print('Train/Test shapes:', X_train.shape, X_test.shape)


## 12. Preprocessing pipelines (numerical & categorical)

In [None]:
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object','category']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

print('Preprocessor ready')


## 13. Modeling: baseline models (Logistic Regression, RandomForest, XGBoost if available)

In [None]:
models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'RandomForest': RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)
}
if xgb is not None:
    models['XGBoost'] = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

results = {}
for name, model in models.items():
    pipe = Pipeline(steps=[('preproc', preprocessor), ('model', model)])
    print('\nTraining', name)
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    print('Classification report for', name)
    print(classification_report(y_test, y_pred))
    try:
        if hasattr(pipe.named_steps['model'], 'predict_proba'):
            y_proba = pipe.predict_proba(X_test)
            # if multiclass, compute micro auc or skip
            if y_proba.shape[1] == 2:
                auc = roc_auc_score(y_test, y_proba[:,1])
                print('ROC AUC:', auc)
    except Exception:
        pass
    results[name] = pipe

print('\nModels trained: ', list(results.keys()))


## 14. Model comparison & selection

Pilih model terbaik berdasarkan metrik (misalnya F1-score / ROC AUC).

In [None]:
from sklearn.model_selection import cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, model in models.items():
    pipe = Pipeline(steps=[('preproc', preprocessor), ('model', model)])
    try:
        sc = cross_val_score(pipe, X_train, y_train, cv=cv, scoring='f1_macro')
        print(f"{name} CV f1_macro: mean={sc.mean():.4f}, std={sc.std():.4f}")
    except Exception as e:
        print('Cross-val error for', name, e)


## 15. Interpretability: Feature importance & SHAP (jika tersedia)

Untuk model tree-based, kita bisa lihat feature importance. Untuk analisis mendalam, gunakan SHAP jika package terpasang.

In [None]:
best_model_name = list(results.keys())[0]
print('Default best model choice (first trained):', best_model_name)
# if XGBoost exists prefer that
if 'XGBoost' in results:
    best_model_name = 'XGBoost'
elif 'RandomForest' in results:
    best_model_name = 'RandomForest'

best_pipe = results[best_model_name]

# feature importance for tree models
try:
    model = best_pipe.named_steps['model']
    # get feature names after preprocessing
    # numeric feature names are numeric_features, categorical become onehot names
    cat_ohe_cols = []
    if len(categorical_features) > 0:
        ohe = best_pipe.named_steps['preproc'].named_transformers_['cat'].named_steps['onehot']
        cat_ohe_cols = ohe.get_feature_names_out(categorical_features).tolist()
    feature_names = numeric_features + cat_ohe_cols

    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        fi = pd.Series(importances, index=feature_names).sort_values(ascending=False)
        display(fi.head(30))
    else:
        print('Model has no attribute feature_importances_')
except Exception as e:
    print('Feature importance error:', e)

# SHAP (optional)
if shap is None:
    print('\nSHAP is not installed. To run SHAP, install it with `pip install shap`')
else:
    try:
        # get transformed training data for SHAP
        X_train_trans = best_pipe.named_steps['preproc'].transform(X_train)
        explainer = shap.Explainer(best_pipe.named_steps['model'])
        shap_values = explainer(X_train_trans)
        shap.summary_plot(shap_values, X_train_trans, feature_names=feature_names)
    except Exception as e:
        print('SHAP error (common if model or data shape mismatches):', e)


## 16. Save best model

Simpan pipeline lengkap (preprocessing + model) untuk dipakai di production atau evaluasi lebih lanjut.

In [None]:
out_model_path = 'model_prediksi_kemiskinan_pipeline.joblib'
joblib.dump(best_pipe, out_model_path)
print('Saved pipeline to', out_model_path)


## 17. (Opsional) Membuat peta choropleth per provinsi

Untuk membuat peta, Anda butuh file GeoJSON / shapefile batas provinsi Indonesia. Contoh kode di bawah mengasumsikan Anda memiliki `indonesia_prov.geojson` yang berisi properti `provinsi` atau `nama_prov` yang cocok dengan kolom `provinsi` di data Anda.


In [None]:
try:
    import geopandas as gpd
    import folium
    has_geo = True
except Exception:
    has_geo = False

if not has_geo:
    print('geopandas or folium not installed. Skip mapping. To enable mapping, install geopandas and folium.')
else:
    # example: aggregate predictions per province and merge with geojson
    # 1) read geojson
    # geo = gpd.read_file('data/indonesia_prov.geojson')
    # 2) create predictions per province (this part depends on your dataframe schema)
    # agg = df_model.copy()
    # agg['pred_proba'] = best_pipe.predict_proba(X)[:,1] if hasattr(best_pipe, 'predict_proba') else None
    # agg_prov = agg.groupby('provinsi')['pred_proba'].mean().reset_index()
    # geo = geo.merge(agg_prov, left_on='provinsi', right_on='provinsi')
    # 3) create folium map
    # m = folium.Map(location=[-2.5489, 118.0149], zoom_start=5)
    # folium.Choropleth(geo_data=geo, data=geo, columns=['provinsi','pred_proba'], key_on='feature.properties.provinsi').add_to(m)
    # display(m)
    print('Mapping cell provided as an example. Please supply a geojson file and adapt the merging keys before running.')


## 18. Kesimpulan & Langkah Selanjutnya

- Notebook ini memberi alur lengkap: load → merge → preprocessing → baseline models → evaluasi → interpretasi → simpan model.
- Langkah selanjutnya yang direkomendasikan:
  - Tuning hyperparameter (RandomizedSearchCV / Optuna)
  - Menangani class imbalance (SMOTE / class weights)
  - Validasi geografis: train di subset provinsi dan test di provinsi berbeda
  - Deployment: buat API (FastAPI) atau dashboard (Streamlit)

---

Jika Anda mau, saya bisa:
- Sesuaikan notebook ini ke nama file CSV Anda jika Anda beri tahu nama file (atau upload file CSV sekarang), atau
- Jalankan notebook di environment ini jika Anda mengupload dataset ke percakapan.


----

*Notebook generated by ChatGPT — kalau ada bagian yang perlu disesuaikan (kolom target, nama kolom wilayah, dsb.) beri tahu saya dan saya akan perbaiki.*