# Proyek Akhir TFX: Prediksi Income

### 📂 Dataset: Adult Income  

Dataset **Adult Income** (juga dikenal sebagai Census Income Dataset) berisi data demografis dan pekerjaan individu dari Sensus AS 1994.  
Tujuannya adalah untuk memprediksi apakah penghasilan seseorang lebih dari \$50K per tahun atau tidak berdasarkan atribut pribadi dan profesionalnya.  

**Detail Dataset**  
- Sumber: [Kaggle – Adult Income Prediction Dataset](https://www.kaggle.com/datasets/mosapabdelghany/adult-income-prediction-dataset)  
- Jumlah baris: **32.561**  
- Jumlah kolom: **15**  
- Target: **Income** (`<=50K` atau `>50K`)  

**Kolom-kolom:**
- `age` – Usia individu  
- `workclass` – Jenis pekerjaan (Private, Self-emp, Government, dll.)  
- `fnlwgt` – Final weight untuk estimasi statistik populasi  
- `education` – Tingkat pendidikan tertinggi  
- `education.num` – Representasi numerik tingkat pendidikan  
- `marital.status` – Status perkawinan  
- `occupation` – Jenis pekerjaan  
- `relationship` – Status hubungan dalam rumah tangga  
- `race` – Ras individu  
- `sex` – Jenis kelamin  
- `capital.gain` – Capital gain yang dilaporkan  
- `capital.loss` – Capital loss yang dilaporkan  
- `hours.per.week` – Jumlah jam kerja per minggu  
- `native.country` – Negara asal  
- `income` – Level pendapatan (target)  

Dataset ini disimpan dalam format CSV dan dibaca oleh komponen **CsvExampleGen** sebagai titik awal pipeline.

In [None]:
# Instal dependensi
!pip install -r requirements.txt

## 1. Import Library

In [6]:
import os
from absl import logging
from tfx.orchestration import metadata, pipeline
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

PIPELINE_NAME = "jasmeinalbaar-pipeline"
DATA_ROOT = "data"
TRANSFORM_MODULE_FILE = "modules/transform.py"
TUNER_MODULE_FILE = "modules/tuner.py"
TRAINER_MODULE_FILE = "modules/trainer.py"
OUTPUT_BASE = "output"
SERVING_MODEL_DIR = os.path.join(OUTPUT_BASE, "serving_model")
PIPELINE_ROOT = os.path.join(OUTPUT_BASE, PIPELINE_NAME)
METADATA_PATH = os.path.join(PIPELINE_ROOT, "metadata.sqlite")

## 2. Inisialisasi Komponen Pipeline

In [7]:
from modules.components import init_components

components = init_components(
    DATA_ROOT,
    training_module=TRAINER_MODULE_FILE,
    tuner_module=TUNER_MODULE_FILE,
    transform_module=TRANSFORM_MODULE_FILE,
    training_steps=5000,
    eval_steps=1000,
    serving_model_dir=SERVING_MODEL_DIR,
)

## 3. Menjalankan Pipeline

In [8]:
def init_local_pipeline(components, pipeline_root):
    logging.info(f"Pipeline root set to: {pipeline_root}")
    beam_args = [
        "--direct_running_mode=multi_processing",
        "--direct_num_workers=0",
    ]
    return pipeline.Pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=pipeline_root,
        components=components,
        enable_cache=True,
        metadata_connection_config=metadata.sqlite_metadata_connection_config(
            METADATA_PATH
        ),
        beam_pipeline_args=beam_args,
    )

p = init_local_pipeline(components, PIPELINE_ROOT)
BeamDagRunner().run(pipeline=p)

Trial 5 Complete [00h 03m 24s]
val_accuracy: 0.8491874933242798

Best val_accuracy So Far: 0.8519218564033508
Total elapsed time: 00h 12m 50s
INFO:tensorflow:Oracle triggered exit


INFO:tensorflow:Oracle triggered exit


Results summary
Results in output/jasmeinalbaar-pipeline/Tuner/.system/executor_execution/7/.temp/7/adult_income_tuning
Showing 10 best trials
<keras_tuner.engine.objective.Objective object at 0x13d2ce6a0>
Trial summary
Hyperparameters:
learning_rate: 0.0001
units_1: 128
units_2: 16
Score: 0.8519218564033508
Trial summary
Hyperparameters:
learning_rate: 0.01
units_1: 160
units_2: 80
Score: 0.851437509059906
Trial summary
Hyperparameters:
learning_rate: 0.001
units_1: 160
units_2: 80
Score: 0.8491874933242798
Trial summary
Hyperparameters:
learning_rate: 0.001
units_1: 160
units_2: 112
Score: 0.8487656116485596
Trial summary
Hyperparameters:
learning_rate: 0.01
units_1: 128
units_2: 32
Score: 0.848562479019165




Processing ./output/jasmeinalbaar-pipeline/_wheels/tfx_user_code_Trainer-0.0+cd76b9019fda91ebf769532b6cb4ab1426603ab2724e10ac27eefe41c15abe74-py3-none-any.whl
Installing collected packages: tfx-user-code-trainer
Successfully installed tfx-user-code-trainer-0.0+cd76b9019fda91ebf769532b6cb4ab1426603ab2724e10ac27eefe41c15abe74
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 workclass_xf (InputLayer)      [(None, 10)]         0           []                               
                                                                                                  
 education_xf (InputLayer)      [(None, 17)]         0           []                               
                                                                                                  
 marital.status_xf (InputLayer)  [(None, 8)]         0         

INFO:tensorflow:Assets written to: output/jasmeinalbaar-pipeline/Trainer/model/8/Format-Serving/assets


You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.
Model plot saved to: output/jasmeinalbaar-pipeline/Trainer/model/8/images/model_plot.png


2025-09-28 20:31:23.426258: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-28 20:31:23.445031: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-09-28 20:31:23.451094: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the ap

Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`


## 🧩 Tahapan Pipeline

### 1️⃣ ExampleGen  
- Komponen ini membaca dataset CSV dari `data_dir`.  
- Data otomatis dibagi menjadi dua subset: **80% untuk train** dan **20% untuk eval** (diatur melalui `split_config`).  
- Output: **Examples** yang akan digunakan oleh komponen-komponen berikutnya.  

---

### 2️⃣ StatisticsGen  
- Menghitung **statistik deskriptif** pada setiap fitur dataset.  
- Berguna untuk memahami distribusi data, nilai hilang, dan outlier.  
- Output: **Statistics artifact**.  

---

### 3️⃣ SchemaGen  
- Membuat **schema** otomatis berdasarkan statistik dataset.  
- Schema berisi tipe data, rentang nilai, dan informasi integritas lainnya.  
- Output: **Schema artifact**.  

---

### 4️⃣ ExampleValidator  
- Mengecek **anomali data** (nilai hilang, nilai di luar rentang, dsb.) dengan membandingkan data dengan schema.  
- Membantu memastikan data bersih sebelum melanjutkan ke tahap transformasi.  

---

### 5️⃣ Transform  
- Menjalankan **preprocessing** (misalnya normalisasi, one-hot encoding) menggunakan file modul Python `transform_module`.  
- Output:  
  - `transformed_examples` – data yang sudah diproses  
  - `transform_graph` – grafik transformasi yang akan dipakai di Trainer dan di serving  

---

### 6️⃣ Tuner  
- Melakukan pencarian **hyperparameter terbaik** untuk model.  
- Menggunakan `tuner_module` yang berisi definisi ruang pencarian hyperparameter.  
- Output: **best_hyperparameters**.  

---

### 7️⃣ Trainer  
- Melatih model menggunakan data hasil transformasi.  
- Memakai `training_module` untuk arsitektur model.  
- **Hyperparameters** diisi dari output Tuner.  
- Output: **Model artifact**.  

---

### 8️⃣ Resolver  
- Mengambil model terbaik yang sudah di-*bless* (lolos evaluasi) dari pipeline sebelumnya sebagai baseline untuk evaluasi model baru.  

---

### 9️⃣ Evaluator  
- Mengevaluasi model baru menggunakan **TensorFlow Model Analysis (TFMA)**.  
- Metrik yang digunakan: **AUC, Precision, Recall, ExampleCount**, dan **BinaryAccuracy** (dengan threshold minimal 0.5).  
- Termasuk evaluasi *fairness* berdasarkan fitur `sex` dan `race` (*slicing spec*).  
- Output: **blessing artifact** yang menentukan apakah model layak di-*push*.  

---

### 🔟 Pusher  
- Jika model lolos evaluasi (mendapat *blessing*), model akan disimpan ke direktori `serving_model_dir`.  
- Model siap untuk **serving** (*deployment*).  

## 4. Mengecek Model Tersimpan

In [9]:
!ls output/serving_model/

[34m1759066329[m[m
