## L1 - Bronze Stage: Raw Data Ingestion

**Tujuan Notebook:**

Notebook ini adalah tahap pertama dalam pipeline ETL (Extract, Transform, Load). Tugas utamanya adalah mengambil data mentah dari sumber eksternal (dalam hal ini, file CSV dari repository GitHub) dan menyimpannya ke dalam *data lakehouse* sebagai tabel Delta di layer **Bronze**.
 
**Proses yang Dilakukan:**
1.  **Ekstraksi (Extract)**: Membaca file-file CSV dari URL raw GitHub.
2.  **Transformasi Minimal (Minimal Transform)**:
    - Menambahkan kolom metadata `load_date` untuk melacak kapan data di-ingest.
    - Menstandardisasi nama kolom serta dari `CAPITAL_CASE` menjadi `snake_case` agar konsisten.
3.  **Pemuatan (Load)**: Menyimpan data yang telah diproses secara minimal ke dalam format Delta Lake dengan mode `overwrite`. Mode ini memastikan bahwa tabel Bronze selalu merupakan cerminan terbaru dari data sumber setiap kali pipeline dijalankan.

In [8]:
# Data source config
OWNER_GITHUB = "harissabil"
REPO = "dave-datasource"
BRANCH = "main"
BASE_RAW = f"https://raw.githubusercontent.com/{OWNER_GITHUB}/{REPO}/{BRANCH}/"

TABLES = [
    "claim","claim_similarity","disease","disease_ontology","has_disease",
    "incharge","incharge_of_claim","insured_of_claim","patient","policyholder",
    "policyholder_connection","policyholder_of_claim","service",
]


StatementMeta(, 1c5f0a3e-909d-480b-bfa0-f6f47f92555d, 10, Finished, Available, Finished)

In [9]:
from pyspark.sql import functions as F
from pyspark.sql import types as T
import re, io, requests, pandas as pd

def read_csv_as_pandas(url: str) -> pd.DataFrame:
    r = requests.get(url); r.raise_for_status()
    pdf = pd.read_csv(io.StringIO(r.text), dtype=str, keep_default_na=False)
    for c in pdf.columns:
        pdf[c] = pdf[c].astype(str).str.strip()
    return pdf
    

StatementMeta(, 1c5f0a3e-909d-480b-bfa0-f6f47f92555d, 11, Finished, Available, Finished)

In [10]:
dfs = {}

for t in TABLES:
    url = BASE_RAW + f"{t.upper()}.csv"
    print(f"→ loading {t} from {url}")
    pdf = read_csv_as_pandas(url)

    df = (
        spark.createDataFrame(pdf)
        .select([F.trim(F.col(c)).alias(c) for c in pdf.columns])
        .withColumn("load_date", F.current_timestamp())
    )

    dfs[t] = df
    print(f"  {t}: rows={df.count()} cols={len(df.columns)}")
    

StatementMeta(, 1c5f0a3e-909d-480b-bfa0-f6f47f92555d, 12, Finished, Available, Finished)

→ loading claim from https://raw.githubusercontent.com/harissabil/dave-datasource/main/CLAIM.csv
  claim: rows=100001 cols=10
→ loading claim_similarity from https://raw.githubusercontent.com/harissabil/dave-datasource/main/CLAIM_SIMILARITY.csv
  claim_similarity: rows=15279 cols=4
→ loading disease from https://raw.githubusercontent.com/harissabil/dave-datasource/main/DISEASE.csv
  disease: rows=397 cols=3
→ loading disease_ontology from https://raw.githubusercontent.com/harissabil/dave-datasource/main/DISEASE_ONTOLOGY.csv
  disease_ontology: rows=448 cols=3
→ loading has_disease from https://raw.githubusercontent.com/harissabil/dave-datasource/main/HAS_DISEASE.csv
  has_disease: rows=446 cols=3
→ loading incharge from https://raw.githubusercontent.com/harissabil/dave-datasource/main/INCHARGE.csv
  incharge: rows=10001 cols=6
→ loading incharge_of_claim from https://raw.githubusercontent.com/harissabil/dave-datasource/main/INCHARGE_OF_CLAIM.csv
  incharge_of_claim: rows=100001 cols=3


In [11]:
bronze_base = "Tables/Bronze"

def getdf(name: str):
    if name not in dfs:
        raise KeyError(f"DataFrame '{name}' belum ada di dfs. Pastikan tahap read_csv_as_pandas sukses.")
    return dfs[name]

# CLAIM
df = getdf("claim")
df = (
    df.withColumnRenamed("CLAIM_ID", "claim_id")
      .withColumnRenamed("CHARGE", "charge")
      .withColumnRenamed("CLAIM_DATE", "claim_date")
      .withColumnRenamed("DURATION", "duration")
      .withColumnRenamed("INSURED_ID", "insured_id")
      .withColumnRenamed("DIAGNOSIS", "diagnosis")
      .withColumnRenamed("PERSON_INCHARGE_ID", "person_incharge_id")
      .withColumnRenamed("TYPE", "claim_type")
      .withColumnRenamed("POLICYHOLDER_ID", "policyholder_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/claim")
print("✓ Bronze/claim saved")

# CLAIM_SIMILARITY
df = getdf("claim_similarity")
df = (
    df.withColumnRenamed("CLAIM_ID", "claim_id")
      .withColumnRenamed("SIM_CLAIM_ID", "sim_claim_id")
      .withColumnRenamed("SIMILARITY_SCORE", "similarity_score")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/claim_similarity")
print("✓ Bronze/claim_similarity saved")

# DISEASE
df = getdf("disease")
df = (
    df.withColumnRenamed("DISEASEID", "disease_id")
      .withColumnRenamed("CONCEPT_NAME", "concept_name")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/disease")
print("✓ Bronze/disease saved")

# DISEASE_ONTOLOGY
df = getdf("disease_ontology")
df = (
    df.withColumnRenamed("PARENTDISEASE", "parent_disease_id")
      .withColumnRenamed("CHILDDISEASE", "child_disease_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/disease_ontology")
print("✓ Bronze/disease_ontology saved")

# HAS_DISEASE
df = getdf("has_disease")
df = (
    df.withColumnRenamed("PATIENT_ID", "patient_id")
      .withColumnRenamed("DISEASEID", "disease_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/has_disease")
print("✓ Bronze/has_disease saved")

# INCHARGE
df = getdf("incharge")
df = (
    df.withColumnRenamed("INCHARGE_ID", "incharge_id")
      .withColumnRenamed("FNAME", "first_name")
      .withColumnRenamed("LNAME", "last_name")
      .withColumnRenamed("RISK_SCORE", "risk_score")
      .withColumnRenamed("SERVICE_ID", "service_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/incharge")
print("✓ Bronze/incharge saved")

# INCHARGE_OF_CLAIM
df = getdf("incharge_of_claim")
df = (
    df.withColumnRenamed("CLAIM_ID", "claim_id")
      .withColumnRenamed("PERSON_INCHARGE_ID", "person_incharge_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/incharge_of_claim")
print("✓ Bronze/incharge_of_claim saved")

# INSURED_OF_CLAIM
df = getdf("insured_of_claim")
df = (
    df.withColumnRenamed("CLAIM_ID", "claim_id")
      .withColumnRenamed("PATIENT_ID", "patient_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/insured_of_claim")
print("✓ Bronze/insured_of_claim saved")

# PATIENT
df = getdf("patient")
df = (
    df.withColumnRenamed("PATIENT_ID", "patient_id")
      .withColumnRenamed("SUBSCRIPTION_ID", "subscription_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/patient")
print("✓ Bronze/patient saved")

# POLICYHOLDER
df = getdf("policyholder")
df = (
    df.withColumnRenamed("POLICYHOLDER_ID", "policyholder_id")
      .withColumnRenamed("FNAME", "first_name")
      .withColumnRenamed("LNAME", "last_name")
      .withColumnRenamed("RISK_SCORE", "risk_score")
      .withColumnRenamed("HIGH_RISK", "high_risk")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/policyholder")
print("✓ Bronze/policyholder saved")

# POLICYHOLDER_CONNECTION
df = getdf("policyholder_connection")
df = (
    df.withColumnRenamed("POLICYHOLDER_ID", "policyholder_id")
      .withColumnRenamed("POLICYHOLDER_ASSOCIATE_ID", "policyholder_associate_id")
      .withColumnRenamed("LEVEL", "level")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/policyholder_connection")
print("✓ Bronze/policyholder_connection saved")

# POLICYHOLDER_OF_CLAIM
df = getdf("policyholder_of_claim")
df = (
    df.withColumnRenamed("CLAIM_ID", "claim_id")
      .withColumnRenamed("POLICYHOLDER_ID", "policyholder_id")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/policyholder_of_claim")
print("✓ Bronze/policyholder_of_claim saved")

# SERVICE
df = getdf("service")
df = (
    df.withColumnRenamed("SERVICE_ID", "service_id")
      .withColumnRenamed("SERVICE_NAME", "service_name")
      .withColumnRenamed("RISK_SCORE", "risk_score")
)
df.write.format("delta").mode("overwrite").save(f"{bronze_base}/service")
print("✓ Bronze/service saved")


StatementMeta(, 1c5f0a3e-909d-480b-bfa0-f6f47f92555d, 13, Finished, Available, Finished)

✓ Bronze/claim saved
✓ Bronze/claim_similarity saved
✓ Bronze/disease saved
✓ Bronze/disease_ontology saved
✓ Bronze/has_disease saved
✓ Bronze/incharge saved
✓ Bronze/incharge_of_claim saved
✓ Bronze/insured_of_claim saved
✓ Bronze/patient saved
✓ Bronze/policyholder saved
✓ Bronze/policyholder_connection saved
✓ Bronze/policyholder_of_claim saved
✓ Bronze/service saved


In [13]:
for t in TABLES:
    df = spark.read.format("delta").load(f"{bronze_base}/{t}")
    print(f"\n=== {t} ===")
    df.printSchema()
    df.show(5, truncate=False)


StatementMeta(, 1c5f0a3e-909d-480b-bfa0-f6f47f92555d, 15, Finished, Available, Finished)


=== claim ===
root
 |-- claim_id: string (nullable = true)
 |-- charge: string (nullable = true)
 |-- claim_date: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- insured_id: string (nullable = true)
 |-- diagnosis: string (nullable = true)
 |-- person_incharge_id: string (nullable = true)
 |-- claim_type: string (nullable = true)
 |-- policyholder_id: string (nullable = true)
 |-- load_date: timestamp (nullable = true)

+--------+---------+-------------------+--------+----------+------------+------------------+----------+---------------+--------------------------+
|claim_id|charge   |claim_date         |duration|insured_id|diagnosis   |person_incharge_id|claim_type|policyholder_id|load_date                 |
+--------+---------+-------------------+--------+----------+------------+------------------+----------+---------------+--------------------------+
|C7520   |61454.81 |2014-07-05 00:00:00|24      |626       |no exception|PI18755           |services  |PH208    