# Insurance Fraud Detection Pipeline (Kubeflow)

This notebook defines and compiles a Kubeflow pipeline to detect insurance fraud.  
It consists of three components—**preprocess**, **train**, and **eval**—plus a pipeline definition and YAML compilation.

---

## Step 1: Imports

**What**  
We import the Kubeflow Pipelines SDK (`kfp`), data libraries (pandas, numpy), ML utilities (scikit-learn), and artifact handling (pickle).

**Why**  
These libraries let us define pipeline components, manipulate tabular data, split datasets, train models, and measure accuracy.

**What will happen**  
After this cell runs, all subsequent code can reference these modules without errors.


In [56]:
# Cell 1: Imports
import kfp
from kfp import dsl
import pickle
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


## Cell 2: Preprocessing Component

**What**  
Declare a Kubeflow component `preprocess_op` that:  
1. Reads the raw CSV  
2. Cleans missing values and drops irrelevant columns  
3. Writes a cleaned CSV artifact  
4. Uploads it to MinIO  
5. Builds and fits a scikit-learn `ColumnTransformer` for numeric and categorical features  
6. Saves the fitted preprocessor to MinIO and as an output artifact  
7. Transforms, splits into train/test sets, and serializes them

**Why**  
Real-world insurance data has missing values and mixed types. Cleaning and feature engineering turn raw claims into numeric arrays suitable for ML models.

**What will happen**  
This component outputs four artifacts:  
- `clean_csv` (cleaned dataset)  
- `prep_joblib` (fitted preprocessor)  
- `train_path` & `test_path` (pickled train/test splits)


In [57]:
@dsl.component(
    base_image="quay.io/jupyter/scipy-notebook:lab-4.4.3",
    packages_to_install=["pandas","numpy","scikit-learn","joblib","minio"]
)
def preprocess_op(
    raw_csv:  dsl.Input[dsl.Artifact],
    clean_csv: dsl.Output[dsl.Artifact],
    prep_joblib: dsl.Output[dsl.Artifact],
    train_path: dsl.Output[dsl.Artifact],
    test_path:  dsl.Output[dsl.Artifact],
):
    import os
    import pandas as pd
    import numpy as np
    import pickle
    import joblib
    from minio import Minio
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder

    # 1️⃣ Load & basic clean
    df = pd.read_csv(raw_csv.path)
    df = df.replace(r'^\s*$', np.nan, regex=True).fillna('NA')
    drop_cols = [
        'policy_number',
        'policy_bind_date',
        'incident_date',
        'incident_location',
        '_c39'
    ]
    df = df.drop(columns=[c for c in drop_cols if c in df.columns])

    # 2️⃣ Persist cleaned CSV artifact (ephemeral)
    df.to_csv(clean_csv.path, index=False)

    # 3️⃣ Push cleaned CSV to stable MinIO path
    client = Minio(
        endpoint="minio-service.kubeflow:9000",
        access_key="minio",
        secret_key="minio123",
        secure=False
    )
    client.fput_object(
        "mlpipeline",
        "insurance/insurance_fraud_cleaned.csv",
        clean_csv.path
    )

    # 4️⃣ Build & fit feature‐engineering pipeline
    X = df.drop('fraud_reported', axis=1)
    y = df['fraud_reported'].map({'Y':1,'N':0})

    numeric_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
    categorical_cols = X.select_dtypes(include=['object','category']).columns.tolist()
    if 'insured_zip' in numeric_cols:
        numeric_cols.remove('insured_zip')
        categorical_cols.append('insured_zip')

    num_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler',  StandardScaler())
    ])
    cat_pipe = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
        ('ohe',      OneHotEncoder(handle_unknown='ignore'))
    ])
    fe = ColumnTransformer([
        ('num', num_pipe, numeric_cols),
        ('cat', cat_pipe, categorical_cols)
    ])

    fe.fit(X, y)

    # 5️⃣ Persist preprocessor to stable MinIO path & as component output
    fe_path = prep_joblib.path
    joblib.dump(fe, fe_path)
    client.fput_object(
        "mlpipeline",
        "insurance/preprocessor.joblib",
        fe_path
    )

    # 6️⃣ Transform, split, and serialize train/test sets
    X_fe = fe.transform(X)
    X_train, X_test, y_train, y_test = train_test_split(
        X_fe, y, test_size=0.2, stratify=y, random_state=42
    )

    with open(train_path.path, 'wb') as f:
        pickle.dump((X_train, y_train), f)
    with open(test_path.path,  'wb') as f:
        pickle.dump((X_test,  y_test),  f)


## Cell 3: Training Component

**What**  
Define `train_op` that loads the serialized training split, fits a `RandomForestClassifier`, and outputs a pickled model.

**Why**  
Random forests are robust classifiers for tabular data and require minimal tuning to get reasonable performance.

**What will happen**  
This component writes out `model_output`, a serialized scikit-learn model artifact.


In [None]:
@dsl.component(base_image="quay.io/jupyter/scipy-notebook:lab-4.4.3")
def train_op(
    train_data: dsl.Input[dsl.Artifact],
    model_output: dsl.Output[dsl.Model],
):
    import pickle
    from sklearn.ensemble import RandomForestClassifier

    # Load numeric arrays
    with open(train_data.path, 'rb') as f:
        X_train, y_train = pickle.load(f)

    # Train
    clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    clf.fit(X_train, y_train)

    # Dump
    with open(model_output.path, 'wb') as f:
        pickle.dump(clf, f)


## Cell 4: Evaluation Component

**What**  
Define `eval_op` that loads the test split and trained model, computes accuracy, and prints it.

**Why**  
Evaluating model performance confirms that our pipeline produces a model with acceptable predictive power.

**What will happen**  
When this component runs, you’ll see “Model accuracy: 0.xxxx” in the logs.

In [59]:
# Cell 4: Eval component
@dsl.component(base_image="quay.io/jupyter/scipy-notebook:lab-4.4.3")
def eval_op(
    test_data:  dsl.Input[dsl.Artifact],
    model_input: dsl.Input[dsl.Model],
):
    import pickle
    from sklearn.metrics import accuracy_score

    # 1️⃣ Load test split & model
    with open(test_data.path,  'rb') as f:
        X_test, y_test = pickle.load(f)
    with open(model_input.path, 'rb') as f:
        clf = pickle.load(f)

    # 2️⃣ Compute & print accuracy
    acc = accuracy_score(y_test, clf.predict(X_test))
    print(f"Model accuracy: {acc:.4f}")


## Cell 5: Pipeline Definition

**What**  
Assemble the three components into a `@dsl.pipeline` named **insurance-fraud-detection-v2**. We import the raw CSV via an `ImporterOp`, then chain `preprocess_op → train_op → eval_op`.

**Why**  
Kubeflow Pipelines needs a single pipeline function to orchestrate component execution order and data flow.

**What will happen**  
Compiling this yields a static YAML spec you can upload to your Kubeflow cluster.

In [60]:
# Cell 5: Define the pipeline
@dsl.pipeline(name="insurance-fraud-detection-v2")
def fraud_pipeline(
    raw_csv: str = "minio://mlpipeline/insurance_claims.csv"
):
    # Import raw CSV from MinIO (or UI-uploaded artifact) via an ImporterOp
    from kfp.v2.dsl import importer, Dataset
    raw_data = importer(
        artifact_uri=raw_csv,
        artifact_class=Dataset,
        reimport=True,
    )

    # 1) clean & split
    prep = preprocess_op(raw_csv=raw_data.output)

    # 2) train
    train = train_op(train_data=prep.outputs["train_path"])

    # 3) eval
    eval_op(
        test_data=prep.outputs["test_path"],
        model_input=train.outputs["model_output"]
    )


## Cell 6: Compile to YAML

**What**  
Use the KFP compiler to generate `insurance_fraud_pipeline_v12.yaml` from our pipeline function.

**Why**  
The Kubeflow Pipelines UI and API ingest a YAML definition when creating or updating pipelines.

**What will happen**  
You’ll see a confirmation message and the YAML file appear in your working directory.

In [61]:
# Cell 6: Compile to YAML
from kfp.v2 import compiler
compiler.Compiler().compile(
    pipeline_func=fraud_pipeline,
    package_path="insurance_fraud_pipeline_v12.yaml"
)
print("✅ Generated insurance_fraud_pipeline_v12.yaml")


✅ Generated insurance_fraud_pipeline_v12.yaml
