### 📥 Step 1: Load and Split the Iris Dataset
We use scikit-learn's `load_iris()` to get a small, well-known dataset of flower measurements. We then split it into training and testing parts.

In [1]:
import kfp
from kfp import dsl
import pickle
import os
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


### 📥 Step 1: Load and Split the Iris Dataset
We use scikit-learn's `load_iris()` to get a small, well-known dataset of flower measurements. We then split it into training and testing parts.

In [2]:
@dsl.component(base_image="quay.io/jupyter/scipy-notebook:lab-4.4.3")
def preprocess_op(train_path: dsl.Output[dsl.Artifact],
                  test_path: dsl.Output[dsl.Artifact]):
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    import pickle

    X, y = load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    with open(train_path.path, 'wb') as f:
        pickle.dump((X_train, y_train), f)
    with open(test_path.path, 'wb') as f:
        pickle.dump((X_test, y_test), f)


### 🌲 Step 2: Train the Model
We use a Random Forest Classifier, which builds multiple decision trees and combines their results for better accuracy.

In [3]:
@dsl.component(base_image="quay.io/jupyter/scipy-notebook:lab-4.4.3")
def train_op(train_data: dsl.Input[dsl.Artifact],
             model_output: dsl.Output[dsl.Model]):
    import pickle
    from sklearn.ensemble import RandomForestClassifier

    with open(train_data.path, 'rb') as f:
        X_train, y_train = pickle.load(f)

    clf = RandomForestClassifier()
    clf.fit(X_train, y_train)

    with open(model_output.path, 'wb') as f:
        pickle.dump(clf, f)

### 📊 Step 3: Evaluate the Model
We test how well the model predicts unseen data using `accuracy_score`, which tells us the percentage of correct predictions.

In [4]:
@dsl.component(base_image="quay.io/jupyter/scipy-notebook:lab-4.4.3")
def eval_op(test_data: dsl.Input[dsl.Artifact],
            model_input: dsl.Input[dsl.Model]):
    import pickle
    from sklearn.metrics import accuracy_score

    with open(test_data.path, 'rb') as f:
        X_test, y_test = pickle.load(f)
    with open(model_input.path, 'rb') as f:
        clf = pickle.load(f)

    acc = accuracy_score(y_test, clf.predict(X_test))
    print(f"Model accuracy: {acc}")


### 🔗 Step 5: Define the Pipeline Structure
We chain the components together using `@dsl.pipeline`, specifying execution order and data flow.

In [5]:
@dsl.pipeline(name="iris-classification-v2")
def iris_pipeline():
    preprocess_task = preprocess_op()
    train_task = train_op(train_data=preprocess_task.outputs["train_path"])
    eval_task = eval_op(
        test_data=preprocess_task.outputs["test_path"],
        model_input=train_task.outputs["model_output"]
    )


### 🛠️ Step 6: Compile the Pipeline
We compile the defined pipeline to a YAML file that Kubeflow can understand and run.

In [6]:
from kfp import compiler
compiler.Compiler().compile(
    pipeline_func=iris_pipeline,
    package_path="iris_pipeline_v4.yaml"
)

### 🧪 Running a Code Block
This block executes part of our pipeline-building logic.