# Profiling scikit-learn Pipelines with Stripje

This notebook demonstrates how to measure detailed per-step timings for a fitted scikit-learn `Pipeline` using `PipelineProfiler`.

In [6]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin

from stripje.profiling import PipelineProfiler

class SleepTransformer(TransformerMixin, BaseEstimator):
    """Transformer that sleeps before passing data through."""
    def __init__(self, sleep_seconds: float) -> None:
        self.sleep_seconds = sleep_seconds

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        import time
        time.sleep(self.sleep_seconds)
        return X

In [7]:
# Build a mixed-type dataset
rng = np.random.default_rng(42)
df = pd.DataFrame(
    {
        "age": rng.integers(18, 70, size=200),
        "income": rng.normal(55000, 15000, size=200),
        "city": rng.choice(["NY", "SF", "LA"], size=200),
        "owns_home": rng.choice(["yes", "no"], size=200),
    }
)
target = (df["income"] > 60000).astype(int)

In [8]:
# Define a pipeline with a ColumnTransformer and an estimator
numeric_features = ["age", "income"]
categorical_features = ["city", "owns_home"]

slow_numeric = Pipeline([
    ("sleep", SleepTransformer(0.5)),
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
])

slow_categorical = Pipeline([
    ("sleep", SleepTransformer(0.3)),
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("encode", OneHotEncoder(handle_unknown="ignore")),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", slow_numeric, numeric_features),
        ("cat", slow_categorical, categorical_features),
    ],
    remainder="drop",
    n_jobs=2,
)

model = Pipeline([
    ("preprocess", preprocess),
    ("classifier", LogisticRegression(max_iter=500)),
])
model.fit(df, target)

In [9]:
# Run the profiler for prediction timings
profiler = PipelineProfiler(model, mode="predict", repetitions=2, warmup=1)
report = profiler.run(df)
compiled_report = profiler.run_compiled(df.iloc[0])





In [10]:
# Helper to print the profiling tree in a readable form
def print_report(report, title):
    print(title)
    print("-" * len(title))
    def recurse(node, indent=0):
        duration = node.last_duration_display
        print(" " * indent + f"{node.name} ({node.kind}) - {duration}")
        for child in node.children:
            recurse(child, indent + 2)
    recurse(report.root)

print_report(report, "Full pipeline timings")
print()
print_report(compiled_report, "Compiled single-row timings")

Full pipeline timings
---------------------
pipeline (Pipeline) - 518.956 ms
  preprocess (ColumnTransformer) - 518.426 ms
    num (Pipeline) - 504.991 ms
      sleep (SleepTransformer) - 500.754 ms
      impute (SimpleImputer) - 3.117 ms
      scale (StandardScaler) - 0.577 ms
    cat (Pipeline) - 304.335 ms
      sleep (SleepTransformer) - 300.433 ms
      impute (SimpleImputer) - 2.272 ms
      encode (OneHotEncoder) - 1.127 ms
  classifier (LogisticRegression) - 0.478 ms

Compiled single-row timings
---------------------------
compiled_pipeline (callable) - 803.572 ms
  preprocess (compiled_step) - 803.476 ms
  classifier (compiled_step) - 0.065 ms


## Next steps

- Adjust `repetitions` or `warmup` for more stable measurements.
- Use `report.to_dict()` to export results for further analysis.
- Compare different pipeline configurations by profiling each variant.