---
title: PARAFAC2 Pipeline Orchastrator Demonstration
description: A prototype for a sklearn based PARAFAC2 pipeline Orchastrator with preprocessing, pipeline, postprocessing and results demonstration
project: parafac2
status: closed
conclusion: "orchastrator successfully completes the PARAFAC2 pipeline and displays results, storing them in a database in a datamart-like model"
---

# Introduction

Multistage preprocessing of data prior to modeling can quickly become difficult to manage - keeping track of the order of execution of the individual stages, modifying stages, and inspecting the data in between stages are some requirements that need to be managed. One popular method is the sklearn Pipeline object. It can be used to organise and inspect the individual stages. However, its most powerful feature is that it provides a grid search capability, allowing the user to **test** different hyperparameter combinations across the different stages. Say you wanted to see how a more aggressive baseline correction strategy would affect the binning of the signal, and thus the decomposition result. it does require a modicrum of work to set up, but the payoff is large. In this notebook we will set up a pipeline framework that will enable us to organise the stages of the pipeline and inspect the pipeline if an error is encountered.

# Setup


In [None]:
%reload_ext autoreload
%autoreload 2

# get the test data as two tables: metadata and a samplewise stacked img table
import logging

import duckdb as db
import polars as pl

from pca_analysis.notebooks.experiments.parafac2_pipeline.parafac2_decomposition import (
    get_data_run_pipeline,
)
from pca_analysis.notebooks.experiments.parafac2_pipeline.parafac2_viz import (
    Parafac2Viz,
)
from tests.test_definitions import TEST_DB_PATH

logger = logging.getLogger(__name__)

logger.setLevel(logging.DEBUG)

with db.connect(TEST_DB_PATH) as conn:
    ids = [x[0] for x in conn.execute("select distinct runid from inc_chm").fetchall()]
    conn.close()

ids


# Define Test Data

We want a representative dataset small enough to enable quick iterations.

# Decomposition


visual inspection hints at 8 peaks translating to a 8 + 1 rank, with 1 for noise.


In [None]:
testdata_filter_expr = pl.col("mins").is_between(0, 2) & pl.col("nm").is_between(
    230, 270
)

df = get_data_run_pipeline(TEST_DB_PATH, ids, testdata_filter_expr)
df.head()


A demonstration of some visualisation of the decomposition results.


## Viz


In [None]:
wavelength = 256
runid = df.get_column("runid")[0]

pv = Parafac2Viz()

pv.overlay_components_input_signal(rectified_df=df, wavelength=wavelength, runid=runid)
