---
title: PARAFAC2 Pipeline Orchastrator Demonstration
description: A prototype for a sklearn based PARAFAC2 pipeline Orchastrator with preprocessing, pipeline, postprocessing and results demonstration
project: parafac2
status: closed
conclusion: "orchastrator successfully completes the PARAFAC2 pipeline and displays results, storing them in a database in a datamart-like model"
---

# Introduction

Multistage preprocessing of data prior to modeling can quickly become difficult to manage - keeping track of the order of execution of the individual stages, modifying stages, and inspecting the data in between stages are some requirements that need to be managed. One popular method is the sklearn Pipeline object. It can be used to organise and inspect the individual stages. However, its most powerful feature is that it provides a grid search capability, allowing the user to **test** different hyperparameter combinations across the different stages. Say you wanted to see how a more aggressive baseline correction strategy would affect the binning of the signal, and thus the decomposition result. it does require a modicrum of work to set up, but the payoff is large. In this notebook we will set up a pipeline framework that will enable us to organise the stages of the pipeline and inspect the pipeline if an error is encountered.

# Setup


In [None]:
%reload_ext autoreload
%autoreload 2

# get the test data as two tables: metadata and a samplewise stacked img table

import duckdb as db
from pca_analysis.definitions import DB_PATH_UV
from pca_analysis.code.get_sample_data import get_ids_by_varietal

import polars as pl

from pca_analysis.notebooks.experiments.parafac2_pipeline.orchestrator import (
    Orchestrator,
)
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

con = db.connect(DB_PATH_UV)
ids = get_ids_by_varietal(con=con, varietal="shiraz")


# Define Test Data

We want a representative dataset small enough to enable quick iterations.

In [None]:
testdata_filter_expr = pl.col("mins").is_between(0.7, 1.39) & pl.col("nm").is_between(
    240, 270
)

orc = Orchestrator()
orc.load_data(con=con, runids=ids, filter_expr=testdata_filter_expr)
orc.input_data.plot_3d()


# Decomposition


visual inspection hints at 8 peaks translating to a 8 + 1 rank, with 1 for noise.


In [None]:
results = orc.run_pipeline()


## Viz

A demonstration of some visualisation of the decomposition results.


### The Reconstruction in 3D


In [None]:
results.viz_recon_3d()


### Overlay Components and Input


In [None]:
results.viz_overlay_curve_components(sample=0, wavelength=0)


# Overlay Input and Recon 2D

In [None]:
results.viz_recon_input_overlay(sample=7, wavelength=10)


### Facet of Recon and Input 2D By Sample


In [None]:
results.viz_recon_input_overlay_facet(wavelengths=10, facet_col="sample")


### Facet by Wavelength


In [None]:
results.viz_recon_input_overlay_facet(samples=2, facet_col="wavelength_point")


## Test: Is my Reconstruction Equal to the Tensorly Implementation

A sanity check will be whether it is.

### Verifying the Horizontal Sum

To reconstruct the slice we sum (convolve) the components, as by definition their sum is the input X. This currently achieved by converting the individual component columns into a list and summing the list. One possible cause of the discrepency is that duckdb handles floating points slightly differently (or list sums) to numpy. To test this, we can get the component columns as a numpy array and sum them then compare to the duckdb result.

In [None]:
results._check_computations_match_tly()


Ok. That proves it. Any variation is caused by either polars or duckdb rather than a computation error on my part.


# Conclusion

Bare bones orchestrator with preprocessing, pipeline, postprocessing and results display is complete. 
