# Three Levels of Workflows

In the previous notebook, we introduced a type of workflow called `Classification` that is a *specialized workflow*. These workflows are built for a specific use case and come with a predefined sequence of steps. While a few parts can be customized, the overall structure is fixed. This makes them ideal for end users who may not be familiar with all the tools available, but still need a reliable solution for a given task.

However, specialized workflows are just one way to define a process. In this tutorial, we'll explore other levels of workflow design that offer more flexibility and control.

## Reminder: The Classification Specialized Workflow

Let’s briefly come back to the classification specialized workflow. This workflow is a ready-to-use pipeline tailored for classification tasks. It follows a fixed sequence of steps:, preprocessing, training, and evaluation, with only minimal customization required. It works out-of-the-box.

In [1]:
from neuralk_foundry_ce.workflow.use_cases import Classification
from sklearn.datasets import load_iris


X, y = load_iris(return_X_y=True, as_frame=True)

use_case = Classification('best_buy_simple_categ')
use_case.notebook_display()
data, metrics = use_case.run(X, y)
print(f'Final test ROC AUC {metrics["xgboost-classifier"]["test_roc_auc"]}')

  from .autonotebook import tqdm as notebook_tqdm


Final test ROC AUC 1.0




## Understanding and Recoding the Specialized Classification Workflow

Let’s take a closer look at the classification specialized workflow. When we display it, we see that it's made up of a clear sequence of steps:

* **StratifiedShuffleSplitter**: This step splits the dataset into training, validation, and test sets while preserving the balance of the target classes.
* **ColumnTypeDetection**: Automatically detects which columns are categorical, numerical, or text.
* **CategoricalProcessing**: Handles missing values and encodes categorical columns into numbers the model can use.
* **NumericalProcessing**: Fills in missing values and optionally standardizes or normalizes numerical columns.
* **TfIdfVectorizer**: Transforms text columns into numerical vectors using the TF-IDF method.
* **XGBoostClassifier**: Trains a gradient boosting model to perform the classification.

This workflow is specialized because these steps are predefined and tightly linked to the classification task. However, we can recreate the exact same structure as a regular workflow in Foundry if we want more control or customization. Let’s see how.


In [2]:
from neuralk_foundry_ce.sample_selection.splitter import StratifiedShuffleSplitter
from neuralk_foundry_ce.feature_engineering.preprocessing import (
    ColumnTypeDetection,
    CategoricalPreprocessor,
    NumericalPreprocessor,
)
from neuralk_foundry_ce.feature_engineering.vectorizer import TfidfVectorizer
from neuralk_foundry_ce.models.classifier import XGBoostClassifier
from neuralk_foundry_ce.workflow import WorkFlow


steps = [
    StratifiedShuffleSplitter(),
    ColumnTypeDetection(),
    CategoricalPreprocessor(imputation="most_frequent", encoding="onehot"),
    NumericalPreprocessor(imputation="mean", scaling="standard"),
    TfidfVectorizer(),
    XGBoostClassifier(),
]

# Create the workflow and display it
workflow = WorkFlow(steps, cache_dir=None)
workflow.display()

We immediately notice something important: although the structure is similar to the specialized workflow, this regular workflow displays all steps at the same level, without any internal grouping.

So is a workflow just a sequence of steps? Not quite. In Foundry, a workflow also acts as an experiment runner. It manages the execution of steps, handles all the underlying logistics, and adds helpful features like caching intermediate results to avoid recomputing the same things twice.

This raises a natural question: why do both specialized and regular workflows exist?

The answer lies in the balance between simplicity and flexibility. Specialized workflows are designed for common tasks and provide a streamlined experience with minimal configuration—ideal for non-expert users or quick prototypes. Regular workflows, on the other hand, offer full control over each step and are better suited for advanced use cases, experimentation, or research.

Let’s now run the regular workflow on the same example. In the specialized workflow, the `run` function only takes `X` and `y`, as it follows a fixed structure tailored to datasets. The regular workflow, on the other hand, doesn’t assume anything about the data, so we need to provide an `init_data` dictionary with all required inputs. To help with this, regular workflows offer a `check_consistency` method to verify that the sequence of steps is valid and runnable.

In [3]:
workflow.check_consistency(init_keys={'X', 'y'})




False

We see here that providing only `X` and `y` is not enough, since the hyperparameter optimization step in the classifier also requires a metric to optimize. More generally, it’s good practice to define a clear objective before running a task.

The Academic Classification specialized workflow handles this automatically behind the scenes. With a regular workflow, we need to specify it explicitly.

In [4]:
data, metrics = workflow.run(init_data={'X': X, 'y': y, 'metric_to_optimize': 'roc_auc'})
print(f'Final test ROC AUC {metrics["xgboost-classifier"]["test_roc_auc"]}')

Final test ROC AUC 1.0




The workflow provides several useful features: a visual display of steps, automatic caching of intermediate results, and step-wise metrics returned automatically. However, if needed, steps can also be run manually, allowing you to inject custom logic between them.

## Running It Step by Step

The sequence of steps we defined earlier can also be executed manually, one step at a time. The only requirement is to maintain a shared memory that tracks inputs and outputs across steps. In the example below, we display the output and metrics for each step as we go.

In [5]:
memory = {'X': X, 'y': y, 'metric_to_optimize': 'roc_auc'}

for step in steps:
    new_memory = step.run(memory)
    metrics = step.logged_metrics
    
    print(f'Step {step.name}:')
    print(f'* Outputs: {list(new_memory.keys())}')
    print(f'* Metrics: {list(metrics.keys())}')
    print()

    memory.update(new_memory) 

Step stratified-shuffle-split:
* Outputs: ['splits']
* Metrics: ['num_samples', 'num_columns', 'num_categorical', 'num_numerical', 'num_boolean', 'num_datetime', 'missing_values_ratio', 'columns_with_missing', 'high_cardinality_columns', 'constant_columns']

Step column-type-detection:
* Outputs: ['numerical_features', 'categorical_features', 'text_features', 'date_features']
* Metrics: []

Step categorical-preprocessing:
* Outputs: ['X']
* Metrics: []

Step numerical-preprocessing:
* Outputs: ['X']
* Metrics: []

Step tfidf-vectorizer:
* Outputs: ['X']
* Metrics: []

Step xgboost-classifier:
* Outputs: ['y_score', 'y_pred']
* Metrics: ['test_cross_entropy', 'test_roc_auc', 'test_precision', 'test_recall', 'test_hinge_loss', 'test_f1_score', 'test_accuracy', 'best_hyperopt_params', 'best_hyperopt_score', 'metric_to_optimize', 'mem_usage', 'time_usage']



