# Caching the dataframe(s) in case of a pipeline failure

This notebook shows a couple of useful features to help the user to debug a broken pipeline.

Please note that this debugging process is intended for execution in a notebook, as it relies on nebula storage, which resides within the Python kernel and cannot be utilized in a recipe or Airflow.

There two main situations in which a pipeline may break:
- a transformer fails
- In a split pipeline, the split dataframes become unmergeable due to varying schemas or columns.

In the first case nebula stores the input dataframe of the failed transformer, in the latter one all the dataframes that should be merged are retained, allowing the user to retrieve them and address the issue.

In [1]:
import polars as pl

from nebula import nebula_storage as ns
from nebula import TransformerPipeline
from nebula.base import Transformer
from nebula.transformers import (
    AssertNotEmpty,
    DropColumns,
    RenameColumns,
    SelectColumns,
)

In [2]:

class ThisTransformerIsBroken(Transformer):
    @staticmethod
    def _transform_nw(df):
        raise ValueError("Broken transformer")


df_input = pl.DataFrame({
    "c1": [1, 2],
    "c2": [3, 4],
    "c3": [5, 6],
})

In [3]:
ns.clear()
ns.set("AASASDS", 123)
ns.set("AASAsdSDS", 12653)
# _FAIL_CACHE.clear()

pipe = TransformerPipeline([ThisTransformerIsBroken(), AssertNotEmpty()])

pipe.run(df_input)


2025-12-25 22:29:32,424 | [INFO]: Nebula Storage: clear. 
2025-12-25 22:29:32,428 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-25 22:29:32,429 | [INFO]: Nebula Storage: setting an object (<class 'int'>) with the key "AASASDS". 
2025-12-25 22:29:32,429 | [INFO]: Nebula Storage: setting an object (<class 'int'>) with the key "AASAsdSDS". 
2025-12-25 22:29:32,429 | [INFO]: Starting pipeline 
2025-12-25 22:29:32,429 | [INFO]: Running 'ThisTransformerIsBroken' ... 
2025-12-25 22:29:32,439 | [ERROR]: Error at node ThisTransformerIsBroken_b01eb6@1: Broken transformer 
2025-12-25 22:29:32,439 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "FAIL_DF_transformer:ThisTransformerIsBroken". 


ValueError: Get the dataframe(s) before the failure in the nebula storage with the keys: ['transformer:ThisTransformerIsBroken']
Original Error:
Broken transformer

In [None]:
ns.list_keys()

In [4]:

# name = _PREFIX_FAIL_CACHE + "ThisTransformerIsBroken"
# print(ns.list_keys())

# df_chk = ns.get(name)

In [5]:
ns.clear()

def _split_function(df):
    cond = pl.col("cad1") < 2
    return {
        "low": df.filter(cond),
        "hi": df.filter(~cond),
    }
    
data = {
    "low": [DropColumns(columns="c2")],
    "hi": [DropColumns(columns="c3")],
}
pipe = TransformerPipeline(data, split_function=_split_function)

pipe.run(df_input)

2025-12-25 22:29:34,843 | [INFO]: Nebula Storage: clear. 
2025-12-25 22:29:34,844 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-25 22:29:34,844 | [INFO]: Starting pipeline 
2025-12-25 22:29:34,845 | [INFO]: Entering split 
2025-12-25 22:29:34,845 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "FAIL_DF_fork:split". 


ColumnNotFoundError: Get the dataframe(s) before the failure in the nebula storage with the keys: ['fork:split']
Original Error:
unable to find column "cad1"; valid columns: ["c1", "c2", "c3"]

In [None]:
ns.list_keys()

In [None]:
data = [
    [0.1234, "a", "b"],
    [4.1234, "", ""],
    [5.1234, None, None],
    [6.1234, "", None],
    [8.1234, "a", None],
    [9.1234, "a", ""],
    [10.1234, "", "b"],
    [11.1234, "a", None],
    [12.1234, None, "b"],
    [14.1234, "", None],
]

df_input = pl.DataFrame(data, orient="row", schema=["c1", "c2", "c3"])
print(df_input.schema)
df_input

In [None]:
def _split_function_with_null(df: pl.DataFrame) -> dict[str, pl.DataFrame]:
    """Split dataframe into 'low', 'hi', and 'null' subsets."""
    ret = _split_function(df)
    # Include both actual nulls and NaN values in the 'null' split
    cond_null = pl.col("c1").is_null() | pl.col("c1").is_nan()
    return {**ret, "null": df.filter(cond_null)}


In [None]:
dict_transformers = {"low": [], "hi": []}

pipe = TransformerPipeline(
    dict_transformers,
    split_function=_split_function_with_null,
    splits_no_merge={"hi"},
)

pipe.run(df_input)

In [None]:
dir(pipe)

In [None]:
pipe._ir.steps

In [None]:
callable(lambda x: x)

In [None]:
schema = [
    StructField("c1", FloatType(), True),
    StructField("c2", StringType(), True),
    StructField("c3", StringType(), True),
]

data = [
    [0.1234, "a", "b"],
    [0.1234, "a", "b"],
    [0.1234, "a", "b"],
    [1.1234, "a", "  b"],
    [2.1234, "  a  ", "  b  "],
    [3.1234, "", ""],
    [4.1234, "   ", "   "],
    [5.1234, None, None],
    [6.1234, " ", None],
    [7.1234, "", None],
    [8.1234, "a", None],
    [9.1234, "a", ""],
    [10.1234, "   ", "b"],
    [11.1234, "a", None],
    [12.1234, None, "b"],
    [13.1234, None, "b"],
    [14.1234, None, None],
]

df_input = spark.createDataFrame(data, schema=StructType(schema)).cache()
df_input.show()

## Transformer failure

In [None]:
class ThisTransformerIsBroken:
    @staticmethod
    def transform(df):
        """Public transform method w/o parent class."""        
        return df.select("wrong")


# clear the cache
ns.clear()

pipe = TransformerPipeline([
    NanToNull(input_columns="*"),
    ThisTransformerIsBroken(),
    Distinct(),
])

pipe.show_pipeline(add_transformer_params=True)

### Retrieve the input dataframe of the failed transformer as the pipe breaks.

The error message will contain the key(s) associated with storing the aforementioned dataframe. 

A few lines above, the original exception is documented.

In [None]:
pipe.run(df_input)

In [None]:
ns.get("FAIL_DF_ThisTransformerIsBroken").show()

## Unable to merge splits

In this example the transformers work properly, but they modified the dataframes in a way that is not possible to merge them back anymore.

To address this issue, all the dataframes before the union process are stored, allowing the user to investigate the problem.

In this example one split drops the column `c2`, the other one the column `c3`, hence they cannot be merged.

In [None]:
def my_split_function(df):
    cond = F.col("c1") < 10
    return {
        "low": df.filter(cond),
        "hi": df.filter(~cond),
    }


dict_transf = {
    "low": [DropColumns(columns="c2")],
    "hi": [DropColumns(columns="c3")],
}

# clear the cache
ns.clear()

pipe = TransformerPipeline(dict_transf, split_function=my_split_function)

pipe.show_pipeline(add_transformer_params=True)

In [None]:
_ = pipe.run(df_input)

In [None]:
ns.get("FAIL_DF_low").show()

In [None]:
ns.get("FAIL_DF_hi").show()

### Overwriting keys associated with the failed dataframes

The keys used for storing dataframes are generated with a method that prevent any form of overwriting by adding a numerical suffix to them, hence the user should not worry about that.

Rerruning the same broken pipeline, without clearing the cache, the keys associated with the failed dataframes do not overwrite the previous ones.

In [None]:
_ = pipe.run(df_input)

The keys associated with the failed dataframes are now:
- `FAIL_DF_low_0`
- `FAIL_DF_hi_0`