# 04 - Storage & Runtime Dynamics

This notebook covers Nebula's runtime features for managing state, debugging, and handling failures:

| Part | Topic | Description |
|------|-------|-------------|
| **1** | Nebula Storage | Store and retrieve objects during pipeline execution |
| **2** | Pipeline Keywords | Declarative storage operations in pipelines |
| **3** | LazyWrapper | Runtime parameter resolution from storage |
| **4** | Interleaved Transformers | Inject debug transformers between steps |
| **5** | Failure Recovery | Retrieve DataFrames when pipelines fail |

In [1]:
import polars as pl

from nebula import TransformerPipeline
from nebula.storage import nebula_storage as ns
from nebula.base import Transformer, LazyWrapper
from nebula.transformers import (
    AddLiterals,
    AssertNotEmpty,
    DropColumns,
    DropNulls,
    Filter,
    SelectColumns,
)

In [2]:
# Sample e-commerce data
orders = pl.DataFrame({
    "order_id": [1, 2, 3, 4, 5],
    "customer": ["alice", "bob", "alice", "carol", "bob"],
    "amount": [150.0, 75.0, 200.0, 50.0, 300.0],
    "status": ["completed", "completed", "pending", "completed", "pending"],
})
orders

order_id,customer,amount,status
i64,str,f64,str
1,"""alice""",150.0,"""completed"""
2,"""bob""",75.0,"""completed"""
3,"""alice""",200.0,"""pending"""
4,"""carol""",50.0,"""completed"""
5,"""bob""",300.0,"""pending"""


---
## Part 1: Nebula Storage

Nebula storage is an in-memory key-value store that lives within your Python process. It's useful for:

- **Passing data between transformers** (e.g., a DataFrame for joining)
- **Storing intermediate results** for debugging
- **Recovering from failures** (automatic - see Part 5)

### 1.1 Basic API

In [3]:
# Always start fresh
ns.clear()

# Store any Python object
ns.set("my_number", 42)
ns.set("my_list", [1, 2, 3])
ns.set("my_df", orders)

# Retrieve objects
print(f"Number: {ns.get('my_number')}")
print(f"List: {ns.get('my_list')}")
print(f"DataFrame rows: {len(ns.get('my_df'))}")

2025-12-26 01:09:25,726 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:25,745 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:25,748 | [INFO]: Nebula Storage: setting an object (<class 'int'>) with the key "my_number". 
2025-12-26 01:09:25,750 | [INFO]: Nebula Storage: setting an object (<class 'list'>) with the key "my_list". 
2025-12-26 01:09:25,751 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "my_df". 


Number: 42
List: [1, 2, 3]
DataFrame rows: 5


In [4]:
# Inspect storage state
print(f"Keys: {ns.list_keys()}")
print(f"Count: {ns.count_objects()}")
print(f"'my_number' exists: {ns.isin('my_number')}")
print(f"'unknown' exists: {ns.isin('unknown')}")

Keys: ['my_df', 'my_list', 'my_number']
Count: 3
'my_number' exists: True
'unknown' exists: False


In [5]:
# Clear specific keys
ns.clear("my_number")
print(f"After clearing 'my_number': {ns.list_keys()}")

# Clear multiple keys
ns.clear(["my_list", "my_df"])
print(f"After clearing list: {ns.list_keys()}")

# Clear everything
ns.set("temp", "value")
ns.clear()  # No argument = clear all
print(f"After clear(): {ns.list_keys()}")

2025-12-26 01:09:27,257 | [INFO]: Nebula Storage: clear key "my_number". 
2025-12-26 01:09:27,258 | [INFO]: Nebula Storage: 2 keys remained after clearing. 
2025-12-26 01:09:27,259 | [INFO]: Nebula Storage: clear user-defined keys. 
2025-12-26 01:09:27,261 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:27,262 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "temp". 
2025-12-26 01:09:27,262 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:27,263 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


After clearing 'my_number': ['my_df', 'my_list']
After clearing list: []
After clear(): []


### 1.2 Overwriting Control

By default, storage allows overwriting. You can disable this to catch accidental key collisions:

In [6]:
ns.clear()

# Default: overwriting allowed
ns.set("key", "first")
ns.set("key", "second")  # Silently overwrites
print(f"Value: {ns.get('key')}")

# Disable overwriting
ns.disallow_overwriting()
try:
    ns.set("key", "third")  # Raises KeyError
except KeyError as e:
    print(f"KeyError: {e}")

# Re-enable for rest of notebook
ns.allow_overwriting()
ns.clear()

2025-12-26 01:09:28,898 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:28,900 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:28,903 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "key". 
2025-12-26 01:09:28,905 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "key". 
2025-12-26 01:09:28,907 | [INFO]: Nebula Storage: disallow overwriting. 
2025-12-26 01:09:28,907 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "key". 
2025-12-26 01:09:28,908 | [INFO]: Nebula Storage: allow overwriting. 
2025-12-26 01:09:28,909 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:28,910 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


Value: second
KeyError: 'Nebula Storage: key "key" already exists and overwriting is disabled.'


### 1.3 Debug Mode

Debug mode lets you store objects conditionally. When debug mode is **off**, `debug=True` stores are silently skipped. This lets you add extensive debugging without modifying code for production:

In [7]:
ns.clear()

# Debug mode OFF (default)
ns.allow_debug(False)
ns.set("regular", "always stored")
ns.set("debug_data", "skipped when debug off", debug=True)

print(f"Debug OFF - keys: {ns.list_keys()}")

# Debug mode ON
ns.allow_debug(True)
ns.set("debug_data", "now it's stored", debug=True)

print(f"Debug ON - keys: {ns.list_keys()}")
print(f"Debug mode active: {ns.is_debug_mode}")

ns.allow_debug(False)
ns.clear()

2025-12-26 01:09:30,293 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:30,296 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:30,298 | [INFO]: Nebula Storage: deactivate debug storage. 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "regular". 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: asked to set "debug_data" in debug mode but the storage debug is not active. The object will not be stored. 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: activate debug storage. 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "debug_data". 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: deactivate debug storage. 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:30,299 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


Debug OFF - keys: ['regular']
Debug ON - keys: ['debug_data', 'regular']
Debug mode active: True


---
## Part 2: Pipeline Storage Keywords

Instead of writing custom transformers to store data, use pipeline keywords - single-key dictionaries that integrate with the pipeline flow.

| Keyword | Description |
|---------|-------------|
| `{"store": "key"}` | Store current DataFrame |
| `{"store_debug": "key"}` | Store only if debug mode is active |
| `{"storage_debug_mode": True/False}` | Toggle debug mode |
| `{"replace_with_stored_df": "key"}` | Replace current DataFrame with stored one |

In [8]:
ns.clear()

pipe = TransformerPipeline([
    Filter(input_col="status", perform="keep", operator="eq", value="completed"),
    {"store": "completed_orders"},  # Store after filtering
    SelectColumns(columns=["order_id", "amount"]),
    {"store": "final_output"},
])

pipe.show()

2025-12-26 01:09:31,870 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:31,872 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


*** Pipeline *** (2 transformations)
 - Filter
   --> Store df with key "completed_orders"
 - SelectColumns
   --> Store df with key "final_output"


In [9]:
result = pipe.run(orders)

print(f"\nStored keys: {ns.list_keys()}")
print(f"\nCompleted orders (before select):")
print(ns.get("completed_orders"))
print(f"\nFinal output:")
print(ns.get("final_output"))

2025-12-26 01:09:32,612 | [INFO]: Starting pipeline 
2025-12-26 01:09:32,613 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:32,619 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:32,619 | [INFO]:    --> Store df with key "completed_orders" 
2025-12-26 01:09:32,619 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "completed_orders". 
2025-12-26 01:09:32,619 | [INFO]: Running 'SelectColumns' ... 
2025-12-26 01:09:32,619 | [INFO]: Completed 'SelectColumns' in 0.0s 
2025-12-26 01:09:32,631 | [INFO]:    --> Store df with key "final_output" 
2025-12-26 01:09:32,632 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "final_output". 
2025-12-26 01:09:32,633 | [INFO]: Pipeline completed in 0.0s 



Stored keys: ['completed_orders', 'final_output']

Completed orders (before select):
shape: (3, 4)
┌──────────┬──────────┬────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    │
│ ---      ┆ ---      ┆ ---    ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       │
╞══════════╪══════════╪════════╪═══════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ completed │
│ 2        ┆ bob      ┆ 75.0   ┆ completed │
│ 4        ┆ carol    ┆ 50.0   ┆ completed │
└──────────┴──────────┴────────┴───────────┘

Final output:
shape: (3, 2)
┌──────────┬────────┐
│ order_id ┆ amount │
│ ---      ┆ ---    │
│ i64      ┆ f64    │
╞══════════╪════════╡
│ 1        ┆ 150.0  │
│ 2        ┆ 75.0   │
│ 4        ┆ 50.0   │
└──────────┴────────┘


### 2.1 Debug Storage in Pipelines

In [10]:
ns.clear()

pipe = TransformerPipeline([
    {"storage_debug_mode": False},           # Start with debug OFF
    Filter(input_col="amount", perform="keep", operator="gt", value=100),
    {"store_debug": "step1_skipped"},        # Skipped (debug off)
    {"storage_debug_mode": True},            # Turn debug ON
    {"store_debug": "step2_stored"},         # Stored (debug on)
    {"storage_debug_mode": False},           # Turn debug OFF
])

pipe.show()

2025-12-26 01:09:34,098 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:34,099 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


*** Pipeline *** (1 transformation)
   --> Deactivate storage debug mode
 - Filter
   --> Store df (debug) with key "step1_skipped"
   --> Activate storage debug mode
   --> Store df (debug) with key "step2_stored"
   --> Deactivate storage debug mode


In [11]:
result = pipe.run(orders)

print(f"Stored keys: {ns.list_keys()}")
# Note: 'step1_skipped' is NOT in the list

2025-12-26 01:09:34,832 | [INFO]: Starting pipeline 
2025-12-26 01:09:34,835 | [INFO]:    --> Deactivate storage debug mode 
2025-12-26 01:09:34,838 | [INFO]: Nebula Storage: deactivate debug storage. 
2025-12-26 01:09:34,839 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:34,844 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:34,844 | [INFO]:    --> Store df (debug) with key "step1_skipped" 
2025-12-26 01:09:34,844 | [INFO]: Nebula Storage: asked to set "step1_skipped" in debug mode but the storage debug is not active. The object will not be stored. 
2025-12-26 01:09:34,844 | [INFO]:    --> Activate storage debug mode 
2025-12-26 01:09:34,844 | [INFO]: Nebula Storage: activate debug storage. 
2025-12-26 01:09:34,844 | [INFO]:    --> Store df (debug) with key "step2_stored" 
2025-12-26 01:09:34,849 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "step2_stored". 
2025-12-26 01:09:34,850 | [INFO]:    --> Deactivate storage deb

Stored keys: ['step2_stored']


### 2.2 Replacing the DataFrame Mid-Pipeline

In [12]:
ns.clear()

# Pre-store a lookup table
customer_tiers = pl.DataFrame({
    "customer": ["alice", "bob", "carol"],
    "tier": ["gold", "silver", "bronze"],
})
ns.set("customer_tiers", customer_tiers)

# Pipeline that swaps to a different DataFrame
pipe = TransformerPipeline([
    Filter(input_col="amount", perform="keep", operator="gt", value=100),
    {"store": "filtered_orders"},
    {"replace_with_stored_df": "customer_tiers"},  # Switch to tiers table
    Filter(input_col="tier", perform="keep", operator="eq", value="gold"),
])

result = pipe.run(orders)
print("Result (gold tier customers):")
print(result)

2025-12-26 01:09:36,274 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:36,276 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:36,278 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "customer_tiers". 
2025-12-26 01:09:36,280 | [INFO]: Starting pipeline 
2025-12-26 01:09:36,280 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:36,284 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:36,284 | [INFO]:    --> Store df with key "filtered_orders" 
2025-12-26 01:09:36,284 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "filtered_orders". 
2025-12-26 01:09:36,285 | [INFO]:    --> Load df from key "customer_tiers" 
2025-12-26 01:09:36,285 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:36,287 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:36,287 | [INFO]: Pipeline completed in 0.0s 


Result (gold tier customers):
shape: (1, 2)
┌──────────┬──────┐
│ customer ┆ tier │
│ ---      ┆ ---  │
│ str      ┆ str  │
╞══════════╪══════╡
│ alice    ┆ gold │
└──────────┴──────┘


---
## Part 3: LazyWrapper - Runtime Parameter Resolution

Sometimes transformer parameters aren't known until runtime - they depend on values computed by earlier pipeline steps. `LazyWrapper` defers transformer instantiation until `transform()` is called, resolving parameters from storage at that moment.

### 3.1 Basic LazyWrapper

In [13]:
ns.clear()

# Simulate an earlier step storing a computed value
ns.set("discount_label", "holiday_promo")

# LazyWrapper resolves (ns, "key") at transform time
pipe = TransformerPipeline([
    LazyWrapper(
        AddLiterals,
        data=[{"alias": "promo_code", "value": (ns, "discount_label")}]
    ),
])

pipe.show(add_params=True)

2025-12-26 01:09:37,874 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:37,876 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:37,878 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "discount_label". 


*** Pipeline *** (1 transformation)
 - (Lazy) AddLiterals -> PARAMS: data=[{'alias': 'promo_code', 'value': 'ns.get("discount_label")'}]


In [14]:
result = pipe.run(orders)
print(result)

2025-12-26 01:09:38,800 | [INFO]: Starting pipeline 
2025-12-26 01:09:38,802 | [INFO]: Running '(Lazy) AddLiterals' ... 
2025-12-26 01:09:38,804 | [INFO]: Completed '(Lazy) AddLiterals' in 0.0s 
2025-12-26 01:09:38,807 | [INFO]: Pipeline completed in 0.0s 


shape: (5, 5)
┌──────────┬──────────┬────────┬───────────┬───────────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ promo_code    │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---           │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ str           │
╞══════════╪══════════╪════════╪═══════════╪═══════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ completed ┆ holiday_promo │
│ 2        ┆ bob      ┆ 75.0   ┆ completed ┆ holiday_promo │
│ 3        ┆ alice    ┆ 200.0  ┆ pending   ┆ holiday_promo │
│ 4        ┆ carol    ┆ 50.0   ┆ completed ┆ holiday_promo │
│ 5        ┆ bob      ┆ 300.0  ┆ pending   ┆ holiday_promo │
└──────────┴──────────┴────────┴───────────┴───────────────┘


### 3.2 Dynamic Parameters from Earlier Steps

The real power: one transformer computes a value, stores it, and a later transformer uses it:

In [15]:
ns.clear()

class ComputeThreshold(Transformer):
    """Compute and store a threshold based on data."""
    def _transform_nw(self, df):
        # In reality, this might be a complex calculation
        import narwhals as nw
        avg = df.select(nw.col("amount").mean()).to_native().item()
        ns.set("computed_threshold", avg)
        print(f"Computed threshold: {avg}")
        return df


pipe = TransformerPipeline([
    ComputeThreshold(),
    # Filter uses the threshold computed above
    LazyWrapper(
        Filter,
        input_col="amount",
        perform="keep",
        operator="gt",
        value=(ns, "computed_threshold"),
    ),
])

result = pipe.run(orders)
print(f"\nOrders above average amount:")
print(result)

2025-12-26 01:09:40,439 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:40,440 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:40,443 | [INFO]: Starting pipeline 
2025-12-26 01:09:40,443 | [INFO]: Running 'ComputeThreshold' ... 
2025-12-26 01:09:40,448 | [INFO]: Nebula Storage: setting an object (<class 'float'>) with the key "computed_threshold". 
2025-12-26 01:09:40,450 | [INFO]: Completed 'ComputeThreshold' in 0.0s 
2025-12-26 01:09:40,450 | [INFO]: Running '(Lazy) Filter' ... 
2025-12-26 01:09:40,453 | [INFO]: Completed '(Lazy) Filter' in 0.0s 
2025-12-26 01:09:40,455 | [INFO]: Pipeline completed in 0.0s 


Computed threshold: 155.0

Orders above average amount:
shape: (2, 4)
┌──────────┬──────────┬────────┬─────────┐
│ order_id ┆ customer ┆ amount ┆ status  │
│ ---      ┆ ---      ┆ ---    ┆ ---     │
│ i64      ┆ str      ┆ f64    ┆ str     │
╞══════════╪══════════╪════════╪═════════╡
│ 3        ┆ alice    ┆ 200.0  ┆ pending │
│ 5        ┆ bob      ┆ 300.0  ┆ pending │
└──────────┴──────────┴────────┴─────────┘


### 3.3 Nested Lazy Parameters

Lazy references work at any nesting depth within parameter structures:

In [16]:
ns.clear()
ns.set("col1_value", "computed_at_runtime")
ns.set("col2_value", 999)

# Lazy references nested inside list of dicts
pipe = TransformerPipeline([
    LazyWrapper(
        AddLiterals,
        data=[
            {"alias": "static_col", "value": "hardcoded"},
            {"alias": "dynamic_col1", "value": (ns, "col1_value")},
            {"alias": "dynamic_col2", "value": (ns, "col2_value")},
        ]
    ),
])

result = pipe.run(orders.head(2))
print(result)

2025-12-26 01:09:42,867 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:42,869 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:42,869 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "col1_value". 
2025-12-26 01:09:42,871 | [INFO]: Nebula Storage: setting an object (<class 'int'>) with the key "col2_value". 
2025-12-26 01:09:42,873 | [INFO]: Starting pipeline 
2025-12-26 01:09:42,874 | [INFO]: Running '(Lazy) AddLiterals' ... 
2025-12-26 01:09:42,876 | [INFO]: Completed '(Lazy) AddLiterals' in 0.0s 
2025-12-26 01:09:42,876 | [INFO]: Pipeline completed in 0.0s 


shape: (2, 7)
┌──────────┬──────────┬────────┬───────────┬────────────┬─────────────────────┬──────────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ static_col ┆ dynamic_col1        ┆ dynamic_col2 │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---        ┆ ---                 ┆ ---          │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ str        ┆ str                 ┆ i32          │
╞══════════╪══════════╪════════╪═══════════╪════════════╪═════════════════════╪══════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ completed ┆ hardcoded  ┆ computed_at_runtime ┆ 999          │
│ 2        ┆ bob      ┆ 75.0   ┆ completed ┆ hardcoded  ┆ computed_at_runtime ┆ 999          │
└──────────┴──────────┴────────┴───────────┴────────────┴─────────────────────┴──────────────┘


### 3.4 JSON/YAML Syntax: `__ns__` Prefix

In JSON/YAML configs, use the `__ns__` prefix instead of `(ns, "key")` tuples:

```yaml
- transformer: AddLiterals
  lazy: true
  params:
    data:
      - alias: dynamic_col
        value: "__ns__my_storage_key"
```

The `lazy: true` flag tells the loader to wrap the transformer in `LazyWrapper`.

In [17]:
from nebula.pipelines.pipeline_loader import load_pipeline

ns.clear()
ns.set("runtime_value", "loaded_from_storage")

config = {
    "pipeline": [
        {
            "transformer": "AddLiterals",
            "lazy": True,
            "params": {
                "data": [
                    {"alias": "new_col", "value": "__ns__runtime_value"}
                ]
            }
        }
    ]
}

pipe = load_pipeline(config)
result = pipe.run(orders.head(2))
print(result)

2025-12-26 01:09:49,262 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:09:49,263 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:09:49,264 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "runtime_value". 
2025-12-26 01:09:49,267 | [INFO]: Starting pipeline 
2025-12-26 01:09:49,267 | [INFO]: Running '(Lazy) AddLiterals' ... 
2025-12-26 01:09:49,272 | [INFO]: Completed '(Lazy) AddLiterals' in 0.0s 
2025-12-26 01:09:49,272 | [INFO]: Pipeline completed in 0.0s 


shape: (2, 5)
┌──────────┬──────────┬────────┬───────────┬─────────────────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ new_col             │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---                 │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ str                 │
╞══════════╪══════════╪════════╪═══════════╪═════════════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ completed ┆ loaded_from_storage │
│ 2        ┆ bob      ┆ 75.0   ┆ completed ┆ loaded_from_storage │
└──────────┴──────────┴────────┴───────────┴─────────────────────┘


---
## Part 4: Interleaved Transformers

Interleaved transformers are injected between each step of your pipeline - useful for debugging, logging, or validation during development.

### 4.1 The `interleaved` Parameter

In [18]:
class LogRowCount(Transformer):
    """Debug transformer that logs row count."""
    def _transform_nw(self, df):
        print(f"  [DEBUG] Row count: {len(df)}")
        return df


pipe = TransformerPipeline(
    [
        Filter(input_col="status", perform="keep", operator="eq", value="completed"),
        Filter(input_col="amount", perform="keep", operator="gt", value=50),
        SelectColumns(columns=["order_id", "amount"]),
    ],
    interleaved=[LogRowCount()],  # Injected after each step
)

pipe.show()

*** Pipeline *** (5 transformations)
 - Filter
 - LogRowCount
 - Filter
 - LogRowCount
 - SelectColumns


In [19]:
result = pipe.run(orders)

2025-12-26 01:09:57,742 | [INFO]: Starting pipeline 
2025-12-26 01:09:57,744 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:57,748 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:57,750 | [INFO]: Running 'LogRowCount' ... 
2025-12-26 01:09:57,753 | [INFO]: Completed 'LogRowCount' in 0.0s 
2025-12-26 01:09:57,753 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:57,753 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:57,753 | [INFO]: Running 'LogRowCount' ... 
2025-12-26 01:09:57,753 | [INFO]: Completed 'LogRowCount' in 0.0s 
2025-12-26 01:09:57,753 | [INFO]: Running 'SelectColumns' ... 
2025-12-26 01:09:57,753 | [INFO]: Completed 'SelectColumns' in 0.0s 
2025-12-26 01:09:57,753 | [INFO]: Pipeline completed in 0.0s 


  [DEBUG] Row count: 3
  [DEBUG] Row count: 2


### 4.2 Prepend and Append Options

Control whether interleaved transformers run at the start/end:

In [20]:
pipe = TransformerPipeline(
    [
        Filter(input_col="amount", perform="keep", operator="gt", value=100),
        SelectColumns(columns=["order_id", "amount"]),
    ],
    interleaved=[LogRowCount()],
    prepend_interleaved=True,   # Also run BEFORE first transformer
    append_interleaved=True,    # Also run AFTER last transformer
)

print("With prepend and append:")
result = pipe.run(orders)

2025-12-26 01:09:59,029 | [INFO]: Starting pipeline 
2025-12-26 01:09:59,029 | [INFO]: Running 'LogRowCount' ... 
2025-12-26 01:09:59,029 | [INFO]: Completed 'LogRowCount' in 0.0s 
2025-12-26 01:09:59,029 | [INFO]: Running 'Filter' ... 
2025-12-26 01:09:59,029 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:09:59,029 | [INFO]: Running 'LogRowCount' ... 
2025-12-26 01:09:59,029 | [INFO]: Completed 'LogRowCount' in 0.0s 
2025-12-26 01:09:59,029 | [INFO]: Running 'SelectColumns' ... 
2025-12-26 01:09:59,029 | [INFO]: Completed 'SelectColumns' in 0.0s 
2025-12-26 01:09:59,029 | [INFO]: Running 'LogRowCount' ... 
2025-12-26 01:09:59,029 | [INFO]: Completed 'LogRowCount' in 0.0s 
2025-12-26 01:09:59,045 | [INFO]: Pipeline completed in 0.0s 


With prepend and append:
  [DEBUG] Row count: 5
  [DEBUG] Row count: 3
  [DEBUG] Row count: 3


### 4.3 Runtime Injection: `force_interleaved_transformer`

Inject a transformer at runtime without modifying the pipeline definition:

In [21]:
# Pipeline defined without interleaved
pipe = TransformerPipeline([
    Filter(input_col="status", perform="keep", operator="eq", value="completed"),
    SelectColumns(columns=["order_id", "customer", "amount"]),
])

# Inject at runtime for this specific run
print("Running with injected debug transformer:")
result = pipe.run(orders, force_interleaved_transformer=LogRowCount())

2025-12-26 01:10:00,554 | [INFO]: Starting pipeline 
2025-12-26 01:10:00,557 | [INFO]: Running 'Filter' ... 
2025-12-26 01:10:00,562 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:10:00,563 | [INFO]: Running 'SelectColumns' ... 
2025-12-26 01:10:00,567 | [INFO]: Completed 'SelectColumns' in 0.0s 
2025-12-26 01:10:00,569 | [INFO]: Pipeline completed in 0.0s 


Running with injected debug transformer:
  [DEBUG] Row count: 3
  [DEBUG] Row count: 3


---
## Part 5: Failure Recovery

When a pipeline fails, Nebula automatically caches the DataFrame(s) that were about to be processed. This lets you inspect the data and debug without re-running expensive earlier steps.

### 5.1 Transformer Failure

In [22]:
ns.clear()

class BrokenTransformer(Transformer):
    """A transformer that always fails."""
    def _transform_nw(self, df):
        raise ValueError("Something went wrong!")


pipe = TransformerPipeline([
    Filter(input_col="amount", perform="keep", operator="gt", value=100),
    BrokenTransformer(),  # This will fail
    SelectColumns(columns=["order_id"]),
])

2025-12-26 01:10:02,266 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:10:02,269 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


In [23]:
try:
    pipe.run(orders)
except ValueError as e:
    print(f"Pipeline failed: {e}")

2025-12-26 01:10:03,099 | [INFO]: Starting pipeline 
2025-12-26 01:10:03,102 | [INFO]: Running 'Filter' ... 
2025-12-26 01:10:03,106 | [INFO]: Completed 'Filter' in 0.0s 
2025-12-26 01:10:03,106 | [INFO]: Running 'BrokenTransformer' ... 
2025-12-26 01:10:03,106 | [ERROR]: Error at node BrokenTransformer_2d10d9@2: Something went wrong! 
2025-12-26 01:10:03,106 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "FAIL_DF_transformer:BrokenTransformer". 


Pipeline failed: Get the dataframe(s) before the failure in the nebula storage with the key(s): ['transformer:BrokenTransformer']
Original Error:
Something went wrong!


In [24]:
# Check what was cached
print(f"Cached keys: {ns.list_keys()}")

# Retrieve the DataFrame that was about to be processed
failed_df = ns.get("FAIL_DF_transformer:BrokenTransformer")
print(f"\nDataFrame before failure:")
print(failed_df)

Cached keys: ['FAIL_DF_transformer:BrokenTransformer']

DataFrame before failure:
shape: (3, 4)
┌──────────┬──────────┬────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    │
│ ---      ┆ ---      ┆ ---    ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       │
╞══════════╪══════════╪════════╪═══════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ completed │
│ 3        ┆ alice    ┆ 200.0  ┆ pending   │
│ 5        ┆ bob      ┆ 300.0  ┆ pending   │
└──────────┴──────────┴────────┴───────────┘


### 5.2 Split Pipeline Merge Failure

When split branches can't be merged (schema mismatch), Nebula caches all branch DataFrames:

In [25]:
ns.clear()

def split_by_amount(df):
    """Split into low and high value orders."""
    return {
        "low": df.filter(pl.col("amount") < 100),
        "high": df.filter(pl.col("amount") >= 100),
    }


# Each branch drops a DIFFERENT column - merge will fail!
pipe = TransformerPipeline(
    {
        "low": [DropColumns(columns="customer")],   # Drops 'customer'
        "high": [DropColumns(columns="status")],    # Drops 'status'
    },
    split_function=split_by_amount,
    # allow_missing_columns=False (default) - will fail on merge
)

2025-12-26 01:10:04,966 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:10:04,969 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


In [26]:
try:
    pipe.run(orders)
except Exception as e:
    print(f"Pipeline failed: {type(e).__name__}")
    print(f"Message: {str(e)[:100]}...")

2025-12-26 01:10:05,549 | [INFO]: Starting pipeline 
2025-12-26 01:10:05,556 | [INFO]: Entering split 
2025-12-26 01:10:05,559 | [INFO]: Running 'DropColumns' ... 
2025-12-26 01:10:05,562 | [INFO]: Completed 'DropColumns' in 0.0s 
2025-12-26 01:10:05,562 | [INFO]: Running 'DropColumns' ... 
2025-12-26 01:10:05,566 | [INFO]: Completed 'DropColumns' in 0.0s 
2025-12-26 01:10:05,568 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "FAIL_DF_high-df-before-appending:append". 
2025-12-26 01:10:05,569 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "FAIL_DF_low-df-before-appending:append". 


Pipeline failed: ValueError
Message: Get the dataframe(s) before the failure in the nebula storage with the key(s): ['high-df-before-appe...


In [27]:
# Check cached DataFrames
print(f"Cached keys: {ns.list_keys()}")

# Retrieve each branch's result to debug
for key in ns.list_keys():
    if key.startswith("FAIL_DF_"):
        df = ns.get(key)
        print(f"\n{key}:")
        print(f"  Columns: {df.columns}")
        print(f"  Rows: {len(df)}")

Cached keys: ['FAIL_DF_high-df-before-appending:append', 'FAIL_DF_low-df-before-appending:append']

FAIL_DF_high-df-before-appending:append:
  Columns: ['order_id', 'customer', 'amount']
  Rows: 3

FAIL_DF_low-df-before-appending:append:
  Columns: ['order_id', 'amount', 'status']
  Rows: 2


### 5.3 Failure Cache Key Patterns

The failure cache uses predictable key patterns:

| Failure Type | Key Pattern |
|--------------|-------------|
| Transformer | `FAIL_DF_transformer:TransformerName` |
| Function | `FAIL_DF_function:function_name` |
| Split fork | `FAIL_DF_fork:split` |
| Before append | `FAIL_DF_{branch}-df-before-appending:append` |
| Before join | `FAIL_DF_join-left-df:join`, `FAIL_DF_join-right-df:join` |

---
## Summary

| Feature | Use Case |
|---------|----------|
| **nebula_storage** | Share data between transformers, store intermediates |
| **Debug mode** | Conditional storage for dev vs prod |
| **Pipeline keywords** | Declarative storage in pipeline definition |
| **LazyWrapper** | Runtime parameter resolution from storage |
| **Interleaved** | Debug/logging transformers between steps |
| **Failure cache** | Automatic DataFrame recovery on errors |

These features are particularly useful during development and debugging in notebooks, where you can inspect stored DataFrames interactively.

In [28]:
# Cleanup
ns.clear()
ns.allow_debug(False)
ns.allow_overwriting()

2025-12-26 01:10:07,576 | [INFO]: Nebula Storage: clear. 
2025-12-26 01:10:07,578 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2025-12-26 01:10:07,580 | [INFO]: Nebula Storage: deactivate debug storage. 
2025-12-26 01:10:07,583 | [INFO]: Nebula Storage: allow overwriting. 
