# Nebula Storage

The Nebula storage is an object capable of storing any kind of data (list, integers, dataframes, etc) that lives within the python process, without physically writing any data.

Its utility spans several purposes:
- passing data and dataframes across transformers (let's suppose you need to perform a join, a transformer inherently takes as input just one single dataframe, you need a way pass the second one.)
- storing data and intermediary dataframes, aiding in the debugging process
- helping the developer to debug a broken pipeline within a notebook (see next notebook, number `05`)

In [None]:
import polars as pl

from nebula.transformers import *
from nebula.base import Transformer
from nebula.pipelines.pipeline import TransformerPipeline
from nebula.pipelines.pipeline_loader import load_pipeline
from nebula.storage import nebula_storage as ns

In [None]:
data = [
    [0.1234, "a", "b"],
    [4.1234, "", ""],
    [5.1234, None, None],
    [6.1234, "", None],
    [8.1234, "a", None],
    [9.1234, "a", ""],
    [10.1234, "", "b"],
    [11.1234, "a", None],
    [12.1234, None, "b"],
    [14.1234, "", None],
]

df_input = pl.DataFrame(data, orient="row", schema=["c1", "c2", "c3"])
print(df_input.schema)
df_input

## Create a pipeline with two custom transformers using python

- `SetToNebulaStorage`: sets some values in nebula storage
- `ReadFromNebulaStorage`: reads from nebula storage

Values can be stored either as standard values or as debug values by adding the parameter `debug=True`.

In the latter case the debug values are not actually stored unless the debug mode is active. It can be activated / deactivaed by calling `nebula_storage.allow_debug(True / False)`.

This feature empowers users to store extensive data for debugging purposes and then refrain from storing it when transitioning the code into production simply by turning off the debug mode through `nebula_storage.allow_debug(False)` without modifying the code.

It's important to note that the storage for standard values and debug values is shared, so careful attention is needed to avoid unintentional overrides.

Additionally, there is the option to allow or disallow the overwriting mode. When overwriting is disallowed, attempting to store a value twice with the same key (without clearing it) and overwriting the previous value will throw a `KeyError`.

In [None]:
from nebula.storage import nebula_storage as ns


class SetToNebulaStorage:
    @staticmethod
    def transform(df):
        ns.set("this_key", 10)
        
        ns.allow_debug(False)
        # Deactivate the debug storage
        ns.set("debug_value_1", "value_1", debug=True)  #Â This value will not be stored
        
        ns.allow_debug(True)
        ns.set("debug_value_2", "value_2", debug=True)
        return df


class ReadFromNebulaStorage:
    @staticmethod
    def transform(df):
        value = ns.get("this_key")
        print(f"------- read: {value} -------")
        return df


pipe = TransformerPipeline([
    SetToNebulaStorage(),
    ReadFromNebulaStorage(),
])

pipe.show_pipeline(add_transformer_params=True)

In [None]:
pipe.plot()

In [None]:
df_out = pipe.run(df_input)

### Nebula storage methods and properties

- `is_overwriting_allowed` (_property_): return wether the overwrting is allowed
- `is_debug_mode` (_property_): return wether the debug mode is allowed
- `allow_overwriting()` (_method_): allow the overwriting mode
- `disallow_overwriting()` (_method_): disallow the overwriting mode
- `allow_debug(bool)` (_method_): allow / disallow debug mode
- `list_keys()` (_method_): return the current keys as a sorted list
- `count_objects()` (_method_): return the number of stored objects
- `clear(str | list(str) | None)` (_method_): clear all cache or remove some specific key(s).
- `get(str)` (_method_): return the object stored with the provided key
- `isin(str)` (_method_): check if the provided key exists

In [None]:
print(f"overwriting mode: {ns.is_overwriting_allowed}")
print(f"debug mode: {ns.is_debug_mode}")
print(f"current keys: {ns.list_keys()}")
print(f"number of stored objects: {ns.list_keys()}")

#### Note that the `debug_value_1` is not stored because when attempted to store, the debug was not active

### Moreover, it is feasible to insert storage requests between transformers. 

This allows the pipeline dataframe to be stored between steps, enabling later reuse for debugging or in other transformers.

A storage request is a single-key dictionary, such as:
- `{"store": "key_x"}`: Store the intermediate dataframe with the key key_x
- `{"store_debug": "key_y"}`: Store the intermediate dataframe in debug mode with the key key_y
- `{"storage_debug_mode": True}`: activate the debug mode
- `{"storage_debug_mode": False}`: deactivate the debug mode

In [None]:
ns.clear()

pipe = TransformerPipeline([
    Limit(n=5),
    {"storage_debug_mode": False},
    {"store": "only-5-rows"},  # Store the dataframe
    Limit(n=3),
    {"store_debug": "this-key-will-be-skipped"},  # Store the dataframe in debug mode, but the debug mode is not active yet
    {"storage_debug_mode": True},  # Turn on debug mode
    {"store_debug": "only-3-rows"},  # Store the dataframe in debug mode
    {"storage_debug_mode": False},  # Turn off debug mode
])

pipe.show_pipeline(add_transformer_params=True)

In [None]:
pipe.plot_dag()