# Caching the dataframe(s) in case of a pipeline failure

This notebook shows a couple of useful features to help the user to debug a broken pipeline.

Please note that this debugging process is intended for execution in a notebook, as it relies on nebula storage, which resides within the Python kernel and cannot be utilized in a recipe or Airflow.

There two main situations in which a pipeline may break:
- a transformer fails
- In a split pipeline, the split dataframes become unmergeable due to varying schemas or columns.

In the first case nebula stores the input dataframe of the failed transformer, in the latter one all the dataframes that should be merged are retained, allowing the user to retrieve them and address the issue.

In [1]:
import sys

from pyspark.sql import functions as F
from pyspark.sql.types import *

import yaml

from nlsn.nebula.spark_transformers import *
from nlsn.nebula.base import Transformer
from nlsn.nebula.pipelines.pipelines import TransformerPipeline
from nlsn.nebula.pipelines.pipeline_loader import load_pipeline
from nlsn.nebula.storage import nebula_storage as ns

py_version = ".".join(map(str, (sys.version_info[0:2])))
print("python version:", py_version)

python version: 3.9


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/16 14:05:08 WARN [Thread-3] Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
schema = [
    StructField("c1", FloatType(), True),
    StructField("c2", StringType(), True),
    StructField("c3", StringType(), True),
]

data = [
    [0.1234, "a", "b"],
    [0.1234, "a", "b"],
    [0.1234, "a", "b"],
    [1.1234, "a", "  b"],
    [2.1234, "  a  ", "  b  "],
    [3.1234, "", ""],
    [4.1234, "   ", "   "],
    [5.1234, None, None],
    [6.1234, " ", None],
    [7.1234, "", None],
    [8.1234, "a", None],
    [9.1234, "a", ""],
    [10.1234, "   ", "b"],
    [11.1234, "a", None],
    [12.1234, None, "b"],
    [13.1234, None, "b"],
    [14.1234, None, None],
]

df_input = spark.createDataFrame(data, schema=StructType(schema)).cache()
df_input.show()

[Stage 0:>                                                          (0 + 1) / 1]

+-------+-----+-----+
|     c1|   c2|   c3|
+-------+-----+-----+
| 0.1234|    a|    b|
| 0.1234|    a|    b|
| 0.1234|    a|    b|
| 1.1234|    a|    b|
| 2.1234|  a  |  b  |
| 3.1234|     |     |
| 4.1234|     |     |
| 5.1234| null| null|
| 6.1234|     | null|
| 7.1234|     | null|
| 8.1234|    a| null|
| 9.1234|    a|     |
|10.1234|     |    b|
|11.1234|    a| null|
|12.1234| null|    b|
|13.1234| null|    b|
|14.1234| null| null|
+-------+-----+-----+



                                                                                

## Transformer failure

In [4]:
class ThisTransformerIsBroken:
    @staticmethod
    def transform(df):
        """Public transform method w/o parent class."""        
        return df.select("wrong")


# clear the cache
ns.clear()

pipe = TransformerPipeline([
    NanToNull(input_columns="*"),
    ThisTransformerIsBroken(),
    Distinct(),
])

pipe.show_pipeline(add_transformer_params=True)

2024-05-16 14:05:16,654 | storage.py:108 [INFO]: Nebula Storage: clear. 
2024-05-16 14:05:16,655 | storage.py:118 [INFO]: Nebula Storage: 0 keys remained after clearing. 


*** TransformerPipeline *** (3 transformers)
 - NanToNull -> PARAMS: input_columns="*"
 - ThisTransformerIsBroken
 - Distinct


### Retrieve the input dataframe of the failed transformer as the pipe breaks.

The error message will contain the key(s) associated with storing the aforementioned dataframe. 

A few lines above, the original exception is documented.

In [5]:
pipe.run(df_input)

2024-05-16 14:05:16,765 | pipelines.py:516 [INFO]: Running *** TransformerPipeline *** (3 transformers) 
2024-05-16 14:05:16,766 | pipelines.py:283 [INFO]: Running NanToNull -> PARAMS: input_columns="*" ... 
2024-05-16 14:05:16,950 | pipelines.py:297 [INFO]: Execution time for NanToNull: 0.2s 
2024-05-16 14:05:16,950 | pipelines.py:283 [INFO]: Running ThisTransformerIsBroken ... 
2024-05-16 14:05:17,079 | storage.py:124 [INFO]: Nebula Storage: setting an object (<class 'pyspark.sql.dataframe.DataFrame'>) with the key "FAIL_DF_ThisTransformerIsBroken". 


AnalysisException: Find the input dataframe of the failed transformer "ThisTransformerIsBroken" in the nebula storage by using the key "FAIL_DF_ThisTransformerIsBroken".
cannot resolve 'wrong' given input columns: [c1, c2, c3];
'Project ['wrong]
+- Project [CASE WHEN isnan(c1#0) THEN cast(null as float) ELSE c1#0 END AS c1#94, c2#1, c3#2]
   +- LogicalRDD [c1#0, c2#1, c3#2], false


In [6]:
ns.get("FAIL_DF_ThisTransformerIsBroken").show()

+-------+-----+-----+
|     c1|   c2|   c3|
+-------+-----+-----+
| 0.1234|    a|    b|
| 0.1234|    a|    b|
| 0.1234|    a|    b|
| 1.1234|    a|    b|
| 2.1234|  a  |  b  |
| 3.1234|     |     |
| 4.1234|     |     |
| 5.1234| null| null|
| 6.1234|     | null|
| 7.1234|     | null|
| 8.1234|    a| null|
| 9.1234|    a|     |
|10.1234|     |    b|
|11.1234|    a| null|
|12.1234| null|    b|
|13.1234| null|    b|
|14.1234| null| null|
+-------+-----+-----+



## Unable to merge splits

In this example the transformers work properly, but they modified the dataframes in a way that is not possible to merge them back anymore.

To address this issue, all the dataframes before the union process are stored, allowing the user to investigate the problem.

In this example one split drops the column `c2`, the other one the column `c3`, hence they cannot be merged.

In [7]:
def my_split_function(df):
    cond = F.col("c1") < 10
    return {
        "low": df.filter(cond),
        "hi": df.filter(~cond),
    }


dict_transf = {
    "low": [DropColumns(columns="c2")],
    "hi": [DropColumns(columns="c3")],
}

# clear the cache
ns.clear()

pipe = TransformerPipeline(dict_transf, split_function=my_split_function)

pipe.show_pipeline(add_transformer_params=True)

2024-05-16 14:05:25,993 | storage.py:108 [INFO]: Nebula Storage: clear. 
2024-05-16 14:05:25,994 | storage.py:118 [INFO]: Nebula Storage: 0 keys remained after clearing. 


*** TransformerPipeline *** (2 transformers)
SPLIT <<< hi >>>:
     - DropColumns -> PARAMS: columns="c3"
SPLIT <<< low >>>:
     - DropColumns -> PARAMS: columns="c2"
MERGE SPLITS:
   - <<< hi >>>
   - <<< low >>>


In [8]:
_ = pipe.run(df_input)

2024-05-16 14:05:27,069 | pipelines.py:523 [INFO]: Running *** TransformerPipeline *** (2 transformers) 
2024-05-16 14:05:27,164 | pipelines.py:556 [INFO]: Running SPLIT <<< hi >>> 
2024-05-16 14:05:27,165 | pipelines.py:283 [INFO]: Running DropColumns -> PARAMS: columns="c3" ... 
2024-05-16 14:05:27,249 | pipelines.py:297 [INFO]: Execution time for DropColumns: 0.1s 
2024-05-16 14:05:27,249 | pipelines.py:556 [INFO]: Running SPLIT <<< low >>> 
2024-05-16 14:05:27,250 | pipelines.py:283 [INFO]: Running DropColumns -> PARAMS: columns="c2" ... 
2024-05-16 14:05:27,255 | pipelines.py:297 [INFO]: Execution time for DropColumns: 0.0s 
2024-05-16 14:05:27,256 | pipelines.py:570 [INFO]: Merging splits: hi, low ... 
2024-05-16 14:05:27,347 | storage.py:124 [INFO]: Nebula Storage: setting an object (<class 'pyspark.sql.dataframe.DataFrame'>) with the key "FAIL_DF_hi". 
2024-05-16 14:05:27,348 | storage.py:124 [INFO]: Nebula Storage: setting an object (<class 'pyspark.sql.dataframe.DataFrame'>) 

AnalysisException: Unable to perform 'unionByName' and append the dataframes. Find the output dataframes of each single split "hi", "low" in the nebula storage by using the keys "FAIL_DF_hi", "FAIL_DF_low" respectively.
Cannot resolve column name "c2" among (c1, c3)

In [9]:
ns.get("FAIL_DF_low").show()

+------+-----+
|    c1|   c3|
+------+-----+
|0.1234|    b|
|0.1234|    b|
|0.1234|    b|
|1.1234|    b|
|2.1234|  b  |
|3.1234|     |
|4.1234|     |
|5.1234| null|
|6.1234| null|
|7.1234| null|
|8.1234| null|
|9.1234|     |
+------+-----+



In [10]:
ns.get("FAIL_DF_hi").show()

+-------+----+
|     c1|  c2|
+-------+----+
|10.1234|    |
|11.1234|   a|
|12.1234|null|
|13.1234|null|
|14.1234|null|
+-------+----+



### Overwriting keys associated with the failed dataframes

The keys used for storing dataframes are generated with a method that prevent any form of overwriting by adding a numerical suffix to them, hence the user should not worry about that.

Rerruning the same broken pipeline, without clearing the cache, the keys associated with the failed dataframes do not overwrite the previous ones.

In [11]:
_ = pipe.run(df_input)

2024-05-16 14:05:30,470 | pipelines.py:523 [INFO]: Running *** TransformerPipeline *** (2 transformers) 
2024-05-16 14:05:30,541 | pipelines.py:556 [INFO]: Running SPLIT <<< hi >>> 
2024-05-16 14:05:30,542 | pipelines.py:283 [INFO]: Running DropColumns -> PARAMS: columns="c3" ... 
2024-05-16 14:05:30,547 | pipelines.py:297 [INFO]: Execution time for DropColumns: 0.0s 
2024-05-16 14:05:30,547 | pipelines.py:556 [INFO]: Running SPLIT <<< low >>> 
2024-05-16 14:05:30,548 | pipelines.py:283 [INFO]: Running DropColumns -> PARAMS: columns="c2" ... 
2024-05-16 14:05:30,552 | pipelines.py:297 [INFO]: Execution time for DropColumns: 0.0s 
2024-05-16 14:05:30,552 | pipelines.py:570 [INFO]: Merging splits: hi, low ... 
2024-05-16 14:05:30,559 | storage.py:124 [INFO]: Nebula Storage: setting an object (<class 'pyspark.sql.dataframe.DataFrame'>) with the key "FAIL_DF_hi_0". 
2024-05-16 14:05:30,559 | storage.py:124 [INFO]: Nebula Storage: setting an object (<class 'pyspark.sql.dataframe.DataFrame'>

AnalysisException: Unable to perform 'unionByName' and append the dataframes. Find the output dataframes of each single split "hi", "low" in the nebula storage by using the keys "FAIL_DF_hi_0", "FAIL_DF_low_0" respectively.
Cannot resolve column name "c2" among (c1, c3)

The keys associated with the failed dataframes are now:
- `FAIL_DF_low_0`
- `FAIL_DF_hi_0`