# TemporalScope Tutorial: Backend-Agnostic Functions Using Narwhals

### Purpose
This tutorial demonstrates how **TemporalScope** can leverage **Narwhals** to support backend-agnostic data operations. By building backend-agnostic functions, TemporalScope enables compatibility across multiple popular data processing libraries including **Pandas**, **Modin**, **Polars**, and **PyArrow**.

### Key Steps

1. **Create a Sample Dataset**: We begin by creating a synthetic dataset in Pandas, which we will transform and test for compatibility across Modin, Polars, and PyArrow.
2. **Implement a Narwhals-Decorated Function**: With Narwhals’ `@narwhalify` decorator, we can create a backend-agnostic function that performs simple operations like aggregation or column transformations without needing to rewrite the logic for each backend.
3. **Run Compatibility Tests**: Finally, we test the function across all supported backends to verify smooth execution across Pandas, Modin, Polars, and PyArrow.

### Supported TemporalScope Backends
The TemporalScope core API is designed to be compatible with a wide range of popular DataFrame backends. Here are the currently supported backends:

- **Pandas**: General-purpose data processing in Python, compatible with Narwhals.
- **Modin**: Parallelized Pandas-like library for distributed data processing.
- **Polars**: Rust-based, highly efficient DataFrame library for analytics.
- **PyArrow**: Apache Arrow-based DataFrame supporting large, in-memory data processing.
- **Dask**: Distributed DataFrame library for parallel computation on large datasets.

These backends allow TemporalScope users to scale or optimize their workflows without modifying code.

### Advantages of Using Narwhals

- **Uniform API**: The `@narwhalify` decorator from Narwhals creates a seamless backend-neutral execution environment, allowing developers to use the same function across multiple DataFrame libraries.
- **Enhanced Compatibility**: Narwhals optimizes how data is accessed and manipulated across backends, ensuring the syntax and functions used are supported by each compatible backend.
- **Simplified Codebase**: By using Narwhals for backend-agnostic functions, TemporalScope’s core logic can remain generalized, reducing code duplication and maintenance.


### Example 0: Backend Compatibility with TemporalScope and Narwhals

This example demonstrates a basic setup to ensure compatibility across various DataFrame backends using TemporalScope’s backend utilities and Narwhals.

1. **Check Supported Backends**:
   - Retrieve and validate the supported backends with `get_temporalscope_backends()` and `is_valid_temporal_backend()` for `pandas`, `modin`, `polars`, and `pyarrow`.
2. **Run Narwhals-Compatible Operation**:
   - Define a backend-agnostic function with `@narwhalify` to aggregate basic column statistics, showcasing Narwhals’ compatibility layer across supported backends.
3. **Compare Results Across Backends**:
- Execute the Narwhals-compatible function on DataFrames from different backends and compare the results to ensure consistency across implementations.


In [1]:
import pandas as pd
import numpy as np
import narwhals as nw
from temporalscope.core.core_utils import (
    get_temporalscope_backends,
    get_narwhals_backends,
)
from narwhals.typing import FrameT

# Constants
NUM_ROWS = 200
SEED = 42

# Step 1: Generate sample time-series data in Pandas
np.random.seed(SEED)
date_range = pd.date_range(start="2023-01-01", periods=NUM_ROWS, freq="D")
data = pd.DataFrame({
    "datetime": date_range,
    "feature_1": np.random.rand(NUM_ROWS),
    "feature_2": np.random.randn(NUM_ROWS)
})

# Step 2: Define a Narwhals-compatible function using @narwhalify
@nw.narwhalify
def test_narwhal_conversion(df: FrameT) -> FrameT:
    """Perform Narwhals operations on a compatible DataFrame."""
    return df.select(
        feature_1_sum=nw.col("feature_1").sum(),
        feature_2_mean=nw.col("feature_2").mean()
    )

if __name__ == "__main__":
    # Run the Narwhals operation
    print("Original DataFrame Type:", type(data))
    result_df = test_narwhal_conversion(data)
    print("Result of Narwhals operation:\n", result_df)
    
    # Step 3: Check backend support using TemporalScope core_utils
    print("\n== TemporalScope and Narwhals Backend Checks ==")
    temporalscope_backends = get_temporalscope_backends()
    print("Supported TemporalScope Backends:", temporalscope_backends)

    narwhals_backends = get_narwhals_backends()
    print("Supported Narwhals Backends:", narwhals_backends)

    # Validate if 'pandas' backend is supported
    try:
        is_valid_temporal_backend('pandas')
        print("Pandas backend validation: PASSED")
    except Exception as e:
        print("Pandas backend validation: FAILED", e)


Original DataFrame Type: <class 'pandas.core.frame.DataFrame'>
Result of Narwhals operation:
    feature_1_sum  feature_2_mean
0      96.801247        0.067414

== TemporalScope and Narwhals Backend Checks ==
Supported TemporalScope Backends: ['pandas', 'modin', 'pyarrow', 'polars', 'dask']
Supported Narwhals Backends: ['pandas', 'modin', 'cudf', 'pyarrow', 'polars', 'dask', 'unknown']
Pandas backend validation: FAILED name 'is_valid_temporal_backend' is not defined


In [2]:
import pandas as pd
import polars as pl
import numpy as np
import narwhals as nw
from narwhals.typing import FrameT

# Constants
NUM_ROWS = 100
SEED = 42

# Step 1: Generate sample time-series data in Pandas
np.random.seed(SEED)
date_range = pd.date_range(start="2023-01-01", periods=NUM_ROWS, freq="D")
data_pandas = pd.DataFrame({
    "datetime": date_range,
    "feature_1": np.random.rand(NUM_ROWS),
    "feature_2": np.random.randn(NUM_ROWS)
})

# Convert Pandas DataFrame to Polars
data_polars = pl.DataFrame(data_pandas)

# Utility Functions
def is_timestamp_column(df: FrameT, col: str) -> bool:
    """Check if a column is timestamp-like."""
    if col not in df.columns:
        raise ValueError(f"Column '{col}' does not exist in the DataFrame.")
    
    try:
        if isinstance(df, pd.DataFrame):  # Pandas check
            return pd.api.types.is_datetime64_any_dtype(df[col])
        elif isinstance(df, pl.DataFrame):  # Polars check
            return str(df.schema[col]).startswith("Datetime")
        else:
            raise ValueError(f"Unsupported DataFrame type: {type(df)}")
    except Exception as e:
        raise ValueError(f"Error while checking column type: {e}")

def get_backend_info(df: FrameT) -> str:
    """Determine the original backend of the DataFrame."""
    if isinstance(df, pd.DataFrame):
        return "pandas"
    elif isinstance(df, pl.DataFrame):
        return "polars"
    else:
        return "unknown"

# Narwhals Function
@nw.narwhalify
def test_narwhal_conversion(df: FrameT) -> FrameT:
    """Perform Narwhals operations on a compatible DataFrame."""
    return df.select(
        feature_1_sum=nw.col("feature_1").sum(),
        feature_2_mean=nw.col("feature_2").mean()
    )

# Main Test Logic
def run_tests(data: FrameT, backend_name: str):
    print(f"\n=== Testing {backend_name} Backend ===")
    
    # Original Backend
    try:
        original_backend = get_backend_info(data)
        print("Original DataFrame Backend:", original_backend)
    except Exception as e:
        print("Error determining backend:", e)

    # Timestamp Check
    try:
        is_timestamp = is_timestamp_column(data, "datetime")
        print("Is 'datetime' timestamp-like?", is_timestamp)
    except Exception as e:
        print("Timestamp check failed:", e)

    # Narwhals Operation
    try:
        result_df = test_narwhal_conversion(data)
        print("\nResult of Narwhals operation:\n", result_df)
    except Exception as e:
        print("Narwhals operation failed:", e)

if __name__ == "__main__":
    # Test with Pandas
    run_tests(data_pandas, "Pandas")

    # Test with Polars
    run_tests(data_polars, "Polars")



=== Testing Pandas Backend ===
Original DataFrame Backend: pandas
Is 'datetime' timestamp-like? True

Result of Narwhals operation:
    feature_1_sum  feature_2_mean
0      47.018074        -0.00108

=== Testing Polars Backend ===
Original DataFrame Backend: polars
Is 'datetime' timestamp-like? True

Result of Narwhals operation:
 shape: (1, 2)
┌───────────────┬────────────────┐
│ feature_1_sum ┆ feature_2_mean │
│ ---           ┆ ---            │
│ f64           ┆ f64            │
╞═══════════════╪════════════════╡
│ 47.018074     ┆ -0.00108       │
└───────────────┴────────────────┘


### Example 1: Using Narwhals for Backend-Agnostic Null Check

Narwhals enables robust, backend-agnostic data processing, which can streamline workflows across Pandas, Modin, Polars, and PyArrow backends. Here’s how to create a simple backend-agnostic null-check function and test it across multiple frameworks.

1. **Create a Synthetic DataFrame**:
   - Use a function like `generate_data_time_series()` to create a manageable dataset, with defaults that allow flexibility for larger data sizes.

2. **Define a Narwhals-Compatible Function**:
   - Use the `@narwhalify` decorator to convert standard DataFrame operations to backend-agnostic ones.
   - For example, `check_nulls_nw()` checks for null values without being tied to a specific backend.

3. **Test Across Multiple Backends**:
   - Convert the DataFrame to various backends and execute the Narwhals function to verify compatibility and performance.


In [3]:
import pandas as pd
import modin.pandas as mpd
import polars as pl
import pyarrow as pa
import dask.dataframe as dd
import narwhals as nw
import numpy as np
from narwhals.typing import FrameT
from temporalscope.core.core_utils import TEMPORALSCOPE_CORE_BACKEND_TYPES, SupportedTemporalDataFrame

NUM_ROWS = 100
SEED = 42

def generate_data(num_rows: int = NUM_ROWS, backend: str = "pandas") -> SupportedTemporalDataFrame:
    """Generates a time-series DataFrame with specified backend, adding features and a target."""
    np.random.seed(SEED)
    data = pd.DataFrame({
        "datetime": pd.date_range(start="2023-01-01", periods=num_rows, freq="D")
    })
    
    # Generate 5 features
    for i in range(5):
        data[f"feature_{i+1}"] = np.random.randn(num_rows)
    data["target"] = np.random.rand(num_rows)
    data.loc[0, 'feature_1'] = None  # Inject a null value for testing

    # Convert to specified backend
    if backend == "modin":
        return mpd.DataFrame(data)
    elif backend == "polars":
        return pl.DataFrame(data)
    elif backend == "pyarrow":
        return pa.Table.from_pandas(data)
    elif backend == "dask":
        return dd.from_pandas(data, npartitions=2)
    return data

@nw.narwhalify
def check_nulls_nw(df: FrameT) -> FrameT:
    """Checks for null values in 'feature_1' in a backend-agnostic way using Narwhals."""
    return df.select(
        has_nulls=nw.col("feature_1").is_null().any()
    )

def test_backends():
    """Tests `check_nulls_nw` on all supported TemporalScope backends."""
    results = []
    for backend_name in TEMPORALSCOPE_CORE_BACKEND_TYPES.keys():
        data_df = generate_data(backend=backend_name)
        try:
            has_nulls = check_nulls_nw(data_df)
            results.append({
                "Backend": backend_name,
                "Has Nulls": has_nulls,
                "Executed Successfully": True
            })
            print(f"{backend_name} -> Executed Successfully, Has Nulls: {has_nulls}\n")
        except Exception as e:
            results.append({
                "Backend": backend_name,
                "Executed Successfully": False,
                "Error": str(e)
            })
            print(f"{backend_name} -> Failed with error: {e}\n")

    return results

if __name__ == "__main__":
    backend_results = test_backends()
    print("\n--- Summary of Backend Compatibility ---")
    for result in backend_results:
        print(result)


pandas -> Executed Successfully, Has Nulls:    has_nulls
0       True



2024-11-11 11:34:11,220	INFO worker.py:1816 -- Started a local Ray instance.


modin -> Executed Successfully, Has Nulls:    has_nulls
0       True

pyarrow -> Executed Successfully, Has Nulls: pyarrow.Table
has_nulls: bool
----
has_nulls: [[true]]

polars -> Executed Successfully, Has Nulls: shape: (1, 1)
┌───────────┐
│ has_nulls │
│ ---       │
│ bool      │
╞═══════════╡
│ true      │
└───────────┘

dask -> Executed Successfully, Has Nulls: Dask DataFrame Structure:
              has_nulls
npartitions=1          
                   bool
                    ...
Dask Name: to_frame, 11 expressions
Expr=ToFrame(frame=RenameSeries(frame=ScalarToSeries(frame=(RenameSeries(frame=ScalarToSeries(frame=(RenameSeries(frame=IsNa(frame=df['feature_1']), index='feature_1')).any()), index='feature_1'))[0]), index='has_nulls'))


--- Summary of Backend Compatibility ---
{'Backend': 'pandas', 'Has Nulls':    has_nulls
0       True, 'Executed Successfully': True}
{'Backend': 'modin', 'Has Nulls':    has_nulls
0       True, 'Executed Successfully': True}
{'Backend': 'pyarrow',

### Example 2: Calculating Summary Statistics in a Backend-Agnostic Manner

This example demonstrates **Narwhals**' ability to calculate summary statistics (mean, sum, standard deviation) across TemporalScope’s supported backends—**Pandas**, **Modin**, **Polars**, **PyArrow**, and **Dask**—with a backend-agnostic function that leverages Narwhals’ compatibility layer. 

### Key Steps

1. **Generate Synthetic Data Across Backends**:
   - Using `generate_data()`, we create a synthetic time-series DataFrame with multiple features and a target column. This dataset is compatible with multiple backends and provides the basis for backend-agnostic operations.

2. **Define a Narwhals-Compatible Summary Function**:
   - `calculate_summaries_nw()` uses the `@narwhalify` decorator to compute mean, sum, and standard deviation for each feature in a backend-agnostic manner.

3. **Understanding Lazy vs. Eager Evaluation**:
   - **Lazy Evaluation** (Polars, Dask): Allows efficient, optimized computation by delaying operations until results are explicitly requested, like with `.collect()`. Narwhals supports lazy evaluation for backends like Polars, automatically handling eager execution where needed.
   - **Eager Evaluation** (Pandas, Modin, PyArrow): Calculates results immediately, useful for smaller datasets or immediate result retrieval.
   - Narwhals adapts between lazy and eager modes based on backend needs, ensuring that computations are handled correctly even across large and distributed datasets.

4. **Test Across Multiple Backends**:
   - The function is tested across the TemporalScope-supported backends, with each backend returning summary statistics in its native structure.
   - Results vary slightly in format: Pandas and Modin return DataFrames, PyArrow returns a `Table`, Polars provides its own optimized `DataFrame` structure, and Dask shows results as a Dask DataFrame structure.

This example showcases Narwhals’ seamless support for multi-backend compatibility, optimizing data operations across frameworks without modifying the core logic.
~

In [4]:
import pandas as pd
import modin.pandas as mpd
import polars as pl
import pyarrow as pa
import dask.dataframe as dd
import narwhals as nw
from narwhals.typing import FrameT
import numpy as np
from temporalscope.core.core_utils import TEMPORALSCOPE_CORE_BACKEND_TYPES

# Constants
NUM_ROWS = 100
SEED = 42

# Data generator function
def generate_data(num_rows: int = NUM_ROWS, backend: str = "pandas") -> pd.DataFrame:
    """Generates a time-series DataFrame with multiple features and a target column."""
    np.random.seed(SEED)
    data = pd.DataFrame({
        "datetime": pd.date_range(start="2023-01-01", periods=num_rows, freq="D"),
        **{f"feature_{i}": np.random.rand(num_rows) for i in range(1, 6)},  # Generating 5 random features
        "target": np.random.randint(0, 100, num_rows)  # Random integer target
    })
    data.loc[0, 'feature_1'] = None  # Injecting a null value for testing
    
    # Convert to the specified backend
    if backend == "modin":
        return mpd.DataFrame(data)
    elif backend == "polars":
        return pl.DataFrame(data)
    elif backend == "pyarrow":
        return pa.Table.from_pandas(data)
    elif backend == "dask":
        return dd.from_pandas(data, npartitions=2)
    return data  # Default to Pandas DataFrame

# Narwhals-compatible function for summary statistics
@nw.narwhalify
def calculate_summaries_nw(df: FrameT) -> FrameT:
    """Calculates mean, sum, and standard deviation for each feature using Narwhals."""
    expressions = []
    for col in df.columns:
        if col.startswith("feature_"):
            expressions.extend([
                nw.col(col).mean().alias(f"{col}_mean"),
                nw.col(col).sum().alias(f"{col}_sum"),
                nw.col(col).std().alias(f"{col}_std")
            ])
    return df.select(*expressions)

# Testing function across all backends
def test_backends():
    """Tests `calculate_summaries_nw` on all supported TemporalScope backends."""
    results = []
    for backend_name in TEMPORALSCOPE_CORE_BACKEND_TYPES.keys():
        data_df = generate_data(backend=backend_name)
        try:
            summaries = calculate_summaries_nw(data_df)
            results.append({
                "Backend": backend_name,
                "Summaries": summaries,
                "Executed Successfully": True
            })
            print(f"{backend_name} -> Executed Successfully, Summaries:\n{summaries}\n")
        except Exception as e:
            results.append({
                "Backend": backend_name,
                "Executed Successfully": False,
                "Error": str(e)
            })
            print(f"{backend_name} -> Failed with error: {e}\n")

    return results

if __name__ == "__main__":
    # Run the backend compatibility tests
    backend_results = test_backends()
    print("\n--- Summary of Backend Compatibility ---")
    for result in backend_results:
        print(result)


pandas -> Executed Successfully, Summaries:
   feature_1_mean  feature_1_sum  feature_1_std  feature_2_mean  \
0        0.471147      46.643534       0.298846        0.497832   

   feature_2_sum  feature_2_std  feature_3_mean  feature_3_sum  feature_3_std  \
0      49.783172       0.293111        0.517601      51.760133       0.293426   

   feature_4_mean  feature_4_sum  feature_4_std  feature_5_mean  \
0        0.491149      49.114894       0.293452        0.516046   

   feature_5_sum  feature_5_std  
0      51.604582       0.318601  

modin -> Executed Successfully, Summaries:
   feature_1_mean  feature_1_sum  feature_1_std  feature_2_mean  \
0        0.471147      46.643534       0.298846        0.497832   

   feature_2_sum  feature_2_std  feature_3_mean  feature_3_sum  feature_3_std  \
0      49.783172       0.293111        0.517601      51.760133       0.293426   

   feature_4_mean  feature_4_sum  feature_4_std  feature_5_mean  \
0        0.491149      49.114894       0.29345

### Example 3: Scaling and Lagging Features in a Backend-Agnostic Manner

In this example, we demonstrate **Narwhals**’ capability to apply scaling and lag transformations across TemporalScope's supported backends. This approach is beneficial for time-series analysis, where lagging can reveal important sequential dependencies. Narwhals allows us to scale and lag feature columns consistently across multiple frameworks, maintaining a unified codebase.

### Key Steps

1. **Generate Synthetic Data Across Backends**:
   - Using `generate_data()`, we create a synthetic DataFrame with multiple features and a target column, which can be applied to each backend for testing.

2. **Define a Narwhals-Compatible Scaling and Lagging Function**:
   - `scale_and_lag_features_nw()` applies scaling and lagging transformations to each feature column in a backend-agnostic manner. Scaling standardizes each feature, and lagging shifts values by a defined number of steps, revealing sequential dependencies.

3. **Test Across Multiple Backends**:
   - Run the function on each backend, returning results in the native format for each. This verifies compatibility and consistent application of transformations across Pandas, Modin, Polars, PyArrow, and Dask backends.


In [5]:
import pandas as pd
import modin.pandas as mpd
import polars as pl
import pyarrow as pa
import dask.dataframe as dd
import narwhals as nw
from narwhals.typing import FrameT
import numpy as np
from temporalscope.core.core_utils import TEMPORALSCOPE_CORE_BACKEND_TYPES

# Constants
NUM_ROWS = 100
SEED = 42

# Data generator function
def generate_data(num_rows: int = NUM_ROWS, backend: str = "pandas") -> pd.DataFrame:
    """Generates a time-series DataFrame with multiple features and a target column."""
    np.random.seed(SEED)
    data = pd.DataFrame({
        "datetime": pd.date_range(start="2023-01-01", periods=num_rows, freq="D"),
        **{f"feature_{i}": np.random.rand(num_rows) for i in range(1, 6)},  # Generating 5 random features
        "target": np.random.randint(0, 100, num_rows)  # Random integer target
    })
    
    # Convert to the specified backend
    if backend == "modin":
        return mpd.DataFrame(data)
    elif backend == "polars":
        return pl.DataFrame(data)
    elif backend == "pyarrow":
        return pa.Table.from_pandas(data)
    elif backend == "dask":
        return dd.from_pandas(data, npartitions=2)
    return data  # Default to Pandas DataFrame

# Narwhals-compatible function for scaling and lagging features
@nw.narwhalify
def scale_and_lag_features_nw(df: FrameT, lag_steps: int = 3) -> FrameT:
    """Applies scaling and lagging to each feature in a backend-agnostic way using Narwhals."""
    transformations = []
    for col in df.columns:
        if col.startswith("feature_"):
            transformations.extend([
                ((nw.col(col) - nw.col(col).mean()) / nw.col(col).std()).alias(f"{col}_scaled"),
                nw.col(col).shift(lag_steps).alias(f"{col}_lag{lag_steps}")
            ])
    return df.with_columns(*transformations)

# Testing function across all backends
def test_backends():
    """Tests `scale_and_lag_features_nw` on all supported TemporalScope backends."""
    results = []
    for backend_name in TEMPORALSCOPE_CORE_BACKEND_TYPES.keys():
        data_df = generate_data(backend=backend_name)
        try:
            transformed_df = scale_and_lag_features_nw(data_df)
            results.append({
                "Backend": backend_name,
                "Transformed Data": transformed_df,
                "Executed Successfully": True
            })
            print(f"{backend_name} -> Executed Successfully, Transformed Data:\n{transformed_df}\n")
        except Exception as e:
            results.append({
                "Backend": backend_name,
                "Executed Successfully": False,
                "Error": str(e)
            })
            print(f"{backend_name} -> Failed with error: {e}\n")

    return results

if __name__ == "__main__":
    # Run the backend compatibility tests
    backend_results = test_backends()
    print("\n--- Summary of Backend Compatibility ---")
    for result in backend_results:
        print(result)


pandas -> Executed Successfully, Transformed Data:
     datetime  feature_1  feature_2  feature_3  feature_4  feature_5  target  \
0  2023-01-01   0.374540   0.031429   0.642032   0.051682   0.103124      62   
1  2023-01-02   0.950714   0.636410   0.084140   0.531355   0.902553      16   
2  2023-01-03   0.731994   0.314356   0.161629   0.540635   0.505252      72   
3  2023-01-04   0.598658   0.508571   0.898554   0.637430   0.826457      32   
4  2023-01-05   0.156019   0.907566   0.606429   0.726091   0.320050      83   
..        ...        ...        ...        ...        ...        ...     ...   
95 2023-04-06   0.493796   0.349210   0.522243   0.930757   0.353352      63   
96 2023-04-07   0.522733   0.725956   0.769994   0.858413   0.583656      97   
97 2023-04-08   0.427541   0.897110   0.215821   0.428994   0.077735      37   
98 2023-04-09   0.025419   0.887086   0.622890   0.750871   0.974395      49   
99 2023-04-10   0.107891   0.779876   0.085347   0.754543   0.986211 

Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.


modin -> Executed Successfully, Transformed Data:
     datetime  feature_1  feature_2  feature_3  feature_4  feature_5  target  \
0  2023-01-01   0.374540   0.031429   0.642032   0.051682   0.103124      62   
1  2023-01-02   0.950714   0.636410   0.084140   0.531355   0.902553      16   
2  2023-01-03   0.731994   0.314356   0.161629   0.540635   0.505252      72   
3  2023-01-04   0.598658   0.508571   0.898554   0.637430   0.826457      32   
4  2023-01-05   0.156019   0.907566   0.606429   0.726091   0.320050      83   
..        ...        ...        ...        ...        ...        ...     ...   
95 2023-04-06   0.493796   0.349210   0.522243   0.930757   0.353352      63   
96 2023-04-07   0.522733   0.725956   0.769994   0.858413   0.583656      97   
97 2023-04-08   0.427541   0.897110   0.215821   0.428994   0.077735      37   
98 2023-04-09   0.025419   0.887086   0.622890   0.750871   0.974395      49   
99 2023-04-10   0.107891   0.779876   0.085347   0.754543   0.986211  