# Tutorial 6: Time Series Operations with Narwhals

This notebook demonstrates essential patterns for working with time series data using Narwhals, focusing on:

1. **Group-by Time Series Operations**
   - Efficient grouping by ID columns
   - Temporal aggregations within groups
   - Mixed frequency handling

2. **Time Series Validation**
   - Temporal uniqueness checks
   - Group-level validation
   - Data quality assurance

## Pattern 1: Time Series Group-by Operations

**NOTICE**: The following examples are provided AS-IS for academic purposes and are subject to the user's specific requirements and use cases. Always refer to the latest API documentation for production use.

Common time series tasks require grouping by an ID column and performing temporal operations within each group. These patterns are essential for ML workflows in various domains.

The following table shows key use cases where these patterns are critical:

| Use Case | Description |
|----------|-------------|
| Healthcare Analytics | Patient vitals monitoring over time (patient_id), treatment response analysis, longitudinal health studies |
| Quantitative Finance | Multi-asset portfolio analysis (ticker_id), market regime detection, cross-sectional momentum studies |
| Signal Processing | Multi-sensor data fusion (sensor_id), anomaly detection across sensor networks, distributed system monitoring |

Key Narwhals patterns for time series operations:

| Operation | ✅ Good Pattern | ❌ Bad Pattern |
|-----------|----------------|----------------|
| Group By | `df.group_by(id_col).agg([nw.col("value").mean()])` | `df.groupby(id_col).agg({"value": "mean"})` |
| Rolling | `df.with_columns([nw.col("value").rolling_mean(2)])` | `df["value"].rolling(2).mean()` |
| Time Sort | `df.sort([time_col])` | `df.sort_values(time_col)` |
| Null Check | `nw.col("value").is_null().sum()` | `df["value"].isnull().sum()` |

Implementation considerations for robust time series processing:

| Consideration | Description |
|---------------|-------------|
| Lazy Evaluation | • Chain operations for optimization<br>• Let Narwhals handle backend-specific optimizations<br>• Avoid unnecessary materializations |
| Temporal Ordering | • Ensure proper time-based sorting<br>• Handle mixed frequencies<br>• Maintain group boundaries |
| Backend Compatibility | • Use elementary aggregations<br>• Follow Narwhals patterns for operations<br>• Let Narwhals handle backend-specific details |
| Data Quality | • Validate temporal uniqueness<br>• Handle missing values<br>• Check frequency consistency |

These patterns ensure robust time series processing across different ML scenarios and DataFrame backends. Note that some DataFrame implementations like Dask have specific requirements (e.g., known divisions for rolling operations) due to their distributed nature. While Narwhals handles these backend-specific details, always consult the latest API documentation if you need to customize error handling or implement special cases.

**Best Practices Summary:**
1. ✅ Use elementary operations that work across all backends
2. ✅ Let Narwhals handle backend-specific optimizations
3. ✅ Pre-compute complex operations before grouping
4. ✅ Document backend-specific requirements in docstrings
5. ❌ Don't try to handle backend-specific details in your code
6. ❌ Don't chain operations after group-by
7. ❌ Don't use backend-specific operations

The recommended approach is to use Narwhals patterns and let it manage backend-specific optimizations and limitations. This ensures your code remains maintainable and works consistently across different DataFrame implementations.


In [1]:
# Import required libraries
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
import dask.dataframe as dd
from typing import Dict, List, Optional, Union, Any, Literal


@nw.narwhalify
def compute_group_metrics(df: FrameT, id_col: str, time_col: str, value_col: str) -> FrameT:
    """Compute time series metrics by group.

    This function demonstrates proper Narwhals patterns for universal backend support:
    1. Uses only elementary aggregations (min, max, mean, count, sum)
    2. Works consistently across Pandas, Polars, and Dask
    3. Properly handles missing values and time ranges

    Returns a DataFrame with metrics per group:
    - start_time: First timestamp in group
    - end_time: Last timestamp in group
    - avg_value: Mean value in group
    - total_records: Count of records
    - missing_records: Count of nulls
    """
    # Pre-compute null indicators
    df_prep = df.with_columns([nw.col(value_col).is_null().alias("is_null")])

    # Use elementary aggregations
    return df_prep.group_by(id_col).agg(
        [
            # Time range metrics (elementary)
            nw.col(time_col).min().alias("start_time"),
            nw.col(time_col).max().alias("end_time"),
            # Value statistics (elementary)
            nw.col(value_col).mean().alias("avg_value"),
            # Count metrics (elementary)
            nw.col(value_col).count().alias("total_records"),
            nw.col("is_null").sum().alias("missing_records"),
        ]
    )


@nw.narwhalify
def compute_rolling_stats(df: FrameT, id_col: str, time_col: str, value_col: str, window: int) -> FrameT:
    """Compute rolling statistics within groups.

    This function demonstrates proper handling of complex operations:
    1. Pre-computes rolling means before grouping
    2. Uses elementary aggregations for group operations
    3. Lets Narwhals handle backend-specific details

    Note: Rolling operations have backend-specific requirements:
    - Works with Pandas and Polars out of the box
    - Dask requires known divisions (see Dask documentation)
    - Let Narwhals handle these requirements through its error handling

    Returns a DataFrame with rolling metrics per group:
    - rolling_mean_{window}: Mean of rolling window values
    """
    # First sort and compute rolling means
    df_prep = df.sort([id_col, time_col]).with_columns([nw.col(value_col).rolling_mean(window).alias("rolling_value")])

    # Then do elementary group-by aggregation
    return df_prep.group_by(id_col).agg([nw.col("rolling_value").mean().alias(f"rolling_mean_{window}")])


# Test data
test_data = {
    "id": [1, 1, 2, 2, 3],
    "timestamp": pd.date_range("2023-01-01", periods=5, freq="D"),
    "value": [10.0, 20.0, 30.0, None, 50.0],
}

# Create DataFrames
df_pd = pd.DataFrame(test_data)
df_pl = pl.DataFrame(test_data)
df_dd = dd.from_pandas(df_pd, npartitions=2)

# Test functions
print("Testing compute_group_metrics:")
print(compute_group_metrics(nw.from_native(df_pd), "id", "timestamp", "value"))
print(compute_group_metrics(nw.from_native(df_pl), "id", "timestamp", "value"))
print(compute_group_metrics(nw.from_native(df_dd), "id", "timestamp", "value"))

print("\nTesting compute_rolling_stats:")
window = 2
print(compute_rolling_stats(nw.from_native(df_pd), "id", "timestamp", "value", window))
print(compute_rolling_stats(nw.from_native(df_pl), "id", "timestamp", "value", window))
try:
    print(compute_rolling_stats(nw.from_native(df_dd), "id", "timestamp", "value", window))
except ValueError as e:
    print(f"\nDask rolling operations require known divisions:\n{str(e)}")

Testing compute_group_metrics:
   id start_time   end_time  avg_value  total_records  missing_records
0   1 2023-01-01 2023-01-02       15.0              2                0
1   2 2023-01-03 2023-01-04       30.0              1                1
2   3 2023-01-05 2023-01-05       50.0              1                0
shape: (3, 6)
┌─────┬─────────────────────┬─────────────────────┬───────────┬───────────────┬─────────────────┐
│ id  ┆ start_time          ┆ end_time            ┆ avg_value ┆ total_records ┆ missing_records │
│ --- ┆ ---                 ┆ ---                 ┆ ---       ┆ ---           ┆ ---             │
│ i64 ┆ datetime[ns]        ┆ datetime[ns]        ┆ f64       ┆ u32           ┆ u32             │
╞═════╪═════════════════════╪═════════════════════╪═══════════╪═══════════════╪═════════════════╡
│ 2   ┆ 2023-01-03 00:00:00 ┆ 2023-01-04 00:00:00 ┆ 30.0      ┆ 1             ┆ 1               │
│ 1   ┆ 2023-01-01 00:00:00 ┆ 2023-01-02 00:00:00 ┆ 15.0      ┆ 2             ┆ 0  

## Pattern 2: Time Series Validation

Time series data often requires validation to ensure quality and consistency. Here are key validation patterns us


In [2]:
# Import required libraries
import narwhals as nw
from narwhals.typing import FrameT
import pandas as pd
import polars as pl
import dask.dataframe as dd
from typing import Dict, List, Optional, Union, Any, Literal


@nw.narwhalify
def validate_temporal_uniqueness(df: FrameT, id_col: str, time_col: str) -> FrameT:
    """Validate and report temporal uniqueness.

    This demonstrates lazy evaluation for validation:
    - Group and count operations can be optimized
    - Return DataFrame for further processing
    - Let caller decide when to materialize
    """
    # Stage 1: Group by ID and time
    counts = df.group_by([id_col, time_col]).agg([nw.col(time_col).count().alias("count")])

    # Stage 2: Filter for duplicates
    return counts.filter(nw.col("count") > 1)


@nw.narwhalify
def validate_uniform_frequency(df: FrameT, id_col: str, time_col: str) -> FrameT:
    """Validate time frequency consistency.

    This demonstrates lazy evaluation for validation:
    - Group and compute time spans
    - Return DataFrame for further processing
    - Let caller decide when to materialize
    """
    # Stage 1: Group by ID and compute time spans
    return df.group_by(id_col).agg(
        [
            nw.col(time_col).min().alias("start_time"),
            nw.col(time_col).max().alias("end_time"),
            nw.col(time_col).count().alias("points"),
        ]
    )


# Test data
test_data = {
    "id": [1, 1, 1, 2, 2, 2],
    "timestamp": [
        pd.Timestamp("2023-01-01"),
        pd.Timestamp("2023-01-01"),  # Duplicate
        pd.Timestamp("2023-01-02"),
        pd.Timestamp("2023-01-01"),
        pd.Timestamp("2023-01-02"),
        pd.Timestamp("2023-01-04"),  # Non-uniform gap
    ],
    "value": [10.0, 20.0, 30.0, 40.0, 50.0, 60.0],
}

# Test with different backends
print("Testing temporal uniqueness validation:")
print("-" * 50)

for backend, df in [
    ("Pandas", pd.DataFrame(test_data)),
    ("Polars", pl.DataFrame(test_data)),
    ("Dask", dd.from_pandas(pd.DataFrame(test_data), npartitions=2)),
]:
    print(f"\n{backend} Result:")
    try:
        result = validate_temporal_uniqueness(nw.from_native(df), "id", "timestamp")
        if hasattr(result, "compute"):
            result = result.compute()
        print(result)
    except Exception as e:
        print(f"Error: {str(e)}")

print("\nTesting uniform frequency validation:")
print("-" * 50)

for backend, df in [
    ("Pandas", pd.DataFrame(test_data)),
    ("Polars", pl.DataFrame(test_data)),
    ("Dask", dd.from_pandas(pd.DataFrame(test_data), npartitions=2)),
]:
    print(f"\n{backend} Result:")
    try:
        result = validate_uniform_frequency(nw.from_native(df), "id", "timestamp")
        if hasattr(result, "compute"):
            result = result.compute()
        print(result)
    except Exception as e:
        print(f"Error: {str(e)}")

Testing temporal uniqueness validation:
--------------------------------------------------

Pandas Result:
   id  timestamp  count
0   1 2023-01-01      2

Polars Result:
shape: (1, 3)
┌─────┬─────────────────────┬───────┐
│ id  ┆ timestamp           ┆ count │
│ --- ┆ ---                 ┆ ---   │
│ i64 ┆ datetime[μs]        ┆ u32   │
╞═════╪═════════════════════╪═══════╡
│ 1   ┆ 2023-01-01 00:00:00 ┆ 2     │
└─────┴─────────────────────┴───────┘

Dask Result:
   id  timestamp  count
0   1 2023-01-01      2

Testing uniform frequency validation:
--------------------------------------------------

Pandas Result:
   id start_time   end_time  points
0   1 2023-01-01 2023-01-02       3
1   2 2023-01-01 2023-01-04       3

Polars Result:
shape: (2, 4)
┌─────┬─────────────────────┬─────────────────────┬────────┐
│ id  ┆ start_time          ┆ end_time            ┆ points │
│ --- ┆ ---                 ┆ ---                 ┆ ---    │
│ i64 ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32    │

## Key Takeaways - Time Series Validation Patterns

This example demonstrates robust patterns for validating time series data across different DataFrame backends.

### Key Patterns

1. **Keep Operations Lazy**
   - ✅ Return DataFrames for further processing
   - ✅ Use elementary operations (group_by, agg)
   - ✅ Let caller decide when to materialize
   - ❌ Don't extract scalars in validation functions
   - ❌ Don't use over() for window operations

2. **Backend Compatibility**
   - ✅ Use group_by and agg for aggregations
   - ✅ Handle compute() at caller level
   - ✅ Use only Narwhals operations
   - ❌ Don't use backend-specific operations
   - ❌ Don't assume eager evaluation

3. **Validation Results**
   - ✅ Return DataFrames with validation details
   - ✅ Include all relevant information
   - ✅ Allow further processing if needed
   - ❌ Don't force immediate materialization
   - ❌ Don't return only boolean results

### Example Results

1. **Temporal Uniqueness**
```
   id  timestamp  count
0   1 2023-01-01      2  # Shows duplicate timestamps
```

2. **Frequency Validation**
```
   id start_time   end_time  points
0   1 2023-01-01 2023-01-02      3  # Regular frequency
1   2 2023-01-01 2023-01-04      3  # Irregular frequency
```

### Why This Pattern Works

1. **Universal Backend Support**
   - Works with Pandas, Polars, and Dask
   - No backend-specific code needed
   - Consistent results across backends

2. **Efficient Processing**
   - Lazy evaluation enables optimization
   - No unnecessary materializations
   - Chainable with other operations

3. **Rich Validation Results**
   - Full details for analysis
   - Supports further processing
   - Clear validation outcomes


## Pattern 3: Mixed Frequency Handling

Time series data often has mixed frequencies that need special handling: