# SignalFlow Tutorial

**SignalFlow** is a modular quantitative trading framework built on Polars for high-performance data processing.

This tutorial walks through the complete pipeline:

```
Data Sources  -->  Raw Data  -->  Features  -->  Signals  -->  Labels  -->  Validation  -->  Backtest
  (Exchange)      (DuckDB)     (Pipeline)    (Detector)   (Labeler)   (Meta-label)     (Strategy)
```

### Table of Contents

1. [Setup & Imports](#1-setup--imports)
2. [Data Layer](#2-data-layer)  
3. [Feature Engineering](#3-feature-engineering)  
4. [Signal Detection](#4-signal-detection)  
5. [Signal Labeling](#5-signal-labeling)  
6. [Signal Validation (Meta-Labeling)](#6-signal-validation-meta-labeling)  
7. [Backtesting](#7-backtesting)  
8. [Visualization](#8-visualization)  
9. [Architecture & Next Steps](#9-architecture--next-steps)

## 1. Setup & Imports

In [1]:
import signalflow as sf
import polars as pl
from pathlib import Path
from datetime import datetime
from dataclasses import dataclass

  from .autonotebook import tqdm as notebook_tqdm


## 2. Data Layer

SignalFlow's data layer consists of three components:

| Component | Role | Examples |
|-----------|------|----------|
| **Data Source** | Downloads OHLCV from exchanges | `BinanceSpotLoader`, `BybitSpotLoader`, `OkxSpotLoader` |
| **Raw Data Store** | Persists candles to disk | `DuckDbSpotStore`, `SqliteSpotStore`, `PgSpotStore` |
| **RawDataFactory** | Loads stored data into memory | `RawDataFactory.from_duckdb_spot_store()` |

For this tutorial we use `VirtualDataProvider` to generate synthetic data, so **no network access is required**.

### 2.1 Data Store & Synthetic Data Generation

`VirtualDataProvider` generates realistic OHLCV candles using a geometric random walk with configurable volatility and trend. It is a drop-in replacement for exchange loaders and writes directly to a `RawDataStore`.

In [2]:
from signalflow.data.raw_store import DuckDbSpotStore
from signalflow.data.source import VirtualDataProvider

PAIRS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
N_BARS = 10_000  # ~7 days of 1-minute candles per pair
START = datetime(2025, 1, 1)

# 1) Create a DuckDB-backed store for raw OHLCV data
spot_store = DuckDbSpotStore(db_path=Path("tutorial.duckdb"))

# 2) Generate synthetic data with realistic base prices
provider = VirtualDataProvider(
    store=spot_store,
    base_prices={"BTCUSDT": 42_000.0, "ETHUSDT": 2_200.0, "SOLUSDT": 100.0},
    volatility=0.003,  # per-bar return std deviation
    trend=0.00005,  # slight uptrend drift per bar
    seed=42,  # reproducible results
)

provider.download(pairs=PAIRS, n_bars=N_BARS, start=START)

# Verify what we stored
spot_store.get_stats()

[32m2026-02-05 22:45:34.638[0m | [1mINFO    [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36m_ensure_tables[0m:[36m201[0m - [1mDatabase initialized: tutorial.duckdb (timeframe=1m)[0m
[32m2026-02-05 22:45:34.715[0m | [34m[1mDEBUG   [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36minsert_klines[0m:[36m318[0m - [34m[1mInserted 10,000 rows for BTCUSDT[0m
[32m2026-02-05 22:45:34.716[0m | [1mINFO    [0m | [36msignalflow.data.source.virtual[0m:[36mdownload[0m:[36m255[0m - [1mVirtualDataProvider: generated 10000 bars for BTCUSDT[0m
[32m2026-02-05 22:45:34.778[0m | [34m[1mDEBUG   [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36minsert_klines[0m:[36m318[0m - [34m[1mInserted 10,000 rows for ETHUSDT[0m
[32m2026-02-05 22:45:34.778[0m | [1mINFO    [0m | [36msignalflow.data.source.virtual[0m:[36mdownload[0m:[36m255[0m - [1mVirtualDataProvider: generated 10000 bars for ETHUSDT[0m
[32m2026-02-05 22:45:34.836[0m | 

pair,rows,first_candle,last_candle,total_volume
str,i64,datetime[μs],datetime[μs],f64
"""BTCUSDT""",524160,2025-01-01 00:00:00,2025-12-31 00:00:00,754430000000.0
"""ETHUSDT""",524160,2025-01-01 00:00:00,2025-12-31 00:00:00,625160000000.0
"""SOLUSDT""",10000,2025-01-01 00:00:00,2025-01-07 22:39:00,17964000.0


### 2.2 Loading Data from Exchanges (Optional)

SignalFlow supports three exchanges out of the box. Each loader is **async** and handles pagination, rate limits, and gap detection automatically.

| Exchange | Spot Loader | Futures Loader |
|----------|-------------|----------------|
| **Binance** | `BinanceSpotLoader` | `BinanceFuturesUsdtLoader`, `BinanceFuturesCoinLoader` |
| **Bybit** | `BybitSpotLoader` | `BybitFuturesLoader` |
| **OKX** | `OkxSpotLoader` | `OkxFuturesLoader` |

All sources normalize timestamps to **candle close time** (open time + 1 timeframe).

In [3]:
# === Binance (requires network) ===
from signalflow.data.source import BinanceSpotLoader

loader = BinanceSpotLoader(store=spot_store, timeframe="1m")
await loader.download(
    pairs=["BTCUSDT", "ETHUSDT"],
    start=datetime(2025, 12, 1),
    end=datetime(2025, 12, 31),
)

# # === Bybit ===
# from signalflow.data.source import BybitSpotLoader

# loader = BybitSpotLoader(store=spot_store, timeframe="1m")
# await loader.download(
#     pairs=["BTCUSDT"],
#     start=datetime(2025, 12, 1),
#     end=datetime(2025, 12, 31),
# )

# # === OKX ===
# from signalflow.data.source import OkxSpotLoader

# loader = OkxSpotLoader(store=spot_store, timeframe="1m")
# await loader.download(
#     pairs=["BTCUSDT"],  # auto-converted to "BTC-USDT" for OKX API
#     start=datetime(2025, 12, 1),
#     end=datetime(2025, 12, 31),
# )

print("Uncomment the examples above to download real exchange data.")

[32m2026-02-05 22:45:56.108[0m | [1mINFO    [0m | [36msignalflow.data.source.binance[0m:[36mdownload_pair[0m:[36m367[0m - [1mProcessing BTCUSDT from 2025-12-01 00:00:00 to 2025-12-31 00:00:00[0m
[32m2026-02-05 22:45:56.137[0m | [1mINFO    [0m | [36msignalflow.data.source.binance[0m:[36mdownload_pair[0m:[36m367[0m - [1mProcessing ETHUSDT from 2025-12-01 00:00:00 to 2025-12-31 00:00:00[0m


Uncomment the examples above to download real exchange data.


### 2.3 RawDataFactory & RawDataView

- **`RawData`** — immutable in-memory container holding Polars DataFrames keyed by type (e.g. `"spot"`). Created via `RawDataFactory`.
- **`RawDataView`** — adapter that provides zero-copy Polars access (`to_polars()`) and optional Pandas conversion (`to_pandas()`).

`RawDataFactory` validates the schema, removes duplicates, normalizes timestamps, and sorts by `(pair, timestamp)`.

In [None]:
from signalflow.data import RawDataFactory

raw_data = RawDataFactory.from_duckdb_spot_store(
    spot_store_path=Path("tutorial.duckdb"),
    pairs=PAIRS,
    start=START,
    end=datetime(2025, 1, 8),
    data_types=["spot"],
)

raw_data_view = sf.RawDataView(raw=raw_data)

# Access the spot DataFrame
spot_df = raw_data_view.to_polars("spot")
print(f"Shape: {spot_df.shape}")
print(f"Pairs: {spot_df['pair'].unique().sort().to_list()}")
print(f"Columns: {spot_df.columns}")
print(f"Date range: {spot_df['timestamp'].min()} -> {spot_df['timestamp'].max()}")
spot_df.head(5)

[32m2026-02-05 22:34:36.222[0m | [1mINFO    [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36m_ensure_tables[0m:[36m201[0m - [1mDatabase initialized: tutorial.duckdb (timeframe=1m)[0m


Shape: (30160, 8)
Pairs: ['BTCUSDT', 'ETHUSDT', 'SOLUSDT']
Columns: ['pair', 'timestamp', 'open', 'high', 'low', 'close', 'volume', 'trades']
Date range: 2025-01-01 00:00:00 -> 2025-01-08 00:00:00


pair,timestamp,open,high,low,close,volume,trades
str,datetime[μs],f64,f64,f64,f64,f64,i32
"""BTCUSDT""",2025-01-01 00:00:00,42000.0,42012.065495,41908.831169,41923.909648,2018.88,166
"""BTCUSDT""",2025-01-01 00:01:00,41923.909648,42042.646805,41917.558362,41971.777169,763.05,18
"""BTCUSDT""",2025-01-01 00:02:00,41971.777169,42068.255139,41961.938797,42027.086472,1059.6,93
"""BTCUSDT""",2025-01-01 00:03:00,42027.086472,42193.185723,42027.086472,42130.140474,1581.45,64
"""BTCUSDT""",2025-01-01 00:04:00,42130.140474,42252.648048,42113.145756,42206.409039,2386.56,361


## 3. Feature Engineering

Features transform raw OHLCV data into numerical indicators for signal detectors. SignalFlow provides:

| Base Class | Scope | Override |
|------------|-------|----------|
| `Feature` | Per-pair (grouped by pair) | `compute_pair(df)` |
| `GlobalFeature` | Cross-pair (all data at once) | `compute(df)` |

Each feature declares `requires` (input columns) and `outputs` (produced columns) with `{param}` template support.

`FeaturePipeline` orchestrates multiple features — it batches consecutive per-pair features into a single `group_by` call for performance, and validates that all column dependencies are satisfied at construction time.

### 3.1 Built-in Features

| Feature | Class | Output | Key Params |
|---------|-------|--------|------------|
| RSI | `ExampleRsiFeature` | `rsi_{period}` | `period`, `price_col`, `normalized` |
| SMA | `ExampleSmaFeature` | `sma_{period}` | `period`, `price_col`, `normalized` |
| Global Mean RSI | `ExampleGlobalMeanRsiFeature` | `global_mean_rsi_{period}` | `period`, `add_diff` |
| Linear Regression | `LinRegForecastFeature` | forecast values | various |

In [None]:
from signalflow.feature import ExampleRsiFeature, ExampleSmaFeature

# Compute RSI-14 for a single pair
btc_df = spot_df.filter(pl.col("pair") == "BTCUSDT").sort("timestamp")

rsi = ExampleRsiFeature(period=14)
btc_with_rsi = rsi.compute_pair(btc_df)

print(f"Output columns: {rsi.output_cols()}")
print(f"Required columns: {rsi.required_cols()}")
print(f"Warmup period: {rsi.warmup} bars")
btc_with_rsi.select(["timestamp", "close", "rsi_14"]).tail(5)

Output columns: ['rsi_14']
Required columns: ['close']
Warmup period: 42 bars


timestamp,close,rsi_14
datetime[μs],f64,f64
2025-01-07 23:56:00,96935.49,32.851531
2025-01-07 23:57:00,96927.9,34.427177
2025-01-07 23:58:00,96943.43,46.995598
2025-01-07 23:59:00,96922.0,45.188423
2025-01-08 00:00:00,96954.61,52.007603


### 3.2 Custom Feature

Creating a custom feature requires:
1. Inherit from `Feature` (per-pair) or `GlobalFeature` (cross-pair)
2. Declare `requires` and `outputs` (supports `{param}` templates)
3. Implement `compute_pair()` — must return a DataFrame with the **same row count** as input
4. Optionally decorate with `@sf_component(name=...)` to register in the component registry

In [None]:
from signalflow import sf_component
from signalflow.feature.base import Feature


@dataclass
@sf_component(name="custom/log_return")
class CustomLogReturnFeature(Feature):
    """Logarithmic return: ln(P_t / P_{t-n})."""

    price_col: str = "close"
    period: int = 1

    requires = ["{price_col}"]
    outputs = ["log_ret_{period}"]

    def compute_pair(self, df: pl.DataFrame) -> pl.DataFrame:
        col_name = f"log_ret_{self.period}"
        return df.with_columns(pl.col(self.price_col).log().diff(n=self.period).alias(col_name))


# Verify it works
log_ret = CustomLogReturnFeature(period=60)
print(f"Requires: {log_ret.required_cols()}")
print(f"Outputs:  {log_ret.output_cols()}")

# Test on single pair
test_result = log_ret.compute_pair(btc_df)
test_result.select(["timestamp", "close", "log_ret_60"]).tail(3)

Requires: ['close']
Outputs:  ['log_ret_60']


timestamp,close,log_ret_60
datetime[μs],f64,f64
2025-01-07 23:58:00,96943.43,-0.001306
2025-01-07 23:59:00,96922.0,-0.002503
2025-01-08 00:00:00,96954.61,-0.002157


### 3.3 FeaturePipeline

`FeaturePipeline` groups consecutive per-pair features into optimized batches (single `group_by` call). Global features are separated and applied between batches.

In [None]:
from signalflow.feature import (
    FeaturePipeline,
    ExampleRsiFeature,
    ExampleSmaFeature,
    ExampleGlobalMeanRsiFeature,
    OffsetFeature,
)

pipeline = FeaturePipeline(
    features=[
        # Per-pair features (batched into a single group_by)
        ExampleRsiFeature(period=14),
        ExampleRsiFeature(period=60),
        ExampleSmaFeature(period=20),
        ExampleSmaFeature(period=50),
        CustomLogReturnFeature(period=60),
        # Global feature (computed across all pairs per timestamp)
        ExampleGlobalMeanRsiFeature(period=14, add_diff=True),
    ]
)

features_df = pipeline.run(raw_data_view)
print(f"Pipeline outputs: {pipeline.output_cols()}")
print(f"Features shape: {features_df.shape}")
features_df.select(["pair", "timestamp", "rsi_14", "sma_20", "log_ret_60", "global_mean_rsi_14"]).tail(5)

Pipeline outputs: ['rsi_14', 'rsi_60', 'sma_20', 'sma_50', 'log_ret_60', 'global_mean_rsi_14', 'rsi_14_diff']
Features shape: (30160, 15)


pair,timestamp,rsi_14,sma_20,log_ret_60,global_mean_rsi_14
str,datetime[μs],f64,f64,f64,f64
"""SOLUSDT""",2025-01-07 22:35:00,58.507933,226.192168,0.026846,56.067575
"""SOLUSDT""",2025-01-07 22:36:00,56.308376,226.248188,0.021725,52.890512
"""SOLUSDT""",2025-01-07 22:37:00,52.601801,226.352034,0.02361,52.100117
"""SOLUSDT""",2025-01-07 22:38:00,66.288052,226.509135,0.025958,56.597117
"""SOLUSDT""",2025-01-07 22:39:00,54.161074,226.612431,0.025069,52.546148


### 3.4 Multi-Timeframe Features: OffsetFeature

`OffsetFeature` computes a registered feature on resampled (e.g. 15-minute) bars using **all possible offset alignments**. This captures multi-timeframe information without losing 1-minute resolution.

How it works:
1. Resample 1m OHLCV into `window`-minute bars for each possible offset (0 .. window-1)
2. Compute the base feature on each resampled series
3. Map results back to the original 1m timestamps

The `feature_name` parameter references a component registered via `@sf_component(name=...)`.

In [None]:
offset_pipeline = FeaturePipeline(
    features=[
        ExampleRsiFeature(period=60),
        OffsetFeature(
            feature_name="example/rsi",  # registered name of ExampleRsiFeature
            feature_params={"period": 14},  # params for the base feature
            window=15,  # 15-minute resampling window
            prefix="ofs_",  # output column prefix
        ),
    ]
)

offset_df = offset_pipeline.run(raw_data_view)
print(f"Offset outputs: {offset_pipeline.output_cols()}")
offset_df.filter(pl.col("pair") == "BTCUSDT").select(["timestamp", "rsi_60", "ofs_rsi_14", "offset"]).tail(5)

[32m2026-02-05 22:34:36.331[0m | [34m[1mDEBUG   [0m | [36msignalflow.core.registry[0m:[36m_discover_internal_packages[0m:[36m152[0m - [34m[1mautodiscover: failed to import signalflow.detector.adapter[0m


Offset outputs: ['rsi_60', 'ofs_rsi_14', 'offset']


timestamp,rsi_60,ofs_rsi_14,offset
datetime[μs],f64,f64,u8
2025-01-07 23:56:00,47.216776,94.507957,10
2025-01-07 23:57:00,47.469725,94.954005,11
2025-01-07 23:58:00,47.590824,95.373589,12
2025-01-07 23:59:00,45.247385,96.486539,13
2025-01-08 00:00:00,45.953113,96.330167,14


## 4. Signal Detection

A `SignalDetector` generates trading signals from raw data. The pipeline:

```
RawDataView  -->  preprocess()  -->  detect()  -->  validate  -->  Signals
                  (features)         (logic)        (schema)       (pair, timestamp, signal_type, signal)
```

Each signal has:
- `signal_type`: `"rise"` (bullish), `"fall"` (bearish), or `"none"`
- `signal`: numeric value (typically +1 or -1)
- optionally `probability`: confidence score

The base class handles timezone normalization, schema validation, and duplicate detection.

### 4.1 Built-in: SMA Cross Detector

`ExampleSmaCrossDetector` generates signals on SMA crossovers:
- **RISE**: fast SMA crosses above slow SMA
- **FALL**: fast SMA crosses below slow SMA

It automatically creates its own `FeaturePipeline` with two `ExampleSmaFeature` instances in `__post_init__`.

In [None]:
from signalflow.detector import ExampleSmaCrossDetector

sma_detector = ExampleSmaCrossDetector(fast_period=20, slow_period=50)
sma_signals = sma_detector.run(raw_data_view)

# The detector returns all rows including "none" - filter to actual crossovers
active_sma = sma_signals.value.filter(pl.col("signal_type") != "none")
print(f"Total crossovers detected: {active_sma.height}")
print(f"  Rise: {active_sma.filter(pl.col('signal_type') == 'rise').height}")
print(f"  Fall: {active_sma.filter(pl.col('signal_type') == 'fall').height}")
active_sma.head(10)

Total crossovers detected: 717
  Rise: 358
  Fall: 359


pair,timestamp,signal_type,signal
str,datetime[μs],str,i32
"""BTCUSDT""",2025-01-01 01:16:00,"""fall""",-1
"""BTCUSDT""",2025-01-01 01:46:00,"""rise""",1
"""BTCUSDT""",2025-01-01 02:22:00,"""fall""",-1
"""BTCUSDT""",2025-01-01 02:26:00,"""rise""",1
"""BTCUSDT""",2025-01-01 02:38:00,"""fall""",-1
"""BTCUSDT""",2025-01-01 03:00:00,"""rise""",1
"""BTCUSDT""",2025-01-01 03:49:00,"""fall""",-1
"""BTCUSDT""",2025-01-01 04:28:00,"""rise""",1
"""BTCUSDT""",2025-01-01 05:14:00,"""fall""",-1
"""BTCUSDT""",2025-01-01 06:30:00,"""rise""",1


### 4.2 Custom Signal Detector

To create a custom detector:
1. Inherit from `SignalDetector`
2. Set `self.feature_pipeline` in `__post_init__()` for automatic feature extraction
3. Implement `detect(features, context)` → return a `Signals` container

The detector below fires when the 60-bar log return exceeds a threshold. Unlike the SMA cross detector, it **filters out** `"none"` signals in `detect()` for a cleaner output.

In [None]:
from signalflow.core import Signals, SignalType
from signalflow.detector import SignalDetector
from signalflow.feature import FeaturePipeline


@dataclass
@sf_component(name="momentum_breakout")
class MomentumBreakoutDetector(SignalDetector):
    """Detects large price moves based on log return thresholds."""

    threshold: float = 0.02
    price_col: str = "close"
    period: int = 60

    def __post_init__(self):
        self.feature_col = f"log_ret_{self.period}"
        self.feature_pipeline = FeaturePipeline(
            features=[CustomLogReturnFeature(price_col=self.price_col, period=self.period)]
        )

    def detect(self, features: pl.DataFrame, context: dict | None = None) -> Signals:
        feat = pl.col(self.feature_col)

        out = features.select(
            [
                self.pair_col,
                self.ts_col,
                pl.when(feat > self.threshold)
                .then(pl.lit(SignalType.RISE.value))
                .when(feat < -self.threshold)
                .then(pl.lit(SignalType.FALL.value))
                .otherwise(pl.lit(SignalType.NONE.value))
                .alias("signal_type"),
                pl.when(feat > self.threshold)
                .then(1)
                .when(feat < -self.threshold)
                .then(-1)
                .otherwise(0)
                .alias("signal"),
            ]
        ).filter(pl.col("signal_type") != SignalType.NONE.value)

        return Signals(out)

In [None]:
detector = MomentumBreakoutDetector(threshold=0.02, period=60)
signals = detector.run(raw_data_view)

print(f"Detected {signals.value.height} momentum signals:")
print(f"  Rise: {signals.value.filter(pl.col('signal_type') == 'rise').height}")
print(f"  Fall: {signals.value.filter(pl.col('signal_type') == 'fall').height}")
signals.value.head(10)

Detected 10968 momentum signals:
  Rise: 6677
  Fall: 4291


pair,timestamp,signal_type,signal
str,datetime[μs],str,i32
"""BTCUSDT""",2025-01-01 01:00:00,"""rise""",1
"""BTCUSDT""",2025-01-01 01:01:00,"""rise""",1
"""BTCUSDT""",2025-01-01 01:02:00,"""rise""",1
"""BTCUSDT""",2025-01-01 01:03:00,"""rise""",1
"""BTCUSDT""",2025-01-01 01:04:00,"""rise""",1
"""BTCUSDT""",2025-01-01 01:05:00,"""rise""",1
"""BTCUSDT""",2025-01-01 01:06:00,"""rise""",1
"""BTCUSDT""",2025-01-01 02:12:00,"""rise""",1
"""BTCUSDT""",2025-01-01 02:22:00,"""rise""",1
"""BTCUSDT""",2025-01-01 02:23:00,"""rise""",1


## 5. Signal Labeling

**Labelers** assign forward-looking labels to historical data: given a signal at time `t`, what happened to the price?

This is used to train the **validator** (meta-labeler) that predicts signal quality.

| Labeler | Strategy | Key Params |
|---------|----------|------------|
| `FixedHorizonLabeler` | Return after `N` bars | `horizon`, `price_col` |
| `TripleBarrierLabeler` | First hit of profit/loss/time barrier (Numba-accelerated) | `vol_window`, `lookforward_window`, `profit_multiplier` |
| `StaticTripleBarrierLabeler` | Fixed-percentage barriers | varies |

Labels are computed on the full price series but can be **masked** to signal timestamps only. To enable masking, pass `data_context={"signal_keys": ...}` with a DataFrame of `(pair, timestamp)` rows to label. Non-signal rows then get `label="none"`, so only the detected signals receive meaningful labels.

In [None]:
from signalflow.target import FixedHorizonLabeler

labeler = FixedHorizonLabeler(
    price_col="close",
    horizon=60,  # look 60 bars (minutes) ahead
    include_meta=True,  # include t1 (future timestamp) and ret (log return)
)

# Extract signal timestamps for masking
signal_keys = signals.value.select(["pair", "timestamp"])

labeled_df = labeler.compute(
    df=raw_data_view.to_polars("spot"),
    signals=signals,
    data_context={"signal_keys": signal_keys},  # mask labels to signal timestamps
)

# Show labeled signal rows (non-signal rows have label="none")
labeled_signals = labeled_df.filter(pl.col("label") != "none")
print(f"Total labeled signals: {labeled_signals.height}")
print(f"\nLabel distribution:")
display(labeled_signals.group_by("label").len().sort("label"))
print(f"\nSample:")
labeled_signals.head(5)

Total labeled signals: 10891

Label distribution:


label,len
str,u32
"""fall""",4685
"""rise""",6206



Sample:


pair,timestamp,label,t1,ret
str,datetime[μs],str,datetime[μs],f64
"""BTCUSDT""",2025-01-01 01:00:00,"""rise""",2025-01-01 02:00:00,0.019763
"""BTCUSDT""",2025-01-01 01:01:00,"""rise""",2025-01-01 02:01:00,0.014369
"""BTCUSDT""",2025-01-01 01:02:00,"""rise""",2025-01-01 02:02:00,0.014624
"""BTCUSDT""",2025-01-01 01:03:00,"""rise""",2025-01-01 02:03:00,0.00938
"""BTCUSDT""",2025-01-01 01:04:00,"""rise""",2025-01-01 02:04:00,0.00256


## 6. Signal Validation (Meta-Labeling)

The **validator** is a machine learning model that predicts the probability of each signal being correct. It works as a "meta-labeler":

1. **Train** on historical features + labels (from the labeler)
2. **Predict** class probabilities for each signal (`probability_rise`, `probability_fall`, `probability_none`)
3. The strategy can then **filter** or **size** positions based on confidence

`SklearnSignalValidator` supports: `random_forest`, `lightgbm`, `xgboost`, `logistic_regression`, `svm`, and `auto` (cross-validation model selection).

In [None]:
from signalflow.validator import SklearnSignalValidator

# 1) Get features at ALL timestamps from the detector's pipeline
all_features = detector.preprocess(raw_data_view)

# 2) Join features with labels on (pair, timestamp)
train_df = all_features.join(
    labeled_df,
    on=["pair", "timestamp"],
    how="inner",
)

# 3) Filter to signal rows only (recommended by validator docs)
train_df = train_df.filter(pl.col("label") != "none")
print(f"Training samples (signal rows only): {train_df.height}")

# 4) Time-based train/test split (80/20)
split_idx = int(train_df.height * 0.8)

# X must contain pair, timestamp, and feature columns
feature_cols = ["pair", "timestamp", "log_ret_60"]
X_train = train_df.slice(0, split_idx).select(feature_cols)
X_test = train_df.slice(split_idx).select(feature_cols)
y_train = train_df.slice(0, split_idx).select("label")
y_test = train_df.slice(split_idx).select("label")

# 5) Train the validator
validator = SklearnSignalValidator(
    model_type="random_forest",
    model_params={"n_estimators": 100, "max_depth": 5, "random_state": 42},
)
validator.fit(X_train, y_train)
print(f"Validator trained on {X_train.height} samples.")

Training samples (signal rows only): 10891
Validator trained on 8712 samples.


In [None]:
# validate_signals() adds probability columns to the Signals container
validated_signals = validator.validate_signals(signals, all_features)

print(f"Validated {validated_signals.value.height} signals.")
print(f"\nTop signals by rise probability:")
validated_signals.value.select(["pair", "timestamp", "signal_type", "probability_rise", "probability_fall"]).sort(
    "probability_rise", descending=True
).head(10)

Validated 10968 signals.

Top signals by rise probability:


pair,timestamp,signal_type,probability_rise,probability_fall
str,datetime[μs],str,f64,f64
"""ETHUSDT""",2025-01-01 09:16:00,"""fall""",0.898607,0.101393
"""ETHUSDT""",2025-01-01 09:17:00,"""fall""",0.898607,0.101393
"""ETHUSDT""",2025-01-01 09:18:00,"""fall""",0.898607,0.101393
"""ETHUSDT""",2025-01-01 09:21:00,"""fall""",0.898607,0.101393
"""ETHUSDT""",2025-01-01 09:22:00,"""fall""",0.816095,0.183905
"""ETHUSDT""",2025-01-01 09:20:00,"""fall""",0.802929,0.197071
"""SOLUSDT""",2025-01-05 10:36:00,"""fall""",0.758762,0.241238
"""ETHUSDT""",2025-01-01 09:15:00,"""fall""",0.741143,0.258857
"""BTCUSDT""",2025-01-03 14:43:00,"""rise""",0.683112,0.316888
"""BTCUSDT""",2025-01-03 14:45:00,"""rise""",0.683112,0.316888


## 7. Backtesting

The backtesting engine simulates strategy execution bar-by-bar:

```
For each timestamp:
  1. Mark open positions to current prices
  2. Compute metrics (equity, drawdown, sharpe, etc.)
  3. Check exit rules → submit close orders
  4. Check entry rules → submit open orders
```

### Key Components

| Component | Class | Role |
|-----------|-------|------|
| **Entry Rule** | `SignalEntryRule` | Opens positions on validated signals, sizes by probability |
| | `FixedSizeEntryRule` | Opens fixed-size positions |
| **Exit Rule** | `TakeProfitStopLossExit` | Closes at TP/SL percentages |
| **Executor** | `VirtualSpotExecutor` | Simulates fills with fees + slippage |
| **Broker** | `BacktestBroker` | Manages orders, positions, and state |
| **Metrics** | `TotalReturnMetric`, `DrawdownMetric`, etc. | Computed every bar |
| **Runner** | `OptimizedBacktestRunner` | Pre-builds lookups for faster iteration |

### 7.1 Setting Up Strategy Components

In [None]:
from signalflow.strategy.broker import BacktestBroker
from signalflow.strategy.broker.executor import VirtualSpotExecutor
from signalflow.data.strategy_store import DuckDbStrategyStore
from signalflow.strategy.runner import OptimizedBacktestRunner
from signalflow.strategy.component.entry import SignalEntryRule
from signalflow.strategy.component.exit import TakeProfitStopLossExit
from signalflow.analytic.strategy import (
    TotalReturnMetric,
    BalanceAllocationMetric,
    DrawdownMetric,
    WinRateMetric,
    SharpeRatioMetric,
)

INITIAL_CAPITAL = 10_000.0

# Strategy state persistence
strategy_store = DuckDbStrategyStore("tutorial_strategy.duckdb")
strategy_store.init()

# Order execution with fees and slippage
executor = VirtualSpotExecutor(fee_rate=0.001, slippage_pct=0.001)
broker = BacktestBroker(executor=executor, store=strategy_store)

# Entry rule: open positions on validated signals
entry_rule = SignalEntryRule(
    base_position_size=1000.0,  # base size in quote currency
    use_probability_sizing=True,  # scale size by signal probability
    min_probability=0.5,  # ignore signals below this confidence
    max_positions_per_pair=1,  # no stacking positions
    max_total_positions=20,
    allow_shorts=False,  # long only
)

# Exit rule: symmetric take-profit and stop-loss
exit_rule = TakeProfitStopLossExit(
    take_profit_pct=0.02,  # +2% take profit
    stop_loss_pct=0.02,  # -2% stop loss
)

# Performance metrics (computed every bar)
metrics = [
    TotalReturnMetric(initial_capital=INITIAL_CAPITAL),
    BalanceAllocationMetric(initial_capital=INITIAL_CAPITAL),
    DrawdownMetric(),
    WinRateMetric(),
    SharpeRatioMetric(initial_capital=INITIAL_CAPITAL, window_size=100),
]

print("Strategy components ready.")

Strategy components ready.


### 7.2 Running the Backtest

In [None]:
runner = OptimizedBacktestRunner(
    strategy_id="tutorial_momentum",
    broker=broker,
    entry_rules=[entry_rule],
    exit_rules=[exit_rule],
    metrics=metrics,
    initial_capital=INITIAL_CAPITAL,
    data_key="spot",
)

# run() iterates over every timestamp in the raw data
final_state = runner.run(raw_data, validated_signals)

Backtesting: 100%|██████████| 10080/10080 [00:05<00:00, 1798.81it/s]


### 7.3 Analyzing Results

In [None]:
results = runner.get_results()

print("=" * 50)
print("BACKTEST RESULTS")
print("=" * 50)
print(f"  Initial Capital:  ${INITIAL_CAPITAL:,.2f}")
print(f"  Final Equity:     ${results.get('final_equity', 0):,.2f}")
print(f"  Total Return:     {results.get('final_return', 0) * 100:.2f}%")
print(f"  Max Drawdown:     {results.get('max_drawdown', 0) * 100:.2f}%")
print(f"  Win Rate:         {results.get('win_rate', 0) * 100:.1f}%")
print(f"  Sharpe Ratio:     {results.get('sharpe_ratio', 0):.3f}")
print(f"  Total Trades:     {results.get('total_trades', 0)}")
print(f"    Entries:        {results.get('entry_count', 0)}")
print(f"    Exits:          {results.get('exit_count', 0)}")
print("=" * 50)

# Show recent trades
trades_df = results["trades_df"]
if trades_df.height > 0:
    print(f"\nRecent trades ({trades_df.height} total):")
    display(trades_df.tail(10))

# Metrics time series
metrics_df = results["metrics_df"]
print(f"\nMetrics time series: {metrics_df.shape}")
metrics_df.tail(3)

BACKTEST RESULTS
  Initial Capital:  $10,000.00
  Final Equity:     $9,261.86
  Total Return:     -7.38%
  Max Drawdown:     8.15%
  Win Rate:         51.5%
  Sharpe Ratio:     0.033
  Total Trades:     585
    Entries:        294
    Exits:          291

Recent trades (585 total):


id,position_id,pair,side,ts,price,qty,fee,meta
str,str,str,str,datetime[μs],f64,f64,f64,struct[3]
"""90dcfa51-f6e8-40e1-9f35-fef5d8…","""cf038cb9-01dc-4e38-a904-73b7ca…","""SOLUSDT""","""SELL""",2025-01-07 20:42:00,218.727022,4.664514,1.020255,"{""exit"",{null,null,null,null,""take_profit"",214.598982,218.945968},1.0}"
"""a2dc6991-0d94-4f85-8747-d73536…","""0d6f8376-a003-4da9-a147-758b80…","""SOLUSDT""","""BUY""",2025-01-07 21:00:00,218.317319,4.585069,1.001,"{""entry"",{""rise"",1.0,2025-01-07 21:00:00,1000.0,null,null,null},1.0}"
"""786fa664-045a-475f-95bd-6e7917…","""0d6f8376-a003-4da9-a147-758b80…","""SOLUSDT""","""SELL""",2025-01-07 21:22:00,222.472564,4.585069,1.020052,"{""exit"",{null,null,null,null,""take_profit"",218.317319,222.695259},1.0}"
"""a57a804c-5729-482e-a9fa-ccc8b3…","""73a668ad-307b-442d-bdf6-a7bb99…","""SOLUSDT""","""BUY""",2025-01-07 21:22:00,222.917954,4.490441,1.001,"{""entry"",{""rise"",1.0,2025-01-07 21:22:00,1000.0,null,null,null},1.0}"
"""b6f5ebb0-03c0-4a1d-a5e0-61e8cf…","""ca5379b4-6192-4c36-a819-920705…","""BTCUSDT""","""SELL""",2025-01-07 21:29:00,59087.475744,0.016548,0.977801,"{""exit"",{null,null,null,null,""stop_loss"",60489.38647,59146.622366},1.0}"
"""8cc8a1d4-bc99-4fba-8aca-d50342…","""1e3657d0-6364-4f99-8a34-8b9a8a…","""ETHUSDT""","""SELL""",2025-01-07 21:34:00,3382.791004,0.301685,1.020537,"{""exit"",{null,null,null,null,""take_profit"",3318.030543,3386.177181},1.0}"
"""bf63ff9e-5156-4f66-a29c-962b6f…","""74da5108-0839-4160-a9ef-d3f711…","""ETHUSDT""","""BUY""",2025-01-07 21:34:00,3389.563358,0.295318,1.001,"{""entry"",{""rise"",1.0,2025-01-07 21:34:00,1000.0,null,null,null},1.0}"
"""2f6bcd10-cdee-4382-8fa5-179b4b…","""73a668ad-307b-442d-bdf6-a7bb99…","""SOLUSDT""","""SELL""",2025-01-07 22:08:00,227.228343,4.490441,1.020356,"{""exit"",{null,null,null,null,""take_profit"",222.917954,227.455799},1.0}"
"""6e0b0a4f-2f62-40cf-8742-9b085a…","""40709b30-e94e-44bb-87c7-856cc4…","""SOLUSDT""","""BUY""",2025-01-07 22:08:00,227.683255,4.396459,1.001,"{""entry"",{""rise"",1.0,2025-01-07 22:08:00,1000.0,null,null,null},1.0}"
"""fdb8e886-194b-4767-bfd8-e8e27e…","""d7b41855-f142-477a-8ac9-b9854f…","""BTCUSDT""","""BUY""",2025-01-07 22:41:00,97277.00983,0.01029,1.001,"{""entry"",{""rise"",1.0,2025-01-07 22:41:00,1000.0,null,null,null},1.0}"



Metrics time series: (10080, 20)


timestamp,equity,cash,total_return,realized_pnl,unrealized_pnl,total_fees,open_positions,closed_positions,capital_utilization,free_balance_pct,allocated_value,allocation_vs_initial,current_drawdown,max_drawdown,peak_equity,win_rate,winning_trades,losing_trades,sharpe_ratio
f64,f64,f64,f64,f64,f64,f64,i64,i64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,f64
1736300000.0,9261.798894,6266.72155,-0.07382,-144.838289,-7.922656,585.440162,3,291,0.32338,0.67662,2995.077344,0.299508,0.074629,0.081461,10008.741733,0.515464,150,141,0.035769
1736300000.0,9261.61086,6266.72155,-0.073839,-144.838289,-8.11069,585.440162,3,291,0.323366,0.676634,2994.88931,0.299489,0.074648,0.081461,10008.741733,0.515464,150,141,0.022177
1736300000.0,9261.860781,6266.72155,-0.073814,-144.838289,-7.860769,585.440162,3,291,0.323384,0.676616,2995.139231,0.299514,0.074623,0.081461,10008.741733,0.515464,150,141,0.03311


## 8. Visualization

### 8.1 Signals on Price Chart

In [None]:
import plotly.graph_objects as go


def plot_signals(raw_df: pl.DataFrame, signals_df: pl.DataFrame, pair: str = "BTCUSDT"):
    """Plot price with buy/sell signal markers."""
    price = raw_df.filter(pl.col("pair") == pair).sort("timestamp")
    price = price.with_columns(pl.col("timestamp").cast(pl.Datetime("us")))

    sigs = signals_df.filter(pl.col("pair") == pair)
    sigs = sigs.with_columns(pl.col("timestamp").cast(pl.Datetime("us")))
    sigs = sigs.join(price.select(["timestamp", "close"]), on="timestamp", how="inner")

    df_plot = price.to_pandas()
    sig_plot = sigs.to_pandas()

    fig = go.Figure()
    fig.add_trace(
        go.Scatter(
            x=df_plot["timestamp"],
            y=df_plot["close"],
            mode="lines",
            name=f"{pair} Price",
            line=dict(color="#2E86C1", width=1.5),
        )
    )

    buys = sig_plot[sig_plot["signal"] == 1]
    if not buys.empty:
        fig.add_trace(
            go.Scatter(
                x=buys["timestamp"],
                y=buys["close"],
                mode="markers",
                name="Rise Signal",
                marker=dict(symbol="triangle-up", size=12, color="#00CC96", line=dict(width=1, color="black")),
            )
        )

    sells = sig_plot[sig_plot["signal"] == -1]
    if not sells.empty:
        fig.add_trace(
            go.Scatter(
                x=sells["timestamp"],
                y=sells["close"],
                mode="markers",
                name="Fall Signal",
                marker=dict(symbol="triangle-down", size=12, color="#EF553B", line=dict(width=1, color="black")),
            )
        )

    fig.update_layout(
        title=f"SignalFlow: {pair} Signals",
        xaxis_title="Date",
        yaxis_title="Price",
        template="plotly_white",
        height=500,
        hovermode="x unified",
    )
    return fig


for pair in PAIRS:
    fig = plot_signals(spot_df, signals.value, pair=pair)
    fig.show()

### 8.2 Validated Signals with Confidence

Visualize how the meta-labeler scores signals. Marker size reflects confidence; gray markers indicate low-confidence signals that would be **ignored** by the entry rule.

In [None]:
def plot_validated_signals(raw_df: pl.DataFrame, val_signals: sf.Signals, pair: str = "BTCUSDT"):
    """Plot signals colored and sized by validation probability."""
    price = raw_df.filter(pl.col("pair") == pair).sort("timestamp")
    price = price.with_columns(pl.col("timestamp").cast(pl.Datetime("us")))

    sigs = val_signals.value.filter(pl.col("pair") == pair)
    sigs = sigs.with_columns(pl.col("timestamp").cast(pl.Datetime("us")))
    merged = sigs.join(price.select(["timestamp", "close"]), on="timestamp", how="inner").to_pandas()

    fig = go.Figure()
    fig.add_trace(
        go.Scatter(
            x=price.to_pandas()["timestamp"],
            y=price.to_pandas()["close"],
            mode="lines",
            name="Price",
            line=dict(color="#2962FF", width=1.5),
        )
    )

    for sig_type, prob_col, color_hi, color_lo, sym, label in [
        ("rise", "probability_rise", "#00C853", "#B0BEC5", "triangle-up", "Rise"),
        ("fall", "probability_fall", "#C62828", "#B0BEC5", "triangle-down", "Fall"),
    ]:
        subset = merged[merged["signal_type"] == sig_type]
        if subset.empty:
            continue

        # Low confidence (ignored by strategy)
        low = subset[subset[prob_col] < 0.5]
        if not low.empty:
            fig.add_trace(
                go.Scatter(
                    x=low["timestamp"],
                    y=low["close"],
                    mode="markers",
                    name=f"{label} (low conf)",
                    marker=dict(symbol=sym, size=7, color=color_lo),
                )
            )

        # High confidence (acted on by strategy)
        high = subset[subset[prob_col] >= 0.5]
        if not high.empty:
            sizes = 10 + (high[prob_col] * 15)
            fig.add_trace(
                go.Scatter(
                    x=high["timestamp"],
                    y=high["close"],
                    mode="markers",
                    name=f"{label} (high conf)",
                    marker=dict(symbol=sym, size=sizes, color=color_hi, line=dict(width=1, color="black")),
                    text=[f"{p:.2f}" for p in high[prob_col]],
                    hovertemplate=f"<b>{label}</b><br>Price: %{{y:.2f}}<br>Conf: %{{text}}<extra></extra>",
                )
            )

    fig.update_layout(
        title=f"Validated Signals: {pair}",
        template="plotly_white",
        height=550,
        hovermode="x unified",
        legend=dict(orientation="h", y=1.02),
    )
    return fig


fig = plot_validated_signals(spot_df, validated_signals, pair="BTCUSDT")
fig.show()

### 8.3 Backtest Performance

In [None]:
from plotly.subplots import make_subplots


def plot_backtest_performance(results: dict):
    """3-panel chart: return, positions, and drawdown."""
    metrics_df = results.get("metrics_df")
    if metrics_df is None or metrics_df.height == 0:
        print("No metrics to plot.")
        return

    if "timestamp" in metrics_df.columns:
        ts = (
            metrics_df.select(pl.from_epoch(pl.col("timestamp").cast(pl.Int64), time_unit="s").alias("dt"))
            .get_column("dt")
            .to_list()
        )
    else:
        ts = list(range(metrics_df.height))

    fig = make_subplots(
        rows=3,
        cols=1,
        shared_xaxes=True,
        vertical_spacing=0.06,
        subplot_titles=("Strategy Return (%)", "Open / Closed Positions", "Drawdown (%)"),
        row_heights=[0.4, 0.3, 0.3],
    )

    # Row 1: Total return
    if "total_return" in metrics_df.columns:
        ret_pct = (metrics_df.get_column("total_return") * 100).to_list()
        fig.add_trace(
            go.Scatter(
                x=ts,
                y=ret_pct,
                mode="lines",
                name="Return",
                line=dict(color="#1E88E5", width=2),
            ),
            row=1,
            col=1,
        )
    fig.add_hline(y=0, line_dash="dash", line_color="gray", row=1, col=1)

    # Row 2: Positions
    if "open_positions" in metrics_df.columns:
        fig.add_trace(
            go.Scatter(
                x=ts,
                y=metrics_df.get_column("open_positions").to_list(),
                mode="lines",
                name="Open",
                fill="tozeroy",
                line=dict(color="#43A047", width=1.5),
                fillcolor="rgba(67, 160, 71, 0.15)",
            ),
            row=2,
            col=1,
        )
    if "closed_positions" in metrics_df.columns:
        fig.add_trace(
            go.Scatter(
                x=ts,
                y=metrics_df.get_column("closed_positions").to_list(),
                mode="lines",
                name="Closed",
                line=dict(color="#8E24AA", width=1.5, dash="dot"),
            ),
            row=2,
            col=1,
        )

    # Row 3: Drawdown
    if "current_drawdown" in metrics_df.columns:
        dd_pct = [-d * 100 for d in metrics_df.get_column("current_drawdown").to_list()]
        fig.add_trace(
            go.Scatter(
                x=ts,
                y=dd_pct,
                mode="lines",
                name="Drawdown",
                line=dict(color="#E53935", width=2),
                fill="tozeroy",
                fillcolor="rgba(229, 57, 53, 0.15)",
            ),
            row=3,
            col=1,
        )

    final_return = results.get("final_return", 0) * 100
    fig.update_layout(
        title=f"Backtest Results | Return: {final_return:.2f}%",
        template="plotly_white",
        height=800,
        hovermode="x unified",
        legend=dict(orientation="h", y=1.02),
    )
    fig.update_yaxes(title_text="Return (%)", row=1, col=1)
    fig.update_yaxes(title_text="Count", row=2, col=1)
    fig.update_yaxes(title_text="Drawdown (%)", row=3, col=1)
    fig.update_xaxes(title_text="Date", row=3, col=1)
    return fig


fig = plot_backtest_performance(results)
fig.show()

## 9. Architecture & Next Steps

### Data Flow

```
Exchange APIs ─── BinanceSpotLoader ──┐
                  BybitSpotLoader  ───┤
                  OkxSpotLoader  ─────┤
                  VirtualProvider ────┘
                         │
                    RawDataStore (DuckDB / SQLite / PostgreSQL)
                         │
                    RawDataFactory
                         │
                ┌── RawData / RawDataView ──┐
                │                           │
         FeaturePipeline              SignalDetector
          (Feature,                     (detect)
           GlobalFeature,                  │
           OffsetFeature)                  │
                │                      Signals
                │                          │
                └───── Labeler ────────────┘
                         │
                  SklearnSignalValidator
                         │
                  Validated Signals
                         │
                ┌── BacktestRunner ──┐
                │   (per-bar loop)   │
                │                    │
           EntryRules          ExitRules
                │                    │
                └── BacktestBroker ──┘
                     (executor)
                         │
                   StrategyState
                    (portfolio,
                     positions,
                     metrics)
```

### Key Design Decisions

- **Component Registry**: All classes decorated with `@sf_component(name=...)` are discoverable at runtime via `sf.get_component(type, name)`. This enables declarative config-driven pipelines.
- **Immutability**: Core containers (`RawData`, `Signals`, `Trade`, `OrderFill`) are frozen dataclasses for reproducibility.
- **Polars-first**: All internal data processing uses Polars; Pandas is available for visualization via `RawDataView.to_pandas()`.
- **Store Backends**: Choose `DuckDB` (fast, default), `SQLite` (zero extra deps), or `PostgreSQL` (multi-user, remote).

### Next Steps

1. **Real exchange data** — Replace `VirtualDataProvider` with `BinanceSpotLoader` / `BybitSpotLoader` / `OkxSpotLoader`
2. **Custom features** — Build domain-specific indicators by extending `Feature` or `GlobalFeature`
3. **Triple barrier labeling** — Use `TripleBarrierLabeler` for volatility-aware, adaptive labels
4. **Model tuning** — Use `validator.tune()` with Optuna for hyperparameter optimization
5. **Live / paper trading** — Use `RealtimeRunner` with real executors for live execution
6. **Signal composition** — Combine detectors using `Signals.__add__()` to merge signal sets with priority logic
7. **Save / load models** — Use `validator.save(path)` and `SklearnSignalValidator.load(path)` for persistence

### Cleanup

In [None]:
spot_store.close()
strategy_store.close()

# Optionally remove tutorial databases:
# Path("tutorial.duckdb").unlink(missing_ok=True)
# Path("tutorial_strategy.duckdb").unlink(missing_ok=True)