# Data Loading & Resampling

This notebook demonstrates how to load market data from DuckDB stores and use
SignalFlow's OHLCV resampling utilities to work with multiple timeframes.

**What you'll learn:**
- Generate synthetic OHLCV data with `VirtualDataProvider`
- Load data using `RawDataFactory` and the `sf.load()` shortcut
- Detect the timeframe of existing data automatically
- Resample between timeframes (e.g. 1m to 1h, 1m to 4h)
- Check exchange-specific timeframe support

**SignalFlow version:** 0.5.0

## 1. Generate Synthetic Data

We use `VirtualDataProvider` to create realistic OHLCV bars via a geometric
random walk. This lets us explore the data loading and resampling APIs without
needing exchange credentials or real market data.

In [11]:
from datetime import datetime
from pathlib import Path

import signalflow as sf
from signalflow.data.source import VirtualDataProvider
from signalflow.data.raw_store import DuckDbSpotStore
from signalflow.data import RawDataFactory

# Create a temporary DuckDB store
db_path = Path("/tmp/data_loading_demo.duckdb")
store = DuckDbSpotStore(db_path=db_path)

# Generate 10,000 one-minute bars for 3 pairs
provider = VirtualDataProvider(store=store, seed=42)
provider.download(
    pairs=["BTCUSDT", "ETHUSDT", "SOLUSDT"],
    n_bars=10_000,
)

print(f"Store created at: {db_path}")
print(f"Store stats:\n{store.get_stats()}")

[32m2026-02-15 00:50:32.143[0m | [1mINFO    [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36m_ensure_tables[0m:[36m153[0m - [1mDatabase initialized: /tmp/data_loading_demo.duckdb (data_type=spot, timeframe=1m)[0m
[32m2026-02-15 00:50:32.201[0m | [34m[1mDEBUG   [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36minsert_klines[0m:[36m220[0m - [34m[1mInserted 10,000 rows for BTCUSDT[0m
[32m2026-02-15 00:50:32.202[0m | [1mINFO    [0m | [36msignalflow.data.source.virtual[0m:[36mdownload[0m:[36m255[0m - [1mVirtualDataProvider: generated 10000 bars for BTCUSDT[0m
[32m2026-02-15 00:50:32.263[0m | [34m[1mDEBUG   [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36minsert_klines[0m:[36m220[0m - [34m[1mInserted 10,000 rows for ETHUSDT[0m
[32m2026-02-15 00:50:32.264[0m | [1mINFO    [0m | [36msignalflow.data.source.virtual[0m:[36mdownload[0m:[36m255[0m - [1mVirtualDataProvider: generated 10000 bars for ETHUSDT[0m
[32m

Store created at: /tmp/data_loading_demo.duckdb
Store stats:
shape: (3, 5)
┌─────────┬───────┬─────────────────────┬─────────────────────┬──────────────┐
│ pair    ┆ rows  ┆ first_candle        ┆ last_candle         ┆ total_volume │
│ ---     ┆ ---   ┆ ---                 ┆ ---                 ┆ ---          │
│ str     ┆ i64   ┆ datetime[μs]        ┆ datetime[μs]        ┆ f64          │
╞═════════╪═══════╪═════════════════════╪═════════════════════╪══════════════╡
│ BTCUSDT ┆ 10000 ┆ 2024-01-01 00:00:00 ┆ 2024-01-07 22:39:00 ┆ 1.8066e7     │
│ ETHUSDT ┆ 10000 ┆ 2024-01-01 00:00:00 ┆ 2024-01-07 22:39:00 ┆ 1.7957e7     │
│ SOLUSDT ┆ 10000 ┆ 2024-01-01 00:00:00 ┆ 2024-01-07 22:39:00 ┆ 1.7978e7     │
└─────────┴───────┴─────────────────────┴─────────────────────┴──────────────┘


## 2. Load Data with RawDataFactory

`RawDataFactory.from_duckdb_spot_store()` gives you full control over data
loading: pair selection, date range filtering, schema validation, deduplication,
and optional auto-resampling.

In [12]:
raw_data = RawDataFactory.from_duckdb_spot_store(
    spot_store_path=db_path,
    pairs=["BTCUSDT", "ETHUSDT", "SOLUSDT"],
    start=datetime(2020, 1, 1),
    end=datetime(2030, 1, 1),
    data_types=["spot"],
)

spot_df = raw_data.get("spot")
print(f"Shape: {spot_df.shape}")
print(f"Pairs: {spot_df['pair'].unique().sort().to_list()}")
print(f"Time range: {spot_df['timestamp'].min()} .. {spot_df['timestamp'].max()}")
print(f"Columns: {spot_df.columns}")

[32m2026-02-15 00:50:32.348[0m | [1mINFO    [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36m_ensure_tables[0m:[36m153[0m - [1mDatabase initialized: /tmp/data_loading_demo.duckdb (data_type=spot, timeframe=1m)[0m


Shape: (30000, 8)
Pairs: ['BTCUSDT', 'ETHUSDT', 'SOLUSDT']
Time range: 2024-01-01 00:00:00 .. 2024-01-07 22:39:00
Columns: ['pair', 'timestamp', 'open', 'high', 'low', 'close', 'volume', 'trades']


## 3. Load with sf.load() Shortcut

For quick exploration, `sf.load()` wraps the factory method in a single call.
It accepts a path to a `.duckdb` file and returns a `RawData` container.

In [13]:
raw_quick = sf.load(
    db_path,
    pairs=["BTCUSDT", "ETHUSDT", "SOLUSDT"],
    start="2024-01-01",
    timeframe="1m",
)

print(f"Loaded pairs: {raw_quick.pairs}")
print(f"Spot shape: {raw_quick.get('spot').shape}")

[32m2026-02-15 00:50:32.381[0m | [1mINFO    [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36m_ensure_tables[0m:[36m153[0m - [1mDatabase initialized: /tmp/data_loading_demo.duckdb (data_type=spot, timeframe=1m)[0m


Loaded pairs: ['BTCUSDT', 'ETHUSDT', 'SOLUSDT']
Spot shape: (30000, 8)


## 4. Detect Timeframe

`detect_timeframe()` computes the most common timestamp delta across all pairs
and maps it to the nearest known timeframe string.

In [14]:
from signalflow.data.resample import detect_timeframe

df = raw_data.get("spot")
detected_tf = detect_timeframe(df)
print(f"Detected timeframe: {detected_tf}")

Detected timeframe: 1m


## 5. OHLCV Resampling

`resample_ohlcv()` aggregates candles from a smaller timeframe to a larger one
using correct OHLCV rules:

| Column   | Aggregation |
|----------|-------------|
| `open`   | first       |
| `high`   | max         |
| `low`    | min         |
| `close`  | last        |
| `volume` | sum         |
| `trades` | sum         |

In [15]:
from signalflow.data.resample import resample_ohlcv

df_1m = raw_data.get("spot")
print(f"Original (1m): {df_1m.shape}")

df_1h = resample_ohlcv(df_1m, source_tf="1m", target_tf="1h")
print(f"Resampled (1h): {df_1h.shape}")

df_4h = resample_ohlcv(df_1m, source_tf="1m", target_tf="4h")
print(f"Resampled (4h): {df_4h.shape}")

Original (1m): (30000, 8)
Resampled (1h): (501, 8)
Resampled (4h): (126, 8)


## 6. Auto-Detect and Resample

`align_to_timeframe()` combines detection and resampling in one step: it
auto-detects the source timeframe and resamples to the target if possible.
If resampling is not possible (e.g. the target is not a multiple of the
source), the data is returned unchanged with a warning.

In [16]:
from signalflow.data.resample import align_to_timeframe

df_auto_1h = align_to_timeframe(df_1m, target_tf="1h")
print(f"Auto-resampled to 1h: {df_auto_1h.shape}")

Auto-resampled to 1h: (501, 8)


## 7. Auto-Resampling During Data Loading

Both `RawDataFactory.from_duckdb_spot_store()` and `RawDataFactory.from_stores()`
accept a `target_timeframe` parameter. When set, the data is automatically
resampled after loading -- no separate resampling step needed.

In [17]:
raw_1h = RawDataFactory.from_duckdb_spot_store(
    spot_store_path=db_path,
    pairs=["BTCUSDT", "ETHUSDT", "SOLUSDT"],
    start=datetime(2020, 1, 1),
    end=datetime(2030, 1, 1),
    data_types=["spot"],
    target_timeframe="1h",
)
print(f"Auto-resampled spot: {raw_1h.get('spot').shape}")

[32m2026-02-15 00:50:32.488[0m | [1mINFO    [0m | [36msignalflow.data.raw_store.duckdb_stores[0m:[36m_ensure_tables[0m:[36m153[0m - [1mDatabase initialized: /tmp/data_loading_demo.duckdb (data_type=spot, timeframe=1m)[0m


Auto-resampled spot: (501, 8)


## 8. Exchange Timeframe Support

Not every exchange supports every timeframe. SignalFlow ships with
`EXCHANGE_TIMEFRAMES` (a mapping of exchange name to supported timeframes)
and helper functions to navigate this.

In [18]:
from signalflow.data.resample import (
    select_best_timeframe,
    can_resample,
    EXCHANGE_TIMEFRAMES,
    TIMEFRAME_MINUTES,
)

# Show standard timeframes
print("Standard timeframes:")
for tf, minutes in TIMEFRAME_MINUTES.items():
    print(f"  {tf:>4s} = {minutes:>5} min")

print("\nExchange support:")
for exchange, tfs in EXCHANGE_TIMEFRAMES.items():
    # Sort by duration for readability
    sorted_tfs = sorted(tfs, key=lambda t: TIMEFRAME_MINUTES[t])
    print(f"  {exchange:>15s}: {', '.join(sorted_tfs)}")

Standard timeframes:
    1m =     1 min
    3m =     3 min
    5m =     5 min
   15m =    15 min
   30m =    30 min
    1h =    60 min
    2h =   120 min
    4h =   240 min
    6h =   360 min
    8h =   480 min
   12h =   720 min
    1d =  1440 min

Exchange support:
          binance: 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 8h, 12h, 1d
            bybit: 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d
              okx: 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d
      kraken_spot: 1m, 5m, 15m, 30m, 1h, 4h, 1d
   kraken_futures: 1m, 5m, 15m, 30m, 1h, 4h, 12h, 1d
          deribit: 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 12h, 1d
      hyperliquid: 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 8h, 12h, 1d
         whitebit: 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 8h, 12h, 1d


In [19]:
# Find best timeframe for an exchange
# Bybit does not support 8h natively, so select_best_timeframe picks
# the largest supported TF that evenly divides 8h.
best = select_best_timeframe("bybit", target_tf="8h")
print(f"Best Bybit TF for 8h target: {best}")

# Binance supports 8h directly
best_binance = select_best_timeframe("binance", target_tf="8h")
print(f"Best Binance TF for 8h target: {best_binance}")

# Check if resampling is possible
print(f"\nCan resample 1m -> 1h? {can_resample('1m', '1h')}")
print(f"Can resample 1h -> 1m? {can_resample('1h', '1m')}")
print(f"Can resample 1h -> 4h? {can_resample('1h', '4h')}")
print(f"Can resample 1h -> 3h? {can_resample('1h', '3h')}")

Best Bybit TF for 8h target: 4h
Best Binance TF for 8h target: 8h

Can resample 1m -> 1h? True
Can resample 1h -> 1m? False
Can resample 1h -> 4h? True
Can resample 1h -> 3h? False


## 9. Summary

| Function | Purpose |
|----------|--------|
| `sf.load()` | Quick data loading from DuckDB |
| `RawDataFactory.from_duckdb_spot_store()` | Full-control data loading with validation |
| `detect_timeframe()` | Auto-detect timeframe from data |
| `resample_ohlcv()` | Resample OHLCV between timeframes |
| `align_to_timeframe()` | Auto-detect source TF + resample |
| `select_best_timeframe()` | Find best exchange TF for a target |
| `can_resample()` | Check if resampling is possible |

## Cleanup

In [20]:
store.close()
db_path.unlink(missing_ok=True)
print("Temporary DuckDB file removed. Done!")

Temporary DuckDB file removed. Done!


## Next Steps

- [01 - Quick Start](01_quickstart.ipynb): Run your first backtest in 5 minutes
- [02 - Custom Detector](02_custom_detector.ipynb): Create your own signal detector
- [04 - Pipeline Visualization](04_visualization.ipynb): Visualize your strategy pipeline
- [05 - Advanced Strategies](05_advanced_strategies.ipynb): Multi-detector ensembles