# Binance Coin-Margined Futures Data Exploration

This notebook demonstrates how to load historical Binance Coin-Margined (CM) futures market data from `/data/binance_cm/` using pandas and create exemplar visualizations for quick exploratory analysis.

## 1. Environment setup

We import the core Python libraries used for data manipulation and visualization.
- **pandas**: tabular data handling
- **numpy**: numerical helpers
- **matplotlib**/**seaborn**: charting utilities
- **pathlib**: convenient filesystem navigation

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plotting aesthetics
sns.set_theme(style='darkgrid', context='talk')
plt.rcParams['figure.figsize'] = (14, 6)

## 2. Locate available market data files

The dataset directory may contain multiple symbols (e.g., BTCUSDT, ETHUSDT) and multiple file formats (CSV or Parquet).
The snippet below recursively discovers all supported data files so that we can inspect what is available.

In [None]:
DATA_DIR = Path('/data/binance_cm')

if not DATA_DIR.exists():
    raise FileNotFoundError(f'Data directory {DATA_DIR} not found. Ensure the path is mounted inside the environment before running the notebook.')

# Collect both Parquet and CSV files
parquet_files = sorted(DATA_DIR.rglob('*.parquet'))
csv_files = sorted(DATA_DIR.rglob('*.csv'))
all_files = parquet_files + csv_files

print(f'Found {len(all_files)} data file(s) under {DATA_DIR}.')

# Preview the first few entries grouped by symbol
from collections import defaultdict
files_by_symbol = defaultdict(list)
for path in all_files:
    symbol = path.parent.name
    files_by_symbol[symbol].append(path)

preview = {symbol: paths[:3] for symbol, paths in files_by_symbol.items()}
for symbol, paths in preview.items():
    print(f"
Symbol: {symbol}")
    for path in paths:
        print(f"  - {path.name}")

## 3. Helper to load a symbol's data

Depending on the file format, we use the appropriate pandas reader. This helper returns a clean, time-indexed DataFrame ready for analysis.

In [None]:
def load_symbol_data(symbol: str, limit: int | None = None) -> pd.DataFrame:
    symbol_dir = DATA_DIR / symbol
    if not symbol_dir.exists():
        raise FileNotFoundError(f'Could not find directory for symbol {symbol!r} under {DATA_DIR}.')

    # Prefer parquet (more efficient); fall back to CSV
    files = sorted(symbol_dir.glob('*.parquet'))
    reader = pd.read_parquet
    if not files:
        files = sorted(symbol_dir.glob('*.csv'))
        reader = lambda path: pd.read_csv(path, parse_dates=['timestamp'], index_col=None)

    if not files:
        raise FileNotFoundError(f'No parquet or CSV files found for symbol {symbol!r}.')

    frames: list[pd.DataFrame] = []
    rows_loaded = 0
    for path in files:
        df = reader(path)
        # Standardize timestamp parsing
        if 'timestamp' in df.columns:
            df['timestamp'] = pd.to_datetime(df['timestamp'], utc=True, errors='coerce')
        elif 'time' in df.columns:
            df['timestamp'] = pd.to_datetime(df['time'], unit='ms', utc=True, errors='coerce')
        else:
            raise KeyError('Expected a timestamp or time column in the data files.')

        df = df.set_index('timestamp').sort_index()
        frames.append(df)

        rows_loaded += len(df)
        if limit is not None and rows_loaded >= limit:
            break

    data = pd.concat(frames).sort_index()

    if limit is not None:
        data = data.iloc[:limit]

    # Forward-fill to reduce missing values for numeric columns
    numeric_cols = data.select_dtypes(include=[np.number]).columns
    data[numeric_cols] = data[numeric_cols].fillna(method='ffill')

    return data

## 4. Load a sample symbol

Update `sample_symbol` to any symbol present in your dataset. The code loads and displays key statistics from the first rows.

In [None]:
sample_symbol = next(iter(files_by_symbol)) if files_by_symbol else None
if sample_symbol is None:
    raise RuntimeError('No symbols detected. Ensure the dataset directory contains files.')

cm_df = load_symbol_data(sample_symbol)
cm_df.head()

### Summary statistics

In [None]:
cm_df.describe().T

## 5. Resample to higher timeframes

High-frequency data can be noisy. We aggregate to 1-minute candles (open, high, low, close) and traded volume to facilitate plotting.

In [None]:
price_candidates = ['price', 'close', 'last_price', 'mark_price']
volume_candidates = ['volume', 'qty', 'quantity', 'base_volume']

price_col = next((col for col in price_candidates if col in cm_df.columns), None)
volume_col = next((col for col in volume_candidates if col in cm_df.columns), None)

if price_col is None:
    raise KeyError('Could not identify a price-like column. Please update `price_candidates`.')
if volume_col is None:
    raise KeyError('Could not identify a volume-like column. Please update `volume_candidates`.')

resampled = cm_df[[price_col, volume_col]].rename(columns={price_col: 'price', volume_col: 'volume'})
resampled = resampled.resample('1min').agg({
    'price': ['first', 'max', 'min', 'last'],
    'volume': 'sum'
}).dropna()

# Flatten column MultiIndex
resampled.columns = ['open', 'high', 'low', 'close', 'volume']
resampled.head()

## 6. Visualizations

We plot both price action and traded volume over time. Adjust the resampling window or columns to suit your analysis needs.

In [None]:
fig, axes = plt.subplots(2, 1, sharex=True, figsize=(16, 10))

resampled['close'].plot(ax=axes[0], color='dodgerblue')
axes[0].set_title(f'{sample_symbol} Close Price (1-minute)')
axes[0].set_ylabel('Price')

resampled['volume'].plot(ax=axes[1], color='salmon')
axes[1].set_title(f'{sample_symbol} Volume (1-minute)')
axes[1].set_ylabel('Volume')
axes[1].set_xlabel('Timestamp (UTC)')

plt.tight_layout()

### Rolling volatility

Rolling volatility (standard deviation of returns) is a useful diagnostic for regime changes.
Feel free to adjust the window size for your strategy horizon.

In [None]:
returns = resampled['close'].pct_change()
rolling_vol = returns.rolling(window=30, min_periods=15).std() * np.sqrt(30)

ax = rolling_vol.plot(color='mediumseagreen')
ax.set_title(f'{sample_symbol} Rolling Volatility (30-minute window)')
ax.set_ylabel('Volatility (σ)')
ax.set_xlabel('Timestamp (UTC)')
plt.tight_layout()

## 7. Next steps

- Compare multiple symbols by repeating the workflow for each directory.
- Join with funding-rate or open-interest data to understand market structure.
- Export aggregated DataFrames with `DataFrame.to_parquet()` for downstream backtesting pipelines.

This notebook can serve as a starting point for deeper feature engineering and signal research on Binance Coin-Margined futures markets.