# Squeeze Analytics â€” EDA (SQLite)

This notebook does initial exploratory data analysis (EDA) directly from your local `ohlc.sqlite3` file.

Focus:
- Load key tables (`ohlc`, `alerts`, `trade_plans`)
- Normalize epoch timestamps (ms)
- Convert all timestamps to **Australia/Sydney** (AEDT/AEST automatically depending on date)
- Basic sanity checks, missing data, distributions

Notes:
- SQLite stores timestamps as integers (often milliseconds).
- `Australia/Sydney` handles DST transitions for you.


In [None]:
import sqlite3
from pathlib import Path

import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 140)

DB_PATH = Path('ohlc.sqlite3')
assert DB_PATH.exists(), f'Missing DB file: {DB_PATH.resolve()}'
TZ = 'Australia/Sydney'


## Helpers: timestamp normalization + timezone conversion

Your DB uses integers like `1768313070803` which are **epoch milliseconds**.

We convert to UTC first (`utc=True`), then convert to Sydney time.
Sydney will automatically show AEST vs AEDT depending on the date.

In [None]:
def to_utc_datetime(ts: pd.Series | np.ndarray | list, unit: str = 'ms') -> pd.Series:
    """Convert an epoch timestamp series to timezone-aware UTC datetimes.

    unit: 'ms' for epoch milliseconds, 's' for seconds.
    """
    s = pd.Series(ts)
    # best-effort numeric coercion
    s = pd.to_numeric(s, errors='coerce')
    return pd.to_datetime(s, unit=unit, utc=True)


def utc_to_sydney(dt_utc: pd.Series) -> pd.Series:
    """Convert a tz-aware UTC datetime series to Australia/Sydney time.""
    if getattr(dt_utc.dt, 'tz', None) is None:
        raise ValueError('Expected tz-aware series (UTC). Use to_utc_datetime(..., utc=True).')
    return dt_utc.dt.tz_convert(TZ)


def add_sydney_time(df: pd.DataFrame, col: str, unit: str = 'ms', prefix: str | None = None) -> pd.DataFrame:
    """Add UTC + Sydney datetime columns derived from an epoch timestamp column.

    Adds: <prefix>_dt_utc, <prefix>_dt_syd
    """
    if prefix is None:
        prefix = col
    out = df.copy()
    out[f'{prefix}_dt_utc'] = to_utc_datetime(out[col], unit=unit)
    out[f'{prefix}_dt_syd'] = utc_to_sydney(out[f'{prefix}_dt_utc'])
    return out


## List tables and row counts

In [None]:
with sqlite3.connect(DB_PATH) as conn:
    tables = pd.read_sql_query("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name", conn)
    display(tables)

    counts = []
    for t in tables['name'].tolist():
        n = pd.read_sql_query(f"SELECT COUNT(*) AS n FROM {t}", conn)['n'].iloc[0]
        counts.append((t, int(n)))
    counts_df = pd.DataFrame(counts, columns=['table','rows']).sort_values('rows', ascending=False)
    display(counts_df)


## Load OHLC sample

The `ohlc` table usually contains `open_time` and `close_time` in epoch ms.

In [None]:
with sqlite3.connect(DB_PATH) as conn:
    ohlc = pd.read_sql_query("SELECT * FROM ohlc LIMIT 50000", conn)

ohlc.head()


In [None]:
# Normalize numeric columns
for c in ['open','high','low','close','volume']:
    if c in ohlc.columns:
        ohlc[c] = pd.to_numeric(ohlc[c], errors='coerce')

# Add timezone-aware datetimes
if 'open_time' in ohlc.columns:
    ohlc = add_sydney_time(ohlc, 'open_time', unit='ms', prefix='open')
if 'close_time' in ohlc.columns:
    ohlc = add_sydney_time(ohlc, 'close_time', unit='ms', prefix='close')

ohlc[['exchange','symbol','interval','open_time','open_dt_utc','open_dt_syd','close_time','close_dt_syd']].head()


### Basic sanity checks

In [None]:
ohlc.isna().mean().sort_values(ascending=False).head(20)


In [None]:
ohlc[['open','high','low','close','volume']].describe(percentiles=[0.01,0.05,0.5,0.95,0.99]).T


## Load Alerts + Trade Plans
The `alerts` and `trade_plans` tables are often the best starting point for strategy exploration.

In [None]:
with sqlite3.connect(DB_PATH) as conn:
    alerts = pd.read_sql_query("SELECT * FROM alerts LIMIT 200000", conn)
    trade_plans = pd.read_sql_query("SELECT * FROM trade_plans LIMIT 200000", conn)

alerts.shape, trade_plans.shape


In [None]:
# Convert timestamp columns found in these tables
for col in ['ts','created_ts']:
    if col in alerts.columns:
        alerts = add_sydney_time(alerts, col, unit='ms', prefix=col)

if 'ts' in trade_plans.columns:
    trade_plans = add_sydney_time(trade_plans, 'ts', unit='ms', prefix='ts')

alerts[['exchange','symbol','signal','source_tf','ts','ts_dt_syd']].head()


### Alert distributions

In [None]:
alerts['signal'].value_counts(dropna=False).head(30)


In [None]:
alerts['source_tf'].value_counts(dropna=False).head(30)


### Time-of-day / day-of-week in Sydney time
If you want to understand session effects, always compute these features *after* timezone conversion.

In [None]:
alerts['hour_syd'] = alerts['ts_dt_syd'].dt.hour
alerts['dow_syd'] = alerts['ts_dt_syd'].dt.day_name()

display(alerts['dow_syd'].value_counts())
display(alerts['hour_syd'].value_counts().sort_index().head(24))


## Optional: Spark / Databricks SQL examples (Sydney time)
If you end up loading these tables into Databricks, here are equivalent conversions.

In [None]:
# Spark example (Databricks / PySpark)
# from pyspark.sql import functions as F
# df = spark.table('ohlc')
# df = df.withColumn('open_ts_utc', (F.col('open_time')/1000).cast('timestamp'))
# # Interpret as UTC then convert to Sydney
# df = df.withColumn('open_time_syd', F.from_utc_timestamp(F.col('open_ts_utc'), 'Australia/Sydney'))
# display(df.select('exchange','symbol','interval','open_time','open_time_syd').limit(10))

# SQL example:
# %sql
# SELECT
#   exchange, symbol, interval,
#   open_time,
#   from_utc_timestamp(to_timestamp(open_time/1000), 'Australia/Sydney') AS open_time_syd
# FROM ohlc
# LIMIT 10;
