# Data Exploration

Understand what data we have, its quality, and basic characteristics.

**Unit of analysis**: The orderbook (each token_id represents one orderbook)

**Key questions**:
1. What data do we have? (inventory)
2. What is the data quality? (completeness, gaps)
3. What are the characteristics of each orderbook?

In [21]:
import pandas as pd
import numpy as np
import json
from datetime import datetime, timedelta
from sqlalchemy import create_engine
import plotly.express as px
import plotly.graph_objects as go

pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_rows', 50)

engine = create_engine('postgresql://admin:quest@localhost:8812/qdb')

## 1. Data Inventory

What tables do we have and what's in them?

In [22]:
# Table-level statistics
inventory = []

for table, ts_col in [('trades', 'timestamp'), ('orderbook_snapshots', 'timestamp'), ('markets', 'creation_time')]:
    try:
        q = f"SELECT count() as n, min({ts_col}) as first_ts, max({ts_col}) as last_ts FROM {table}"
        row = pd.read_sql(q, engine).iloc[0]
        inventory.append({
            'table': table,
            'records': int(row['n']),
            'first': row['first_ts'],
            'last': row['last_ts']
        })
    except Exception as e:
        inventory.append({'table': table, 'records': 0, 'first': None, 'last': None})

df_inventory = pd.DataFrame(inventory)
df_inventory['duration'] = (df_inventory['last'] - df_inventory['first']).apply(
    lambda x: f"{x.days}d {x.seconds//3600}h" if pd.notna(x) else None
)
display(df_inventory)

Unnamed: 0,table,records,first,last,duration
0,trades,29165,2025-02-05 23:41:11.000000,2026-01-19 11:11:54.000000,347d 11h
1,orderbook_snapshots,38379,2026-01-19 07:12:35.433000,2026-01-19 11:12:21.376000,0d 3h
2,markets,14,2025-01-05 01:34:48.860057,2026-01-04 18:56:17.032325,364d 17h


In [23]:
# Count unique orderbooks (token_ids)
n_orderbooks = pd.read_sql("SELECT count(distinct token_id) as n FROM orderbook_snapshots", engine).iloc[0]['n']
n_markets = pd.read_sql("SELECT count(distinct market_id) as n FROM markets", engine).iloc[0]['n']

print(f"Unique orderbooks (token_ids): {n_orderbooks}")
print(f"Unique markets: {n_markets}")
print(f"Expected orderbooks per market: ~2 (Yes/No tokens)")

Unique orderbooks (token_ids): 28
Unique markets: 14
Expected orderbooks per market: ~2 (Yes/No tokens)


## 2. Orderbook Catalog

**Master reference**: Every orderbook with its market context and current state.

This is the foundation for all analysis - we need to know what each token_id represents.

In [24]:
# Build token_id -> (market_slug, outcome_name) mapping from markets metadata
markets_df = pd.read_sql("SELECT market_id, market_slug, question, metadata FROM markets", engine)

token_info = {}
for _, row in markets_df.iterrows():
    try:
        meta = json.loads(row['metadata']) if isinstance(row['metadata'], str) else row['metadata']
        outcomes = meta.get('outcomes', ['Yes', 'No'])
        token_ids = meta.get('clob_token_ids', [])
        
        for i, tid in enumerate(token_ids):
            token_info[tid] = {
                'market_id': row['market_id'],
                'market_slug': row['market_slug'],
                'question': row['question'][:80],
                'outcome': outcomes[i] if i < len(outcomes) else f'Outcome_{i}'
            }
    except:
        continue

print(f"Token mappings built: {len(token_info)}")

Token mappings built: 28


In [25]:
# Get latest snapshot for each orderbook to see current state
latest_query = """
SELECT 
    token_id,
    market_id,
    last(mid_price) as last_price,
    last(spread_bps) as last_spread_bps,
    last(total_bid_volume) as last_bid_vol,
    last(total_ask_volume) as last_ask_vol,
    count() as n_snapshots,
    max(timestamp) as last_seen
FROM orderbook_snapshots
"""

df_orderbooks = pd.read_sql(latest_query, engine)

# Enrich with market context
df_orderbooks['outcome'] = df_orderbooks['token_id'].map(lambda x: token_info.get(x, {}).get('outcome', 'Unknown'))
df_orderbooks['question'] = df_orderbooks['token_id'].map(lambda x: token_info.get(x, {}).get('question', 'Unknown'))

# Classify by price regime
def price_regime(p):
    if p < 0.10: return 'Low (<10%)'
    elif p > 0.90: return 'High (>90%)'
    else: return 'Mid (10-90%)'

df_orderbooks['price_regime'] = df_orderbooks['last_price'].apply(price_regime)

# Sort by last price descending
df_orderbooks = df_orderbooks.sort_values('last_price', ascending=False)

print(f"\nOrderbook Catalog ({len(df_orderbooks)} orderbooks):")
display(df_orderbooks[['question', 'outcome', 'last_price', 'last_spread_bps', 'price_regime', 'n_snapshots']].head(30))


Orderbook Catalog (28 orderbooks):


Unnamed: 0,question,outcome,last_price,last_spread_bps,price_regime,n_snapshots
16,Fed increases interest rates by 25+ bps after January 2026 meeting?,No,0.9995,10.005003,High (>90%),1372
1,Will Slavia Pragu win the 2025–26 Champions League?,No,0.9985,10.015023,High (>90%),1371
10,Will Leeds win the 2025–26 English Premier League?,No,0.9985,10.015023,High (>90%),1372
14,Fed decreases interest rates by 50+ bps after January 2026 meeting?,No,0.9975,10.025063,High (>90%),1372
22,Will Chelsea Clinton win the 2028 Democratic presidential nomination?,No,0.9965,10.035123,High (>90%),1371
3,Will Elon cut the budget by at least 5% in 2025?,No,0.996,20.080321,High (>90%),1372
24,Will MrBeast win the 2028 Democratic presidential nomination?,No,0.9955,10.045203,High (>90%),1371
27,Will Oprah Winfrey win the 2028 Democratic presidential nomination?,No,0.9945,10.055304,High (>90%),1371
26,Will Andrew Yang win the 2028 Democratic presidential nomination?,No,0.9945,10.055304,High (>90%),1371
5,Will Elon and DOGE cut more than $250b in federal spending in 2025?,No,0.988,40.48583,High (>90%),1372


In [26]:
# Price regime distribution
regime_counts = df_orderbooks['price_regime'].value_counts()
print("\nPrice Regime Distribution:")
for regime, count in regime_counts.items():
    pct = count / len(df_orderbooks) * 100
    print(f"  {regime}: {count} orderbooks ({pct:.0f}%)")

# Visual
fig = px.pie(values=regime_counts.values, names=regime_counts.index, 
             title='Orderbooks by Price Regime',
             color_discrete_sequence=['#ff6b6b', '#4ecdc4', '#45b7d1'])
fig.update_layout(height=350)
fig.show()


Price Regime Distribution:
  High (>90%): 14 orderbooks (50%)
  Low (<10%): 14 orderbooks (50%)


## 3. Data Quality

How complete is our data? Are there gaps?

In [27]:
# Snapshot frequency per orderbook
freq_query = """
SELECT 
    token_id,
    count() as n_snapshots,
    min(timestamp) as first_snapshot,
    max(timestamp) as last_snapshot
FROM orderbook_snapshots
WHERE timestamp > dateadd('d', -3, now())
"""

df_freq = pd.read_sql(freq_query, engine)
df_freq['hours_covered'] = (df_freq['last_snapshot'] - df_freq['first_snapshot']).dt.total_seconds() / 3600
df_freq['snapshots_per_hour'] = df_freq['n_snapshots'] / df_freq['hours_covered'].replace(0, np.nan)
df_freq['outcome'] = df_freq['token_id'].map(lambda x: token_info.get(x, {}).get('outcome', '?'))
df_freq['question'] = df_freq['token_id'].map(lambda x: token_info.get(x, {}).get('question', '?')[:50])

print("Snapshot frequency (last 3 days):")
display(df_freq[['question', 'outcome', 'n_snapshots', 'snapshots_per_hour']].sort_values('snapshots_per_hour', ascending=False))

Snapshot frequency (last 3 days):


Unnamed: 0,question,outcome,n_snapshots,snapshots_per_hour
14,Will Oprah Winfrey win the 2028 Democratic preside,No,1373,364.277434
2,Will Elon cut the budget by at least 5% in 2025?,No,1374,362.789944
22,Will Elon cut the budget by at least 5% in 2025?,Yes,1374,362.789944
1,Will MrBeast win the 2028 Democratic presidential,No,1373,362.028573
23,Will MrBeast win the 2028 Democratic presidential,Yes,1373,362.028573
19,Will Andrew Yang win the 2028 Democratic president,Yes,1373,361.294661
25,Will Andrew Yang win the 2028 Democratic president,No,1373,361.294661
8,Will Oprah Winfrey win the 2028 Democratic preside,Yes,1374,360.662711
16,Will Chelsea Clinton win the 2028 Democratic presi,No,1373,358.248003
11,Will Chelsea Clinton win the 2028 Democratic presi,Yes,1373,358.248003


In [28]:
# Check for data gaps - count snapshots per hour
gaps_query = """
SELECT 
    timestamp,
    count() as n_snapshots
FROM orderbook_snapshots
WHERE timestamp > dateadd('d', -3, now())
SAMPLE BY 1h
ALIGN TO CALENDAR
"""

df_gaps = pd.read_sql(gaps_query, engine)

# Identify low-data hours
median_per_hour = df_gaps['n_snapshots'].median()
low_data_hours = df_gaps[df_gaps['n_snapshots'] < median_per_hour * 0.5]

print(f"Median snapshots per hour: {median_per_hour:.0f}")
print(f"Hours with <50% of median: {len(low_data_hours)}")

# Visualize
fig = go.Figure()
fig.add_trace(go.Bar(x=df_gaps['timestamp'], y=df_gaps['n_snapshots'], name='Snapshots'))
fig.add_hline(y=median_per_hour, line_dash='dash', line_color='red', 
              annotation_text=f'Median: {median_per_hour:.0f}')
fig.update_layout(title='Data Collection Completeness (snapshots/hour)', height=350,
                  xaxis_title='Time', yaxis_title='Snapshots')
fig.show()

Median snapshots per hour: 9952
Hours with <50% of median: 1


## 4. Trading Activity

Which orderbooks have actual trading activity?

In [29]:
# Trades per orderbook
trades_query = """
SELECT 
    token_id,
    market_id,
    count() as n_trades,
    sum(value) as total_volume,
    avg(size) as avg_trade_size
FROM trades
WHERE timestamp > dateadd('d', -7, now())
"""

df_trades = pd.read_sql(trades_query, engine)
df_trades['outcome'] = df_trades['token_id'].map(lambda x: token_info.get(x, {}).get('outcome', '?'))
df_trades['question'] = df_trades['token_id'].map(lambda x: token_info.get(x, {}).get('question', '?')[:50])

# Sort by volume
df_trades = df_trades.sort_values('total_volume', ascending=False)

print(f"\nTrading Activity by Orderbook (last 7 days):")
display(df_trades[['question', 'outcome', 'n_trades', 'total_volume', 'avg_trade_size']].head(20))


Trading Activity by Orderbook (last 7 days):


Unnamed: 0,question,outcome,n_trades,total_volume,avg_trade_size
18,Fed increases interest rates by 25+ bps after Janu,No,1803,5127792.0,2848.082613
24,Fed decreases interest rates by 50+ bps after Janu,No,1866,4947141.0,2659.288388
27,No change in Fed interest rates after January 2026,Yes,2053,1379746.0,699.684239
3,Will Chelsea Clinton win the 2028 Democratic presi,No,1436,1359076.0,949.764271
12,Will Oprah Winfrey win the 2028 Democratic preside,No,1269,1001536.0,792.667204
16,Will Leeds win the 2025–26 English Premier League?,No,1864,612244.5,328.93944
0,Will MrBeast win the 2028 Democratic presidential,No,1020,534071.5,525.959201
22,Khamenei out as Supreme Leader of Iran by January,No,1032,363841.0,380.563464
5,Will Andrew Yang win the 2028 Democratic president,No,891,349001.7,393.857701
20,Fed decreases interest rates by 25 bps after Janua,No,1358,321525.8,245.316234


In [30]:
# Activity distribution
active_orderbooks = len(df_trades[df_trades['n_trades'] > 0])
high_activity = len(df_trades[df_trades['n_trades'] >= 100])

print(f"\nActivity Summary:")
print(f"  Orderbooks with any trades: {active_orderbooks}")
print(f"  Orderbooks with 100+ trades: {high_activity}")
print(f"  Total volume (7d): ${df_trades['total_volume'].sum():,.0f}")


Activity Summary:
  Orderbooks with any trades: 28
  Orderbooks with 100+ trades: 24
  Total volume (7d): $16,501,979


In [31]:
# Trade size distribution
size_query = """
SELECT size, value FROM trades 
WHERE timestamp > dateadd('d', -7, now())
LIMIT 10000
"""
df_sizes = pd.read_sql(size_query, engine)

if len(df_sizes) > 0:
    print(f"\nTrade Size Distribution (n={len(df_sizes)}):")
    print(df_sizes['value'].describe().round(2))
    
    fig = px.histogram(df_sizes, x='value', nbins=50, title='Trade Value Distribution',
                       labels={'value': 'Trade Value (USD)'})
    fig.update_layout(height=350)
    fig.update_yaxes(type='log')
    fig.show()


Trade Size Distribution (n=10000):
count    10000.00
mean       475.59
std       1607.41
min          0.00
25%          6.00
50%         51.42
75%        334.19
max      36317.30
Name: value, dtype: float64


## 5. Summary

Key findings about our dataset.

In [32]:
# Create summary dataframe combining orderbook info with trading activity
df_summary = df_orderbooks[['token_id', 'question', 'outcome', 'last_price', 'last_spread_bps', 'price_regime', 'n_snapshots']].copy()

# Merge with trading data
trade_cols = df_trades[['token_id', 'n_trades', 'total_volume']].copy()
df_summary = df_summary.merge(trade_cols, on='token_id', how='left')
df_summary['n_trades'] = df_summary['n_trades'].fillna(0).astype(int)
df_summary['total_volume'] = df_summary['total_volume'].fillna(0)

# Classify activity level
def activity_level(n):
    if n == 0: return 'Inactive'
    elif n < 50: return 'Low'
    elif n < 500: return 'Medium'
    else: return 'High'

df_summary['activity'] = df_summary['n_trades'].apply(activity_level)

print("\nComplete Orderbook Summary:")
display(df_summary.sort_values('total_volume', ascending=False))


Complete Orderbook Summary:


Unnamed: 0,token_id,question,outcome,last_price,last_spread_bps,price_regime,n_snapshots,n_trades,total_volume,activity
0,42139849929574046088630785796780813725435914859433767469767950066058132350666,Fed increases interest rates by 25+ bps after January 2026 meeting?,No,0.9995,10.005003,High (>90%),1372,1803,5127792.0,High
3,71478852790279095447182996049071040792010759617668969799049179229104800573786,Fed decreases interest rates by 50+ bps after January 2026 meeting?,No,0.9975,10.025063,High (>90%),1372,1866,4947141.0,High
12,112838095111461683880944516726938163688341306245473734071798778736646352193304,No change in Fed interest rates after January 2026 meeting?,Yes,0.9615,10.400416,High (>90%),1372,2053,1379746.0,High
4,76057920052421891902791411567177996435483774677774664174053982044923692373687,Will Chelsea Clinton win the 2028 Democratic presidential nomination?,No,0.9965,10.035123,High (>90%),1371,1436,1359076.0,High
7,44046525152074436753629616217525652949736205933417533647338274649282385796755,Will Oprah Winfrey win the 2028 Democratic presidential nomination?,No,0.9945,10.055304,High (>90%),1371,1269,1001536.0,High
2,79742853692640441630014447218057231927327132180925275737649345065452061252209,Will Leeds win the 2025–26 English Premier League?,No,0.9985,10.015023,High (>90%),1372,1864,612244.5,High
6,93910251767329056141706007839545721784238465183449597462421810827705938093892,Will MrBeast win the 2028 Democratic presidential nomination?,No,0.9955,10.045203,High (>90%),1371,1020,534071.5,High
13,62595435619678438799673612599999067112702849851098967060818869994133628780778,Khamenei out as Supreme Leader of Iran by January 31?,No,0.935,106.951872,High (>90%),1372,1032,363841.0,High
8,21234886083000978203440582985665409733123024063268807484684316771005736994028,Will Andrew Yang win the 2028 Democratic presidential nomination?,No,0.9945,10.055304,High (>90%),1371,891,349001.7,High
11,48193521645113703700467246669338225849301704920590102230072263970163239985027,Fed decreases interest rates by 25 bps after January 2026 meeting?,No,0.9645,10.368066,High (>90%),1372,1358,321525.8,High


In [None]:
# Export for use in other notebooks
df_summary.to_csv('orderbook_summary.csv', index=False)
print("Saved orderbook_summary.csv for use in liquidity analysis")

In [33]:
engine.dispose()
print("Done.")

Done.
