# Kalshi NYC Temperature Market — Raw Data Exploration

This notebook helps you understand how the **raw Kalshi data** for NYC high-temperature prediction markets works. We'll load the JSON files, parse their structure, and visualize what the market is telling us.

## What is Kalshi?

Kalshi is a regulated prediction market (CFTC) where people trade on outcomes. For weather, you're betting on things like *"Will the high temp in NYC be between 32°–33° on Feb 4, 2026?"* Each contract pays $1 if the outcome is **Yes**, and $0 if **No**. Prices are in **cents** (1–99), so 45¢ means the market implies ~45% probability.

## 1. Raw Data File Structure

The `data/raw/kalshi/` folder contains two types of files:

| File Type | Naming Pattern | Contents |
|-----------|----------------|----------|
| **Event Markets** | `KXHIGHNY-26FEB04_20260212_021232_event_markets.json` | Event meta + list of all markets (one per temp range) |
| **Orderbook** | `KXHIGHNY-26FEB04-B32.5_20260212_021232_orderbook.json` | Bids/asks for a single market ticker |

**Ticker naming:**
- `KXHIGHNY-26FEB04` = NYC high temp event for Feb 4, 2026
- `B32.5` = "Between" 32–33° (the decimal is the midpoint)
- `T36` = "Top" — 36° or above
- `T30` = "Top" of low range — 29° or below (less than 30°)

In [None]:
import json
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plta

# Run this notebook from the exploration/ directorya
KALSHI_DIR = Path(".") / "data" / "raw" / "kalshi"

# List files by type
event_files = sorted(KALSHI_DIR.glob("*_event_markets.json"))
orderbook_files = sorted(KALSHI_DIR.glob("*_orderbook.json"))
print(f"Event market files: {len(event_files)}")
print(f"Orderbook files: {len(orderbook_files)}")
print("\nSample event files:")
for f in event_files[:5]:
    print(f"  {f.name}")

## 2. Event Markets JSON Structure

Each `*_event_markets.json` has:
- `event_ticker` — e.g. KXHIGHNY-26FEB04
- `date` — the date the market resolves for
- `markets` — list of individual markets, one per temperature range

Each **market** has:
- `ticker` — e.g. KXHIGHNY-26FEB04-B32.5
- `subtitle` — human-readable range (e.g. "32° to 33°")
- `yes_bid` / `yes_ask` — best bid/ask in cents (implied probability)
- `last_price` — last trade price
- `status` — `active` or `finalized`
- `result` — `yes` or `no` when finalized
- `expiration_value` — the actual high temp (when resolved)
- `volume`, `open_interest`, `liquidity` — trading activity

In [None]:
# Load one event file
sample_event = event_files[0]
with open(sample_event) as f:
    event = json.load(f)

print(f"Event: {event['event_ticker']}")
print(f"Date: {event['date']}")
print(f"Markets: {len(event['markets'])}")

# Show first market
m = event['markets'][0]
print("\n--- First market keys ---")
print(list(m.keys())[:20])

print("\n--- First market (sample) ---")
print(json.dumps({k: m[k] for k in ['ticker', 'subtitle', 'yes_bid', 'yes_ask', 'last_price', 'status', 'result', 'expiration_value']}, indent=2))

## 3. Build a DataFrame of All Markets

We'll extract key fields from each event file and build a table for analysis.

In [None]:
def load_all_event_markets(event_files):
    """Load all event files and flatten into a DataFrame."""
    rows = []
    for path in event_files:
        with open(path) as f:
            event = json.load(f)
        event_ticker = event["event_ticker"]
        date = event["date"]
        # Use only the most recent snapshot per event (last file wins)
        for m in event["markets"]:
            rows.append({
                "event_ticker": event_ticker,
                "date": date,
                "ticker": m["ticker"],
                "subtitle": m["subtitle"],
                "yes_bid": m.get("yes_bid"),
                "yes_ask": m.get("yes_ask"),
                "last_price": m.get("last_price"),
                "status": m.get("status"),
                "result": m.get("result", ""),
                "expiration_value": m.get("expiration_value", ""),
                "volume": m.get("volume"),
                "open_interest": m.get("open_interest"),
                "liquidity": m.get("liquidity"),
                "strike_type": m.get("strike_type"),
                "source_file": path.name,
            })
    return pd.DataFrame(rows)

df = load_all_event_markets(event_files)

# Deduplicate: keep latest snapshot per ticker (by source file timestamp)
df = df.sort_values("source_file", ascending=False).drop_duplicates(subset=["ticker"], keep="first")
df = df.sort_values(["date", "ticker"])

print(f"Total rows: {len(df)}")
print(f"Unique dates: {df['date'].nunique()}")
df.head(10)

## 4. Interpret the Columns

| Column | Meaning |
|--------|--------|
| `yes_bid` | Best price someone will pay for a "Yes" contract (1–99 cents). Implies "market thinks probability is at least X%" |
| `yes_ask` | Best price someone will sell a "Yes" for. Implies "market thinks probability is at most X%" |
| `last_price` | Last traded price. Often used as a "mid" estimate |
| `result` | "yes" or "no" — only set when the market is finalized |
| `expiration_value` | Actual NWS high temp (when resolved) — used to show which market won |
| `volume` | Total contracts traded |
| `open_interest` | Contracts still held (not closed) |
| `liquidity` | Total $ available at best bid/ask |

In [None]:
# Show finalized markets and their outcome
finalized = df[df["status"] == "finalized"].copy()
winner = finalized[finalized["result"] == "yes"]
print("\n--- Finalized markets: which contract won? ---")
for _, row in winner.iterrows():
    print(f"  {row['date']}: {row['subtitle']} (actual high: {row['expiration_value']}°F)")

## 5. Visualization: Implied Probability by Temperature Range

For each day, we show the market's implied probability for each temperature bin. The winning bin (when finalized) should have probability near 1.

In [None]:
def plot_implied_probs(df, date_filter=None):
    """Plot implied probability (mid of bid/ask) by temperature range for each date."""
    subset = df.copy()
    if date_filter:
        subset = subset[subset["date"] == date_filter]
    
    # Compute mid probability
    def mid_prob(row):
        b, a = row["yes_bid"], row["yes_ask"]
        if pd.notna(b) and pd.notna(a):
            return (b + a) / 2
        return row["last_price"] if pd.notna(row["last_price"]) else None
    
    subset = subset.assign(mid_prob=subset.apply(mid_prob, axis=1))
    subset = subset.assign(mid_prob=subset["mid_prob"].fillna(subset["last_price"]))
    subset = subset.dropna(subset=["mid_prob"])
    
    dates = subset["date"].unique()
    if len(dates) == 0:
        print("No data for plotting")
        return
    
    fig, axes = plt.subplots(len(dates), 1, figsize=(10, 4 * len(dates)), squeeze=False)
    for i, date in enumerate(dates):
        ax = axes[i, 0]
        d = subset[subset["date"] == date]
        colors = ["green" if r == "yes" else "gray" if r else "steelblue" for r in d["result"]]
        bars = ax.bar(d["subtitle"], d["mid_prob"] / 100, color=colors, alpha=0.8)
        ax.set_ylabel("Implied probability")
        ax.set_title(f"{date}" + (f" (actual: {d['expiration_value'].iloc[0]}°F)" if d["expiration_value"].iloc[0] else ""))
        ax.set_ylim(0, 1.05)
        ax.tick_params(axis="x", rotation=45)
    plt.tight_layout()
    plt.show()

# Plot a few dates
plot_implied_probs(df, date_filter="2026-02-04")

## 6. Orderbook Structure

The `*_orderbook.json` files contain live order depth:
- `yes` / `no` — lists of `[price_cents, quantity]`
- `yes` = people willing to buy/sell "Yes" contracts
- `no` = people willing to buy/sell "No" contracts

When a market is **finalized**, the orderbook is often empty (null). For **active** markets, you see depth.

In [None]:
# Find an orderbook with data (active market)
ob_with_data = None
for p in orderbook_files:
    with open(p) as f:
        ob = json.load(f)
    ob_data = ob.get("orderbook", ob)
    yes_ob = ob_data.get("yes") or ob_data.get("yes_dollars")
    no_ob = ob_data.get("no") or ob_data.get("no_dollars")
    if yes_ob or no_ob:
        ob_with_data = (p, ob)
        break

if ob_with_data:
    path, ob = ob_with_data
    print(f"Orderbook: {path.name}")
    ob_data = ob.get("orderbook", ob)
    yes_ob = ob_data.get("yes")
    no_ob = ob_data.get("no")
    no_d = ob_data.get("no_dollars")
    print(f"  yes levels: {len(yes_ob) if yes_ob else 0}")
    print(f"  no levels: {len(no_ob) if no_ob else 0}")
    if no_ob:
        print("  Sample no (price, qty):", no_ob[:5])
        print("  ...")
else:
    print("No orderbook with non-null data found (all may be finalized)")

## 7. Orderbook Depth Visualization

When we have orderbook data, we can plot bid/ask depth. The "no" side: each row is `[price_cents, quantity]`. Higher price = more willing to pay for "No" = more confident the outcome won't happen.

In [None]:
def plot_orderbook(ob_path):
    with open(ob_path) as f:
        ob = json.load(f)
    ob_data = ob.get("orderbook", ob)
    yes_ob = ob_data.get("yes")
    no_ob = ob_data.get("no")
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    if no_ob:
        prices, qtys = zip(*no_ob)
        axes[0].barh(range(len(prices)), qtys, color="coral", alpha=0.7)
        axes[0].set_yticks(range(len(prices)))
        axes[0].set_yticklabels([f"{p}¢" for p in prices])
        axes[0].set_xlabel("Quantity")
        axes[0].set_title("No side (buy No = bet against outcome)")
    
    if yes_ob:
        prices, qtys = zip(*yes_ob)
        axes[1].barh(range(len(prices)), qtys, color="teal", alpha=0.7)
        axes[1].set_yticks(range(len(prices)))
        axes[1].set_yticklabels([f"{p}¢" for p in prices])
        axes[1].set_xlabel("Quantity")
        axes[1].set_title("Yes side (buy Yes = bet outcome happens)")
    
    if not yes_ob and not no_ob:
        axes[0].text(0.5, 0.5, "No orderbook data", ha="center", va="center")
    plt.tight_layout()
    plt.show()

if ob_with_data:
    plot_orderbook(ob_with_data[0])
else:
    print("Skipping orderbook plot (no data)")

## 8. Summary: What You're Seeing

### Event Markets
- **One event per date** (e.g. Feb 4, 2026)
- **One market per temperature bin** (e.g. 29° or below, 30–31°, 32–33°, …)
- **Exactly one** market resolves **Yes** for each date (the bin containing the actual high)
- `yes_bid` / `yes_ask` = implied probability that this bin wins
- `expiration_value` = actual NWS high when resolved

### Orderbooks
- **One orderbook per market ticker** (e.g. KXHIGHNY-26FEB04-B32.5)
- `yes` / `no` = price levels and quantities
- Empty when market is finalized (no more trading)

### Resolution
Kalshi uses the **NWS Climatological Report (Daily)** for Central Park (KNYC). The official high is rounded to the nearest whole °F. See `project_rules/weather_prediction_rules.md` for details.