# Kalshi Market Extractor

This notebook is a proof-of-concept for extracting series, event, and market data from the Kalshi predicition market (hereafter refered to as "market data").

This is the first step required in order to create models to backtest event contracts on the Kalshi prediction market against real-world news events, create automated trading systems, etc.

This notebook pulls only the markets that are open each daily.

The architecture will be as follows:
* pull data for all series from Kalshi
* filter series for politics, economics, financials, and tech & science
* extract events from selected series
* extract event title and summary text
* extract intraday price movement data
* create features using text chunking, entity extraction, etc.

* The following steps require further solutioning:
* create headline aggregator
* select specific contracts for testing based on availbility of key dates where intraday price movements exceed one standard deviation from daily distribution
* parse headlines from headline aggregator from those dates
* use contract features from selected contracts to test against headlines from selected high-movement days
* run same analyses across multiple dates and across multiple contracts
* run statistical analysis to determine factors that can be used to create forecasting model
* backtest model against open-ended range of dates of collected headlines to predict price movements, and compare against actual prices


| series        |  event                  |  market |
|---------------|-------------------------|----------------------|
| KXFEDDECISION |  KXFEDDECISION-25MAY    |  KXFEDDECISION-25MAY-H0 |
| Fed Decision  | Fed Decision - May 2025 |  Fed Decision - May 2025 - No Change |

In [23]:
from datetime import datetime
import requests
import time
import pandas as pd
import sys
import spacy
import subprocess

## From API

This cell takes care of setting constants, reusable parameters, and other code preparation steps.

This cell calls the events API, requesting open events, limit 200, with their markets.

In [2]:
url = "https://api.elections.kalshi.com/trade-api/v2/events?limit=200&status=open&with_nested_markets=true"

headers = {"accept": "application/json"}

response = requests.get(url, headers=headers)

response_json = response.json()

This cell saves all events into a list.

In [3]:
events = []

for response in response_json['events']:
    events.append(response)

This cell uses the API's cursor object to determine if there are additional pages of records - if yes, will continue to call the API until no cursor object is returned (i.e., there are no further records)

In [4]:
while response_json['cursor']:
    cursor = response_json['cursor']
    url = 'https://api.elections.kalshi.com/trade-api/v2/events?limit=200&cursor=' + cursor + '&status=open&with_nested_markets=true'
    response = requests.get(url, headers=headers)
    response_json = response.json()
    for response in response_json['events']:
        events.append(response)
    time.sleep(.10)

In [5]:
print(len(events))

1314


In [6]:
events_df = pd.DataFrame(events)

# Flatten the nested 'markets' field
markets_df = pd.json_normalize(
    events,
    record_path=['markets'],
    meta=['event_ticker', 'series_ticker', 'title'],
    record_prefix='market_',
    meta_prefix='event_'
)

# Merge the events and markets DataFrames on the matching event ticker
df = markets_df.merge(
    events_df,
    how='left',
    left_on='event_event_ticker',
    right_on='event_ticker',
    suffixes=('_market', '_event')
)
# Reorder columns to move 'event_title' to the front
# Reorder columns: 'title' first, 'sub_title' second, followed by the rest
cols = ['title', 'sub_title', 'market_rules_primary', 'market_rules_secondary'] + [col for col in df.columns if col not in ['title', 'sub_title', 'market_rules_primary', 'market_rules_secondary']]

df = df[cols]

# Drop columns that start with 'market_custom_strike'
df = df.loc[:, ~df.columns.str.startswith('market_custom_strike')]


In [7]:
df.shape

(9169, 54)

In [8]:
df.head()

Unnamed: 0,title,sub_title,market_rules_primary,market_rules_secondary,market_ticker,market_event_ticker,market_market_type,market_title,market_subtitle,market_yes_sub_title,...,event_series_ticker,event_title,event_ticker,series_ticker,collateral_return_type,mutually_exclusive,category,markets,strike_date,strike_period
0,Will a humanoid robot walk on Mars before a hu...,Before 2035,If a humanoid robot walks on Mars before a hum...,,KXROBOTMARS-35,KXROBOTMARS-35,binary,,,Before 2035,...,KXROBOTMARS,Will a humanoid robot walk on Mars before a hu...,KXROBOTMARS-35,KXROBOTMARS,,False,Science and Technology,"[{'ticker': 'KXROBOTMARS-35', 'event_ticker': ...",,
1,Which of these Latin America leaders will leav...,Before 2035,If President of Venezuela is the first leader ...,An announcement that a leader will leave their...,KXLALEADEROUT-35-NM,KXLALEADEROUT-35,binary,,:: President of Venezuela,Nicolás Maduro,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,
2,Which of these Latin America leaders will leav...,Before 2035,If President of El Salvador is the first leade...,An announcement that a leader will leave their...,KXLALEADEROUT-35-NB,KXLALEADEROUT-35,binary,,:: President of El Salvador,Nayib Bukele,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,
3,Which of these Latin America leaders will leav...,Before 2035,If the President of Brazil is the first leader...,An announcement that a leader will leave their...,KXLALEADEROUT-35-LULA,KXLALEADEROUT-35,binary,,:: President of Brazil,Luiz Inácio Lula da Silva,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,
4,Which of these Latin America leaders will leav...,Before 2035,If President of the Dominican Republic is the ...,An announcement that a leader will leave their...,KXLALEADEROUT-35-LA,KXLALEADEROUT-35,binary,,:: President of the Dominican Republic,Luis Abinader,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,


This cell searches for duplicate markets - none are expected.

In [9]:
duplicates = df[df.duplicated(subset='market_ticker', keep=False)]

if len(duplicates) > 0:
    print(duplicates)
else:
    print("Successful data import.")


Successful data import.


 The following cells search for 'Category' types and removes categories that do not fit our model. They also remove any non-active markets.

In [10]:
unique_categories = df['category'].unique()

print(unique_categories)

['Science and Technology' 'Politics' 'Elections' 'Climate and Weather'
 'Economics' 'Sports' 'Health' 'Social' 'Companies' 'Entertainment'
 'World' 'Financials' 'Education' 'Crypto' 'Transportation']


In [11]:
# List of categories to keep
categories_to_keep = [
    'Politics', 'Social', 'Transportation',
    'Science and Technology', 'Economics', 'Companies', 'Elections',
    'Health', 'Financials', 'Climate and Weather', 'World'
]

filtered_df = df[df['category'].isin(categories_to_keep)]

In [12]:
unique_categories = df['market_status'].unique()

print(unique_categories)

['active' 'finalized' 'inactive' 'closed' 'settled' 'initialized']


In [13]:
# List of market statuses to exclude
statuses_to_exclude = ['finalized', 'inactive', 'closed', 'settled']

# Filter out rows with these statuses
filtered_df = filtered_df[~filtered_df['market_status'].isin(statuses_to_exclude)]

In [14]:
filtered_df.shape

(3596, 54)

In [15]:
filtered_df.head()

Unnamed: 0,title,sub_title,market_rules_primary,market_rules_secondary,market_ticker,market_event_ticker,market_market_type,market_title,market_subtitle,market_yes_sub_title,...,event_series_ticker,event_title,event_ticker,series_ticker,collateral_return_type,mutually_exclusive,category,markets,strike_date,strike_period
0,Will a humanoid robot walk on Mars before a hu...,Before 2035,If a humanoid robot walks on Mars before a hum...,,KXROBOTMARS-35,KXROBOTMARS-35,binary,,,Before 2035,...,KXROBOTMARS,Will a humanoid robot walk on Mars before a hu...,KXROBOTMARS-35,KXROBOTMARS,,False,Science and Technology,"[{'ticker': 'KXROBOTMARS-35', 'event_ticker': ...",,
1,Which of these Latin America leaders will leav...,Before 2035,If President of Venezuela is the first leader ...,An announcement that a leader will leave their...,KXLALEADEROUT-35-NM,KXLALEADEROUT-35,binary,,:: President of Venezuela,Nicolás Maduro,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,
2,Which of these Latin America leaders will leav...,Before 2035,If President of El Salvador is the first leade...,An announcement that a leader will leave their...,KXLALEADEROUT-35-NB,KXLALEADEROUT-35,binary,,:: President of El Salvador,Nayib Bukele,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,
3,Which of these Latin America leaders will leav...,Before 2035,If the President of Brazil is the first leader...,An announcement that a leader will leave their...,KXLALEADEROUT-35-LULA,KXLALEADEROUT-35,binary,,:: President of Brazil,Luiz Inácio Lula da Silva,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,
4,Which of these Latin America leaders will leav...,Before 2035,If President of the Dominican Republic is the ...,An announcement that a leader will leave their...,KXLALEADEROUT-35-LA,KXLALEADEROUT-35,binary,,:: President of the Dominican Republic,Luis Abinader,...,KXLALEADEROUT,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,KXLALEADEROUT,MECNET,True,Politics,"[{'ticker': 'KXLALEADEROUT-35-NM', 'event_tick...",,


This cell saves all events that are `active` and categorized as 'Politics', 'Social', 'Transportation', 'Science and Technology', 'Economics', 'Companies', 'Elections', 'Health', 'Financials', 'Climate and Weather', or 'World' to a CSV file in the `data/outputs` directory.

In [34]:
today = datetime.today().strftime('%Y-%m-%d')
filtered_df.to_csv(f'data/output/1_market_extractor/events_{today}.csv', index=False)

The following cell combines collected markets into tranches made up of Kalshi events.

In [20]:
# Select only the relevant columns and drop duplicate combinations
unique_events = filtered_df[['market_event_ticker', 'event_title']].drop_duplicates()

# Select only the rows from filtered_df that match the index of unique_event_titles
rules_subset = filtered_df.loc[unique_events.index, ['market_rules_primary']]

# Merge into features_df on index
unique_events = unique_events.merge(rules_subset, left_index=True, right_index=True)

unique_events.shape

(877, 3)

In [21]:
unique_events.head()

Unnamed: 0,market_event_ticker,event_title,market_rules_primary
0,KXROBOTMARS-35,Will a humanoid robot walk on Mars before a hu...,If a humanoid robot walks on Mars before a hum...
1,KXLALEADEROUT-35,Which of these Latin America leaders will leav...,If President of Venezuela is the first leader ...
11,KXBRUVSEAT-35,Will Andrew Tate's party win a seat in the nex...,"If Britain Restoring Underlying Values (""BRUV""..."
12,KXAFRICALEADEROUT-35,Which of these African leaders will leave offi...,If the President of Ghana is the first leader ...
22,EUCLIMATE,EU meets its 2030 climate goals?,If the EU has reduced greenhouse gas emissions...


The following cells begin feature extraction based off of the event title and the primary market rules.

In [29]:
# Load the model; download if not already present
try:
    nlp = spacy.load("en_core_web_md")
    print("spaCy model 'en_core_web_md' is already installed.")
except OSError:
    print("Downloading spaCy model 'en_core_web_md...")
    try:
        # Install spaCy only (no transformers)
        subprocess.run([sys.executable, "-m", "pip", "install", "-U", "spacy"], check=True)

        # Download the medium English model
        subprocess.run([sys.executable, "-m", "spacy", "download", "en_core_web_md"], check=True)

        nlp = spacy.load("en_core_web_md")
        print("spaCy basic model loaded successfully!")

    except subprocess.CalledProcessError as e:
        print(f"Installation error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

spaCy model 'en_core_web_md' is already installed.


In [30]:
# Remove rows with missing or empty titles
unique_event_titles = unique_event_titles[unique_event_titles['event_title'].notna()]
unique_event_titles = unique_event_titles[unique_event_titles['event_title'].str.strip() != ""]

def extract_features(event_title, market_rules_primary):
    # Combine text safely
    combined_text = f"{event_title or ''}. {market_rules_primary or ''}"
    doc = nlp(combined_text)

    # Get spans of date entities to exclude
    date_spans = {ent.start for ent in doc.ents if ent.label_ == 'DATE'}

    # Custom exclusions
    custom_exclusions = {'resolve', 'yes'}

    # Filter tokens
    filtered_tokens = [
        token for i, token in enumerate(doc)
        if token.is_alpha and not token.is_stop and i not in date_spans
    ]

    # Deduplicated and normalized lemmas
    lemmas_set = {
        token.lemma_.lower()
        for token in filtered_tokens
        if token.lemma_.lower() not in custom_exclusions
    }
    normalized_lemmas = ' '.join(sorted(lemmas_set))

    # Extract entity text values
    entity_texts = [ent.text for ent in doc.ents]

    return {
        'lemmas': normalized_lemmas,
        'named_entities': entity_texts
    }

# Apply the feature extraction to each row in your DataFrame
features = unique_event_titles.apply(
    lambda row: extract_features(row['event_title'], row['market_rules_primary']),
    axis=1,
    result_type='expand'
)
features_df = pd.concat([unique_event_titles, features], axis=1)

# Reorder columns: 'title' and 'market_event_ticker' first
cols = ['event_title', 'market_event_ticker'] + [col for col in features_df.columns if col not in ['event_title', 'market_event_ticker']]
features_df = features_df[cols]

In [31]:
features_df.shape

(877, 5)

In [32]:
features_df.head()

Unnamed: 0,event_title,market_event_ticker,market_rules_primary,lemmas,named_entities
0,Will a humanoid robot walk on Mars before a hu...,KXROBOTMARS-35,If a humanoid robot walks on Mars before a hum...,human humanoid market mars robot walk,"[Mars, Mars, 2035]"
1,Which of these Latin America leaders will leav...,KXLALEADEROUT-35,If President of Venezuela is the first leader ...,america latin leader leave market office presi...,"[Latin America, Venezuela, first]"
11,Will Andrew Tate's party win a seat in the nex...,KXBRUVSEAT-35,"If Britain Restoring Underlying Values (""BRUV""...",andrew britain bruv commons election general h...,"[Will Andrew Tate's, UK, Britain, BRUV, the Ho..."
12,Which of these African leaders will leave offi...,KXAFRICALEADEROUT-35,If the President of Ghana is the first leader ...,african ghana leader leave market office presi...,"[African, Ghana, first]"
22,EU meets its 2030 climate goals?,EUCLIMATE,If the EU has reduced greenhouse gas emissions...,climate compare emission eu gas goal greenhous...,"[EU, 2030, EU, 55%, 1990, 2030]"
