# News Event Notification Service

This notebook intakes metadata from selected financial markets (titles, descriptions, etc.), uses natural language processing to parse out their embeddings, then compares those against headlines and other metadata taken from the news media aggregators in order to compute similarity scores between those news items and the selected financial markets.

The purpose of this is to create a curated news feed that alerts the end user whenever a news item that affects the identified markets is posted.

In [88]:
import pandas as pd
import os
from datetime import datetime
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
import numpy as np

First, we load the CSV files that we are comparing - those from financial markets we are tracking and then input data from our news aggregators

In [89]:
today = datetime.today().strftime('%Y_%m_%d')

# Kalshi prediction markets
markets_file = os.path.join('../../../..','Financial Market Engines/Prediction Market Engines/Kalshi Prediction Market Engine/KalshiPredictionMarketEngine/data/output/2_market_selector', f'selected_markets_{today}.csv')

# Bluesky posts
posts_file = os.path.join('../../../..','News Media Aggregators/Social Media Aggregators/Bluesky Aggregator/BlueskyAggregator/data/output/1_author_feed_extractor', f'collected_feeds.csv')

# Load the CSVs into DataFrames
posts_df = pd.read_csv(posts_file)
all_markets_df = pd.read_csv(markets_file)

# Fill missing values with empty string (so that concatenation won't produce 'nan')
posts_df.fillna('', inplace=True)
all_markets_df.fillna('', inplace=True)

# Reset index to ensure 0..N-1 indexing
posts_df.reset_index(drop=True, inplace=True)
all_markets_df.reset_index(drop=True, inplace=True)

# (Optional) View basic info about the data
print("Bluesky posts loaded:", posts_df.shape, "records")
print("Kalshi markets loaded:", all_markets_df.shape, "records")

Bluesky posts loaded: (310, 9) records
Kalshi markets loaded: (9, 54) records


  all_markets_df.fillna('', inplace=True)


In [90]:
posts_df.head()

Unnamed: 0,title,text,author,displayName,timestamp,external_url,uuid,page_title,meta_description
0,"Miley Cyrus says she has Reinke's edema, a dis...",Miley Cyrus has opened up about her experience...,cbsnews.com,CBS News,2025-05-28 01:00:00.970000+00:00,https://cbsn.ws/43zE5p6,at://did:plc:3bxtpdpr73tf7tldv5q4oyqc/app.bsky...,"Miley Cyrus says she has Reinke's edema, a dis...",Miley Cyrus has opened up about her experience...
1,New student visas paused as State Dept. plans ...,The State Department on Tuesday suspended fore...,washingtonpost.com,The Washington Post,2025-05-28 00:52:20.780000+00:00,https://www.washingtonpost.com/education/2025/...,at://did:plc:k5nskatzhyxersjilvtnz4lh/app.bsky...,,
2,SpaceX megarocket gets farther in test than la...,UPDATE: SpaceX's Starship travels farther in i...,cnn.com,CNN,2025-05-28 00:52:04.026000+00:00,https://www.cnn.com/science/live-news/spacex-s...,at://did:plc:dzezcmpb3fhcpns4n4xm4ur5/app.bsky...,SpaceX launches Starship test flight 9: Live u...,SpaceX has launched a ninth uncrewed test flig...
3,"Trump touts free ""Golden Dome"" for Canada, as ...",Canadian PM Mark Carney told CBC today he want...,axios.com,Axios,2025-05-28 00:51:43.197000+00:00,https://www.axios.com/2025/05/28/trump-canada-...,at://did:plc:f6avy7jkujdhusski5n64joj/app.bsky...,,
4,"Former Times reporter sues Villanueva, L.A Cou...","Former Times reporter sues Villanueva, L.A Cou...",latimes.com,Los Angeles Times,2025-05-28 00:51:03.498000+00:00,https://www.latimes.com/california/story/2025-...,at://did:plc:d2jith367s6ybc3ldsusgdae/app.bsky...,"Former Times reporter sues Villanueva, L.A Cou...",Former Times reporter Maya Lau has filed a law...


In [91]:
all_markets_df

Unnamed: 0,title,sub_title,market_rules_primary,market_rules_secondary,market_ticker,market_event_ticker,market_market_type,market_title,market_subtitle,market_yes_sub_title,...,event_series_ticker,event_title,event_ticker,series_ticker,collateral_return_type,mutually_exclusive,category,markets,strike_date,strike_period
0,How many Starship launches will reach space th...,In 2025,If above 24 Starship launches reach Space in 2...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-24,KXSTARSHIPSPACE-25,binary,,,25 or above,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
1,How many Starship launches will reach space th...,In 2025,If between 7 to 9 Starship launches reach Spac...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-8.0,KXSTARSHIPSPACE-25,binary,,,7 to 9,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
2,How many Starship launches will reach space th...,In 2025,If between 4 to 6 Starship launches reach Spac...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-5.0,KXSTARSHIPSPACE-25,binary,,,4 to 6,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
3,How many Starship launches will reach space th...,In 2025,If between 22 to 24 Starship launches reach Sp...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-23.0,KXSTARSHIPSPACE-25,binary,,,22 to 24,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
4,How many Starship launches will reach space th...,In 2025,If between 19 to 21 Starship launches reach Sp...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-20.0,KXSTARSHIPSPACE-25,binary,,,19 to 21,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
5,How many Starship launches will reach space th...,In 2025,If between 16 to 18 Starship launches reach Sp...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-17.0,KXSTARSHIPSPACE-25,binary,,,16 to 18,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
6,How many Starship launches will reach space th...,In 2025,If between 13 to 15 Starship launches reach Sp...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-14.0,KXSTARSHIPSPACE-25,binary,,,13 to 15,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
7,How many Starship launches will reach space th...,In 2025,If between 10 to 12 Starship launches reach Sp...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-11.0,KXSTARSHIPSPACE-25,binary,,,10 to 12,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,
8,How many Starship launches will reach space th...,In 2025,If below 4 Starship launches reach Space in 20...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-4,KXSTARSHIPSPACE-25,binary,,,3 or below,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,


In [92]:
markets_df = all_markets_df.drop_duplicates(subset='market_event_ticker')
markets_df

Unnamed: 0,title,sub_title,market_rules_primary,market_rules_secondary,market_ticker,market_event_ticker,market_market_type,market_title,market_subtitle,market_yes_sub_title,...,event_series_ticker,event_title,event_ticker,series_ticker,collateral_return_type,mutually_exclusive,category,markets,strike_date,strike_period
0,How many Starship launches will reach space th...,In 2025,If above 24 Starship launches reach Space in 2...,A Starship will be considered to have reached ...,KXSTARSHIPSPACE-25-24,KXSTARSHIPSPACE-25,binary,,,25 or above,...,KXSTARSHIPSPACE,How many Starship launches will reach space th...,KXSTARSHIPSPACE-25,KXSTARSHIPSPACE,MECNET,True,Science and Technology,"[{'ticker': 'KXSTARSHIPSPACE-25-24', 'event_ti...",,


Next, we combine the text fields for each record into a single string per record. This string will be fed into the embedding model so as to capture the combined content of the entire text fields.

In [93]:
# Combine relevant text fields for each Bluesky post into one string
posts_df['combined_text'] = (
    posts_df['title'].astype(str).str.strip() + " " +
    posts_df['text'].astype(str).str.strip() + " " +
    posts_df['page_title'].astype(str).str.strip() + " " +
    posts_df['meta_description'].astype(str).str.strip()
)

# Combine relevant text fields for each Kalshi market into one string
markets_df['combined_text'] = (
    markets_df['title'].astype(str).str.strip() + " " +
    markets_df['sub_title'].astype(str).str.strip() + " " +
    markets_df['market_rules_primary'].astype(str).str.strip() + " " +
    markets_df['market_rules_secondary'].astype(str).str.strip()
)

# Create Python lists of the combined text for each dataset (for easy iteration or embedding input)
posts_texts = posts_df['combined_text'].tolist()
markets_texts = markets_df['combined_text'].tolist()

# (Optional) Show an example of combined text from each dataset
print("Example combined Bluesky post text:\n", posts_texts[0][:100], "...\n")
print("Example combined Kalshi market text:\n", markets_texts[0][:100], "...")

Example combined Bluesky post text:
 Miley Cyrus says she has Reinke's edema, a disorder that makes her voice "super unique" Miley Cyrus  ...

Example combined Kalshi market text:
 How many Starship launches will reach space this year? In 2025 If above 24 Starship launches reach S ...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  markets_df['combined_text'] = (


In the following cell, convert each combined text string into a semantic embedding vector. The result is one set of embeddings for each of our inputs.

In [94]:
# Load a pretrained sentence transformer model for embeddings
model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

# Encode the combined texts to get embeddings
# show_progress_bar=True will display a progress bar for the encoding process
post_embeddings = model.encode(posts_texts, show_progress_bar=True)
market_embeddings = model.encode(markets_texts, show_progress_bar=True)

print("Embeddings generated:")
print(" - Bluesky posts embeddings shape:", post_embeddings.shape)
print(" - Kalshi markets embeddings shape:", market_embeddings.shape)

Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings generated:
 - Bluesky posts embeddings shape: (310, 384)
 - Kalshi markets embeddings shape: (1, 384)


The following cell allows us to set the similarity threshold as well as the similarity method.
- Similarity Threshold: A floating-point value (default 0.6) above which we consider a post and market to be a “match”. This can be adjusted based on how strict or loose we want the similarity filtering.
- Similarity Method: We can choose between cosine similarity and dot-product similarity.
    - Cosine similarity measures the cosine of the angle between two vectors (ranges from -1 to 1 for normalized vectors). It essentially compares the direction of the vectors, ignoring their magnitude.
    - Dot-product similarity directly multiplies and sums the vector components, which combines both direction and magnitude. (If the embeddings are not normalized, a larger magnitude vector can yield a higher dot product even if the semantic content isn’t more similar.)

In many text applications, cosine similarity is preferred because it normalizes for the length of the text (vector magnitude) and focuses on semantic alignment. The dot product might be useful in certain cases, but note that its scale is different (it can grow with vector length).

If using cosine, we’ll compute similarity as (A·B) / (|A|*|B|), and if using dot, we’ll compute it as just A·B (which will typically yield a value that could be greater than 1). The threshold is assumed to apply to whichever metric is chosen (for cosine 0.6 is a moderate similarity; for raw dot product you might need to adjust the threshold based on typical magnitudes).

Note: Cosine similarity is essentially the dot product of the vectors after normalizing them to unit length ￼. It yields 1 for identical vectors (when they point in the same direction) and 0 for orthogonal (unrelated) vectors. Dot product, on the other hand, will be higher for longer vectors even if the content is the same, so use it with care (or consider normalizing embeddings first if you want it comparable to cosine).


In [95]:
# User-configurable parameters
SIMILARITY_THRESHOLD = 0.3      # default threshold for considering a match (for cosine similarity)
SIMILARITY_METHOD = 'cosine'    # 'cosine' or 'dot'

# Validate similarity method choice
SIMILARITY_METHOD = SIMILARITY_METHOD.lower()
if SIMILARITY_METHOD not in ('cosine', 'dot'):
    raise ValueError("SIMILARITY_METHOD must be 'cosine' or 'dot'")

print(f"Using similarity method: {SIMILARITY_METHOD}")
print(f"Similarity threshold: {SIMILARITY_THRESHOLD}")

Using similarity method: cosine
Similarity threshold: 0.3


Next, we compute the pairwise semantic similarities between our markets and our news posts. This code iterates through each set of posts and each set of markets and computes a similarity score between each pair of posts and markets.

For each pair, the code:
- Computes the similarity score (cosine or dot).
- If the score meets or exceeds our threshold, it records this as a potential match.

It does this by:
- Looping over each post (outer loop) and each market (inner loop).
  - For cosine similarity, we compute the norm of the post vector (|A|) and use precomputed norms for market vectors (|B|). The similarity `sim` is computed as the dot product divided by the product of norms. If either vector has zero length (which could happen if a text was empty, resulting in a zero embedding), we define the similarity as 0 to be safe.
  - For dot product similarity, we just compute the dot product `np.dot(post_vec, market_vec)` directly.
- If the similarity meets the threshold, we append a dictionary to matches containing the similarity score (rounded to 4 decimal places for readability) and the corresponding post title/text and market title/subtitle. We use `posts_df.loc[i, '...']` to retrieve the original text fields for the i-th post (and similarly for j-th market) to include in the results.
- After looping, we convert the list of matches to a DataFrame `matches_df` for convenient viewing. We also sort it by similarity score in descending order to see the highest similarity pairs first.
- Prints the total number of matches found and displayed the first few matches (if any) for a quick preview.

In [96]:
# todo: if the dataset is very large, the double loop approach could be slow. For a more optimized approach, one might compute all pairwise similarities in a vectorized manner or use approximate nearest neighbor search.

matches = []  # to store matches above threshold

# If using cosine, precompute norms of all market embeddings (to avoid repetitive computation)
if SIMILARITY_METHOD == 'cosine':
    market_norms = np.linalg.norm(market_embeddings, axis=1)

# Iterate over each Bluesky post embedding with a progress bar
for i, post_vec in enumerate(tqdm(post_embeddings, desc="Bluesky Posts")):
    # Compute norm of the post vector once (for cosine similarity)
    if SIMILARITY_METHOD == 'cosine':
        post_norm = np.linalg.norm(post_vec)
    for j, market_vec in enumerate(market_embeddings):
        if SIMILARITY_METHOD == 'cosine':
            # Compute cosine similarity = (A · B) / (|A| * |B|)
            # Handle zero-norm cases to avoid division by zero
            if post_norm == 0 or market_norms[j] == 0:
                sim = 0.0
            else:
                sim = np.dot(post_vec, market_vec) / (post_norm * market_norms[j])
        else:  # dot-product similarity
            sim = np.dot(post_vec, market_vec)
        # Check if the similarity is above the threshold
        if sim >= SIMILARITY_THRESHOLD:
            # Record the match with relevant details
            matches.append({
                'similarity_score': round(float(sim), 4),  # round for readability
                'post_title': posts_df.loc[i, 'title'],
                'post_text': posts_df.loc[i, 'text'],
                'market_title': markets_df.loc[j, 'title'],
                'market_subtitle': markets_df.loc[j, 'sub_title']
            })

Bluesky Posts: 100%|██████████| 310/310 [00:00<00:00, 87990.41it/s]


In [97]:
# Convert the matches list into a DataFrame for easy viewing/manipulation
matches_df = pd.DataFrame(matches)
print(f"\nFound {len(matches_df)} matches with similarity >= {SIMILARITY_THRESHOLD}")
# If any matches found, display a few sample matches
if not matches_df.empty:
    # Sort matches by highest similarity first (optional, for convenience)
    matches_df.sort_values('similarity_score', ascending=False, inplace=True)
matches_df


Found 13 matches with similarity >= 0.3


Unnamed: 0,similarity_score,post_title,post_text,market_title,market_subtitle
4,0.4954,SpaceX launches another Starship mega rocket i...,SpaceX has launched its Starship mega rocket a...,How many Starship launches will reach space th...,In 2025
11,0.4759,SpaceX’s Starship test flight loses control 30...,The ninth launch of Elon Musk’s futuristic Spa...,How many Starship launches will reach space th...,In 2025
1,0.4394,SpaceX is poised to launch the ninth test flig...,SpaceX set to launch another test of the most ...,How many Starship launches will reach space th...,In 2025
8,0.4274,SpaceX’s Starship test flight loses control 30...,"SpaceX’s ninth Starship test flight, led by El...",How many Starship launches will reach space th...,In 2025
3,0.4203,SpaceX Starship launch live updates: What time...,Live updates: SpaceX is aiming to launch Stars...,How many Starship launches will reach space th...,In 2025
0,0.414,SpaceX megarocket gets farther in test than la...,UPDATE: SpaceX's Starship travels farther in i...,How many Starship launches will reach space th...,In 2025
2,0.4027,SpaceX launching Super Heavy-Starship on 9th t...,SpaceX launching Super Heavy-Starship on 9th t...,How many Starship launches will reach space th...,In 2025
7,0.3862,What happened on last Starship flight? SpaceX ...,SpaceX has released the findings of its invest...,How many Starship launches will reach space th...,In 2025
12,0.3771,SpaceX loses another Starship on test flight a...,"After launching into space, SpaceX loses anoth...",How many Starship launches will reach space th...,In 2025
6,0.3586,SpaceX Starship Launches on Test Flight After ...,LATEST: SpaceX’s colossal Starship rocket suff...,How many Starship launches will reach space th...,In 2025
