# Matching Bets using efficient Text Embedding with Instructions
Simplifying Semantic Search and Similarity Tasks.


*Resources:*
- https://huggingface.co/dunzhang/stella_en_1.5B_v5
- https://www.sbert.net/examples/applications/retrieve_rerank/README.html
- https://huggingface.co/spaces/mteb/leaderboard
- https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb

# Requirements

In [None]:
!pip install sentence_transformers flash_attn gdown  -q
!gdown 1-HlaTL7Xlrm00cFu5QZADVUxuJvuASFA
!gdown 11WasQroWXaagm7bJ6ti27_VNnfZyN68d

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.6/2.6 MB[0m [31m96.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m59.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone


# Example of similarity matching

In [None]:
from sentence_transformers import SentenceTransformer

# This model supports two prompts: "s2p_query" and "s2s_query" for sentence-to-passage and sentence-to-sentence tasks, respectively.
# They are defined in `config_sentence_transformers.json`
query_prompt_name = "s2p_query"
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
# docs do not need any prompts
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# ！The default dimension is 1024, if you need other dimensions, please clone the model and modify `modules.json` to replace `2_Dense_1024` with another dimension, e.g. `2_Dense_256` or `2_Dense_8192` !
model = SentenceTransformer("dunzhang/stella_en_1.5B_v5", trust_remote_code=True).cuda()
query_embeddings = model.encode(queries, prompt_name=query_prompt_name)
doc_embeddings = model.encode(docs)
print(query_embeddings.shape, doc_embeddings.shape)
# (2, 1024) (2, 1024)

similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
# tensor([[0.8179, 0.2

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/397 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/174k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/51.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

modeling_qwen.py:   0%|          | 0.00/65.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/dunzhang/stella_en_1.5B_v5:
- modeling_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

tokenization_qwen.py:   0%|          | 0.00/10.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/dunzhang/stella_en_1.5B_v5:
- tokenization_qwen.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


1_Pooling/config.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

2_Dense_1024/config.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

(2, 1024) (2, 1024)
tensor([[0.8179, 0.2958],
        [0.3194, 0.7854]])


# Matching bets with embeddings

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
pd.options.display.max_columns = 200

# Models to use: dunzhang/stella_en_1.5B_v5, dunzhang/stella_en_400M_v5
model = SentenceTransformer("dunzhang/stella_en_1.5B_v5", trust_remote_code=True).cuda()
# Load data
kalshi = pd.read_json("kalshi_markets.json")
polymarket_markets = pd.read_json("polymarket_markets.json")

# create a column for the retrieval:
kalshi["bet_description"] = kalshi["title"] + " " + kalshi["subtitle"] + "\n" + kalshi['rules_primary']  + "\nEnd date: " + str(kalshi["close_time"])
polymarket_markets["bet_description"] = polymarket_markets["question"] + "\n" + polymarket_markets["description"] + "\nEnd date: " + polymarket_markets["end_date_iso"]
polymarket_subset = polymarket_markets.dropna(subset=["bet_description"])
kalshi_subset = kalshi.dropna(subset=["bet_description"])
kalshi_subset.drop_duplicates(subset=["event_ticker"], inplace=True)
print("Polymarkets", len(polymarket_subset))
print("Kalshi markets", len(kalshi_subset))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Polymarkets 10009
Kalshi markets 13599


In [None]:
polymarket_subset.iloc[9307]

Unnamed: 0,10759
enable_order_book,True
active,True
closed,False
archived,False
accepting_orders,True
accepting_order_timestamp,2024-08-02T21:09:42Z
minimum_order_size,5.0
minimum_tick_size,0.01
condition_id,0x287691ef015e94db805a833d9a51101fd2a8199f6b59...
question_id,0x06656db52a0bf3b67f78e676f19efe04be7dafb27d6d...


# Start inference

In [None]:
!mkdir embeddings/

In [None]:
doc_embeddings = model.encode(polymarket_subset["bet_description"].tolist())
# save
np.save("embeddings/poly_doc_embeddings.npy", doc_embeddings)

In [None]:
query_embeddings = model.encode(
    ("""Instruct: Given a prediction market event, retrieve the exact matching prediction based on date and rules.\nQuery:  """ + kalshi_subset["bet_description"]
     ).tolist())
# save
np.save("embeddings/kalshi_query_embeddings.npy", query_embeddings)

In [None]:
similarities = model.similarity(query_embeddings, doc_embeddings)
top_5_prob,top_5 = similarities.sort(1,descending=True)
top_5_prob[:15,:5]

tensor([[0.7805, 0.7726, 0.7697, 0.7685, 0.7664],
        [0.7132, 0.7088, 0.7030, 0.7019, 0.6997],
        [0.7118, 0.7109, 0.7103, 0.7101, 0.7093],
        [0.7066, 0.7047, 0.7035, 0.7010, 0.7005],
        [0.7724, 0.7354, 0.7346, 0.7286, 0.7190],
        [0.6693, 0.6653, 0.6652, 0.6643, 0.6637],
        [0.7195, 0.7153, 0.7144, 0.7104, 0.7087],
        [0.7579, 0.7491, 0.7465, 0.7464, 0.7461],
        [0.7493, 0.7205, 0.7199, 0.7199, 0.7190],
        [0.7338, 0.7301, 0.7185, 0.7097, 0.7083],
        [0.7400, 0.6555, 0.6513, 0.6507, 0.6477],
        [0.6814, 0.6795, 0.6652, 0.6645, 0.6641],
        [0.8501, 0.8497, 0.8472, 0.8455, 0.8447],
        [0.7316, 0.7239, 0.7099, 0.7023, 0.6967],
        [0.8415, 0.8299, 0.8032, 0.8030, 0.8026]])

In [None]:
result_map = {i:q for i, q in enumerate(polymarket_subset["question"])}
search = pd.DataFrame(top_5)
search["question"]  = kalshi_subset["title"].tolist()
search = search.replace(result_map)
search.head(20)

Unnamed: 0,0,1,2,3,4,question
0,US adds more than 300k jobs in August?,US adds between 250k and 300k jobs in August?,US adds between 200k and 250k jobs in August?,US adds between 150k and 200k jobs in August?,US adds between 100k and 150k jobs in August?,"Initial jobless claims from Aug 22-28, 2021?"
1,Will the European Parliament pass the AI Act b...,Will Grune receive more than 14% of votes?,Will the EPP win 150-159 seats in the European...,Will the EPP win 200 or more seats in the Euro...,Will the EPP win 180-189 seats in the European...,EU meets its 2030 climate goals?
2,Will the NDA win by 5%-0%?,Will the NDA win by 20%-15%?,Will the NDA win by 15%-10%?,Indian Election: Modi reelected?,Will the NDA win less than 300 seats?,India meets its 2030 climate goals?
3,Will the final global heat increase be 1.05 or...,Will the final global heat increase be 1.08 or...,Will the final global heat increase be 1.13 or...,Will the final global heat increase be 1.18 or...,Will the final global heat increase be 1.15 or...,US meets its climate goals?
4,OpenAI announces it has achieved AGI in 2024?,Will AI be the 2023 TIME Person of the Year?,Will an AI win the $5 million AI Math Olympiad...,Will Sam Altman testify before congress by May...,OpenAI renamed to ClosedAI in March?,AI passes Turing test before 2030?
5,"Will BTC hit $50,000 by Jan 31?","ETH above $3,000 on March 1?","Will BTC hit $50,000 in 2023?","ETH above $2,000 at end of September?",Another Tesla recall before July?,EV market share in 2030?
6,US congress stock trading ban by June 30?,Congress passes bill banning TikTok by April 30?,US congress stock trading ban by August 31?,Supreme Court term limits in 2024?,Will PredictIt still allow trading through Apr...,Non-compete ban overturned?
7,New EU country recognizes Palestine before July?,Will Sweden join NATO by December 31?,Will NATO expand by March 31?,Will Sweden join NATO by January 31?,Will Sweden join NATO by March 31?,EU has a new member before 2030?
8,Andrew Tate flee the EU?,Will NATO expand by March 31?,New EU country recognizes Palestine before July?,UK election called by end of year?,Will the EPP win less than 150 seats in the Eu...,EU loses a member before 2030?
9,Will China invade Taiwan in 2023?,Will China invade Taiwan in 2024?,Will China unban Bitcoin in 2024?,Will China invade Taiwan in May?,US travel ban for China in 2023?,China overtakes USA’s economy by 2030?


Unnamed: 0,question
0,[Single Market] Will Ron DeSantis win the U.S....
1,[Single Market] Will Donald J. Trump win the U...
2,[Single Market] Will Nikki Haley win the U.S. ...
3,[Single Market] Will Joe Biden win the U.S. 20...
4,[Single Market] Will Kamala Harris win the U.S...
...,...
11512,"Will Kamala Harris say ""Jew"" during DNC speech?"
11513,"Will Kamala Harris say ""Unburdened"" during DNC..."
11517,"Will Kamala Harris say ""Not going back"" during..."
11523,"Will Kamala Harris say ""Monkeypox"" during DNC ..."
