# HyDE Table Retrieval

Hypothetical Document Embeddings for table search:
1. Generate table descriptions using LLM
2. Generate hypothetical table descriptions from queries using LLM
3. Encode and retrieve using MiniLM-L6-v2 + FAISS

## Setup

In [None]:
import pandas as pd
import numpy as np
import json
import faiss
from sentence_transformers import SentenceTransformer
from collections import defaultdict
from tqdm import tqdm

from openai import OpenAI
from functools import reduce
import time

client = OpenAI(api_key="<insert key here>")


  from .autonotebook import tqdm as notebook_tqdm


## Load Data

In [2]:
# Load tables
tables_df = pd.read_csv('data/wikitables_mini.csv')
print(f"Loaded {len(tables_df)} tables")
print(f"Columns: {list(tables_df.columns)}")
tables_df.head(2)

Loaded 2932 tables
Columns: ['table_id', 'page_title', 'section_title', 'table_caption', 'headers', 'sample_data']


Unnamed: 0,table_id,page_title,section_title,table_caption,headers,sample_data
0,table-0001-249,Auburn Tigers swimming and diving,Summer Olympic Games Beijing 2008,Summer Olympic Games Beijing 2008,"[""Athlete"", ""Nation"", ""Total"", ""Gold"", ""Silver...","[[""[Fr\u00e9d\u00e9rick_Bousquet|Fr\u00e9d\u00..."
1,table-0001-400,Bisphenol A,Low-dose exposure in animals,Low-dose exposure in animals,"[""Dose (\u00b5g/kg/day)"", ""[Environmental_Work...","[[""0.025"", ""\""Permanent changes to genital tra..."


## Generate Table Descriptions (HyDE)

Use LLM to generate natural descriptions from table metadata.

In [None]:
def generate_table_description_with_llm(row, max_chars=256):
    """
    Generate table description using LLM.
    
    """
    
    # Prepare table metadata as prompt context
    metadata = []
    if pd.notna(row['table_caption']):
        metadata.append(f"Caption: {row['table_caption']}")
    if pd.notna(row['page_title']):
        metadata.append(f"Page: {row['page_title']}")
    if pd.notna(row['section_title']):
        metadata.append(f"Section: {row['section_title']}")
    
    try:
        headers = json.loads(row['headers'])
        if headers:
            metadata.append(f"Columns: {', '.join([str(h) for h in headers[:10]])}")
    except:
        pass
    
    try:
        sample_data = json.loads(row['sample_data'])
        if sample_data and len(sample_data) > 0:
            sample_str = str(sample_data[0][:5])
            metadata.append(f"Sample: {sample_str}")
    except:
        pass
    
    metadata_str = '\n'.join(metadata)

    sleep_timer = 1
    time.sleep(sleep_timer)

    system_msg = 'You are a helpful assistant.'

    user_msg = f"""
Given the following table metadata, write a natural 
description of what this table contains in 1-2 sentences.
Keep it under 256 characters.\n\n{metadata_str}
    """

    response = client.chat.completions.create(model="gpt-4o",
    messages=[{"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}],
    max_tokens=4000, 
    temperature=0.0)

    hypothetical_description = response.choices[0].message.content
    
    return hypothetical_description[:max_chars]

# Test
print("Example LLM-generated descriptions:")
print("=" * 80)
for i in range(3):
    desc = generate_table_description_with_llm(tables_df.iloc[i])
    print(f"\n{i+1}. {desc}")

Example LLM-generated descriptions:

1. The table lists Auburn Tigers swimmers and divers who participated in the 2008 Beijing Summer Olympics, detailing each athlete's nation, total medals won, and the count of gold, silver, and bronze medals across various events.

2. The table details the effects of low-dose Bisphenol A exposure in animals, listing the dose in µg/kg/day, observed effects, and the study year.

3. This table lists players from the Charlotte Bobcats' all-time roster, highlighting those selected in the 2004 NBA Expansion Draft and indicating which players are currently on the roster.


In [4]:
# Generate descriptions for all tables
print("Generating LLM descriptions for all tables...")
print("NOTE: This will make ~3K LLM API calls. Consider batch processing or caching.")
print()

table_descriptions = []
table_ids = []

for idx, row in tqdm(tables_df.iterrows(), total=len(tables_df)):
    table_descriptions.append(generate_table_description_with_llm(row))
    table_ids.append(row['table_id'])

print(f"Generated {len(table_descriptions)} descriptions")
print(f"Length stats - Mean: {np.mean([len(d) for d in table_descriptions]):.1f}, Max: {max([len(d) for d in table_descriptions])}")

Generating LLM descriptions for all tables...
NOTE: This will make ~3K LLM API calls. Consider batch processing or caching.



100%|██████████| 2932/2932 [2:17:44<00:00,  2.82s/it]  

Generated 2932 descriptions
Length stats - Mean: 173.8, Max: 256





## Encode Tables

In [5]:
# Load encoder
model_name = 'all-MiniLM-L6-v2'
print(f"Loading {model_name}...")
encoder = SentenceTransformer(model_name)
print(f"Dimension: {encoder.get_sentence_embedding_dimension()}")

Loading all-MiniLM-L6-v2...
Dimension: 384


In [6]:
# Encode descriptions
print("Encoding...")
table_embeddings = encoder.encode(
    table_descriptions, batch_size=32, show_progress_bar=True,
    convert_to_numpy=True, normalize_embeddings=True
)
print(f"Shape: {table_embeddings.shape}")

Encoding...


Batches: 100%|██████████| 92/92 [00:06<00:00, 14.72it/s]

Shape: (2932, 384)





## Build FAISS Index

In [7]:
# Build FAISS index
print("Building index...")
index = faiss.IndexFlatIP(encoder.get_sentence_embedding_dimension())
index.add(table_embeddings.astype('float32'))
print(f"✓ Index built with {index.ntotal} tables")

Building index...
✓ Index built with 2932 tables


### Test Custom Queries

In [None]:
import re

# Test custom queries - change test_query and run
test_query = "olympic medals table"
top_k = 5

print(f"Query: '{test_query}'")
print("=" * 80)

sleep_timer = 1
time.sleep(sleep_timer)

system_msg = 'You are a helpful assistant.'

user_msg = f"""
Given the search query: '{test_query}', generate a description 
of what a relevant table would contain. Describe the table structure, 
columns, and type of data it would have. Keep it under 256 characters.

"""

response = client.chat.completions.create(model="gpt-4o",
messages=[{"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}],
max_tokens=4000, 
temperature=0.0)

hypothetical_description = response.choices[0].message.content


# Encode and search
test_emb = encoder.encode([hypothetical_description], convert_to_numpy=True, normalize_embeddings=True).astype('float32')
scores_test, indices_test = index.search(test_emb, top_k)

for rank in range(top_k):
    idx = indices_test[0][rank]
    table_id = table_ids[idx]
    row = tables_df[tables_df['table_id'] == table_id].iloc[0]
    
    print(f"\n{rank + 1}. {table_id} (score: {scores_test[0][rank]:.4f})")
    print(f"   Page: {row['page_title']}")
    print(f"   Caption: {row['table_caption']}")
    print(f"   Description: {table_descriptions[idx][:120]}...")
    print("-" * 80)

Query: 'olympic medals table'

1. table-0529-770 (score: 0.7789)
   Page: Lists of Olympic medalists
   Caption: Summer Olympic Games
   Description: This table provides information about the Summer Olympic Games, including the year, host city, number of medal events, t...
--------------------------------------------------------------------------------

2. table-0529-771 (score: 0.7686)
   Page: Lists of Olympic medalists
   Caption: Winter Olympic Games
   Description: This table provides information on the Winter Olympic Games, including the year, links to medal winners and medal tables...
--------------------------------------------------------------------------------

3. table-0952-496 (score: 0.7506)
   Page: List of multiple Olympic gold medalists at a single Games
   Caption: Timeline
   Description: This table lists athletes who won multiple Olympic gold medals at a single Games, detailing the number of golds, the yea...
---------------------------------------------------------

## Evaluation

### Load Queries and Relevance Judgments

In [9]:
# Load queries
queries = {}
with open('data/queries.txt', 'r') as f:
    for line in f:
        parts = line.strip().split(None, 1)
        if len(parts) == 2:
            query_id, query_text = parts
            queries[query_id] = query_text

print(f"Loaded {len(queries)} queries")
print("Examples:", list(queries.items())[:3])

# Load qrels (relevance judgments)
qrels = defaultdict(dict)
with open('data/qrels.txt', 'r') as f:
    for line in f:
        parts = line.strip().split()
        if len(parts) >= 4:
            query_id, table_id, relevance = parts[0], parts[2], int(parts[3])
            qrels[query_id][table_id] = relevance

qrels = dict(qrels)
print(f"Loaded qrels for {len(qrels)} queries")

Loaded 60 queries
Examples: [('1', 'world interest rates table'), ('2', '2008 beijing olympics'), ('3', 'fast cars')]
Loaded qrels for 60 queries


---

### Generate Hypothetical Descriptions from Queries (HyDE)

In [10]:
query_ids = list(queries.keys())
query_texts = [queries[qid] for qid in query_ids]

print(f"Generating hypothetical table descriptions for {len(query_texts)} queries...")
print("NOTE: This will make ~60 LLM API calls.")
print()

hypothetical_descriptions = []

for query_text in tqdm(query_texts):
    sleep_timer = 1
    time.sleep(sleep_timer)

    system_msg = 'You are a helpful assistant.'

    user_msg = f"""
    Given the search query: '{query_text}', generate a description 
    of what a relevant table would contain. Describe the table structure, 
    columns, and type of data it would have. Keep it under 256 characters.

    """
    response = client.chat.completions.create(model="gpt-4o",
    messages=[{"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}],
    max_tokens=4000, 
    temperature=0.0)

    hypothetical_description = response.choices[0].message.content

    
    print(f"Hypothetical table description: {hypothetical_description}\n")

    hypothetical_descriptions.append(hypothetical_description)

print(f"Generated {len(hypothetical_descriptions)} hypothetical descriptions")
print(f"\nExamples:")
for i in range(3):
    print(f"  Query: {query_texts[i]}")
    print(f"  Hypothetical: {hypothetical_descriptions[i]}")
    print()

Generating hypothetical table descriptions for 60 queries...
NOTE: This will make ~60 LLM API calls.



  2%|▏         | 1/60 [00:02<02:03,  2.10s/it]

Hypothetical table description: A relevant table would list countries, their central bank interest rates, and the date of the last update. Columns: Country, Interest Rate (%), Last Updated. Data includes country names, numerical rates, and date stamps.



  3%|▎         | 2/60 [00:04<01:59,  2.06s/it]

Hypothetical table description: A relevant table would include columns for Event, Date, Venue, Gold Medalist, Silver Medalist, Bronze Medalist, and Country. It would contain data on the events held, winners, and their respective countries.



  5%|▌         | 3/60 [00:06<02:05,  2.21s/it]

Hypothetical table description: A relevant table for 'fast cars' would include columns like Model, Manufacturer, Top Speed (mph), 0-60 mph Time (seconds), Horsepower, and Price (USD). It would contain data on various high-performance car models and their specifications.



  7%|▋         | 4/60 [00:09<02:10,  2.33s/it]

Hypothetical table description: A relevant table for 'clothing sizes' would include columns for Size (e.g., XS, S, M, L, XL), Measurements (e.g., chest, waist, hip in inches/cm), Gender (e.g., Men, Women, Unisex), and Region (e.g., US, EU, UK). Data is text and numeric.



  8%|▊         | 5/60 [00:11<02:06,  2.30s/it]

Hypothetical table description: A table on 'phases of the moon' would include columns: Date (YYYY-MM-DD), Phase (New, Waxing Crescent, First Quarter, etc.), Illumination (%) and Visibility (Yes/No). It would contain data on the moon's phase and visibility for each date.



 10%|█         | 6/60 [00:13<01:59,  2.21s/it]

Hypothetical table description: A relevant table would list U.S. states in one column and their respective populations in another. It would include columns for "State" (text) and "Population" (numeric), with rows for each state, providing a clear overview of population distribution.



 12%|█▏        | 7/60 [00:15<01:54,  2.16s/it]

Hypothetical table description: The table lists Prime Ministers of the UK, with columns for Name, Term Start, Term End, Political Party, and Notable Achievements. Data includes names, dates, party affiliations, and key accomplishments during their tenure.



 13%|█▎        | 8/60 [00:17<01:52,  2.16s/it]

Hypothetical table description: A relevant table for 'iPod models' would include columns for Model Name, Release Year, Storage Capacity, Color Options, and Discontinued Year. It would contain text for names and colors, integers for years and capacity, and dates for discontinuation.



 15%|█▌        | 9/60 [00:19<01:51,  2.20s/it]

Hypothetical table description: A relevant table for 'bittorrent clients' would include columns like Name (text), Version (text), Platform (text), License (text), Features (text), and Website (URL). It would list client names, their versions, supported platforms, license types, key features, and official websites.



 17%|█▋        | 10/60 [00:21<01:47,  2.16s/it]

Hypothetical table description: A relevant table for 'Olympus digital SLRs' would include columns like Model, Megapixels, Sensor Type, ISO Range, Price, and Release Date. It would contain data on camera specifications, pricing, and launch details.



 18%|█▊        | 11/60 [00:23<01:44,  2.13s/it]

Hypothetical table description: A table on the sun's composition would include columns for Element, Percentage by Mass, and Percentage by Number. Data would list elements like Hydrogen and Helium, with their respective mass and number percentages.



 20%|██        | 12/60 [00:26<01:46,  2.22s/it]

Hypothetical table description: The table for 'running shoes' would include columns like Brand (text), Model (text), Price (decimal), Size (integer), Color (text), Material (text), Weight (decimal), Cushioning (text), and Rating (decimal). It stores detailed product info for comparison.



 22%|██▏       | 13/60 [00:28<01:48,  2.30s/it]

Hypothetical table description: A relevant table for 'fuel consumption' would include columns like Vehicle ID, Fuel Type, Distance Traveled (km), Fuel Consumed (liters), and Date. It would contain numerical data for distance and fuel, categorical data for fuel type, and date entries.



 23%|██▎       | 14/60 [00:31<01:46,  2.31s/it]

Hypothetical table description: A stock quote table includes columns for Ticker, Company Name, Last Price, Change, % Change, Volume, Market Cap, and P/E Ratio. It provides real-time data on stock performance, helping investors track market trends and make informed decisions.



 25%|██▌       | 15/60 [00:33<01:42,  2.28s/it]

Hypothetical table description: A relevant table would include columns: Rank (integer), Title (string), Year (integer), Gross Revenue (currency), and Studio (string). It would list movies by their box office earnings, providing a snapshot of top-grossing films.



 27%|██▋       | 16/60 [00:35<01:42,  2.33s/it]

Hypothetical table description: A relevant table for 'nutrition values' would include columns for Food Item, Serving Size, Calories, Protein (g), Carbohydrates (g), Fats (g), Fiber (g), and Vitamins/Minerals. Each row would list specific foods with their corresponding nutritional data.



 28%|██▊       | 17/60 [00:38<01:40,  2.33s/it]

Hypothetical table description: The table includes columns: "State" (state name), "Capital" (capital city), "Largest City" (largest city by population), and "Population" (population of the largest city). It provides a quick reference for comparing capitals and largest cities.



 30%|███       | 18/60 [00:40<01:33,  2.24s/it]

Hypothetical table description: A relevant table for "professional wrestlers" would include columns like Name (text), Ring Name (text), Weight (numeric), Height (numeric), Debut Year (numeric), Nationality (text), and Championships Won (numeric).



 32%|███▏      | 19/60 [00:42<01:29,  2.19s/it]

Hypothetical table description: A relevant table would include columns for Year, Revenue, Cost of Goods Sold, Gross Profit, Operating Expenses, Operating Income, Net Income, and Earnings Per Share, with rows detailing financial figures for each fiscal year.



 33%|███▎      | 20/60 [00:44<01:30,  2.27s/it]

Hypothetical table description: A relevant table for 'dog breeds' would include columns like Breed Name, Size, Coat Type, Temperament, Lifespan, and Origin. It would contain textual data describing each breed's characteristics, size category, coat description, typical behavior, average lifespan, and country of origin.



 35%|███▌      | 21/60 [00:47<01:39,  2.54s/it]

Hypothetical table description: The table would include columns like Model, Series, Type (electric/acoustic), Price, Features, and Availability. It would contain data on various Ibanez guitar models, their specifications, pricing, and stock status.



 37%|███▋      | 22/60 [00:50<01:35,  2.50s/it]

Hypothetical table description: A relevant table for 'used cellphones' would include columns like Model, Brand, Condition, Price, Storage, Color, Carrier, and Seller Location. It would contain data on phone specifications, pricing, and seller details to aid in purchasing decisions.



 38%|███▊      | 23/60 [00:52<01:30,  2.44s/it]

Hypothetical table description: A table on 'world religions' would include columns like Religion Name, Origin Country, Founding Year, Number of Adherents, Major Beliefs, and Sacred Texts. It would contain textual and numerical data, providing a concise overview of each religion.



 40%|████      | 24/60 [00:54<01:27,  2.42s/it]

Hypothetical table description: A relevant table for 'stocks' would include columns like Ticker, Company Name, Current Price, Market Cap, P/E Ratio, 52-Week High/Low, Volume, and Dividend Yield. It would contain numerical and text data, providing a snapshot of stock performance.



 42%|████▏     | 25/60 [00:57<01:22,  2.37s/it]

Hypothetical table description: A relevant table for 'academy awards' would include columns like Year, Category, Winner, Nominees, and Film. It would contain data such as the year of the award, the award category, the winner's name, other nominees, and the film title.



 43%|████▎     | 26/60 [00:59<01:19,  2.33s/it]

Hypothetical table description: The table lists 2008 Olympic gold medalists, with columns for Sport, Event, Athlete/Team, Country, and Medal Count. Data includes sport names, event details, athlete/team names, representing countries, and number of gold medals won.



 45%|████▌     | 27/60 [01:01<01:17,  2.34s/it]

Hypothetical table description: The table lists countries and their currencies. Columns include "Country" (name of the country), "Currency" (name of the currency), and "Currency Code" (ISO 4217 code). Data types are text for all columns.



 47%|████▋     | 28/60 [01:04<01:13,  2.30s/it]

Hypothetical table description: A relevant table for 'science discoveries' would include columns like "Discovery Name," "Scientist(s)," "Year," "Field," and "Impact." It would contain data on notable scientific breakthroughs, their discoverers, the year of discovery, the scientific field, and their significance.



 48%|████▊     | 29/60 [01:06<01:11,  2.32s/it]

Hypothetical table description: A relevant table for 'PGA leaderboard' would include columns for Position, Player Name, Country, Round Scores (R1, R2, R3, R4), Total Score, and Earnings. It would display player rankings, scores per round, cumulative scores, and prize money.



 50%|█████     | 30/60 [01:08<01:11,  2.37s/it]

Hypothetical table description: A relevant table for 'pain medications' would include columns for Medication Name, Type (e.g., NSAID, Opioid), Dosage Form (e.g., tablet, liquid), Typical Dosage, Side Effects, and Prescription Status. Data would be text and numerical values.



 52%|█████▏    | 31/60 [01:11<01:06,  2.30s/it]

Hypothetical table description: A relevant table would list football clubs and their cities. Columns: "Club Name" (text), "City" (text), "Country" (text), "Founded Year" (integer), "Stadium" (text). It provides club details and their geographical locations.



 53%|█████▎    | 32/60 [01:13<01:02,  2.22s/it]

Hypothetical table description: A relevant table would include columns for Food Item, Price, Serving Size, Calories, Nutrients, and Store/Location. It would contain data on various healthy foods, their costs, nutritional information, and where to purchase them.



 55%|█████▌    | 33/60 [01:15<00:59,  2.20s/it]

Hypothetical table description: The table lists world capitals and their top attractions. Columns include "Capital" (city name), "Country" (nation name), "Attraction" (site name), "Type" (e.g., museum, park), and "Description" (brief details). Data is text-based.



 57%|█████▋    | 34/60 [01:17<00:56,  2.19s/it]

Hypothetical table description: A relevant table would include columns for Disease Name, Mortality Rate (%), Year, Region, and Total Deaths. It would contain data on various diseases, their mortality rates, and the number of deaths by year and region.



 58%|█████▊    | 35/60 [01:20<00:59,  2.38s/it]

Hypothetical table description: A relevant table would include columns for Brand Name, Market Share (%), Region, and Year. It would contain data on various cigarette brands, their market share percentages, the regions they operate in, and the year of the data.



 60%|██████    | 36/60 [01:22<00:57,  2.39s/it]

Hypothetical table description: A relevant table would include columns for Year, Company, Market Share (%), and Region. It would contain data on Apple's market share compared to competitors, with percentages indicating their share in different regions over various years.



 62%|██████▏   | 37/60 [01:25<00:55,  2.43s/it]

Hypothetical table description: A relevant table would include columns for Food Item, Calories, Protein (g), Carbs (g), Fats (g), Vitamins, and Minerals. Each row would list a food item with its corresponding nutritional values and micronutrient content.



 63%|██████▎   | 38/60 [01:27<00:53,  2.42s/it]

Hypothetical table description: A relevant table would include columns like Hormone Name, Function, Source Gland, Target Organs, Effects, and Imbalance Symptoms. It would contain text data describing each hormone's role, origin, impact on the body, and symptoms of excess or deficiency.



 65%|██████▌   | 39/60 [01:30<00:51,  2.46s/it]

Hypothetical table description: A relevant table would list household chemicals, with columns for Chemical Name, Strength/Concentration, Usage, Safety Precautions, and Storage Instructions. Data includes text for names and usage, numerical values for strength, and text for safety and storage.



 67%|██████▋   | 40/60 [01:32<00:49,  2.48s/it]

Hypothetical table description: A relevant table would list lakes with columns for Lake Name, Altitude (meters), Location (Country/Region), Surface Area (sq km), and Depth (meters). It would contain data on each lake's elevation above sea level and geographical details.



 68%|██████▊   | 41/60 [01:34<00:45,  2.41s/it]

Hypothetical table description: A relevant table would list laptops with columns for Model, CPU Brand, CPU Model, Cores, Threads, Base Clock Speed (GHz), Turbo Clock Speed (GHz), and Price. It would contain data on CPU specifications and pricing for comparison.



 70%|███████   | 42/60 [01:37<00:42,  2.34s/it]

Hypothetical table description: The table lists Asian countries with columns for "Country Name," "Currency Name," and "Currency Code." It includes data like "Japan," "Yen," "JPY," providing a quick reference for each country's official currency and its ISO code.



 72%|███████▏  | 43/60 [01:39<00:39,  2.33s/it]

Hypothetical table description: A relevant table would list diseases and associated risk factors. Columns: Disease Name (text), Risk Factor (text), Risk Level (low/medium/high), Age Group (text), and Prevalence (%) (numeric). Data includes disease names, risk factors, and statistics.



 73%|███████▎  | 44/60 [01:41<00:37,  2.32s/it]

Hypothetical table description: A relevant table would list external drives with columns for Brand, Model, Capacity (GB/TB), Type (HDD/SSD), Interface (USB/Thunderbolt), Price, and Rating. It would contain text, numerical, and currency data for easy comparison.



 75%|███████▌  | 45/60 [01:43<00:34,  2.31s/it]

Hypothetical table description: The table lists baseball teams and their captains. Columns include "Team Name" (text), "Captain Name" (text), "Captain Since" (year), and "League" (text). It provides a quick reference to team leadership across leagues.



 77%|███████▋  | 46/60 [01:46<00:31,  2.26s/it]

Hypothetical table description: A relevant table would list Maryland counties with columns for "County Name," "Population," and "Year." It would contain data on each county's population figures for a specific year, allowing for easy comparison and analysis.



 78%|███████▊  | 47/60 [01:48<00:28,  2.22s/it]

Hypothetical table description: The table would list countries and their capitals. It would have two columns: "Country" and "Capital." Each row would contain the name of a country and its corresponding capital city, both as text data.



 80%|████████  | 48/60 [01:50<00:26,  2.22s/it]

Hypothetical table description: A relevant table would include columns for Disease Name, Incidence Rate, Year, Region, and Population. It would contain data on the frequency of diseases per 100,000 people, categorized by year and geographic area.



 82%|████████▏ | 49/60 [01:52<00:23,  2.17s/it]

Hypothetical table description: The table lists EU countries with columns: "Country" (name of the country), "Year Joined" (year they joined the EU). Data includes country names and corresponding years of accession.



 83%|████████▎ | 50/60 [01:55<00:24,  2.44s/it]

Hypothetical table description: The table lists Irish counties with columns: "County Name" (text), "Area (km²)" (numeric), and "Rank by Area" (numeric). It provides each county's name, its area in square kilometers, and its rank in size compared to other counties.



 85%|████████▌ | 51/60 [01:57<00:21,  2.39s/it]

Hypothetical table description: The table would list cereals with columns for Name, Serving Size (g), Calories, Protein (g), Carbs (g), Sugars (g), Fiber (g), and Fat (g). Each row provides nutritional values per serving, helping compare different cereals.



 87%|████████▋ | 52/60 [02:00<00:18,  2.33s/it]

Hypothetical table description: The table would include columns for ERP System Name, Vendor, Price Range, Licensing Model, Features, and User Reviews. It would contain data on system names, vendor details, cost estimates, licensing types, key features, and average user ratings.



 88%|████████▊ | 53/60 [02:03<00:17,  2.53s/it]

Hypothetical table description: A relevant table would include columns for "Breed", "Average Lifespan (years)", "Health Factors", and "Care Tips". It would contain data on various cat breeds, their typical lifespan, common health issues, and tips for extending their life.



 90%|█████████ | 54/60 [02:05<00:14,  2.38s/it]

Hypothetical table description: A relevant table would include columns: "Musical Title" (text), "Director" (text), "Year" (integer), "Composer" (text), and "Opening Date" (date). It would list Broadway musicals, their directors, and related details.



 92%|█████████▏| 55/60 [02:07<00:11,  2.33s/it]

Hypothetical table description: A relevant table would include columns for Infection Type, Symptoms, Treatment Options, Medication, Dosage, Duration, and Success Rate. It would contain data on various infections, their symptoms, recommended treatments, and effectiveness.



 93%|█████████▎| 56/60 [02:09<00:09,  2.44s/it]

Hypothetical table description: A relevant table for 'food type' would include columns like "Food ID" (integer), "Name" (text), "Category" (text, e.g., fruit, vegetable), "Calories" (integer), and "Nutritional Info" (text). It categorizes foods by type and provides basic nutritional data.



 95%|█████████▌| 57/60 [02:12<00:07,  2.40s/it]

Hypothetical table description: A relevant table would list board games with columns for "Game Name," "Min Players," "Max Players," and "Recommended Players." It would contain text for names and numerical data for player counts, helping users find games suitable for their group size.



 97%|█████████▋| 58/60 [02:15<00:05,  2.57s/it]

Hypothetical table description: A relevant table would include columns for Product Name, Review Score, Number of Reviews, Review Summary, and Review Date. It would contain data such as product names, average scores, total reviews, brief summaries, and dates of reviews.



 98%|█████████▊| 59/60 [02:17<00:02,  2.49s/it]

Hypothetical table description: The table would list constellations with columns for Constellation Name, Right Ascension, Declination, and Distance to Earth. Data includes constellation names, celestial coordinates, and proximity in light-years.



100%|██████████| 60/60 [02:19<00:00,  2.33s/it]

Hypothetical table description: A relevant table for 'games age' would include columns: Game Title (text), Release Year (integer), Age Rating (text), Genre (text), and Platform (text). It would list video games, their release years, age ratings, genres, and platforms.

Generated 60 hypothetical descriptions

Examples:
  Query: world interest rates table
  Hypothetical: A relevant table would list countries, their central bank interest rates, and the date of the last update. Columns: Country, Interest Rate (%), Last Updated. Data includes country names, numerical rates, and date stamps.

  Query: 2008 beijing olympics
  Hypothetical: A relevant table would include columns for Event, Date, Venue, Gold Medalist, Silver Medalist, Bronze Medalist, and Country. It would contain data on the events held, winners, and their respective countries.

  Query: fast cars
  Hypothetical: A relevant table for 'fast cars' would include columns like Model, Manufacturer, Top Speed (mph), 0-60 mph Time (se




### Encode Hypothetical Descriptions and Retrieve

In [11]:
# Encode hypothetical descriptions (NOT raw queries)
print(f"Encoding {len(hypothetical_descriptions)} hypothetical descriptions...")
query_embeddings = encoder.encode(
    hypothetical_descriptions, batch_size=32, show_progress_bar=True,
    convert_to_numpy=True, normalize_embeddings=True
)

# Search top 100
k = 100
print(f"Searching top-{k}...")
scores, indices = index.search(query_embeddings.astype('float32'), k)

results = {qid: [table_ids[idx] for idx in indices[i]] for i, qid in enumerate(query_ids)}
print(f"✓ Retrieved {len(results)} query results")

# Show examples
print("\n" + "=" * 80)
print("EXAMPLE RESULTS (HyDE)")
print("=" * 80)
for qid in list(queries.keys())[:3]:
    print(f"\nQuery {qid}: '{queries[qid]}'")
    print(f"Hypothetical: {hypothetical_descriptions[query_ids.index(qid)][:80]}...")
    for rank, tid in enumerate(results[qid][:3], 1):
        rel = qrels.get(qid, {}).get(tid, 0)
        score_val = scores[query_ids.index(qid)][rank-1]
        page = tables_df[tables_df['table_id'] == tid].iloc[0]['page_title']
        print(f"  {rank}. {tid} (score: {score_val:.3f}, rel: {rel}) - {page}")
print("=" * 80)

Encoding 60 hypothetical descriptions...


Batches: 100%|██████████| 2/2 [00:00<00:00,  3.33it/s]

Searching top-100...
✓ Retrieved 60 query results

EXAMPLE RESULTS (HyDE)

Query 1: 'world interest rates table'
Hypothetical: A relevant table would list countries, their central bank interest rates, and th...
  1. table-0552-511 (score: 0.607, rel: 0) - Single deposit
  2. table-0370-614 (score: 0.596, rel: 2) - Eurozone
  3. table-0610-21 (score: 0.543, rel: 0) - Currencies of the European Union

Query 2: '2008 beijing olympics'
Hypothetical: A relevant table would include columns for Event, Date, Venue, Gold Medalist, Si...
  1. table-0109-237 (score: 0.697, rel: 0) - Athletics at the 2007 All-Africa Games
  2. table-0831-590 (score: 0.682, rel: 0) - List of multi-sport events
  3. table-0713-460 (score: 0.681, rel: 0) - List of Olympic medalists in baseball

Query 3: 'fast cars'
Hypothetical: A relevant table for 'fast cars' would include columns like Model, Manufacturer,...
  1. table-1574-853 (score: 0.656, rel: 0) - Suzuki Carry
  2. table-0990-862 (score: 0.619, rel: 1) - Spee




### Calculate Metrics

In [12]:
# Evaluation functions
def recall_at_k(retrieved, relevant, k):
    if len(relevant) == 0:
        return 0.0
    retrieved_at_k = set(retrieved[:k])
    return len(retrieved_at_k & relevant) / len(relevant)

def ndcg_at_k(retrieved, relevance, k):
    if len(relevance) == 0:
        return 0.0
    dcg = sum(relevance.get(retrieved[i], 0) / np.log2(i + 2) for i in range(min(k, len(retrieved))))
    ideal_rels = sorted(relevance.values(), reverse=True)[:k]
    idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal_rels))
    return dcg / idcg if idcg > 0 else 0.0

In [13]:
# Evaluate
k_values = [1, 5, 10, 20]
metrics = defaultdict(list)

for query_id, retrieved in results.items():
    if query_id not in qrels:
        continue
    relevance = qrels[query_id]
    relevant = set(tid for tid, rel in relevance.items() if rel > 0)
    
    for k in k_values:
        metrics[f'Recall@{k}'].append(recall_at_k(retrieved, relevant, k))
        metrics[f'nDCG@{k}'].append(ndcg_at_k(retrieved, relevance, k))

# Print results
print("\n" + "="*60)
print("HYDE EVALUATION RESULTS")
print("="*60)
print("\nRecall:")
for k in k_values:
    print(f"  Recall@{k:2d}: {np.mean(metrics[f'Recall@{k}']):.4f}")
print("\nnDCG:")
for k in k_values:
    print(f"  nDCG@{k:2d}  : {np.mean(metrics[f'nDCG@{k}']):.4f}")
print("="*60)


HYDE EVALUATION RESULTS

Recall:
  Recall@ 1: 0.0939
  Recall@ 5: 0.2381
  Recall@10: 0.3520
  Recall@20: 0.5134

nDCG:
  nDCG@ 1  : 0.4917
  nDCG@ 5  : 0.4839
  nDCG@10  : 0.4917
  nDCG@20  : 0.5207


## Inspect Results

In [14]:
# Inspect specific query results
query_id = '1'

print(f"Query {query_id}: {queries[query_id]}")
print(f"Hypothetical description: {hypothetical_descriptions[query_ids.index(query_id)]}")
print("="*80)

for i, table_id in enumerate(results[query_id][:5], 1):
    row = tables_df[tables_df['table_id'] == table_id].iloc[0]
    rel = qrels.get(query_id, {}).get(table_id, 0)
    score_val = scores[query_ids.index(query_id)][i-1]
    
    print(f"\n{i}. {table_id} (score: {score_val:.4f}, relevance: {rel})")
    print(f"   Page: {row['page_title']}")
    print(f"   Section: {row['section_title']}")
    print(f"   Caption: {row['table_caption']}")
    
    try:
        headers = json.loads(row['headers'])
        print(f"   Headers: {headers[:5]}{'...' if len(headers) > 5 else ''}")
    except:
        pass
    
    try:
        sample = json.loads(row['sample_data'])
        print(f"   Sample ({len(sample)} rows): {sample[0][:3]}...")
    except:
        pass
    
    print(f"   LLM Description: {table_descriptions[table_ids.index(table_id)]}")
    print("-" * 80)

Query 1: world interest rates table
Hypothetical description: A relevant table would list countries, their central bank interest rates, and the date of the last update. Columns: Country, Interest Rate (%), Last Updated. Data includes country names, numerical rates, and date stamps.

1. table-0552-511 (score: 0.6070, relevance: 0)
   Page: Single deposit
   Section: Real World Example
   Caption: Real World Example
   Headers: ['Interest Accrued', 'Maturity Amount', 'Gain']
   Sample (1 rows): ['737.62 USD', '5,737.62 USD', '14.75%']...
   LLM Description: This table displays financial data for a single deposit, including the interest accrued, the total maturity amount, and the percentage gain.
--------------------------------------------------------------------------------

2. table-0370-614 (score: 0.5960, relevance: 2)
   Page: Eurozone
   Section: Interest rates
   Caption: Interest rates
   Headers: ['Date', 'Deposit facility', 'Main refinancing operations', 'Marginal lending facil