# Formula 1 Knowledge Base with RAG Architecture

This notebook implements a Retrieval-Augmented Generation (RAG) system for answering Formula 1 questions. RAG combines the power of:

1. **Retrieval**: Finding relevant information from a knowledge base
2. **Generation**: Producing natural answers based on retrieved information

## GitHub

To view a sizably larger (and somewhat less complete) version of this model please feel free to check out my github repo https://github.com/ibanrohazz/F1RAG 

## How This System Works

1. We load Formula 1 race data spanning from 1950 to 2024 from https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020 
2. Create a structured knowledge base of racing facts
3. Embed these facts using transformer models
4. Build a semantic search index with FAISS
5. Implement a chatbot that combines:
   - Semantic search retrieval
   - Knowledge-based answer generation
   - Context-aware follow-up question handling

Let's start by loading the Formula 1 dataset!

In [1]:
# Install all necessary libraries
%pip install pandas torch scikit-learn transformers faiss-cpu tqdm kagglehub

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1
[notice] To update, run: C:\Users\mroja\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
import pandas as pd

# Load all necessary files into separate DataFrames
races = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "rohanrao/formula-1-world-championship-1950-2020",
    "races.csv"
)
drivers = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "rohanrao/formula-1-world-championship-1950-2020",
    "drivers.csv"
)
results = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "rohanrao/formula-1-world-championship-1950-2020",
    "results.csv"
)
constructors = kagglehub.load_dataset(
    KaggleDatasetAdapter.PANDAS,
    "rohanrao/formula-1-world-championship-1950-2020",
    "constructors.csv"
)

# Now you can access the data directly:
print("Loaded:", len(races), "races,", len(drivers), "drivers,", len(results), "results.")

# If you want to combine them into a single DataFrame:
# Create a 'table' column for each DataFrame
races['table'] = 'races'
drivers['table'] = 'drivers'
results['table'] = 'results'
constructors['table'] = 'constructors'

# Concatenate the DataFrames
f1_df = pd.concat([races, drivers, results, constructors], ignore_index=True)

  races = kagglehub.load_dataset(
  drivers = kagglehub.load_dataset(
  results = kagglehub.load_dataset(
  constructors = kagglehub.load_dataset(


Loaded: 1125 races, 861 drivers, 26759 results.


In [3]:
# Assuming your new DataFrame is called 'f1_df'
import pandas as pd
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from sklearn.model_selection import train_test_split
import faiss
from tqdm import tqdm

# No need to load from CSV anymore
# races = pd.read_csv('races.csv')
# drivers = pd.read_csv('drivers.csv')
# results = pd.read_csv('results.csv')
# constructors = pd.read_csv('constructors.csv')

# Access data directly from the new DataFrame
races = f1_df[f1_df['table'] == 'races']  # Assuming 'table' column identifies data type
drivers = f1_df[f1_df['table'] == 'drivers']
results = f1_df[f1_df['table'] == 'results']
constructors = f1_df[f1_df['table'] == 'constructors']

print("Loaded:", len(races), "races,", len(drivers), "drivers,", len(results), "results.")

Loaded: 1125 races, 861 drivers, 26759 results.


# Building the Formula 1 Knowledge Base

A knowledge base is the foundation of any RAG system. For our Formula 1 assistant, we'll create a structured repository of racing facts that can be efficiently searched.

## Process:
1. Merge race results with driver and constructor data
2. Extract only winning results (position = 1)
3. Format each win as a natural language fact
4. These facts will become our searchable knowledge units

This approach creates a clean, structured knowledge base that's both human-readable and machine-searchable.

In [4]:
f1_data = results.merge(races, on='raceId', suffixes=('_results', '_races'))
f1_data = f1_data.merge(drivers, left_on='driverId_results', right_on='driverId', suffixes=('_f1data', '_drivers'))
f1_data = f1_data.merge(constructors, left_on='constructorId_results', right_on='constructorId', suffixes=('_f1data', '_constructors'))

# Only winning results
winners = f1_data[f1_data['positionOrder_results'] == 1].copy()

# Inspect columns to determine correct driver and constructor column names
print(winners.columns.tolist())

# Use the correct columns for driver names (update as needed after inspecting columns)
winners.loc[:, 'fact'] = winners.apply(
    lambda row: f"In {row['year_races']:.0f}, {row['forename_f1data']} {row['surname_f1data']} won the {row['name_races']} driving for {row['name_constructors']}.",
    axis=1
)
f1_facts = winners[['fact']].reset_index(drop=True)

print("Sample Fact:", f1_facts.iloc[0]['fact'])

['raceId_f1data', 'year_results', 'round_results', 'circuitId_results', 'name_results', 'date_results', 'time_results', 'url_results', 'fp1_date_results', 'fp1_time_results', 'fp2_date_results', 'fp2_time_results', 'fp3_date_results', 'fp3_time_results', 'quali_date_results', 'quali_time_results', 'sprint_date_results', 'sprint_time_results', 'table_results', 'driverId_results', 'driverRef_results', 'number_results', 'code_results', 'forename_results', 'surname_results', 'dob_results', 'nationality_results', 'resultId_results', 'constructorId_results', 'grid_results', 'position_results', 'positionText_results', 'positionOrder_results', 'points_results', 'laps_results', 'milliseconds_results', 'fastestLap_results', 'rank_results', 'fastestLapTime_results', 'fastestLapSpeed_results', 'statusId_results', 'constructorRef_results', 'year_races', 'round_races', 'circuitId_races', 'name_races', 'date_races', 'time_races', 'url_races', 'fp1_date_races', 'fp1_time_races', 'fp2_date_races', 'fp2

# Embedding the Knowledge Base

To make our knowledge searchable, we need to convert text facts into vector representations (embeddings) that capture semantic meaning.

## The Embedding Process:
1. Load a pre-trained language model optimized for semantic similarity
2. Transform each Formula 1 fact into a dense vector (embedding)
3. These vectors position similar facts closer together in vector space
4. When a user asks a question, we'll embed it using the same model and find the closest facts

The embedding model we're using is specifically trained to preserve semantic relationships, allowing us to find relevant information even when phrasing differs between questions and facts.

In [5]:
# Load model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
bert = AutoModel.from_pretrained(model_name)

# Embedding function
def compute_embeddings(texts, batch_size=32):
    all_embeddings = []
    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True)
        with torch.no_grad():
            outputs = bert(**inputs)
            embeddings = outputs.last_hidden_state[:, 0]
            embeddings = F.normalize(embeddings, dim=1)
            all_embeddings.append(embeddings.cpu())
    return torch.cat(all_embeddings, dim=0)

# Compute
fact_embeddings = compute_embeddings(f1_facts['fact'].tolist())
print("Embedded", fact_embeddings.shape[0], "facts.")


100%|██████████| 36/36 [00:03<00:00,  9.46it/s]

Embedded 1128 facts.





# Designing an Efficient Retrieval System

Effective retrieval is critical for any question-answering system. Our retrieval component uses FAISS (Facebook AI Similarity Search) for high-performance vector similarity search.

## Enhanced Retrieval Features:
1. **Metadata filtering** - Quickly narrow down search space using years, circuits, drivers
2. **Query intent analysis** - Understand what type of information the user is looking for
3. **Entity recognition** - Identify key F1 entities mentioned in questions
4. **Hybrid search** - Combine exact metadata matching with semantic similarity

This multi-stage approach balances precision and recall, enabling both exact matches for specific queries and more flexible semantic matching for exploratory questions.

In [6]:
class F1FAISSRetriever:
    def __init__(self, facts, embeddings):
        self.facts = facts
        dim = embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dim)
        self.index.add(embeddings.numpy())
        
        # Create dictionaries for metadata filtering
        self.facts_by_year = {}
        self.facts_by_circuit = {}
        self.facts_by_driver = {}
        self.facts_by_constructor = {}
        
        # Extract metadata from facts for faster filtering
        import re
        for i, fact in enumerate(facts):
            # Extract year
            year_match = re.search(r"In (\d{4}),", fact)
            if year_match:
                year = year_match.group(1)
                if year not in self.facts_by_year:
                    self.facts_by_year[year] = []
                self.facts_by_year[year].append(i)
            
            # Extract circuit names
            circuits = ["Monaco", "Silverstone", "Abu Dhabi", "Bahrain", "Malaysia", 
                       "Las Vegas", "Qatar", "São Paulo", "Mexico City", "Austrian", 
                       "German", "South African", "Hungarian", "Italian", "British", 
                       "Belgian", "Spanish", "Canadian", "Australian", "Japanese", "French"]
            for circuit in circuits:
                if circuit in fact:
                    if circuit not in self.facts_by_circuit:
                        self.facts_by_circuit[circuit] = []
                    self.facts_by_circuit[circuit].append(i)
            
            # Extract driver names - simple approach, could be improved
            driver_match = re.search(r", ([A-Za-z]+ [A-Za-z]+) won the", fact)
            if driver_match:
                driver = driver_match.group(1)
                if driver not in self.facts_by_driver:
                    self.facts_by_driver[driver] = []
                self.facts_by_driver[driver].append(i)
            
            # Extract constructor names
            constructor_match = re.search(r"driving for ([A-Za-z]+)", fact)
            if constructor_match:
                constructor = constructor_match.group(1)
                if constructor not in self.facts_by_constructor:
                    self.facts_by_constructor[constructor] = []
                self.facts_by_constructor[constructor].append(i)

    def analyze_query(self, query):
        """Analyze the query to understand user intent and extract key entities"""
        import re
        
        entities = {
            'year': None,
            'circuit': None,
            'driver': None,
            'constructor': None,
            'query_type': 'general'  # default query type
        }
        
        # Detect year mentions
        year_match = re.search(r"\b(19|20)\d{2}\b", query)
        if year_match:
            entities['year'] = year_match.group(0)
        
        # Detect circuit mentions
        circuits = ["Monaco", "Silverstone", "Abu Dhabi", "Bahrain", "Malaysia", 
                   "Las Vegas", "Qatar", "São Paulo", "Mexico City", "Austrian", 
                   "German", "South African", "Hungarian", "Italian", "British", 
                   "Belgian", "Spanish", "Canadian", "Australian", "Japanese", "French"]
        for circuit in circuits:
            if circuit.lower() in query.lower():
                entities['circuit'] = circuit
                break
        
        # Detect query types
        if any(word in query.lower() for word in ['win', 'won', 'winner', 'victory']):
            entities['query_type'] = 'win'
        elif any(word in query.lower() for word in ['champion', 'championship']):
            entities['query_type'] = 'championship'
        elif 'constructor' in query.lower() or 'team' in query.lower():
            entities['query_type'] = 'constructor'
        
        # Find drivers in query (simplistic approach)
        for driver, indices in self.facts_by_driver.items():
            if driver.lower() in query.lower():
                entities['driver'] = driver
                break
        
        # Find constructors in query
        for constructor, indices in self.facts_by_constructor.items():
            if constructor.lower() in query.lower():
                entities['constructor'] = constructor
                break
        
        return entities

    def retrieve(self, query, k=5):
        """Enhanced retrieval with metadata filtering and query analysis"""
        # Analyze query to understand intent
        query_analysis = self.analyze_query(query)
        
        # Start with all facts
        candidate_indices = set(range(len(self.facts)))
        filtered_facts = self.facts
        filtered_indices = list(candidate_indices)
        
        # Apply metadata filters based on query analysis
        if query_analysis['year'] and query_analysis['year'] in self.facts_by_year:
            year_indices = set(self.facts_by_year[query_analysis['year']])
            candidate_indices = candidate_indices.intersection(year_indices)
        
        if query_analysis['circuit'] and query_analysis['circuit'] in self.facts_by_circuit:
            circuit_indices = set(self.facts_by_circuit[query_analysis['circuit']])
            candidate_indices = candidate_indices.intersection(circuit_indices)
            
        if query_analysis['driver'] and query_analysis['driver'] in self.facts_by_driver:
            driver_indices = set(self.facts_by_driver[query_analysis['driver']])
            candidate_indices = candidate_indices.intersection(driver_indices)
            
        if query_analysis['constructor'] and query_analysis['constructor'] in self.facts_by_constructor:
            constructor_indices = set(self.facts_by_constructor[query_analysis['constructor']])
            candidate_indices = candidate_indices.intersection(constructor_indices)
        
        # If we have filtered indices, use them
        if candidate_indices:
            filtered_indices = list(candidate_indices)
            filtered_facts = [self.facts[i] for i in filtered_indices]
        
        # If filtering results in no facts, fall back to all
        if not filtered_facts:
            filtered_facts = self.facts
            filtered_indices = list(range(len(self.facts)))
        
        # Embed query and filtered facts
        inputs = tokenizer(filtered_facts, return_tensors='pt', padding=True, truncation=True)
        with torch.no_grad():
            outputs = bert(**inputs)
            fact_embs = outputs.last_hidden_state[:, 0]
            fact_embs = F.normalize(fact_embs, dim=1)
        
        # Embed query
        q_inputs = tokenizer(query, return_tensors='pt', truncation=True, padding=True)
        with torch.no_grad():
            q_output = bert(**q_inputs)
            query_emb = q_output.last_hidden_state[:, 0]
            query_emb = F.normalize(query_emb, dim=1)
        
        # Compute similarity and get top-k results
        sims = torch.matmul(query_emb, fact_embs.T).squeeze(0)
        topk = torch.topk(sims, min(k, len(filtered_facts)))
        
        # Map back to original indices and return with scores
        results = []
        for rank, idx in enumerate(topk.indices):
            original_idx = filtered_indices[idx]
            score = float(sims[idx])
            results.append((self.facts[original_idx], score, query_analysis))
        
        return results

retriever = F1FAISSRetriever(f1_facts['fact'].tolist(), fact_embeddings)

# Conversational Formula 1 Assistant

The final component of our system is a conversational interface that ties everything together. This chatbot combines:

## Key Capabilities:
1. **Specialized handlers** for common question types:
   - Driver comparisons and statistics
   - Constructor/team performance analysis
   - Championship season summaries
   - Race-specific results

2. **Context maintenance** for follow-up questions:
   - Remembers previous questions and retrieved facts
   - Understands references like "How about in 2010?" or "What about Ferrari?"
   - Preserves context when the conversation topic evolves

3. **Enhanced answer formatting**:
   - Dynamically structures responses based on query type
   - Includes relevant statistics and comparative data
   - Provides attribution to source facts

This design creates a more natural conversational experience while ensuring answers remain grounded in the factual knowledge base.

In [7]:
class F1Chatbot:
    def __init__(self, retriever):
        self.retriever = retriever
        self.chat_history = []
        # Extract all available variables/columns for flexible querying
        self.available_columns = list(winners.columns)
        self.available_columns_lower = [col.lower() for col in self.available_columns]
        # Store the last query and results for follow-up questions
        self.last_query = None
        self.last_results = None

    def extract_drivers(self, query):
        # Simple extraction: look for known driver names in the query
        driver_names = winners['forename_f1data'] + ' ' + winners['surname_f1data']
        found = []
        for name in driver_names.unique():
            if isinstance(name, str) and name.lower() in query.lower():
                found.append(name)
        return list(set(found))

    def compare_drivers(self, driver1, driver2):
        # Aggregate wins for each driver
        d1_wins = winners[
            (winners['forename_f1data'] + ' ' + winners['surname_f1data'] == driver1)
        ]
        d2_wins = winners[
            (winners['forename_f1data'] + ' ' + winners['surname_f1data'] == driver2)
        ]
        
        # More comprehensive comparison
        d1_first_win = d1_wins.sort_values('year_races').iloc[0] if not d1_wins.empty else None
        d2_first_win = d2_wins.sort_values('year_races').iloc[0] if not d2_wins.empty else None
        
        d1_teams = d1_wins['name_constructors'].unique()
        d2_teams = d2_wins['name_constructors'].unique()
        
        # Head-to-head: races where both participated
        races_both = set(d1_wins['name_races']).intersection(set(d2_wins['name_races']))
        
        answer = (
            f"{driver1} has {len(d1_wins)} wins"
            + (f" (first in {d1_first_win['year_races']:.0f} at {d1_first_win['name_races']})" if d1_first_win is not None else "") +
            f" driving for {', '.join(d1_teams)}.\n" +
            f"{driver2} has {len(d2_wins)} wins"
            + (f" (first in {d2_first_win['year_races']:.0f} at {d2_first_win['name_races']})" if d2_first_win is not None else "") +
            f" driving for {', '.join(d2_teams)}.\n" +
            f"Races both have won: {', '.join(races_both) if races_both else 'None'}."
        )
        return answer

    def extract_constructor(self, query):
        # Look for known constructor names in the query
        constructor_names = winners['name_constructors'].dropna().unique()
        found = []
        for name in constructor_names:
            if isinstance(name, str) and name.lower() in query.lower():
                found.append(name)
        return list(set(found))

    def constructor_wins(self, constructor):
        # Aggregate wins for the constructor
        c_wins = winners[winners['name_constructors'] == constructor]
        if c_wins.empty:
            return f"No wins found for {constructor}."
            
        # Group by year to show progression
        wins_by_year = c_wins.groupby('year_races').size()
        total_wins = len(c_wins)
        
        # Get unique drivers for this constructor
        drivers = c_wins['forename_f1data'] + ' ' + c_wins['surname_f1data']
        unique_drivers = drivers.unique()
        
        # Get most successful seasons
        best_year = wins_by_year.idxmax()
        best_year_wins = wins_by_year.max()
        
        facts = c_wins['fact'].tolist()
        answer = (
            f"{constructor} has {total_wins} race wins across {len(wins_by_year)} seasons.\n"
            f"Most successful year: {best_year:.0f} with {best_year_wins:.0f} wins.\n"
            f"Winning drivers: {', '.join(unique_drivers)}.\n\n"
            f"Recent wins:\n" + "\n".join([f"- {fact}" for fact in sorted(facts, reverse=True)[:3]])
        )
        return answer

    def get_champion_for_year(self, year):
        # Find the driver with the most wins in the given year
        year_mask = winners['year_races'] == float(year)
        year_winners = winners[year_mask]
        if year_winners.empty:
            return f"No race winners found for {year}."
        # Count wins per driver
        driver_names = year_winners['forename_f1data'] + ' ' + year_winners['surname_f1data']
        win_counts = driver_names.value_counts()
        champion = win_counts.idxmax()
        wins = win_counts.max()
        
        # Calculate total races that year
        total_races = len(races[races['year'] == float(year)])
        
        champion_team = year_winners[
            (year_winners['forename_f1data'] + ' ' + year_winners['surname_f1data']) == champion
        ]['name_constructors'].mode().iloc[0]
        
        # Get the constructor champion (most constructor wins)
        constructor_wins = year_winners['name_constructors'].value_counts()
        constructor_champion = constructor_wins.idxmax()
        constructor_win_count = constructor_wins.max()
        
        answer = (
            f"Formula 1 {year} Season Summary:\n\n"
            f"Driver with most wins: {champion} ({wins} out of {total_races} races, driving for {champion_team}).\n"
            f"Constructor with most wins: {constructor_champion} ({constructor_win_count} wins).\n\n"
            f"{champion}'s race wins in {year}:\n" +
            "\n".join([
                f"- {row['fact']}"
                for _, row in year_winners[
                    (year_winners['forename_f1data'] + ' ' + year_winners['surname_f1data']) == champion
                ].iterrows()
            ])
        )
        return answer

    def extract_column_from_query(self, query):
        # Try to match a column/variable from the query
        for col, col_lower in zip(self.available_columns, self.available_columns_lower):
            if col_lower in query.lower():
                return col
        # Try to match by keywords (e.g., "year", "driver", "constructor", etc.)
        keywords = {
            "year": "year_races",
            "driver": "forename_f1data",
            "constructor": "name_constructors",
            "race": "name_races",
            "circuit": "name_races",
            "nationality": "nationality_f1data",
            "points": "points_results",
            "laps": "laps_results",
            "position": "position_results"
        }
        for key, col in keywords.items():
            if key in query.lower():
                return col
        return None
    
    def process_followup_question(self, query):
        """Handle follow-up questions by using context from previous exchanges"""
        if not self.last_query or not self.last_results:
            return None
            
        # Check if this is a follow-up question
        followup_indicators = ["what about", "how about", "and what", "what of", "tell me about"]
        is_followup = any(indicator in query.lower() for indicator in followup_indicators)
        
        if not is_followup:
            return None
            
        # Extract entities from previous query analysis
        if len(self.last_results) > 0 and len(self.last_results[0]) > 2:
            prev_analysis = self.last_results[0][2]  # Get the analysis from the first result
            
            # Look for new entities in the follow-up
            new_analysis = self.retriever.analyze_query(query)
            
            # Merge previous context with new query
            combined_query = query
            
            # If the follow-up doesn't specify a year but previous query did
            if not new_analysis['year'] and prev_analysis['year']:
                combined_query += f" in {prev_analysis['year']}"
                
            # If the follow-up doesn't specify a circuit but previous query did
            if not new_analysis['circuit'] and prev_analysis['circuit']:
                if prev_analysis['year']:  # Don't duplicate if already added above
                    combined_query += f" at {prev_analysis['circuit']}"
                else:
                    combined_query += f" at {prev_analysis['circuit']}"
                    
            return combined_query
        return None

    def get_column_values(self, column, filter_query=None, limit=5):
        # Optionally filter by a keyword in the query
        df = winners
        if filter_query:
            df = df[df.apply(lambda row: filter_query.lower() in str(row).lower(), axis=1)]
        values = df[column].dropna().unique()
        return values[:limit]

    def chat(self, query, top_k=3):
        # Process potential follow-up questions
        followup_query = self.process_followup_question(query)
        if followup_query:
            query = followup_query
            
        # Detect driver comparison
        drivers = self.extract_drivers(query)
        if len(drivers) == 2 and any(word in query.lower() for word in ["compare", "vs", "versus", "against", "or"]):
            answer = self.compare_drivers(drivers[0], drivers[1])
            self.chat_history.append({"user": query})
            self.chat_history.append({"bot": answer})
            print(f"\nUser: {query}")
            print(f"Bot:\n{answer}\n")
            return

        # Detect constructor win queries
        constructors = self.extract_constructor(query)
        if constructors and ("win" in query.lower() or "victor" in query.lower()):
            answer = self.constructor_wins(constructors[0])
            self.chat_history.append({"user": query})
            self.chat_history.append({"bot": answer})
            print(f"\nUser: {query}")
            print(f"Bot:\n{answer}\n")
            return

        # Detect champion for year queries
        import re
        match = re.search(r'(?:champion|winner).*?(\d{4})', query.lower())
        if match:
            year = match.group(1)
            answer = self.get_champion_for_year(year)
            self.chat_history.append({"user": query})
            self.chat_history.append({"bot": answer})
            print(f"\nUser: {query}")
            print(f"Bot:\n{answer}\n")
            return

        # Variable extraction
        col = self.extract_column_from_query(query)
        if col:
            values = self.get_column_values(col, filter_query=query)
            answer = f"Sample values for '{col}': {', '.join(map(str, values))}"
            self.chat_history.append({"user": query})
            self.chat_history.append({"bot": answer})
            print(f"\nUser: {query}")
            print(f"Bot:\n{answer}\n")
            return

        # Use the retriever for general questions
        self.chat_history.append({"user": query})
        self.last_query = query
        self.last_results = self.retriever.retrieve(query, k=top_k)
        
        # Format better answers based on query analysis
        query_analysis = self.last_results[0][2] if self.last_results else None
        
        if query_analysis and query_analysis['query_type'] == 'win':
            if query_analysis['year'] and query_analysis['circuit']:
                answer = f"Results for the {query_analysis['circuit']} Grand Prix in {query_analysis['year']}:\n\n"
            elif query_analysis['circuit']:
                answer = f"Race winners at {query_analysis['circuit']}:\n\n"
            elif query_analysis['year']:
                answer = f"Race winners in {query_analysis['year']}:\n\n"
            else:
                answer = "Race winners:\n\n"
        else:
            answer = "Here's what I found:\n\n"
            
        # Add the retrieved facts with relevance scores
        for i, (fact, score, _) in enumerate(self.last_results):
            answer += f"[{score:.2f}] {fact}\n"
            
        self.chat_history.append({"bot": answer})
        print(f"\nUser: {query}")
        print(f"Bot:\n{answer}\n")

    def show_history(self):
        for turn in self.chat_history:
            for speaker, text in turn.items():
                print(f"{speaker.capitalize()}: {text}\n")

f1_chatbot = F1Chatbot(retriever)

# Testing Our Formula 1 Knowledge Assistant

Let's see the system in action! We'll test it with several question types:

1. **Specific race results** - "Who won Monaco in 2019?"
2. **Follow-up questions** - "How about Silverstone in 2014?"
3. **Championship queries** - "Who was champion in 2008?"
4. **Driver comparisons** - "Compare Hamilton and Vettel"
5. **Constructor performance** - "How many wins does Ferrari have?"

The system should handle these diverse questions while maintaining context between related queries.

In [8]:
# Test a variety of questions to showcase the enhanced system
f1_chatbot.chat("Who won the Monaco Grand Prix in 2019?")
f1_chatbot.chat("How about Silverstone in 2014?")
f1_chatbot.chat("Who was champion in 2008?")
f1_chatbot.chat("Compare Lewis Hamilton and Sebastian Vettel")
f1_chatbot.chat("How many wins does Ferrari have?")


User: Who won the Monaco Grand Prix in 2019?
Bot:
Results for the Monaco Grand Prix in 2019:

[0.86] In 2019, Lewis Hamilton won the Monaco Grand Prix driving for Mercedes.



User: How about Silverstone in 2014?
Bot:
Here's what I found:

[0.66] In 2014, Nico Rosberg won the Austrian Grand Prix driving for Mercedes.
[0.65] In 2014, Nico Rosberg won the German Grand Prix driving for Mercedes.
[0.65] In 2014, Nico Rosberg won the Australian Grand Prix driving for Mercedes.



User: Who was champion in 2008?
Bot:
Formula 1 2008 Season Summary:

Driver with most wins: Felipe Massa (6 out of 18 races, driving for Ferrari).
Constructor with most wins: Ferrari (8 wins).

Felipe Massa's race wins in 2008:
- In 2008, Felipe Massa won the Bahrain Grand Prix driving for Ferrari.
- In 2008, Felipe Massa won the Turkish Grand Prix driving for Ferrari.
- In 2008, Felipe Massa won the French Grand Prix driving for Ferrari.
- In 2008, Felipe Massa won the European Grand Prix driving for Ferrari.
- I

# Enhanced F1 RAG System Architecture

This notebook demonstrates a complete Retrieval-Augmented Generation system for Formula 1 knowledge, combining several key technologies:

## System Components:

### 1. Knowledge Base Construction
- **Data sources**: Comprehensive F1 race data (1950-2024)
- **Knowledge units**: Structured facts about race wins
- **Metadata extraction**: Years, circuits, drivers, constructors

### 2. Neural Retrieval Engine
- **Embedding model**: Sentence-transformer for semantic encoding
- **Vector database**: FAISS for efficient similarity search
- **Query analysis**: Intent classification and entity extraction
- **Filtering system**: Metadata-based pre-filtering

### 3. Answer Generation
- **Specialized handlers**: Custom logic for common question types
- **Context management**: Support for follow-up questions
- **Response formatting**: Structured answers with statistics

### 4. User Interface
- **Interactive chat**: Real-time question answering
- **Example questions**: Quick access to demonstration queries
- **Visual styling**: Clean presentation of responses

This architecture balances factual accuracy with conversational fluidity, enabling both precise answers to specific questions and more exploratory dialogue about Formula 1 topics.

In [9]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# Create a more visually appealing chat interface
title = widgets.HTML(value="<h2 style='text-align:center'>Formula 1 Knowledge Assistant</h2>")

input_box = widgets.Text(
    value='',
    placeholder='Ask a Formula 1 question...',
    description='Question:',
    disabled=False,
    layout=widgets.Layout(width='90%')
)

# Add some example questions as buttons
example_questions = [
    "Who won Monaco in 2019?",
    "Compare Hamilton and Vettel",
    "Ferrari wins in 2004",
    "Who was champion in 2008?"
]

buttons = [widgets.Button(description=q, layout=widgets.Layout(width='auto')) for q in example_questions]
button_box = widgets.HBox(buttons, layout=widgets.Layout(justify_content='center', padding='10px'))

# Create a styled output area with a border
output_area = widgets.Output(layout=widgets.Layout(
    border='1px solid #ddd',
    padding='10px',
    width='90%',
    height='300px',
    overflow_y='auto'
))

def on_submit(change):
    if change['type'] == 'change' and change['name'] == 'value' and change['new'] == '':
        # Only fetch results when user presses Enter (input_box.value is set to '' after Enter)
        return
    if change['type'] == 'change' and change['name'] == 'value':
        return  # Ignore all other changes
    # Only handle when user presses Enter (submit event)
    with output_area:
        clear_output(wait=True)
        f1_chatbot.chat(input_box.value)
    input_box.value = ''

def on_button_click(b):
    with output_area:
        clear_output(wait=True)
        f1_chatbot.chat(b.description)

input_box.on_submit(lambda _: on_submit({'type': 'submit'}))
for button in buttons:
    button.on_click(on_button_click)

# Display all widgets
display(title)
display(widgets.Label(value="Try one of these example questions:"))
display(button_box)
display(widgets.Label(value="Or ask your own question:"))
display(input_box)
display(output_area)

  input_box.on_submit(lambda _: on_submit({'type': 'submit'}))


HTML(value="<h2 style='text-align:center'>Formula 1 Knowledge Assistant</h2>")

Label(value='Try one of these example questions:')

HBox(children=(Button(description='Who won Monaco in 2019?', layout=Layout(width='auto'), style=ButtonStyle())…

Label(value='Or ask your own question:')

Text(value='', description='Question:', layout=Layout(width='90%'), placeholder='Ask a Formula 1 question...')

Output(layout=Layout(border_bottom='1px solid #ddd', border_left='1px solid #ddd', border_right='1px solid #dd…