# MongoDB Atlas Vector Search with VoyageAI Embeddings for Sports Scores and Stories

This notebook demonstrates how to use VoyageAI embeddings with MongoDB Atlas Vector Search for retrieving relevant sports scores and stories based on user queries.

## Overview

In this tutorial, we'll learn how to:

1. Connect to MongoDB Atlas and retrieve sports data
2. Generate embeddings using VoyageAI's embedding models
3. Store these embeddings in MongoDB
4. Create and use a vector search index for semantic similarity search
5. Use hybrid search for result tuning.
6. Implement a RAG (Retrieval-Augmented Generation) system to answer questions about sports teams and matches
7. Showing how Agentic rag changes the results by using hybrid search as tools for an ai-agent built with the openai-agent sdk.

This approach combines the power of vector embeddings with natural language processing to provide relevant sports information based on user queries.

## Setup and Configuration

First, let's import the necessary libraries and set up our environment. We'll need libraries for data manipulation, machine learning, visualization, and MongoDB connectivity.

In [None]:
%pip install voyageai pymongo   scikit-learn python-dotenv openai

Collecting voyageai
  Downloading voyageai-0.3.2-py3-none-any.whl.metadata (2.6 kB)
Collecting pymongo
  Downloading pymongo-4.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting aiolimiter (from voyageai)
  Downloading aiolimiter-1.2.1-py3-none-any.whl.metadata (4.5 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading voyageai-0.3.2-py3-none-any.whl (25 kB)
Downloading pymongo-4.11.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading aiolimiter-1.2.1-py3-none-any.whl (6.7 kB)
Installing collected packages: dnspython, aiolimiter, pymongo, voy

In [None]:
import logging
import os
from datetime import datetime, timedelta

import voyageai
from dotenv import load_dotenv
from openai import OpenAI
from pymongo import MongoClient

# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

# Load environment variables
load_dotenv()

False

### Environment Variables

We'll use environment variables to store sensitive information like API keys and connection strings. These should be stored in a `.env` file in the same directory as this notebook.

Example `.env` file content:
```
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/
VOYAGE_API_KEY=your_voyage_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
```

In [3]:
# MongoDB connection string
import getpass

MONGODB_URI = getpass.getpass("Enter your MongoDB connection string: ")
# VoyageAI API key for embeddings
VOYAGE_API_KEY = getpass.getpass("Enter your VoyageAI API key: ")
# OpenAI API key for RAG
OPENAI_API_KEY = getpass.getpass("Enter your OpenAI API key: ")


# Check if environment variables are set
if not MONGODB_URI or not VOYAGE_API_KEY or not OPENAI_API_KEY:
    print(
        "Error: Environment variables MONGODB_URI, VOYAGE_API_KEY, and OPENAI_API_KEY must be set"
    )
    print("Please create a .env file with these variables")
else:
    print("Environment variables loaded successfully")

Enter your MongoDB connection string: ··········
Enter your VoyageAI API key: ··········
Enter your OpenAI API key: ··········
Environment variables loaded successfully


### MongoDB Configuration

Now let's set up our MongoDB connection and define the database and collections we'll be using.

In [6]:
# MongoDB configuration
DB_NAME = "sports_demo"
COLLECTION_NAME = "matches"
TEAMS_COLLECTION = "teams"
NEWS_COLLECTION = "news"
VECTOR_COLLECTION = "vector_features"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "voyage_vector_index"

# Initialize MongoDB client
client = MongoClient(MONGODB_URI, appname="voyageai.mongodb.sports_scores_demo")

# Access collections
matches_collection = client[DB_NAME][COLLECTION_NAME]
teams_collection = client[DB_NAME][TEAMS_COLLECTION]
news_collection = client[DB_NAME][NEWS_COLLECTION]
vector_collection = client[DB_NAME][VECTOR_COLLECTION]

# Test the connection
try:
    # The ismaster command is cheap and does not require auth
    client.admin.command("ismaster")
    print("MongoDB connection successful")
except Exception as e:
    print(f"MongoDB connection failed: {e}")

MongoDB connection successful


## VoyageAI Embeddings

Next, we'll create a class to handle generating embeddings using VoyageAI's API. Embeddings are vector representations of text that capture semantic meaning, allowing us to perform operations like similarity search.

In [7]:
class VoyageAIEmbeddings:
    """Custom VoyageAI embeddings class"""

    def __init__(self, api_key, model="voyage-3"):
        self.api_key = api_key
        self.model = model
        os.environ["VOYAGE_API_KEY"] = api_key
        self.client = voyageai.Client(api_key=api_key)

    def embed_text(self, text):
        """Embed a single text using VoyageAI"""
        response = self.client.embed([text], model=self.model, input_type="document")
        return response.embeddings[0]

    def embed_batch(self, texts, batch_size=20):
        """Embed a batch of texts efficiently"""
        embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i : i + batch_size]
            response = self.client.embed(batch, model=self.model, input_type="document")
            embeddings.extend(response.embeddings)
            print(f"Processed {i+len(batch)}/{len(texts)} embeddings")
        return embeddings

### Understanding Embeddings

Embeddings are dense vector representations of text that capture semantic meaning. The VoyageAI model we're using (`voyage-3`) generates 1024-dimensional vectors for each text input. These vectors have several important properties:

1. **Semantic similarity**: Texts with similar meanings will have embeddings that are close to each other in the vector space
2. **Dimensionality**: The high-dimensional space allows for capturing complex relationships between concepts
3. **Language understanding**: The model has been trained on vast amounts of text data to understand language nuances

In our case, we'll use these embeddings to represent sports data in a way that captures the semantic meaning of team names, match descriptions, and news stories.

## Sample Data Generation

For demonstration purposes, let's create some sample sports data. In a real-world scenario, this data would come from an API or another data source.

In [None]:
def generate_sample_data():
    """Generate sample sports data for demonstration purposes"""
    print("Generating sample sports data...")

    # Sample teams with nicknames
    teams = [
        {
            "team_id": "MNU",
            "name": "Manchester United",
            "nicknames": ["Red Devils", "United"],
            "league": "Premier League",
            "country": "England",
        },
        {
            "team_id": "MNC",
            "name": "Manchester City",
            "nicknames": ["Citizens", "City"],
            "league": "Premier League",
            "country": "England",
        },
        {
            "team_id": "LIV",
            "name": "Liverpool",
            "nicknames": ["Reds", "The Kop"],
            "league": "Premier League",
            "country": "England",
        },
        {
            "team_id": "CHE",
            "name": "Chelsea",
            "nicknames": ["Blues", "The Pensioners"],
            "league": "Premier League",
            "country": "England",
        },
        {
            "team_id": "ARS",
            "name": "Arsenal",
            "nicknames": ["Gunners", "The Arsenal"],
            "league": "Premier League",
            "country": "England",
        },
        {
            "team_id": "TOT",
            "name": "Tottenham Hotspur",
            "nicknames": ["Spurs", "Lilywhites"],
            "league": "Premier League",
            "country": "England",
        },
        {
            "team_id": "BAR",
            "name": "Barcelona",
            "nicknames": ["Barça", "Blaugrana"],
            "league": "La Liga",
            "country": "Spain",
        },
        {
            "team_id": "RMA",
            "name": "Real Madrid",
            "nicknames": ["Los Blancos", "Merengues"],
            "league": "La Liga",
            "country": "Spain",
        },
        {
            "team_id": "ATM",
            "name": "Atletico Madrid",
            "nicknames": ["Atleti", "Colchoneros"],
            "league": "La Liga",
            "country": "Spain",
        },
        {
            "team_id": "BAY",
            "name": "Bayern Munich",
            "nicknames": ["Die Roten", "Bavarians"],
            "league": "Bundesliga",
            "country": "Germany",
        },
        {
            "team_id": "BVB",
            "name": "Borussia Dortmund",
            "nicknames": ["BVB", "Die Schwarzgelben"],
            "league": "Bundesliga",
            "country": "Germany",
        },
        {
            "team_id": "JUV",
            "name": "Juventus",
            "nicknames": ["Old Lady", "Bianconeri"],
            "league": "Serie A",
            "country": "Italy",
        },
        {
            "team_id": "INT",
            "name": "Inter Milan",
            "nicknames": ["Nerazzurri", "La Beneamata"],
            "league": "Serie A",
            "country": "Italy",
        },
        {
            "team_id": "ACM",
            "name": "AC Milan",
            "nicknames": ["Rossoneri", "Diavolo"],
            "league": "Serie A",
            "country": "Italy",
        },
        {
            "team_id": "PSG",
            "name": "Paris Saint-Germain",
            "nicknames": ["Les Parisiens", "PSG"],
            "league": "Ligue 1",
            "country": "France",
        },
    ]

    # Generate sample matches (recent results)
    now = datetime.now()
    matches = []

    # Premier League matches
    matches.extend(
        [
            {
                "match_id": "PL2023-001",
                "home_team": "MNU",
                "away_team": "LIV",
                "home_score": 2,
                "away_score": 1,
                "date": (now - timedelta(days=2)).strftime("%Y-%m-%d"),
                "competition": "Premier League",
                "season": "2023-2024",
                "stadium": "Old Trafford",
                "summary": "Manchester United secured a thrilling 2-1 victory over Liverpool at Old Trafford. Bruno Fernandes opened the scoring with a penalty in the 34th minute, before Marcus Rashford doubled the lead with a brilliant solo effort in the 67th minute. Mohamed Salah pulled one back for Liverpool in the 85th minute, but United held on for a crucial win.",
            },
            {
                "match_id": "PL2023-002",
                "home_team": "ARS",
                "away_team": "MNC",
                "home_score": 1,
                "away_score": 1,
                "date": (now - timedelta(days=3)).strftime("%Y-%m-%d"),
                "competition": "Premier League",
                "season": "2023-2024",
                "stadium": "Emirates Stadium",
                "summary": "Arsenal and Manchester City played out an entertaining 1-1 draw at the Emirates Stadium. Erling Haaland gave City the lead in the 23rd minute with a powerful header, but Bukayo Saka equalized for the Gunners in the 59th minute with a well-placed shot from the edge of the box.",
            },
            {
                "match_id": "PL2023-003",
                "home_team": "CHE",
                "away_team": "TOT",
                "home_score": 3,
                "away_score": 0,
                "date": (now - timedelta(days=1)).strftime("%Y-%m-%d"),
                "competition": "Premier League",
                "season": "2023-2024",
                "stadium": "Stamford Bridge",
                "summary": "Chelsea dominated Tottenham in a 3-0 London derby win at Stamford Bridge. Cole Palmer scored twice in the first half, and Nicolas Jackson added a third in the 78th minute to complete the rout. Spurs struggled to create chances throughout the match.",
            },
        ]
    )

    # La Liga matches
    matches.extend(
        [
            {
                "match_id": "LL2023-001",
                "home_team": "BAR",
                "away_team": "RMA",
                "home_score": 3,
                "away_score": 2,
                "date": (now - timedelta(days=4)).strftime("%Y-%m-%d"),
                "competition": "La Liga",
                "season": "2023-2024",
                "stadium": "Camp Nou",
                "summary": "Barcelona edged Real Madrid 3-2 in an exciting El Clásico at Camp Nou. Robert Lewandowski scored twice for Barça, while Lamine Yamal added another. Vinícius Júnior and Jude Bellingham scored for Real Madrid, but it wasn't enough to prevent defeat.",
            },
            {
                "match_id": "LL2023-002",
                "home_team": "ATM",
                "away_team": "BAR",
                "home_score": 1,
                "away_score": 2,
                "date": (now - timedelta(days=11)).strftime("%Y-%m-%d"),
                "competition": "La Liga",
                "season": "2023-2024",
                "stadium": "Metropolitano",
                "summary": "Barcelona came from behind to beat Atletico Madrid 2-1 at the Metropolitano. Antoine Griezmann gave Atletico the lead in the first half, but goals from Pedri and Robert Lewandowski in the second half secured the win for Barcelona.",
            },
        ]
    )

    # Other league matches
    matches.extend(
        [
            {
                "match_id": "BL2023-001",
                "home_team": "BAY",
                "away_team": "BVB",
                "home_score": 4,
                "away_score": 0,
                "date": (now - timedelta(days=5)).strftime("%Y-%m-%d"),
                "competition": "Bundesliga",
                "season": "2023-2024",
                "stadium": "Allianz Arena",
                "summary": "Bayern Munich thrashed Borussia Dortmund 4-0 in Der Klassiker at the Allianz Arena. Harry Kane scored a hat-trick, while Leroy Sané added another as Bayern dominated from start to finish.",
            },
            {
                "match_id": "SA2023-001",
                "home_team": "JUV",
                "away_team": "INT",
                "home_score": 1,
                "away_score": 1,
                "date": (now - timedelta(days=6)).strftime("%Y-%m-%d"),
                "competition": "Serie A",
                "season": "2023-2024",
                "stadium": "Allianz Stadium",
                "summary": "Juventus and Inter Milan shared the points in a 1-1 draw in the Derby d'Italia. Dusan Vlahovic put Juventus ahead in the first half, but Lautaro Martínez equalized for Inter in the second half.",
            },
        ]
    )

    # Generate sample news stories
    news = [
        {
            "news_id": "NEWS001",
            "title": "Manchester United's Bruno Fernandes wins Player of the Month",
            "date": (now - timedelta(days=1)).strftime("%Y-%m-%d"),
            "content": "Manchester United captain Bruno Fernandes has been named Premier League Player of the Month for his outstanding performances. The Portuguese midfielder scored 4 goals and provided 3 assists in 5 matches, helping United climb up the table. This is Fernandes' 5th Player of the Month award since joining United in January 2020.",
            "teams": ["MNU"],
            "players": ["Bruno Fernandes"],
            "category": "Award",
        },
        {
            "news_id": "NEWS002",
            "title": "Liverpool suffer injury blow as Salah ruled out for three weeks",
            "date": now.strftime("%Y-%m-%d"),
            "content": "Liverpool have been dealt a major injury blow with the news that Mohamed Salah will be sidelined for three weeks with a hamstring strain. The Egyptian forward picked up the injury during Liverpool's 2-1 defeat to Manchester United and is expected to miss crucial matches against Arsenal and Manchester City. Manager Jürgen Klopp described the injury as 'unfortunate timing' as Liverpool enter a busy period of fixtures.",
            "teams": ["LIV", "MNU"],
            "players": ["Mohamed Salah"],
            "category": "Injury",
        },
        {
            "news_id": "NEWS003",
            "title": "Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer",
            "date": (now - timedelta(days=4)).strftime("%Y-%m-%d"),
            "content": "Barcelona wonderkid Lamine Yamal has made history by becoming the youngest ever goalscorer in El Clásico at just 16 years and 107 days old. The Spanish teenager scored a spectacular long-range goal in Barcelona's 3-2 victory over Real Madrid at Camp Nou. 'It's a dream come true,' said Yamal after the match. 'I've been watching El Clásico since I was a child, and to score in this fixture is incredible.'",
            "teams": ["BAR", "RMA"],
            "players": ["Lamine Yamal"],
            "category": "Record",
        },
        {
            "news_id": "NEWS004",
            "title": "Manchester City's Erling Haaland on track to break Premier League scoring record",
            "date": (now - timedelta(days=2)).strftime("%Y-%m-%d"),
            "content": "Manchester City striker Erling Haaland is on course to break his own Premier League scoring record this season. The Norwegian has already netted 15 goals in just 10 matches, putting him ahead of his record-breaking pace from last season when he scored 36 goals. Pep Guardiola praised Haaland's incredible form: 'What he's doing is remarkable. His hunger for goals is insatiable.'",
            "teams": ["MNC"],
            "players": ["Erling Haaland"],
            "category": "Performance",
        },
        {
            "news_id": "NEWS005",
            "title": "Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker",
            "date": (now - timedelta(days=5)).strftime("%Y-%m-%d"),
            "content": "Harry Kane scored a perfect hat-trick (right foot, left foot, header) as Bayern Munich demolished Borussia Dortmund 4-0 in Der Klassiker. The England captain has made a sensational start to his Bundesliga career since his summer move from Tottenham Hotspur. 'I'm loving my time here in Munich,' said Kane. 'The team is incredible and we're playing some fantastic football.'",
            "teams": ["BAY", "BVB"],
            "players": ["Harry Kane"],
            "category": "Performance",
        },
    ]

    # Clear existing data
    teams_collection.delete_many({})
    matches_collection.delete_many({})
    news_collection.delete_many({})

    # Insert sample data
    teams_collection.insert_many(teams)
    matches_collection.insert_many(matches)
    news_collection.insert_many(news)

    print(
        f"Inserted {len(teams)} teams, {len(matches)} matches, and {len(news)} news stories"
    )

    return teams, matches, news


# Generate sample data
teams, matches, news = generate_sample_data()

Generating sample sports data...
Inserted 15 teams, 7 matches, and 5 news stories


## Data Processing and Embedding Generation

Now let's define functions to process our sports data and generate embeddings.

In [None]:
def generate_text_for_embedding(item, item_type):
    """Create a text representation for embedding based on the item type"""
    if item_type == "match":
        # Get team names for readability
        home_team = next(
            (team["name"] for team in teams if team["team_id"] == item["home_team"]),
            item["home_team"],
        )
        away_team = next(
            (team["name"] for team in teams if team["team_id"] == item["away_team"]),
            item["away_team"],
        )

        text_parts = [
            f"Match: {home_team} vs {away_team}",
            f"Score: {item['home_score']}-{item['away_score']}",
            f"Competition: {item['competition']} {item['season']}",
            f"Date: {item['date']}",
            f"Stadium: {item['stadium']}",
            f"Summary: {item['summary']}",
        ]
        return " ".join(text_parts)

    elif item_type == "team":
        text_parts = [
            f"Team: {item['name']}",
            f"Also known as: {', '.join(item['nicknames'])}",
            f"League: {item['league']}",
            f"Country: {item['country']}",
        ]
        return " ".join(text_parts)

    elif item_type == "news":
        text_parts = [
            f"Title: {item['title']}",
            f"Date: {item['date']}",
            f"Category: {item['category']}",
            f"Content: {item['content']}",
        ]
        return " ".join(text_parts)

    return ""


def create_and_save_embeddings():
    """Generate and save embeddings for all sports data"""
    print("Generating embeddings for sports data...")

    # Initialize VoyageAI embeddings
    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)

    # Clear existing vector data
    vector_collection.delete_many({})

    # Process teams
    team_texts = [generate_text_for_embedding(team, "team") for team in teams]
    team_embeddings = voyage_embeddings.embed_batch(team_texts)

    # Process matches
    match_texts = [generate_text_for_embedding(match, "match") for match in matches]
    match_embeddings = voyage_embeddings.embed_batch(match_texts)

    # Process news
    news_texts = [generate_text_for_embedding(news_item, "news") for news_item in news]
    news_embeddings = voyage_embeddings.embed_batch(news_texts)

    # Create records with embeddings
    vector_records = []

    # Add team embeddings
    for i, team in enumerate(teams):
        vector_records.append(
            {
                "object_id": team["team_id"],
                "object_type": "team",
                "name": team["name"],
                "league": team["league"],
                "country": team["country"],
                "embedding": team_embeddings[i],
                "data": team,
            }
        )

    # Add match embeddings
    for i, match in enumerate(matches):
        vector_records.append(
            {
                "object_id": match["match_id"],
                "object_type": "match",
                "home_team": match["home_team"],
                "away_team": match["away_team"],
                "competition": match["competition"],
                "date": match["date"],
                "embedding": match_embeddings[i],
                "data": match,
            }
        )

    # Add news embeddings
    for i, news_item in enumerate(news):
        vector_records.append(
            {
                "object_id": news_item["news_id"],
                "object_type": "news",
                "title": news_item["title"],
                "date": news_item["date"],
                "category": news_item["category"],
                "embedding": news_embeddings[i],
                "data": news_item,
            }
        )

    # Insert all records
    vector_collection.insert_many(vector_records)
    print(f"Saved {len(vector_records)} embedding records to MongoDB")

    return vector_records

In [None]:
def create_vector_search_index():
    """Create a vector search index in MongoDB Atlas"""

    print("Setting up Vector Search Index in MongoDB Atlas...")
    print("Note: To create the vector search index in MongoDB Atlas:")
    print("1. Go to the MongoDB Atlas dashboard")
    print("2. Select your cluster")
    print("3. Go to the 'Search' tab")
    print(
        f"4. Create a new index on '{VECTOR_COLLECTION}'with the following configuration:"
    )
    print("""
   {
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}
    """)
    print(f"Name the index: {ATLAS_VECTOR_SEARCH_INDEX_NAME}")
    print("5. Apply the index to the vector_features collection")


def perform_vector_search(query_text, k=5):
    """Perform a vector search query using VoyageAI embeddings"""
    print(f"Performing vector search for: {query_text}")

    # Generate embedding for the query
    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)
    query_embedding = voyage_embeddings.client.embed(
        [query_text], model=voyage_embeddings.model, input_type="query"
    ).embeddings[0]

    # Perform vector search
    vector_search_results = vector_collection.aggregate(
        [
            {
                "$vectorSearch": {
                    "index": ATLAS_VECTOR_SEARCH_INDEX_NAME,
                    "path": "embedding",
                    "queryVector": query_embedding,
                    "numCandidates": 100,
                    "limit": k,
                }
            },
            {
                "$project": {
                    "object_id": 1,
                    "object_type": 1,
                    "name": 1,
                    "title": 1,
                    "competition": 1,
                    "date": 1,
                    "data": 1,
                    "score": {"$meta": "vectorSearchScore"},
                }
            },
        ]
    )

    results = list(vector_search_results)

    print(f"Found {len(results)} relevant items:")
    for i, result in enumerate(results):
        if result["object_type"] == "team":
            print(
                f"{i+1}. Team: {result.get('name', 'Unknown')} (Score: {result.get('score', 0):.4f})"
            )
        elif result["object_type"] == "match":
            home = result.get("data", {}).get("home_team", "Unknown")
            away = result.get("data", {}).get("away_team", "Unknown")
            score = f"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}"
            print(
                f"{i+1}. Match: {home} vs {away} ({score}) (Score: {result.get('score', 0):.4f})"
            )
        elif result["object_type"] == "news":
            print(
                f"{i+1}. News: {result.get('title', 'Unknown')} (Score: {result.get('score', 0):.4f})"
            )

    return results

In [None]:
# Create embeddings and save them to MongoDB
vector_records = create_and_save_embeddings()

# Create a vector search index (this will provide instructions -
# actual index creation must be done in MongoDB Atlas UI)
create_vector_search_index()

Generating embeddings for sports data...
Processed 15/15 embeddings
Processed 7/7 embeddings
Processed 5/5 embeddings
Saved 27 embedding records to MongoDB
Setting up Vector Search Index in MongoDB Atlas...
Note: To create the vector search index in MongoDB Atlas:
1. Go to the MongoDB Atlas dashboard
2. Select your cluster
3. Go to the 'Search' tab
4. Create a new index on 'vector_features'with the following configuration:

   {
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1024,
      "similarity": "cosine"
    }
  ]
}
    
Name the index: voyage_vector_index
5. Apply the index to the vector_features collection


In [None]:
# Example search queries to test our vector search
example_queries = [
    "Recent Manchester United games",
    "The Red Devils, how did they do?",
    "Who won El Clasico?",
    "Premier League match results",
    "Player injuries news",
    "Bayern Munich performance",
]

print("Testing vector search with example queries:")
for query in example_queries:
    print("\n" + "=" * 50)
    print(f"QUERY: {query}")
    print("=" * 50)
    results = perform_vector_search(query, k=10)

Testing vector search with example queries:

QUERY: Recent Manchester United games
Performing vector search for: Recent Manchester United games
Found 10 relevant items:
1. Team: Manchester United (Score: 0.7876)
2. Match: MNU vs LIV (2-1) (Score: 0.7315)
3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.7312)
4. Team: Manchester City (Score: 0.7214)
5. Team: Chelsea (Score: 0.6717)
6. News: Manchester City's Erling Haaland on track to break Premier League scoring record (Score: 0.6715)
7. Match: ARS vs MNC (1-1) (Score: 0.6690)
8. Team: Tottenham Hotspur (Score: 0.6638)
9. Team: Atletico Madrid (Score: 0.6635)
10. Team: Arsenal (Score: 0.6631)

QUERY: The Red Devils, how did they do?
Performing vector search for: The Red Devils, how did they do?
Found 10 relevant items:
1. Team: Manchester United (Score: 0.6628)
2. Team: Borussia Dortmund (Score: 0.6567)
3. Team: Juventus (Score: 0.6364)
4. Match: JUV vs INT (1-1) (Score: 0.6277)
5. Team: Bayern Munich (Sc

## Hybrid Search

[Hybrid Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/) allows combination of full text search for text token matching with vector search for semantic mapping.

In [None]:
## Create FTS


def create_full_search_index():
    """Create a fulltext search index in MongoDB Atlas"""

    print("Setting up Search Index in MongoDB Atlas...")
    print("Note: To create the vector search index in MongoDB Atlas:")
    print("1. Go to the MongoDB Atlas dashboard")
    print("2. Select your cluster")
    print("3. Go to the 'Search' tab")
    print(
        f"4. Create a new 'Search' index on '{VECTOR_COLLECTION}'with the following configuration:"
    )
    print("""
   {
  "mappings": {
    "dynamic": true,
    }
  }
}
    """)
    print("Name the index: default")
    print("5. Apply the index to the vector_features collection")

In [8]:
def hybrid_search(query, limit=5, vector_weight=0.5, full_text_weight=0.5):
    """Perform a hybrid search using vector search and full-text search."""

    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)
    query_embedding = voyage_embeddings.client.embed(
        [query], model=voyage_embeddings.model, input_type="query"
    ).embeddings[0]

    pipeline = [
        {
            "$vectorSearch": {
                "index": ATLAS_VECTOR_SEARCH_INDEX_NAME,
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": 100,
                "limit": limit * 2,  # Get more results for potential ranking
            }
        },
        {"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
        {"$unwind": {"path": "$docs", "includeArrayIndex": "rank"}},
        {
            "$addFields": {
                "vs_score": {
                    "$multiply": [
                        vector_weight,
                        {
                            "$divide": [
                                1.0,
                                {
                                    "$add": ["$rank", 60]  # Adjust ranking
                                },
                            ]
                        },
                    ]
                }
            }
        },
        {
            "$project": {
                "vs_score": 1,
                "_id": "$docs._id",
                "title": "$docs.title",
                "object_type": "$docs.object_type",
                "data": "$docs.data",
            }
        },
        {
            "$unionWith": {
                "coll": VECTOR_COLLECTION,
                "pipeline": [
                    {
                        "$search": {
                            "index": "default",
                            "compound": {
                                "must": [
                                    {
                                        "text": {
                                            "query": query,
                                            "path": {"wildcard": "*"},
                                            "fuzzy": {},
                                        }
                                    }
                                ]
                            },
                        }
                    },
                    {"$limit": limit * 2},
                    {"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
                    {"$unwind": {"path": "$docs", "includeArrayIndex": "fts_rank"}},
                    {
                        "$addFields": {
                            "fts_score": {
                                "$multiply": [
                                    full_text_weight,
                                    {"$divide": [1.0, {"$add": ["$fts_rank", 60]}]},
                                ]
                            }
                        }
                    },
                    {
                        "$project": {
                            "fts_score": 1,
                            "_id": "$docs._id",
                            "title": "$docs.title",
                            "object_type": "$docs.object_type",
                            "data": "$docs.data",
                        }
                    },
                ],
            }
        },
        {
            "$addFields": {
                "final_score": {
                    "$add": [
                        {"$ifNull": ["$vs_score", 0]},  # Handle missing vs_score
                        {"$ifNull": ["$fts_score", 0]},  # Handle missing fts_score
                    ]
                }
            }
        },
        {"$sort": {"final_score": -1}},
        {"$limit": limit},
    ]

    results = list(vector_collection.aggregate(pipeline))

    print(f"Found {len(results)} relevant items:")
    for i, result in enumerate(results):
        if result["object_type"] == "team":
            print(
                f"{i+1}. Team: {result.get('data', {}).get('name', 'Unknown')} (Score: {result.get('final_score', 0):.4f})"
            )
        elif result["object_type"] == "match":
            home = result.get("data", {}).get("home_team", "Unknown")
            away = result.get("data", {}).get("away_team", "Unknown")
            score = f"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}"
            print(
                f"{i+1}. Match: {home} vs {away} ({score}) (Score: {result.get('final_score', 0):.4f})"
            )
        elif result["object_type"] == "news":
            print(
                f"{i+1}. News: {result.get('data', {}).get('title', 'Unknown')} (Score: {result.get('final_score', 0):.4f})"
            )

    return results

In [9]:
# Example search queries to test our hybrid search
example_queries = [
    "Recent Manchester United games",
    "The Red Devils, how did they do?",
    "Who won El Clasico?",
    "Premier League match results",
    "Player injuries news",
    "Bayern Munich performance",
]

print("Testing vector search with default wieghts example queries:")
for query in example_queries:
    print("\n" + "=" * 50)
    print(f"QUERY: {query}")
    print("=" * 50)
    results = hybrid_search(query, limit=5)

    print("Testing vector search with favor of vector wieghts example queries:")
for query in example_queries:
    print("\n" + "=" * 50)
    print(f"QUERY: {query}")
    print("=" * 50)
    results = hybrid_search(query, limit=5, vector_weight=0.9, full_text_weight=0.1)

Testing vector search with default wieghts example queries:

QUERY: Recent Manchester United games
Found 5 relevant items:
1. Team: Manchester United (Score: 0.0083)
2. Team: Manchester United (Score: 0.0083)
3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0082)
4. Match: MNU vs LIV (2-1) (Score: 0.0082)
5. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0081)
Testing vector search with favor of vector wieghts example queries:

QUERY: The Red Devils, how did they do?
Found 5 relevant items:
1. Team: Chelsea (Score: 0.0083)
2. Team: Manchester United (Score: 0.0083)
3. Team: Liverpool (Score: 0.0082)
4. Team: Borussia Dortmund (Score: 0.0082)
5. Team: Juventus (Score: 0.0081)
Testing vector search with favor of vector wieghts example queries:

QUERY: Who won El Clasico?
Found 5 relevant items:
1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)
2. News: Barcelona's Lamine Yamal becomes youn

## RAG with OpenAI

RAG is a pipeline that loads similarity or hybrid context into an LLM to produce a relevant response considering a specific question.

In [None]:
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY)


def generate_response_with_hybrid_search(query, limit=5):
    """Generates a response using OpenAI's responses API with hybrid search."""

    # 1. Perform hybrid search to retrieve relevant documents
    search_results = hybrid_search(query, limit=limit)

    # 2. Format search results for OpenAI API
    context = ""
    for result in search_results:
        if result["object_type"] == "team":
            context += f"Team: {result.get('data', {}).get('name', 'Unknown')}\n"
        elif result["object_type"] == "match":
            home = result.get("data", {}).get("home_team", "Unknown")
            away = result.get("data", {}).get("away_team", "Unknown")
            score = f"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}"
            context += f"Match: {home} vs {away} ({score})\n"
        elif result["object_type"] == "news":
            context += f"News: {result.get('data', {}).get('title', 'Unknown')}\n{result.get('data', {}).get('content', '')}\n"

    # 3. Call OpenAI API to generate response
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful sports assistant. Answer the user's query using the provided context.",
            },
            {"role": "user", "content": f"{query}\n\nContext:\n{context}"},
        ],
    )

    return response.choices[0].message.content


def generate_response_with_vector_search(query, limit=5):
    """Generates a response using OpenAI's responses API with vector search."""

    # 1. Perform vector search to retrieve relevant documents
    search_results = perform_vector_search(query, k=limit)

    # 2. Format search results for OpenAI API
    context = ""
    for result in search_results:
        if result["object_type"] == "team":
            context += f"Team: {result.get('name', 'Unknown')}\n"
        elif result["object_type"] == "match":
            home = result.get("data", {}).get("home_team", "Unknown")
            away = result.get("data", {}).get("away_team", "Unknown")
            score = f"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}"
            context += f"Match: {home} vs {away} ({score})\n"
        elif result["object_type"] == "news":
            context += f"News: {result.get('title', 'Unknown')}\n{result.get('data', {}).get('content', '')}\n"

    # 3. Call OpenAI API to generate response
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful sports assistant. Answer the user's query using the provided context.",
            },
            {"role": "user", "content": f"{query}\n\nContext:\n{context}"},
        ],
    )

    return response.choices[0].message.content

In [None]:
query = "Who won El Clasico?"

# Using hybrid search
print("Testing hybrid search with example queries:")
print("=" * 50)
response_hybrid = generate_response_with_hybrid_search(query)

print("=" * 20 + "Hybrid RAG" + "=" * 20)
print("Response (Hybrid Search):", response_hybrid)

# Using vector search
print("\nTesting vector search with example queries:")
print("=" * 50)
response_vector = generate_response_with_vector_search(query)
print("=" * 20 + "Vector RAG" + "=" * 20)
print("Response (Vector Search):", response_vector)

Testing hybrid search with example queries:
Found 5 relevant items:
1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)
2. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)
3. Match: CHE vs TOT (3-0) (Score: 0.0082)
4. Match: BAR vs RMA (3-2) (Score: 0.0082)
5. Team: Real Madrid (Score: 0.0081)
Response (Hybrid Search): Barcelona won El Clásico, defeating Real Madrid with a score of 3-2 at Camp Nou.

Testing vector search with example queries:
Performing vector search for: Who won El Clasico?
Found 5 relevant items:
1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.7120)
2. Match: BAR vs RMA (3-2) (Score: 0.7113)
3. Team: Real Madrid (Score: 0.6963)
4. Team: Atletico Madrid (Score: 0.6953)
5. Match: ATM vs BAR (1-2) (Score: 0.6768)
Response (Vector Search): Barcelona won El Clásico against Real Madrid with a 3-2 victory at Camp Nou.


## Agentic RAG with Hybrid Search

Here we will use the [openai-agents](https://openai.github.io/openai-agents-python/) sdk to use the "hybrid_search" function as a tool. This helps the AI to better tailor the search term we pass to the tools and can perform multiple step tasks.

In [1]:
!pip install -Uq openai-agents

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
OPENAI_MODEL = "gpt-4o"

In [13]:
from agents.tool import function_tool


@function_tool
def hybrid_search(
    query: str, limit: int, vector_weight: float, full_text_weight: float
) -> list:
    """Perform a hybrid search using vector search and full-text search."""

    voyage_embeddings = VoyageAIEmbeddings(api_key=VOYAGE_API_KEY)
    query_embedding = voyage_embeddings.client.embed(
        [query], model=voyage_embeddings.model, input_type="query"
    ).embeddings[0]

    pipeline = [
        {
            "$vectorSearch": {
                "index": ATLAS_VECTOR_SEARCH_INDEX_NAME,
                "path": "embedding",
                "queryVector": query_embedding,
                "numCandidates": 100,
                "limit": limit * 2,  # Get more results for potential ranking
            }
        },
        {"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
        {"$unwind": {"path": "$docs", "includeArrayIndex": "rank"}},
        {
            "$addFields": {
                "vs_score": {
                    "$multiply": [
                        vector_weight,
                        {
                            "$divide": [
                                1.0,
                                {
                                    "$add": ["$rank", 60]  # Adjust ranking
                                },
                            ]
                        },
                    ]
                }
            }
        },
        {
            "$project": {
                "vs_score": 1,
                "_id": "$docs._id",
                "title": "$docs.title",
                "object_type": "$docs.object_type",
                "data": "$docs.data",
            }
        },
        {
            "$unionWith": {
                "coll": VECTOR_COLLECTION,
                "pipeline": [
                    {
                        "$search": {
                            "index": "default",
                            "compound": {
                                "must": [
                                    {
                                        "text": {
                                            "query": query,
                                            "path": {"wildcard": "*"},
                                            "fuzzy": {},
                                        }
                                    }
                                ]
                            },
                        }
                    },
                    {"$limit": limit * 2},
                    {"$group": {"_id": None, "docs": {"$push": "$$ROOT"}}},
                    {"$unwind": {"path": "$docs", "includeArrayIndex": "fts_rank"}},
                    {
                        "$addFields": {
                            "fts_score": {
                                "$multiply": [
                                    full_text_weight,
                                    {"$divide": [1.0, {"$add": ["$fts_rank", 60]}]},
                                ]
                            }
                        }
                    },
                    {
                        "$project": {
                            "fts_score": 1,
                            "_id": "$docs._id",
                            "title": "$docs.title",
                            "object_type": "$docs.object_type",
                            "data": "$docs.data",
                        }
                    },
                ],
            }
        },
        {
            "$addFields": {
                "final_score": {
                    "$add": [
                        {"$ifNull": ["$vs_score", 0]},  # Handle missing vs_score
                        {"$ifNull": ["$fts_score", 0]},  # Handle missing fts_score
                    ]
                }
            }
        },
        {"$sort": {"final_score": -1}},
        {"$limit": limit},
    ]

    results = list(vector_collection.aggregate(pipeline))

    print(f"Found {len(results)} relevant items:")
    for i, result in enumerate(results):
        if result["object_type"] == "team":
            print(
                f"{i+1}. Team: {result.get('data', {}).get('name', 'Unknown')} (Score: {result.get('final_score', 0):.4f})"
            )
        elif result["object_type"] == "match":
            home = result.get("data", {}).get("home_team", "Unknown")
            away = result.get("data", {}).get("away_team", "Unknown")
            score = f"{result.get('data', {}).get('home_score', 0)}-{result.get('data', {}).get('away_score', 0)}"
            print(
                f"{i+1}. Match: {home} vs {away} ({score}) (Score: {result.get('final_score', 0):.4f})"
            )
        elif result["object_type"] == "news":
            print(
                f"{i+1}. News: {result.get('data', {}).get('title', 'Unknown')} (Score: {result.get('final_score', 0):.4f})"
            )

    return results

In [22]:
from agents import Agent, Runner

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
virtual_primary_care_assistant = Agent(
    name="Sports Assistant specialised on sports queries",
    model=OPENAI_MODEL,
    instructions="""
      You can search information using the tools hybrid_search, be excited like you are a fun!
    """,
    tools=[hybrid_search],
)

example_queries = [
    "Recent Manchester United games",
    "The Red Devils, how did they do?",
    "Who won El Clasico?",
    "Premier League match results",
    "Player injuries news",
    "Bayern Munich performance",
]

# run_result_with_tools = await Runner.run(virtual_primary_care_assistant, input = "Who won El claisco you know?")

print("Testing agentic hybrid search with example queries:")
print("=" * 50)

for query in example_queries:
    print("\n" + "=" * 50)
    print(f"QUERY: {query}")
    print("=" * 50)
    run_result_with_tools = await Runner.run(
        virtual_primary_care_assistant, input=query
    )
    print(run_result_with_tools.final_output)
    print("=" * 50)

Testing agentic hybrid search with example queries:

QUERY: Recent Manchester United games
Found 5 relevant items:
1. Team: Manchester United (Score: 0.0117)
2. Match: MNU vs LIV (2-1) (Score: 0.0115)
3. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0113)
4. Team: Manchester City (Score: 0.0111)
5. Team: Chelsea (Score: 0.0109)




Here are some of the recent Manchester United games:

1. **Against Liverpool**  
   Date: March 24, 2025  
   Competition: Premier League  
   Score: Manchester United 2 - 1 Liverpool  
   **Summary:** Manchester United secured a thrilling 2-1 victory over Liverpool at Old Trafford. Bruno Fernandes opened the scoring with a penalty in the 34th minute, before Marcus Rashford doubled the lead with a brilliant solo effort. Mohamed Salah pulled one back for Liverpool, but United held on for a crucial win.

Bruno Fernandes has also been in sizzling form, winning the Premier League Player of the Month award for March. He scored 4 goals and provided 3 assists in 5 matches. Go Bruno! 🎉

Would you like to know more about any specific game or player? 😊

QUERY: The Red Devils, how did they do?
Found 5 relevant items:
1. Team: Manchester United (Score: 0.0083)
2. Team: Manchester United (Score: 0.0083)
3. Match: BAR vs RMA (3-2) (Score: 0.0082)
4. Team: Borussia Dortmund (Score: 0.0082)
5. Team: L



I couldn't find the latest match results for the Red Devils (Manchester United). However, they are known as one of the top teams in the Premier League! Would you like more info or try a different search? ⚽

QUERY: Who won El Clasico?
Found 1 relevant items:
1. News: Barcelona's Lamine Yamal becomes youngest El Clásico goalscorer (Score: 0.0083)
Barcelona won the latest El Clásico against Real Madrid with a score of 3-2! Lamine Yamal made history by becoming the youngest goalscorer at just 16 years and 107 days old. How amazing is that? 🎉⚽🎉

QUERY: Premier League match results




Found 5 relevant items:
1. News: Manchester City's Erling Haaland on track to break Premier League scoring record (Score: 0.0083)
2. Team: Tottenham Hotspur (Score: 0.0083)
3. Match: CHE vs TOT (3-0) (Score: 0.0082)
4. Team: Chelsea (Score: 0.0082)
5. Team: Manchester City (Score: 0.0081)
Here's an exciting recent Premier League match result for you:

- **Chelsea vs Tottenham Hotspur**
  - **Date**: March 25, 2025
  - **Stadium**: Stamford Bridge
  - **Result**: Chelsea 3-0 Tottenham Hotspur
  - **Summary**: Chelsea dominated the London derby with a 3-0 victory at Stamford Bridge. Cole Palmer scored twice in the first half, and Nicolas Jackson added a third goal in the 78th minute. Spurs found it difficult to create any clear chances throughout the match.

If you want more match results or details, just let me know! 🎉⚽

QUERY: Player injuries news




Found 5 relevant items:
1. News: Manchester United's Bruno Fernandes wins Player of the Month (Score: 0.0083)
2. News: Liverpool suffer injury blow as Salah ruled out for three weeks (Score: 0.0083)
3. Match: ARS vs MNC (1-1) (Score: 0.0082)
4. Team: Inter Milan (Score: 0.0082)
5. Team: Manchester United (Score: 0.0081)




Here's some fresh injury news from the world of sports:

### Liverpool:

- **Mohamed Salah** is facing a setback! 😢 The star forward has been ruled out for three weeks due to a hamstring strain. He sustained the injury during Liverpool's recent match against Manchester United. This comes at a bad time as Liverpool prepares to face off against Arsenal and Manchester City. Manager Jürgen Klopp described the situation as "unfortunate timing." 

Stay tuned for more updates! ⚽🔍

QUERY: Bayern Munich performance




Found 5 relevant items:
1. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0083)
2. Team: Bayern Munich (Score: 0.0083)
3. Team: Bayern Munich (Score: 0.0082)
4. News: Bayern Munich's Harry Kane scores perfect hat-trick in Der Klassiker (Score: 0.0082)
5. Match: BAY vs BVB (4-0) (Score: 0.0081)
Bayern Munich is on fire! 🎉

1. **Harry Kane's Hat-Trick Magic**: Harry Kane recently scored a *perfect hat-trick* (right foot, left foot, and header) as Bayern Munich crushed Borussia Dortmund 4-0 in Der Klassiker. Kane, who joined from Tottenham, is thriving in the Bundesliga, saying he's loving his time in Munich and the fantastic football they're playing!

2. **Match Details**: In that same match, apart from Kane's brilliant performance, Leroy Sané also got on the scoresheet, leading Bayern to a dominant victory at the Allianz Arena.

Bayern Munich is clearly playing some dazzling football right now! ⚽🥳
