## Module 3 Assignment - Building RAG systems with a Vector Database

---

In this assignment you will work with [Weaviate API](https://weaviate.io/) to build a tiny RAG system. You will:

- Load a [collection](https://weaviate.io/developers/weaviate/manage-data/collections) for BBC News data.
- Use the Weaviate API to retrieve documents from the vector database.
- Create functions to retrieve data based on Semantic Search, BM25 and Hybrid Search (using RRF) using the Weaviate API.
- Use an LLM to generate responses.

**IMPORTANT**: This assignment assumes you know how to handle simple tasks with collections in the Weaviate API. If you are not familiar with it yet, please read the Ungraded Lab on the Weaviate API! Furthermore, the data you will be working here is already *chunked*. You can get more hands-on experience on chunking reading the Ungraded Lab on chunking!


# Table of Contents
- [ 1 - Loading the libraries](#1)
- [ 2 - Setting up the Weaviate Client and loading the data](#2)
  - [ 2.1 Loading the Weaviate Client](#2-1)
  - [ 2.2 Loading the data](#2-2)
- [ 3 - Loading the Collection](#3)
  - [ 3.1 Metadata filtering](#3-1)
    - [ Exercise 1](#ex01)
  - [ 3.2 Semantic search](#3-2)
    - [ Exercise 2](#ex02)
  - [ 3.3 BM25 Serach](#3-3)
    - [ Exercise 3](#ex03)
  - [ 3.4 Hybrid search](#3-4)
    - [ Exercise 4](#ex04)
    - [ Exercise 5](#ex05)
- [ 4 - Incorporating the Weaviate API into our previous schema](#4)
  - [ 4.1 Generating the final prompt](#4-1)
  - [ 4.2 LLM call](#4-2)
- [ 5 - Experimenting with Your RAG System](#5)


---
<h4 style="color:black; font-weight:bold;">USING THE TABLE OF CONTENTS</h4>

JupyterLab provides an easy way for you to navigate through your assignment. It's located under the Table of Contents tab, found in the left panel, as shown in the picture below.

![TOC Location](images/toc.png)

---

<h4 style="color:green; font-weight:bold;">TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:</h4>

- All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.

- You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.

- Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.

- - To submit your notebook for grading, first save it by clicking the 💾 icon on the top left of the page and then click on the <span style="background-color: blue; color: white; padding: 3px 5px; font-size: 16px; border-radius: 5px;">Submit assignment</span> button on the top right of the page.
---

<a id='1'></a>
## 1 - Loading the libraries

---

Run the cell below to load the necessary libraries for this assignment.

In [1]:
import joblib
import weaviate
from weaviate.classes.query import (
    Filter, 
    Rerank
)

In [None]:
# import flask_app
# import weaviate_server
# from utils import (
#     generate_with_single_input,
#     print_object_properties,
#     display_widget
# )
# import unittests

ModuleNotFoundError: No module named 'FlagEmbedding'

<a id='2'></a>
## 2 - Setting up the Weaviate Client and loading the data

---

In this section, you will set up the Weaviate client and load the data, which consists of the [BBC news dataset](https://www.kaggle.com/datasets/gpreda/bbc-news) adapted from Kaggle.

<a id='2-1'></a>
### 2.1 Loading the Weaviate Client

Let's connect the Weaviate client to begin working with the Weaviate API. The server is already running on the backend. 

**Troubleshooting:**

- If you encounter issues loading the next cell, try restarting your kernel, clicking in the circled arrow in the panel above.

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()
print(os.getenv("OPENAI_API_KEY",""))

client = weaviate.connect_to_custom(
    http_host="cappu",
    http_port=8080,
    http_secure=False,
    grpc_host="cappu",
    grpc_port=50051,
    grpc_secure=False,
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY","")  # Or any other inference API keys
    }
)
# Test the connection
print(client.is_ready())

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
True


In [6]:
client.close()
# Test the connection
print(client.is_ready())

The `WeaviateClient` is closed. Run `client.connect()` to (re)connect!
False


<a id='2-2'></a>
### 2.2 Loading the data

Now, let's load the data. The dataset is structured with the following fields:

- **`title`**: The headline of the article.
- **`pubDate`**: The publication date and time of the article.
- **`guid`**: A unique identifier for the article, commonly used for listing.
- **`link`**: A URL link to access the full article online.
- **`description`**: A brief summary or teaser of the article's content.
- **`article_content`**: The complete text of the article, providing detailed information.

In [9]:
bbc_data = joblib.load('./data/bbc_data.joblib')

In [15]:
from pprint import pprint

pprint(bbc_data[0])

{'article_content': "Justin Welby speaks on BBC Radio 4's Today programme as "
                    'part of a special show guest edited by Dame Emma Warmsley '
                    'The Archbishop of Canterbury has urged politicians not to '
                    'treat their opponents as enemies but fellow human beings. '
                    'Speaking to the BBC, the Most Rev Justin Welby warned '
                    "Britain's leaders to avoid divisive topics. But he said "
                    'our capacity "to disagree deeply and not destructively" '
                    "is cause for hope. Later, he will deliver a new year's "
                    'message reflecting on global conflicts and his wishes for '
                    'a "peaceful 2024". The archbishop\'s intervention came '
                    "during an interview for BBC Radio 4's Today programme, "
                    'which is being guest edited by Dame Emma Walmsley, chief '
                    'executive of pharmaceutical

In [None]:
collection_name = "bbc_collection"


In [5]:
# Delete collection if it exists (optional)
try:
    client.collections.delete(collection_name)
except:
    pass

In [32]:
from sentence_transformers import SentenceTransformer

# Load the embedding model
model = SentenceTransformer('BAAI/bge-base-en-v1.5') # Using a smaller model for speed
documents = [
    "The iPhone 15 Pro features a new titanium chassis.",
    "The MacBook Pro is powered by the M3 chip for incredible performance.",
    "The iPad Pro has a stunning Liquid Retina XDR display.",
    "Apple Watch Series 9 introduces a new double-tap gesture."
]
embeddings = model.encode(documents).tolist()
embeddings

[[-0.039039578288793564,
  0.006337731145322323,
  0.03933665528893471,
  0.029621649533510208,
  0.03714470937848091,
  0.020457863807678223,
  -0.017526894807815552,
  0.010904021561145782,
  -0.012459068559110165,
  -0.044477012008428574,
  0.025661488994956017,
  0.015852829441428185,
  -0.018254993483424187,
  -0.0026079716626554728,
  -0.025835074484348297,
  0.02053203247487545,
  0.011544320732355118,
  -0.01366666704416275,
  0.010960279032588005,
  0.050548356026411057,
  -0.05564916506409645,
  0.03157968819141388,
  0.05416615679860115,
  0.021283570677042007,
  -0.025976009666919708,
  -0.006601659115403891,
  0.02718396484851837,
  0.030530864372849464,
  -0.0417865514755249,
  0.03686646372079849,
  -0.001532791182398796,
  -0.049070704728364944,
  -0.03615816310048103,
  -0.06011253595352173,
  -0.024427663534879684,
  -0.017198581248521805,
  0.0021442649886012077,
  -0.008716326206922531,
  -0.0750696137547493,
  -0.016504736617207527,
  -0.012818743474781513,
  0.032

In [11]:
import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.init import Auth
from sentence_transformers import SentenceTransformer
import json

def load_list_to_weaviate(data_list, collection_name="bbc_collection", host="cappu", port=8080, text_field="content"):
    """
    Load a list of dictionaries to Weaviate v4 with custom embeddings
    
    Args:
        data_list: List of dictionaries containing your BBC data
        collection_name: Name for the Weaviate collection
        host: Weaviate host
        port: Weaviate port
        text_field: Field name to use for generating embeddings (default: "content")
    """
    
    # Load the embedding model
    print("Loading BAAI/bge-base-en-v1.5 model...")
    model = SentenceTransformer('BAAI/bge-base-en-v1.5')
    
    # Connect to Weaviate
    client = weaviate.connect_to_custom(
        http_host=host,
        http_port=port,
        http_secure=False,  # Set to True if using HTTPS
        grpc_host=host,
        grpc_port=50051,    # Default gRPC port
        grpc_secure=False   # Set to True if using secure gRPC
    )
    
    try:
        # Check if collection exists and delete it (optional)
        if client.collections.exists(collection_name):
            client.collections.delete(collection_name)
            print(f"Deleted existing collection: {collection_name}")
        
        # Analyze the data structure from first item
        if not data_list:
            print("Error: Data list is empty")
            return
            
        sample_item = data_list[0]
        print(f"Sample data structure: {sample_item}")
        
        # Create collection schema based on your data structure
        # Adjust these properties based on your actual data structure
        properties = []
        
        for key, value in sample_item.items():
            if isinstance(value, str):
                properties.append(Property(
                    name=key,
                    data_type=DataType.TEXT
                ))
            elif isinstance(value, (int, float)):
                properties.append(Property(
                    name=key,
                    data_type=DataType.NUMBER
                ))
            elif isinstance(value, bool):
                properties.append(Property(
                    name=key,
                    data_type=DataType.BOOLEAN
                ))
            elif isinstance(value, list):
                properties.append(Property(
                    name=key,
                    data_type=DataType.TEXT_ARRAY
                ))
            else:
                # Default to TEXT for unknown types
                properties.append(Property(
                    name=key,
                    data_type=DataType.TEXT
                ))
        
        # Create the collection
        collection = client.collections.create(
            name=collection_name,
            properties=properties,
            # No automatic vectorizer - we'll provide custom embeddings
            vectorizer_config=Configure.Vectorizer.none()
        )
        
        print(f"Created collection: {collection_name}")
        
        # Prepare data for insertion with custom embeddings
        clean_data = []
        texts_for_embedding = []
        
        for i, item in enumerate(data_list):
            clean_item = {}
            for key, value in item.items():
                # Handle None/null values
                if value is None:
                    continue
                
                # Convert data types as needed
                if isinstance(value, (list, dict)):
                    # Convert complex types to JSON strings
                    clean_item[key] = json.dumps(value) if isinstance(value, dict) else value
                else:
                    clean_item[key] = value
            
            clean_data.append(clean_item)
            
            # Extract text for embedding (use specified field or fallback)
            text_for_embed = ""
            if text_field in clean_item:
                text_for_embed = str(clean_item[text_field])
            elif "content" in clean_item:
                text_for_embed = str(clean_item["content"])
            elif "title" in clean_item:
                text_for_embed = str(clean_item["title"])
            else:
                # Use all text fields combined
                text_for_embed = " ".join([str(v) for v in clean_item.values() if isinstance(v, str)])
            
            texts_for_embedding.append(text_for_embed)
            
            if i % 1000 == 0:
                print(f"Prepared {i} items...")
        
        # Generate embeddings for all texts
        print(f"Generating embeddings for {len(texts_for_embedding)} texts...")
        embeddings = model.encode(texts_for_embedding, show_progress_bar=True).tolist()
        
        # Batch insert data with custom embeddings
        print(f"Inserting {len(clean_data)} objects with embeddings...")
        
        # Insert in batches
        batch_size = 100
        for i in range(0, len(clean_data), batch_size):
            batch_data = clean_data[i:i + batch_size]
            batch_embeddings = embeddings[i:i + batch_size]
            
            # Prepare objects with custom vectors
            objects_with_vectors = []
            for data, embedding in zip(batch_data, batch_embeddings):
                objects_with_vectors.append({
                    "properties": data,
                    "vector": embedding
                })
            
            response = collection.data.insert_many(objects_with_vectors)
            
            # Check for errors
            if response.has_errors:
                print(f"Batch {i//batch_size + 1} had errors:")
                for error in response.errors:
                    print(f"  - {error}")
            else:
                print(f"Successfully inserted batch {i//batch_size + 1} ({len(batch_data)} items)")
        
        # Verify insertion
        total_count = collection.aggregate.over_all(total_count=True)
        print(f"Total objects in collection: {total_count.total_count}")
        
        # Sample query to verify data
        response = collection.query.fetch_objects(limit=3)
        print("\nSample inserted objects:")
        for obj in response.objects:
            print(f"UUID: {obj.uuid}")
            print(f"Properties: {obj.properties}")
            print("---")
            
    except Exception as e:
        print(f"Error: {e}")
    
    finally:
        client.close()



In [18]:
load_list_to_weaviate(bbc_data[:10], collection_name="bbc_collection")


Loading BAAI/bge-base-en-v1.5 model...
Deleted existing collection: bbc_collection
Sample data structure: {'title': 'Justin Welby: Political leaders should treat opponents as human beings', 'pubDate': Timestamp('2024-01-01 00:00:04'), 'guid': 'https://www.bbc.co.uk/news/uk-67844356', 'link': 'https://www.bbc.co.uk/news/uk-67844356?at_medium=RSS&at_campaign=KARANGA', 'description': 'The Archbishop of Canterbury urges politicians to "forswear wedge issues" and avoid divisive topics.', 'article_content': 'Justin Welby speaks on BBC Radio 4\'s Today programme as part of a special show guest edited by Dame Emma Warmsley The Archbishop of Canterbury has urged politicians not to treat their opponents as enemies but fellow human beings. Speaking to the BBC, the Most Rev Justin Welby warned Britain\'s leaders to avoid divisive topics. But he said our capacity "to disagree deeply and not destructively" is cause for hope. Later, he will deliver a new year\'s message reflecting on global conflicts

            Use the `vector_config` argument instead.
            


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inserting 10 objects with embeddings...
Error: It is forbidden to insert `id` or `vector` inside properties: {'properties': {'title': 'Justin Welby: Political leaders should treat opponents as human beings', 'pubDate': Timestamp('2024-01-01 00:00:04'), 'guid': 'https://www.bbc.co.uk/news/uk-67844356', 'link': 'https://www.bbc.co.uk/news/uk-67844356?at_medium=RSS&at_campaign=KARANGA', 'description': 'The Archbishop of Canterbury urges politicians to "forswear wedge issues" and avoid divisive topics.', 'article_content': 'Justin Welby speaks on BBC Radio 4\'s Today programme as part of a special show guest edited by Dame Emma Warmsley The Archbishop of Canterbury has urged politicians not to treat their opponents as enemies but fellow human beings. Speaking to the BBC, the Most Rev Justin Welby warned Britain\'s leaders to avoid divisive topics. But he said our capacity "to disagree deeply and not destructively" is cause for hope. Later, he will deliver a new year\'s message reflecting o

<a id='3'></a>
## 3 - Loading the Collection

---

In this section, you will load the collection containing the BBC News dataset.

In [16]:
collection = client.collections.get("bbc_collection")

In [17]:
print(f"The number of elements in the collection is: {len(collection)}")

WeaviateClosedClientError: The `WeaviateClient` is closed. Run `client.connect()` to (re)connect!

Let's fetch one example of object in this collection.

In [8]:
object = collection.query.fetch_objects(limit = 1, include_vector = True).objects[0]
print("Printing the properties (some will be truncated due to size)")
print_object_properties(object.properties)
print("Vector: (truncated)",object.vector['main_vector'][0:15])
print("Vector length: ", len(object.vector['main_vector']))

Printing the properties (some will be truncated due to size)
article_content: US Vice-President Kamala Harris has gone on the offensive against Donald Trump in the first rally of...(truncated)
chunk: US Vice-President Kamala Harris has gone on the offensive against Donald Trump in the first rally of...(truncated)
chunk_index: 0
description: The Democratic White House candidate highlights her Republican opponent's criminal conviction, while he portrays her as "radical left".
link: https://www.bbc.com/news/articles/cn053pnv0k1o
pubDate: 2024-07-24 00:14:06+00:00
title: Kamala Harris slams Trump at first rally as he hits back

Vector: (truncated) [-0.05203712731599808, -0.05209054797887802, 0.008704056963324547, -0.012111086398363113, 0.009252868592739105, -0.027179855853319168, 0.020350266247987747, -0.012469162233173847, 0.012519430369138718, -0.028637126088142395, 0.03203899413347244, 0.009703295305371284, -0.08515051752328873, 0.06591801345348358, 0.07906429469585419]
Vector length:  

The vector length is `768`. So every chunk in the vector database is mapped into a 768 dimension vector. This is the vector the Weaviate API uses to perform semantic search.

<a id='ex01'></a>

<a id='3-1'></a>
### 3.1 Metadata filtering

<a id='ex01'></a>
### Exercise 1

In this exercise, you will implement a metadata filtering function. This function will take several inputs: a property (such as `article_content`, `title`, `pubDate`, etc.), the values you want to filter by, the collection you want to search in, and the number of items you want to retrieve.

<details>
<summary style="color: green;">Hint 1</summary>
<p>Remember that to perform filtering based only on metadata, the appropriate method to use is <code>collection.query.fetch_objects</code>.</p>
</details>
<details>
<summary style="color: green;">Hint 2</summary>
<p>When using <code>collection.query.fetch_objects</code>, you must provide the <code>metadata_property</code> as the <code>property</code> and the corresponding <code>Filter</code> object.</p>
</details>
<details>
<summary style="color: green;">Hint 3</summary>
<p>The filter object should be used with the method <code>.by_property</code> for the appropriate property, and <code>.contains_any</code> with the relevant values. A typical call would be <code>Filter.by_property(metadata_property).contains_any(values)</code>.</p>
<p>To limit the results, use <code>limit=limit</code> within the <code>.fetch_objects</code> method.</p>
</details>

In [13]:
# GRADED CELL 

def filter_by_metadata(metadata_property: str, 
                       values: list[str], 
                       collection: "weaviate.collections.collection.sync.Collection" , 
                       limit: int = 5) -> list:
    """
    Retrieves objects from a specified collection based on metadata filtering criteria.

    This function queries a collection within the specified client to fetch objects that match 
    certain metadata criteria. It uses a filter to find objects whose specified 'property' contains 
    any of the given 'values'. The number of objects retrieved is limited by the 'limit' parameter.

    Args:
    metadata_property (str): The name of the metadata property to filter on.
    values (List[str]): A list of values to be matched against the specified property.
    collection_name (weaviate.collections.collection.sync.Collection): The collection to query.
    limit (int, optional): The maximum number of objects to retrieve. Defaults to 5.

    Returns:
    List[Object]: A list of objects from the collection that match the filtering criteria.
    """
    ### START CODE HERE ###
    
    # Retrieve using collection.query.fetch_objects
    
    response = collection.query.fetch_objects(limit = limit, include_vector = True, filters=Filter.by_property(metadata_property).contains_any(values))

    ### END CODE HERE ###
    
    response_objects = [x.properties for x in response.objects]
    
    return response_objects

In [14]:
# Let's get an example
res = filter_by_metadata('title', ['Taylor Swift'], collection, limit = 2)
for x in res:
    print_object_properties(x)

article_content: The 2024 awards season kicked off in style at the Golden Globes - the first major red carpet event o...(truncated)
chunk: some of his previous get-ups. The Bear's Jeremy Allen White - who recently became the new face (and ...(truncated)
chunk_index: 4
description: Stars including Margot Robbie and Taylor Swift arrived in a variety of eye-catching outfits.
link: https://www.bbc.co.uk/news/entertainment-arts-67908727?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-01-08 03:23:58+00:00
title: Margot Robbie, Taylor Swift and more on Golden Globes red carpet

article_content: The 2024 awards season kicked off in style at the Golden Globes - the first major red carpet event o...(truncated)
chunk: headpiece - not entirely a fashion choice. She says the "protective veil" is because she hurt her fa...(truncated)
chunk_index: 5
description: Stars including Margot Robbie and Taylor Swift arrived in a variety of eye-catching outfits.
link: https://www.bbc.co.uk/news/entertainment-

**Expected output**
```
article_content: The 2024 awards season kicked off in style at the Golden Globes - the first major red carpet event o...(truncated)
chunk: some of his previous get-ups. The Bear's Jeremy Allen White - who recently became the new face (and ...(truncated)
chunk_index: 4
description: Stars including Margot Robbie and Taylor Swift arrived in a variety of eye-catching outfits.
link: https://www.bbc.co.uk/news/entertainment-arts-67908727?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-01-08 03:23:58+00:00
title: Margot Robbie, Taylor Swift and more on Golden Globes red carpet

article_content: The 2024 awards season kicked off in style at the Golden Globes - the first major red carpet event o...(truncated)
chunk: headpiece - not entirely a fashion choice. She says the "protective veil" is because she hurt her fa...(truncated)
chunk_index: 5
description: Stars including Margot Robbie and Taylor Swift arrived in a variety of eye-catching outfits.
link: https://www.bbc.co.uk/news/entertainment-arts-67908727?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-01-08 03:23:58+00:00
title: Margot Robbie, Taylor Swift and more on Golden Globes red carpet
```

In [15]:
# Test your solution!
unittests.test_filter_by_metadata(filter_by_metadata, client)

[92m All tests passed!


<a id='ex02'></a>

<a id='3-2'></a>
### 3.2 Semantic search

<a id='ex02'></a>
### Exercise 2

In this exercise, you will implement a semantic search retrieval, similar to the one you created in the previous assignment, but this time utilizing the Weaviate API.

<details>
<summary style="color: green;">Hint</summary>
<p>Remember that to perform semantic search, you should use the method <code>collection.query.near_text</code>.</p>
<p>The <code>top_k</code> parameter in the function dictates how many results to retrieve. In Weaviate, this is referred to as <code>limit</code>. Adjust this parameter as needed.</p>
</details>

In [18]:
# GRADED CELL 

def semantic_search_retrieve(query: str,
                             collection: "weaviate.collections.collection.sync.Collection" , 
                             top_k: int = 5) -> list:
    """
    Performs a semantic search on a collection and retrieves the top relevant chunks.

    This function executes a semantic search query on a specified collection to find text chunks 
    that are most relevant to the input 'query'. The search retrieves a limited number of top 
    matching objects, as specified by 'top_k'. The function returns the 'chunk' property of 
    each of the top matching objects.

    Args:
    query (str): The search query used to find relevant text chunks.
    collection (weaviate.collections.collection.sync.Collection): The collection in which the semantic search is performed.
    top_k (int, optional): The number of top relevant objects to retrieve. Defaults to 5.

    Returns:
    List[str]: A list of text chunks that are most relevant to the given query.
    """
    ### START CODE HERE ###

    # Retrieve using collection.query.near_text
    response = collection.query.near_text(query, limit=top_k)

    ### END CODE HERE ###
    
    response_objects = [x.properties for x in response.objects]
    
    return response_objects

In [19]:
# Let's have an example!
print_object_properties(semantic_search_retrieve(query = 'Tell me about the last Taylor Swift show', collection = collection, top_k = 2))

article_content: Taylor Swift has finished the European leg of her Eras Tour with a record-breaking show at Wembley S...(truncated)
chunk: size crowd at all". At an earlier show in Liverpool, she had also called the Eras Tour the “most exh...(truncated)
chunk_index: 10
description: The star is joined by Florence + The Machine and sings So Long, London at her final UK show.
link: https://www.bbc.com/news/articles/cr5nr3n6epvo
pubDate: 2024-08-21 03:02:08+00:00
title: 'I've never had it this good' - Taylor Swift thanks fans after new Wembley record

article_content: Taylor Swift has finished the European leg of her Eras Tour with a record-breaking show at Wembley S...(truncated)
chunk: regular part of the setlist. Last week, the star was joined by Ed Sheeran to play the songs Endgame ...(truncated)
chunk_index: 4
description: The star is joined by Florence + The Machine and sings So Long, London at her final UK show.
link: https://www.bbc.com/news/articles/cr5nr3n6epvo
pubDate: 2024-08-2

**Expected Output**
```
article_content: Taylor Swift has finished the European leg of her Eras Tour with a record-breaking show at Wembley S...(truncated)
chunk: size crowd at all". At an earlier show in Liverpool, she had also called the Eras Tour the “most exh...(truncated)
chunk_index: 10
description: The star is joined by Florence + The Machine and sings So Long, London at her final UK show.
link: https://www.bbc.com/news/articles/cr5nr3n6epvo
pubDate: 2024-08-21 03:02:08+00:00
title: 'I've never had it this good' - Taylor Swift thanks fans after new Wembley record

article_content: Taylor Swift has finished the European leg of her Eras Tour with a record-breaking show at Wembley S...(truncated)
chunk: regular part of the setlist. Last week, the star was joined by Ed Sheeran to play the songs Endgame ...(truncated)
chunk_index: 4
description: The star is joined by Florence + The Machine and sings So Long, London at her final UK show.
link: https://www.bbc.com/news/articles/cr5nr3n6epvo
pubDate: 2024-08-21 03:02:08+00:00
title: 'I've never had it this good' - Taylor Swift thanks fans after new Wembley record

```

In [20]:
unittests.test_semantic_search_retrieve(semantic_search_retrieve, client)

[92m All tests passed!


<a id='ex03'></a>

<a id='3-3'></a>
### 3.3 BM25 Serach

<a id='ex03'></a>
### Exercise 3

In this exercise, you will implement a BM25 retrieval, similar to the one you created in the previous assignment, but now using the Weaviate API.
<details>
<summary style="color: green;">Hint</summary>
<p>To perform a BM25 search, use the method <code>collection.query.bm25</code>.</p>
<p>The <code>top_k</code> parameter in the function specifies how many results to retrieve. In Weaviate, this parameter is referred to as <code>limit</code>. Adjust this accordingly.</p>
</details>

In [21]:
# GRADED CELL 

def bm25_retrieve(query: str, 
                  collection: "weaviate.collections.collection.sync.Collection" , 
                  top_k: int = 5) -> list:
    """
    Performs a BM25 search on a collection and retrieves the top relevant chunks.

    This function executes a BM25-based search query on a specified collection to identify text 
    chunks that are most relevant to the provided 'query'. It retrieves a limited number of the 
    top matching objects, as specified by 'top_k', and returns the 'chunk' property of these objects.

    Args:
    query (str): The search query used to find relevant text chunks.
    collection (weaviate.collections.collection.sync.Collection): The collection in which the BM25 search is performed.
    top_k (int, optional): The number of top relevant objects to retrieve. Defaults to 5.

    Returns:
    List[str]: A list of text chunks that are most relevant to the given query.
    """
    
    ### START CODE HERE ###

    # Retrieve using collection.query.bm25
    response = collection.query.bm25(query,limit=top_k)

    ### END CODE HERE ### 
    
    response_objects = [x.properties for x in response.objects]
    return response_objects 

In [22]:
print_object_properties(bm25_retrieve('Tell me about the last Taylor Swift show', collection, top_k = 2))

article_content: Rapper Killer Mike won three Grammys in the rap category - best rap song, best rap performance and b...(truncated)
chunk: police brutality and systemic racism. He was a highly visible supporter of Bernie Sanders' two campa...(truncated)
chunk_index: 4
description: The 48-year-old was detained on a misdemeanour charge after winning three awards in the rap category.
link: https://www.bbc.co.uk/news/world-us-canada-68201021?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-02-05 23:27:08+00:00
title: Killer Mike dismisses arrest at Grammys as 'speed bump'

article_content: Rapper Killer Mike won three Grammys in the rap category - best rap song, best rap performance and b...(truncated)
chunk: Nicki Minaj. He also won a third award for best rap album with his album Michael. "You cannot tell m...(truncated)
chunk_index: 3
description: The 48-year-old was detained on a misdemeanour charge after winning three awards in the rap category.
link: https://www.bbc.co.uk/news/world-us

**Expected Output**
```
article_content: Rapper Killer Mike won three Grammys in the rap category - best rap song, best rap performance and b...(truncated)
chunk: police brutality and systemic racism. He was a highly visible supporter of Bernie Sanders' two campa...(truncated)
chunk_index: 4
description: The 48-year-old was detained on a misdemeanour charge after winning three awards in the rap category.
link: https://www.bbc.co.uk/news/world-us-canada-68201021?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-02-05 23:27:08+00:00
title: Killer Mike dismisses arrest at Grammys as 'speed bump'

article_content: Rapper Killer Mike won three Grammys in the rap category - best rap song, best rap performance and b...(truncated)
chunk: Nicki Minaj. He also won a third award for best rap album with his album Michael. "You cannot tell m...(truncated)
chunk_index: 3
description: The 48-year-old was detained on a misdemeanour charge after winning three awards in the rap category.
link: https://www.bbc.co.uk/news/world-us-canada-68201021?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-02-05 23:27:08+00:00
title: Killer Mike dismisses arrest at Grammys as 'speed bump'
```

In [23]:
unittests.test_bm25_retrieve(bm25_retrieve, client)

[92m All tests passed!


<a id='ex04'></a>

<a id='3-4'></a>
### 3.4 Hybrid search

<a id='ex04'></a>
### Exercise 4

In this exercise, you will implement a Reciprocal Rank Fusion (RRF) retrieval system using the Weaviate API. To achieve this, you will need to use the `collection.query.hybrid` method.



<details>
<summary style="color: green;">Hint</summary>
<p>To perform a hybrid search, use the method <code>collection.query.hybrid</code>.</p>
<p>The <code>top_k</code> parameter in the function specifies how many results to retrieve. In Weaviate, this is referred to as <code>limit</code>. Make sure to also include the <code>alpha</code>.</p>
</details>

In [26]:
# GRADED CELL 

def hybrid_retrieve(query: str, 
                    collection: "weaviate.collections.collection.sync.Collection" , 
                    alpha: float = 0.5,
                    top_k: int = 5
                   ) -> list:
    """
    Performs a hybrid search on a collection and retrieves the top relevant chunks.

    This function executes a hybrid search that combines semantic vector search and traditional 
    keyword-based search on a specified collection to find text chunks most relevant to the 
    input 'query'. The relevance of results is influenced by 'alpha', which balances the weight 
    between vector and keyword matches. It retrieves a limited number of top matching objects, 
    as specified by 'top_k', and returns the 'chunk' property of these objects.

    Args:
    query (str): The search query used to find relevant text chunks.
    collection (weaviate.collections.collection.sync.Collection): The collection in which the hybrid search is performed.
    alpha (float, optional): A weighting factor that balances the contribution of semantic 
    and keyword matches. Defaults to 0.5.
    top_k (int, optional): The number of top relevant objects to retrieve. Defaults to 5.

    Returns:
    List[str]: A list of text chunks that are most relevant to the given query.
    """
    ### START CODE HERE ### 

    # Retrieve using collection.query.hybrid
    response = collection.query.hybrid(query=query,alpha=alpha,limit=top_k)

    ### END CODE HERE ###
    
    response_objects = [x.properties for x in response.objects]
    
    return response_objects 

In [27]:
print_object_properties(hybrid_retrieve('Tell me about the last Taylor Swift show', collection, top_k = 2))

article_content: Rapper Killer Mike won three Grammys in the rap category - best rap song, best rap performance and b...(truncated)
chunk: police brutality and systemic racism. He was a highly visible supporter of Bernie Sanders' two campa...(truncated)
chunk_index: 4
description: The 48-year-old was detained on a misdemeanour charge after winning three awards in the rap category.
link: https://www.bbc.co.uk/news/world-us-canada-68201021?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-02-05 23:27:08+00:00
title: Killer Mike dismisses arrest at Grammys as 'speed bump'

article_content: Taylor Swift has finished the European leg of her Eras Tour with a record-breaking show at Wembley S...(truncated)
chunk: size crowd at all". At an earlier show in Liverpool, she had also called the Eras Tour the “most exh...(truncated)
chunk_index: 10
description: The star is joined by Florence + The Machine and sings So Long, London at her final UK show.
link: https://www.bbc.com/news/articles/cr5nr3n6e

**Expected Output**
```
article_content: Rapper Killer Mike won three Grammys in the rap category - best rap song, best rap performance and b...(truncated)
chunk: police brutality and systemic racism. He was a highly visible supporter of Bernie Sanders' two campa...(truncated)
chunk_index: 4
description: The 48-year-old was detained on a misdemeanour charge after winning three awards in the rap category.
link: https://www.bbc.co.uk/news/world-us-canada-68201021?at_medium=RSS&at_campaign=KARANGA
pubDate: 2024-02-05 23:27:08+00:00
title: Killer Mike dismisses arrest at Grammys as 'speed bump'

article_content: Taylor Swift has finished the European leg of her Eras Tour with a record-breaking show at Wembley S...(truncated)
chunk: size crowd at all". At an earlier show in Liverpool, she had also called the Eras Tour the “most exh...(truncated)
chunk_index: 10
description: The star is joined by Florence + The Machine and sings So Long, London at her final UK show.
link: https://www.bbc.com/news/articles/cr5nr3n6epvo
pubDate: 2024-08-21 03:02:08+00:00
title: 'I've never had it this good' - Taylor Swift thanks fans after new Wembley record
```

In [28]:
unittests.test_hybrid_retrieve(hybrid_retrieve, client)

[92m All tests passed!


### 5 - Reranking

<a id='ex05'></a>

<a id='ex05'></a>
### Exercise 5

In this section, you will create a new version of `semantic_search` that allows reranking of the results. This new function must support using a different query for reranking or reranking based on a specific document property (e.g., reranking using only the title).

Your task is to add the `rerank` parameter to the `collection.query.near_text` call.

<details>
<summary style="color: green;">Hint 1</summary>
<p>Remember that <code>collection.query.near_text</code> takes a query, a limit (i.e., <code>top_k</code>), and now also requires the <code>rerank</code> parameter.</p>
</details>

<details>
<summary style="color: green;">Hint 2</summary>
<p>The <code>Rerank</code> object is already loaded into memory. It takes two parameters: the query and the document property to use for ranking—<code>query</code> and <code>prop</code>, respectively.</p>
</details>

<details>
<summary style="color: green;">Hint 3</summary>
<p>Define the reranker as <code>reranker = Reranker(appropriate_parameters)</code>. Don’t forget: the query for the reranker should be <code>rerank_query</code>!</p>
</details>

In [33]:
# GRADED CELL 

def semantic_search_with_reranking(query: str, 
                                   rerank_property: str,
                                   collection: "weaviate.collections.collection.sync.Collection" , 
                                   rerank_query: str = None,
                                   top_k: int = 5
                                   ) -> list:
    """
    Performs a semantic search and reranks the results based on a specified property.

    Args:
        query (str): The search query to perform the initial search.
        rerank_property (str): The property used for reranking the search results.
        collection (weaviate.collections.collection.sync.Collection): The collection to search within.
        rerank_query (str, optional): The query to use specifically for reranking. If not provided, 
                                      the original query is used for reranking.
        top_k (int, optional): The maximum number of top results to return. Defaults to 5.

    Returns:
        list: A list of properties from the reranked search results, where each item corresponds to 
              an object in the collection.
    """
    ### START CODE HERE ### 

    # Set the rerank_query to be the same as the query if rerank_query is not passed (don't change this line)
    if rerank_query is None: 
        rerank_query = query 
        
    # Define the reranker with rerank_query and rerank_property
    reranker = Rerank(prop=rerank_property, query=rerank_query)

    # Retrieve using collection.query.near_text with the appropriate parameters (do not forget the rerank!)
    response = collection.query.near_text(query, limit=top_k, rerank=reranker)

    ### END CODE HERE ###
    
    response_objects = [x.properties for x in response.objects]
    
    return response_objects 

The reranker model receives a query and a passage (in our case, a chunk of the result) to compute a similarity score.

In [34]:
# Set a query
query = 'Tell me about the conflicts in Latin America'
# Get the results from a search (in this case the hybrid search)
results = semantic_search_with_reranking(query, collection = collection, top_k = 2, rerank_property = 'chunk')

In [35]:
print_object_properties(results)

article_content: A huge diplomatic row has erupted after Spain's transport minister suggested Argentina's president h...(truncated)
chunk: weeks' to attend the launch of Vox's European election campaign, newspaper El Pais reported. Mr Mile...(truncated)
chunk_index: 3
description: A row breaks out after Spain's transport minister suggests Argentina's president has taken drugs.
link: https://www.bbc.com/news/articles/czd8qzvpl4lo
pubDate: 2024-05-04 15:56:45+00:00
title: Spain-Argentina row over drug-use accusation

article_content: Opposition supporters have gathered across Venezuela to protest against Nicolás Maduro's disputed vi...(truncated)
chunk: the world, from Australia to Spain and also in the United Kingdom, Canada, Colombia, Mexico and Arge...(truncated)
chunk_index: 4
description: Opposition leader María Corina Machado joined thousands of demonstrators in the capital Caracas.
link: https://www.bbc.com/news/articles/cgedgqqy7x9o
pubDate: 2024-08-17 23:19:48+00:00
title: Prote

**Expected Results**
```
article_content: A huge diplomatic row has erupted after Spain's transport minister suggested Argentina's president h...(truncated)
chunk: weeks' to attend the launch of Vox's European election campaign, newspaper El Pais reported. Mr Mile...(truncated)
chunk_index: 3
description: A row breaks out after Spain's transport minister suggests Argentina's president has taken drugs.
link: https://www.bbc.com/news/articles/czd8qzvpl4lo
pubDate: 2024-05-04 15:56:45+00:00
title: Spain-Argentina row over drug-use accusation

article_content: Opposition supporters have gathered across Venezuela to protest against Nicolás Maduro's disputed vi...(truncated)
chunk: the world, from Australia to Spain and also in the United Kingdom, Canada, Colombia, Mexico and Arge...(truncated)
chunk_index: 4
description: Opposition leader María Corina Machado joined thousands of demonstrators in the capital Caracas.
link: https://www.bbc.com/news/articles/cgedgqqy7x9o
pubDate: 2024-08-17 23:19:48+00:00
title: Protests across Venezuela as election dispute goes on
```

In [36]:
# Test your function!
unittests.test_semantic_search_with_reranking(semantic_search_with_reranking, client)

[92m All tests passed!


<a id='4'></a>
## 4 - Incorporating the Weaviate API into our previous schema
---

This section is not graded. Here, you will revisit the functions used throughout the assignments to integrate the Weaviate API into your existing schema. Once integrated, you will be able to run prompts and test your new RAG system!

<a id='4-1'></a>
### 4.1 Generating the final prompt


In [37]:
def generate_final_prompt(query: str, 
                          top_k: int, 
                          retrieve_function: callable,
                          rerank_query: str = None, 
                          rerank_property: str = None, 
                          use_rerank: bool = False, 
                          use_rag: bool = True) -> str:
    """
    Generates a final prompt by optionally retrieving and formatting relevant documents using retrieval-augmented generation (RAG).

    Args:
        query (str): The initial query to be used for document retrieval.
        top_k (int): The number of top documents to retrieve and use for generating the prompt.
        retrieve_function (callable): The function used to retrieve documents based on the query.
        rerank_query (str, optional): The query used specifically for reranking documents if reranking is enabled.
        rerank_property (str, optional): The property used for reranking. Required if 'use_rerank' is True.
        use_rerank (bool, optional): Flag to denote whether to use reranking in document retrieval. Defaults to False.
        use_rag (bool, optional): Flag to determine whether to use retrieval-augmented generation. Defaults to True.

    Returns:
        str: A constructed prompt that includes the original query and formatted retrieved documents if 'use_rag' is True.
             Otherwise, it returns the original query.
    """
    # If no rag, return the query
    if not use_rag:
        return query
    
    if use_rerank:
        if rerank_property is None:
            raise ValueError('rerank_property must be set if use_rerank = True')
        top_k_documents = retrieve_function(query=query, top_k=top_k, collection = collection, rerank_property = rerank_property, rerank_query = rerank_query)
    else:
        top_k_documents = retrieve_function(query=query, top_k=top_k, collection = collection)
    
    # Initialize an empty string to store the formatted data.
    formatted_data = ""
    
    # Iterate over each retrieved document.
    for document in top_k_documents:
        # Format each document into a structured string.
        document_layout = (
            f"Title: {document['title']}, Chunk: {document['chunk']}, "
            f"Published at: {document['pubDate']}\nURL: {document['link']}"
        )
        # Append the formatted string to the main data string with a newline for separation.
        formatted_data += document_layout + "\n"
    
    # If use_rag flag is True, construct the enhanced prompt with the augmented data.
    retrieve_data_formatted = formatted_data  # Store formatted data.
    prompt = (
        f"Answer the user query below. There will be provided additional information for you to compose your answer. "
        f"The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, "
        f"you should not rely only on this information to answer the query, but add it to your overall knowledge."
        f"The news data is ordered by relevance."
        f"Query: {query}\n"
        f"2024 News: {retrieve_data_formatted}"
    )
    
    return prompt

In [38]:
prompt = generate_final_prompt("Tell me the economic situation of the US in 2024.", top_k = 5, retrieve_function = semantic_search_retrieve, use_rerank = False, rerank_property = 'title')

In [39]:
print(prompt)

Answer the user query below. There will be provided additional information for you to compose your answer. The relevant information provided is from 2024 and it should be added as your overall knowledge to answer the query, you should not rely only on this information to answer the query, but add it to your overall knowledge.The news data is ordered by relevance.Query: Tell me the economic situation of the US in 2024.
2024 News: Title: What is a recession and how could one affect me?, Chunk: which was much better than expected. That put the US at 2.5% over 2023 as a whole, the best performance of all other advanced economies. It is also expected to outperform the rest of the G7 in 2024. In October 2023, the International Monetary Fund (IMF) predicted that the UK would grow by just 0.6% in 2024. The independent Office for Budget Responsibility (OBR) expects the UK economy to grow by 0.7% in 2024, but that is less than half of its earlier prediction of 1.8% growth. How will the UK econom

<a id='4-2'></a>
### 4.2 LLM call

Let's revisit the `llm_call` function, now adapted to this assignment.

In [40]:
def llm_call(query: str, 
             retrieve_function: callable = None, 
             top_k: int = 5, 
             use_rag: bool = True, 
             use_rerank: bool = False, 
             rerank_property: str = None, 
             rerank_query: str = None) -> str:
    """
    Simulates a call to a language model by generating a prompt and using it to produce a response.

    Args:
        query (str): The initial query for which a response is sought.
        retrieve_function (callable, optional): The function used to retrieve documents related to the query.
        top_k (int, optional): The number of top documents to retrieve and use for generating the prompt. Defaults to 5.
        use_rag (bool, optional): Indicates whether to use retrieval-augmented generation. Defaults to True.
        use_rerank (bool, optional): Indicates whether to apply reranking to the retrieved documents. Defaults to False.
        rerank_property (str, optional): The property to use for reranking. Required if 'use_rerank' is True.
        rerank_query (str, optional): The query used specifically for reranking documents if reranking is enabled.

    Returns:
        str: The generated response content after processing the prompt with a language model.
    """
    
    # Get the prompt
    PROMPT = generate_final_prompt(query, top_k = top_k, retrieve_function = retrieve_function, use_rag = use_rag, use_rerank = use_rerank, rerank_property = rerank_property, rerank_query = rerank_query)
    
    generated_response = generate_with_single_input(PROMPT)

    generated_message = generated_response['content']
    
    return generated_message

In [41]:
query = "Tell me about United States and Brazil's relationship over the course of 2024. Provide links for the resources you use in the answer."

In [42]:
# Result with reranked results
print(llm_call(query = query, 
               top_k = 5, 
               retrieve_function = hybrid_retrieve, 
               ))

The relationship between the United States and Brazil in 2024 is complex and multifaceted. While there are areas of cooperation, there are also areas of tension and disagreement.

One positive aspect of the relationship is the growing economic ties between the two countries. Brazil is a significant market for U.S. goods and services, and the U.S. is a major investor in Brazil's economy. The two countries have also cooperated on issues such as trade, security, and climate change.

However, there are also areas of tension in the relationship. The U.S. has expressed concerns about Brazil's human rights record, particularly with regards to the treatment of indigenous peoples and the environment. Brazil has also been critical of U.S. policies on issues such as immigration and trade.

In terms of specific events, in March 2024, French President Emmanuel Macron visited Brazil and was warmly received by President Luiz Inacio Lula da Silva. The two leaders discussed a range of issues, including

<a id='5'></a>
## 5 - Experimenting with Your RAG System

Now it is time for you to experiment with the system! Run the next cell to load a widget that will input a query, a rerank property, and output five different LLM responses:

1. With semantic search
2. With semantic search and reranking
3. With BM25 search
4. With hybrid search
5. Without RAG


In [43]:
display_widget(llm_call, semantic_search_retrieve, bm25_retrieve, hybrid_retrieve, semantic_search_with_reranking)

HTML(value='\n    <style>\n        .custom-output {\n            background-color: #f9f9f9;\n            color…

HBox(children=(Label(value='Query:', layout=Layout(width='10%')), Text(value="Tell me about United States and …

IntSlider(value=5, description='Top K:', max=20, min=1, style=SliderStyle(description_width='initial'))

Dropdown(description='Rerank Property:', options=('title', 'chunk'), style=DescriptionStyle(description_width=…

Button(description='Get Responses', style=ButtonStyle(button_color='#eee'))

Output()

HBox(children=(VBox(children=(Label(value='Semantic Search'), Output(layout=Layout(border_bottom='1px solid #c…

HBox(children=(VBox(children=(Label(value='Hybrid Search'), Output(layout=Layout(border_bottom='1px solid #ccc…

Congratulations on finishing this assignment!

In [None]:
!tar czvf .