# Hybrid Search Exercise

In this notebook, we will delve into the implementation details of the hybrid search using practical tools in the RAG application. You will have the opportunity to implement it using build-in sparse vector support in a vector database (Qdrant), and a popular reranking model from Cohere.



### Visual improvements

In [4]:
from rich.console import Console
from rich_theme_manager import Theme, ThemeManager
import pathlib

theme_dir = pathlib.Path("../themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)

## Hybrid Search - Sparse Index

You will use sparse vector supported database to complement the semantic search (dense vector). You will calculate the sparse and dense vectors using sparse and dense encoders. For the sparse encoder you will use Splade.

### Loading complex documents

You will read a couple of the legal documents, we used in the previous exercise. For simplicity the chunking will be based on the document pages, and not the more time consuming semantic chunking, you used in the previous exercise. 

In [2]:
import pymupdf4llm

import requests
import os

start_doc_number = 193
end_doc_number = 194

doc_urls = [
    f"https://static.case.law/wash-app/{doc_number}.pdf" 
    for doc_number in range(start_doc_number, end_doc_number + 1)
]

md_texts = []  # Initialize an empty list to store md_text for each file
local_data_folder = "data"
import os
if not os.path.exists(local_data_folder):
    os.makedirs(local_data_folder)

for url in doc_urls:
    response = requests.get(url)
    # Download a local copy of the PDF to the local data folder
    local_pdf_path = os.path.join(local_data_folder, url.split('/')[-1])
    with open(local_pdf_path, 'wb') as f:
        f.write(response.content)
    
    # Use pymupdf4llm functions on the local file
    md_text = pymupdf4llm.to_markdown(local_pdf_path, page_chunks=True)
    md_texts.append(md_text)  # Append the md_text to the list

Processing data/193.pdf...
Processing data/194.pdf...


### Chunking the documents 

You will create chunks based on the pages of the documents. Let's create the metadata of the chunks with the page number, and document URL

In [3]:
console.print(md_texts[0][100])

Let's check how many pages are in the first document:

In [None]:
number_of_pages = ### YOUR CODE HERE ###
console.print(number_of_pages)

### Creating the Sparse Index

We will use an advanced sparse encoder, [SPLADE](https://github.com/naver/splade). This method uses a pre-trained model to tokenize and calcualte the score for each token, as well as expanding tokens, not found in the document. For example, "efficient neural search" will generate the tokens: "efficient", "neural", "bert", and "search", with their relevant weights or scores. 

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

model_id = "naver/splade-cocondenser-ensembledistil"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

### Calculating Sparse Vector

The following function compute the sparse vector of a given text using the Splade encoder.

In [5]:
import torch


def compute_vector(text):
    """
    Computes a vector from logits and attention mask using ReLU, log, and max operations.
    """
    tokens = tokenizer(text, return_tensors="pt")
    output = model(**tokens)
    logits, attention_mask = output.logits, tokens.attention_mask
    relu_log = torch.log(1 + torch.relu(logits))
    weighted_log = relu_log * attention_mask.unsqueeze(-1)
    max_val, _ = torch.max(weighted_log, dim=1)
    vec = max_val.squeeze()

    return vec, tokens

test_text = """
Arthur Robert Ashe Jr. (July 10, 1943 – February 6, 1993) was an American professional tennis player. 
He won three Grand Slam titles in singles and two in doubles.
"""
vec, tokens = compute_vector(test_text)
console.print(vec.shape)

### Explore Sparse Vector 

The following function calculates the tokens and their weights from the sparse vector and returns a dictionary with textual tokens as keys.

In [6]:
def extract_and_map_sparse_vector(vector, tokenizer):
    """
    Extracts non-zero elements from a given vector and maps these elements to their human-readable tokens using a tokenizer. The function creates and returns a sorted dictionary where keys are the tokens corresponding to non-zero elements in the vector, and values are the weights of these elements, sorted in descending order of weights.

    This function is useful in NLP tasks where you need to understand the significance of different tokens based on a model's output vector. It first identifies non-zero values in the vector, maps them to tokens, and sorts them by weight for better interpretability.

    Args:
    vector (torch.Tensor): A PyTorch tensor from which to extract non-zero elements.
    tokenizer: The tokenizer used for tokenization in the model, providing the mapping from tokens to indices.

    Returns:
    dict: A sorted dictionary mapping human-readable tokens to their corresponding non-zero weights.
    """

    # Extract indices and values of non-zero elements in the vector
    cols = vector.nonzero().squeeze().cpu().tolist()
    weights = vector[cols].cpu().tolist()

    # Map indices to tokens and create a dictionary
    idx2token = {idx: token for token, idx in tokenizer.get_vocab().items()}
    token_weight_dict = {
        idx2token[idx]: round(weight, 2) for idx, weight in zip(cols, weights)
    }

    # Sort the dictionary by weights in descending order
    sorted_token_weight_dict = {
        k: v
        for k, v in sorted(
            token_weight_dict.items(), key=lambda item: item[1], reverse=True
        )
    }

    return sorted_token_weight_dict


# Usage example
sorted_tokens = extract_and_map_sparse_vector(vec, tokenizer)
console.print(sorted_tokens)

In [None]:
len(sorted_tokens)

### Dense Encoder

You will use an encoder that was pre-trained on legal documents.

In [8]:
from dotenv import load_dotenv

load_dotenv()

True

In [None]:
from openai import OpenAI
import openai
embedding_client = OpenAI()

dense_vector_dimension = 1024
def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return embedding_client.embeddings.create(
            input = [text], 
            model=model,
            # Reduce the size of the embedding vector to 1024
            ### YOUR CODE HERE ###
        ).data[0].embedding


### Create the vector collection



In [10]:
from qdrant_client import QdrantClient
from qdrant_client.http import models

# Qdrant client setup
vector_db_client = QdrantClient(":memory:")

# Define collection name
COLLECTION_NAME = "sparse_legal"

In [11]:
vector_db_client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        "text-dense": models.VectorParams(
            size=dense_vector_dimension,  # we will compress OpenAI Embeddings
            distance=models.Distance.COSINE,
        )
    },
    sparse_vectors_config={
        "text-sparse": models.SparseVectorParams(
            index=models.SparseIndexParams(
                on_disk=False,
            )
        )
    },
)

True

In [13]:
import uuid

def generate_uuid(doc_url, i):
    return str(uuid.uuid5(uuid.NAMESPACE_URL, f"{doc_url}/{i}"))

from tqdm import tqdm

chunk_num = sum(len(md_text) for md_text in md_texts)
state = "Washington"

corpus_json = []
for index, md_text in enumerate(md_texts):
    doc_url = doc_urls[index]
    for i, chunk in tqdm(enumerate(md_text), total=len(md_text), desc=f"Processing {index+1}/{len(md_texts)} documents"):
        chunk_text = chunk['text']
        title = chunk['metadata']['title']
        # Extract the page number from the metadata
        page = chunk['metadata']['page']
        chunk_uuid = generate_uuid(doc_url, page)
        enriched_chunk = f"{title} \n {chunk_text}"
        sparse_vector, tokens = compute_vector([enriched_chunk[:512]])
        doc_indices = sparse_vector.nonzero().numpy().flatten()
        doc_values = sparse_vector.detach().numpy()[doc_indices]
        vector_db_client.upsert(
            collection_name=COLLECTION_NAME,
            points=[
                models.PointStruct(
                    id=chunk_uuid,
                    payload={
                        "document": chunk_text,
                            "metadata" : {
                            "title": title,
                            "state": state,
                            "doc_url": doc_url,
                            "page": page,
                        }
                    }, 
                    vector={
                        "text-dense": ### YOUR CODE HERE ###
                        "text-sparse": models.SparseVector(
                            indices=doc_indices.tolist(), values=doc_values.tolist()
                        )
                    },
                )
            ],
        )


Processing 1/2 documents: 100%|██████████| 1041/1041 [08:49<00:00,  1.97it/s] 
Processing 2/2 documents: 100%|██████████| 1048/1048 [07:47<00:00,  2.24it/s]


### Searching the Sparse Index

As we are doing in the dense index, we need to tokenize and encode the query text:

In [18]:
# Preparing a query vector

query_text = "Please find me cases that are talking about bank foreclosure of mortage defaults"
query_vec, query_tokens = compute_vector(query_text)

query_indices = query_vec.nonzero().numpy().flatten()
query_values = query_vec.detach().numpy()[query_indices]

console.print(extract_and_map_sparse_vector(query_vec, tokenizer))

And use the encoded query to search the sparse index:

In [17]:
# Searching for similar documents
result = vector_db_client.search(
    collection_name=COLLECTION_NAME,
    query_vector=models.NamedSparseVector(
        name="text-sparse",
        vector=models.SparseVector(
            indices=query_indices,
            values=query_values,
        ),
    ),
    with_vectors=True,
)

result

[ScoredPoint(id='37417cdc-3b52-59c6-9080-9f659f69e69e', version=0, score=12.318807601928711, payload={'document': 'tators’ discussions about the discharge ofdeeds oftrust and\n\nmortgages do not support this result.11\n\n      - 1 Rather, the trial court’s ruling fails to recognize that\n\nenforcement of a promissory note and foreclosure of a deed\n\nof trust securing that note are separate remedies of a\n\ncreditor in the event of a borrower’s default.12 The inability\n\nto pursue one remedy does not bar the other.\n\n     - 2 The trial court’s ruling in this case has a practical\n\neffect. That effect is that the Edmundsons retain ownership\n\nof property without repaying the loan used to purchase it.\n\nThe loss shifts to the lender because the Edmundsons no\nlonger have any personal obligation on the promissory note\n\ndue to the discharge in bankruptcy. Under the trial court’s\n\nruling, the lender also has no right to realize on the\n\ncollateral for the loan. Neither the equity 

### Hybrid Search of Sparse and Dense vectors

You will now search both indexes using `search_batch`. Please note to use the correct encoders for each index.

In [19]:
# Compute the dense embedding vector of the query
query_dense_vector = get_embedding(query_text)

hits = vector_db_client.search_batch(
    collection_name=COLLECTION_NAME,
    requests=[
        models.SearchRequest(
            vector=models.NamedVector(
                name="text-dense",
                vector=query_dense_vector,
            ),
            limit=10,
        ),
        models.SearchRequest(
            vector=models.NamedSparseVector(
                name="text-sparse",
                vector=models.SparseVector(
                    indices=query_indices,
                    values=query_values,
                ),
            ),
            limit=10,
        ),
    ],
)

Create a set (remove duplicates) of chunk ids from the two set of results.

In [20]:
unique_ids = set([hit.id for hit in hits[0] + hits[1]])

len(unique_ids)

16

In [21]:
result_chunks = vector_db_client.retrieve(
    collection_name=COLLECTION_NAME,
    ids=list(unique_ids)
)

In [22]:
console.print(result_chunks[0])

### Chunk augmentation

Since the page breaks are arbitrary, you want to concatenate the previous and next page for each of the pages in the results. You will need to use the function to generate the UUID of the chunk in the database, with the metadata of each result.

In [23]:
for result in result_chunks:
    doc_title = result.payload['metadata']['title']
    doc_url = result.payload['metadata']['doc_url']
    page = result.payload['metadata']['page']
    previous_page_uuid = generate_uuid(doc_url, page-1)
    next_page_uuid = generate_uuid(doc_url, page+1)
    
    previous_page_result = vector_db_client.retrieve(
        collection_name=COLLECTION_NAME,
        ids=[previous_page_uuid]
    )
    next_page_result = vector_db_client.retrieve(
        collection_name=COLLECTION_NAME,
        ids=[next_page_uuid]
    )
    
    if previous_page_result:
        previous_page_text = f"<page-break-{page-1}> {previous_page_result[0].payload['document']}"
    else:
        previous_page_text = ""
        
    if next_page_result:
        next_page_text = f"<page-break-{page+1}> {next_page_result[0].payload['document']}"
    else:
        next_page_text = ""
    
    result.payload['document'] = f"""
        Document title: {title}, Document URL: {doc_url}
        {previous_page_text} 
        <page-break-{page}> {result.payload['document']} 
        {next_page_text}"""

In [24]:
console.print(result_chunks[0])

## Hybrid Search - Merging Results by Reranking

You will use a popular [reranking service by Cohere](https://docs.cohere.com/reference/rerank)

In [25]:
import cohere

co = cohere.Client()

docs = ### YOUR CODE HERE ###

rerank_response = co.rerank(
    model="rerank-english-v3.0",
    query=query_text,
    documents=docs,
    top_n=3,
)

## Using reranked results to generate a reply

We can now take the reranked results and call the LLM to generate the reply to the user's query.

In [32]:
extracted_documents = [
    result_chunks[result.index].payload['document'] 
    for result in rerank_response.results
]

In [33]:
# Now time to connect to the large language model
from openai import OpenAI
from rich.text import Text

client = OpenAI()
system_message = """
You are a paralegal specialist. 
Your top priority is to help lawyers find information in a large legal corpus and help them with their requests. 
Please include citation to the document and the page number.
"""
completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": query_text},
        {"role": "assistant", "content": str(extracted_documents)}
    ]
)

In [34]:
from rich.panel import Panel

response_text = Text(completion.choices[0].message.content)
styled_panel = Panel(
    response_text,
    title=f"Reply to '{query_text}'",
    expand=False,
    border_style="bright_yellow",
    padding=(1, 1)
)

console.print(styled_panel)