<a href="https://colab.research.google.com/github/niyazkzubair/pythonprojects/blob/master/RAG_experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

<br>
<br>
<br>

# Retrieval Augmented Generation

<br>

* steps:
    <br>
    
    * data preprocessing
    * vector database creation
    * information retrieval
    * prompting the LLM

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<br>
<br>
<br>
<br>
<br>

## Data preprocessing

<br>

* the first thing we need to do when preparing data is make sure that we have an efficient way of loading data
    <br>
    
    * the loading process can be handled using various data loaders that are available in Langchain
    * in this example, I will be working with PDF files, so I will use the **PyPDFDirectoryLoader**

<br>

* after loading the data, we next need to split it into chunks
    <br>
    
    * instead of creating a vector for each individual word, which is inefficient, we will generate a vector for each chunk of text
    * there are many approaches to splitting text into chunks, but in this example I will be using the **recursive splitter** from Langchain to split our loaded text into chunks

<br>

In [None]:
pip install langchain-community langchain-core

Collecting langchain-community
  Downloading langchain_community-0.3.15-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.15 (from langchain-community)
  Downloading langchain-0.3.15-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core
  Downloading langchain_core-0.3.31-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.25.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  D

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFDirectoryLoader

In [None]:
pip install chromadb

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.9.2-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-py3-

In [None]:
# Create a class to load in our PDF files and split the text
# into chunks

class DocumentProcessor:
    """
    A class to load and split documents using a recursive character splitter.

    Attributes:
        data_path (str): The path to the directory containing the documents.
        splitter_type (str): The type of splitter to use ("recursive").
        chunk_size (int): The size of each chunk (default is 800 for recursive).
        chunk_overlap (int): The overlap between chunks (default is 80 for recursive).

    Methods:
        load_documents():
            Loads documents from the specified directory.
        split_documents(documents):
            Splits the provided documents using the specified splitter type.
        process_documents():
            Loads and splits documents, returning the split document chunks.
    """
    def __init__(self, data_path, splitter_type="recursive", chunk_size=800, chunk_overlap=80):
        """
        Initializes the DocumentProcessor with the specified parameters.
        The parameters are the directory path, splitter type, chunk size, and chunk overlap.

        Args:
            data_path (str): The path to the directory containing the documents.
            splitter_type (str): The type of splitter to use ("recursive").
            chunk_size (int): The size of each chunk (default is 800 for recursive).
            chunk_overlap (int): The overlap between chunks (default is 80 for recursive).
        """
        self.data_path = data_path
        self.splitter_type = splitter_type
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def load_documents(self):
        """
        Loads documents from the specified directory and its subdirectories.

        Returns:
            generator: A generator yielding loaded documents.
        """
        print("Loading documents from ",self.data_path)
        #return PyPDFDirectoryLoader(self.data_path).lazy_load()

        loader = PyPDFDirectoryLoader(self.data_path)

        docs = loader.load()
        print("Loaded files:")
        for doc in docs:
          print(doc.metadata["source"])

        return loader.lazy_load()

    def split_documents(self, documents):
        """
        Splits the provided documents into chunks using the specified splitter type.

        Args:
            documents (list[Document]): A list of Document objects to be split.

        Yields:
            Document: Chunks of the original documents after splitting.

        Raises:
            ValueError: If an unsupported splitter type is specified.
        """
        if self.splitter_type == "recursive":
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=self.chunk_size,
                chunk_overlap=self.chunk_overlap,
                length_function=len,
                is_separator_regex=False,
            )
        else:
            raise ValueError("Unsupported splitter type.")

        for document in documents:
            for chunk in text_splitter.split_documents([document]):
                yield chunk

    def process_documents(self):
        """
        Unified method for loading documents, splitting them, and returning chunks of data.

        Returns:
            generator: A generator yielding split document chunks.
        """

        for document in self.load_documents():
            yield from self.split_documents([document])

<br>
<br>
<br>
<br>
<br>

## Assigning unique identifiers to chunks

<br>

* after chunking our documents, we need to assign a unique ID to each chunk that we created

* this will prevent duplicates in the database and ensure that, regardless of how many times the user updates it, the database will only contain the files from the specified source directory
    <br>
    
    * this approach will also handle scenarios where a new version of a file replaces an older version

* we will create a function that ensures that each ID is unique by combining several elements:
    <br>
    
    * metadata of the chunk collected from the PDF from which the chunk originates
    * hash of the chunk content, created using a cryptographic hash algorithm such as SHA-256
    * a generated UUID (Universally Unique Identifier), which is a 128-bit number used to uniquely identify information in computer systems

<br>

In [None]:
import hashlib
import uuid

<br>

In [None]:
# Create a function that generates a unique ID
# and assigns it to a chunk

def chunk_id(chunk):
    """
    Generate a unique identifier for a chunk based on its source, page, content, and a UUID.

    This function checks if the current chunk is from the same page as the previous
    by comparing the current page ID with the last page ID. If they match, the current chunk
    index is incremented, indicating another chunk from the same page. If not, it resets to 0
    for a new page.

    A unique identifier for the chunk is created by combining this page ID, a hash of the chunk's content,
    and a randomly generated UUID.

    Args:
        chunk (object): The chunk object containing metadata and content.



    Returns:
        tuple: Contains the newly created chunk ID and the current page ID.
    """
    source = chunk.metadata.get("source")
    page = chunk.metadata.get("page")
    current_page_id = f"{source}:{page}"
    content = chunk.page_content

    content_hash = hashlib.sha256(content.encode('utf-8')).hexdigest()
    unique_suffix = uuid.uuid4()
    full_chunk_id = f"{current_page_id}:{content_hash}:{unique_suffix}"

    print("full_chunk_id: ",full_chunk_id," current_page_id:", current_page_id)

    return full_chunk_id, current_page_id

<br>

In [None]:
def process_chunks(chunks):
    """
    Process and uniquely identify each chunk using metadata, content hash, a UUID,
    and manage chunk indexing across pages.

    This function iterates through a list of chunks, generating a unique identifier for each chunk using metadata,
    content hash,and appending a UUID.
    The identifier is stored in the chunk's metadata. It also tracks and updates the page continuity and
    the index of chunks within their respective pages.

    Args:
        chunks (list): A list of chunk objects, where each chunk has metadata including "source" and "page",
                       and content accessible via `chunk.page_content`.

    Yields:
        object: The chunk with updated metadata containing a unique identifier.
    """

    for chunk in chunks:
        chunk_id_str, last_page_id = chunk_id(chunk)
        chunk.metadata["id"] = chunk_id_str
        chunk.metadata["content"] = chunk.page_content
        yield chunk

<br>
<br>
<br>
<br>
<br>

## Creating a vector database

<br>

* a vector database is a specialized type of database designed to store and manage data in vector format

* these databases are optimized for efficiently handling high-dimensional data and are built to scale with large datasets
    <br>
    
    * performance-oriented, designed to facilitate rapid data retrieval
    * the system quickly finds vectors that are most similar to a given query vector by calculating metrics such as the Euclidean distance or cosine similarity

* there are many vector databases to choose from, but I will use the **CHROMA database**

<br>

### Chroma

<br>


* open-source vector database

* it gives users the tools they need to:
    <br>
    
    * embed documents and queries
    * store embeddings together with their metadata
    * search for embeddings similar to an input query embedding

* how it works:
    <br>
    
    * we select and embedding model and use it to convert our text into vectors
    * we store those vectors in the CHROMA database
    * when a user submits a query, we use the same embedding model to turn the query into a vector, and then search for similar vectors inside the database
    

<br>

In [None]:
import os
import logging
import json
import random
from langchain_community.embeddings.ollama import OllamaEmbeddings
from tqdm import tqdm
from langchain_community.vectorstores import Chroma

In [None]:
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain.prompts import ChatPromptTemplate
from langchain_community.llms.ollama import Ollama
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

In [None]:
!apt-get install net-tools # If not already installed

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  net-tools
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 204 kB of archives.
After this operation, 819 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 net-tools amd64 1.60+git20181103.0eebece-1ubuntu5 [204 kB]
Fetched 204 kB in 1s (294 kB/s)
Selecting previously unselected package net-tools.
(Reading database ... 124561 files and directories currently installed.)
Preparing to unpack .../net-tools_1.60+git20181103.0eebece-1ubuntu5_amd64.deb ...
Unpacking net-tools (1.60+git20181103.0eebece-1ubuntu5) ...
Setting up net-tools (1.60+git20181103.0eebece-1ubuntu5) ...
Processing triggers for man-db (2.10.2-1) ...


In [None]:
import subprocess

def get_ollama_port():
  """
  Gets the port number that the Ollama instance is running on.

  Returns:
    The port number (as an integer) or None if it couldn't be determined.
  """
  try:
    # Run netstat and capture the output
    process = subprocess.run(['netstat', '-tulnp'], capture_output=True, text=True)
    output = process.stdout

    # Search for the line containing 'ollama' and extract the port number
    for line in output.split('\n'):
      print(line)
      if 'ollama' in line:
        parts = line.split()
        # The port is usually the last part of the 'Local Address' field
        local_address = parts[3]
        port = int(local_address.split(':')[-1])
        return port
  except Exception as e:
    print(f"Error finding Ollama port: {e}")
    return None

In [None]:
print(get_ollama_port())

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 172.28.0.12:6000        0.0.0.0:*               LISTEN      12/kernel_manager_p 
tcp        0      0 127.0.0.1:39501         0.0.0.0:*               LISTEN      1103/python3        
tcp        0      0 127.0.0.1:40885         0.0.0.0:*               LISTEN      1138/python3        
tcp        0      0 127.0.0.11:41935        0.0.0.0:*               LISTEN      -                   
tcp        0      0 172.28.0.12:9000        0.0.0.0:*               LISTEN      90/python3          
tcp        0      0 127.0.0.1:58823         0.0.0.0:*               LISTEN      1138/python3        
tcp        0      0 127.0.0.1:3453          0.0.0.0:*               LISTEN      69/python3          
tcp6       0      0 :::8080                 :::*                    LISTEN      7/node              
udp        0      0 127.0.0.11:57295        0.0.

<br>

In [None]:
# Prepare a model that will convert text into embeddings

def get_embeddings(embedding_model="nomic-embed-text"):
    """
    Initialize and return an instance of the OllamaEmbeddings class.

    This function creates an OllamaEmbeddings object using the "nomic-embed-text" model and returns it.

    Returns:
        OllamaEmbeddings: An instance of the OllamaEmbeddings class initialized with the
        "nomic-embed-text" model.
    """
    ollama_base_url = 'http://localhost:11434'  # Update with your Ollama base URL
    #return OllamaEmbeddings(base_url='http://host.docker.internal:11434', model=embedding_model)
    return OllamaEmbeddings(base_url=ollama_base_url, model=embedding_model)

<br>

In [None]:
# Get the directory of the current notebook
##current_dir = os.getcwd()

# Define the path to the JSON file
##json_file_path = os.path.join(current_dir, 'config.json')

# Load the JSON file
##with open(json_file_path, 'r') as config_file:
##    config = json.load(config_file)

# Configure logging from JSON file
##logging.basicConfig(level=getattr(logging, config["logging"]["level"]),
                    #format=config["logging"]["format"])
##logger = logging.getLogger(__name__)

In [None]:
# Create a database class

class ChromaDatabase:
    """
    A class to interact with the Chroma database for storing and updating document chunks.

    Attributes:
        database_path (str): The path to the Chroma database.
        db (Chroma): An instance of the Chroma database with embeddings initialized.

    Methods:
        update_database():
            Processes documents and updates the database with new chunks, removing duplicates and
            persisting changes.
    """

    def __init__(self, database_path):
        """
        Initialize the ChromaDatabase with a specified database path.

        Args:
            database_path (str): The file path to the Chroma database.
        """
        self.database_path = database_path  # Set the database path
        # Initialize the Chroma database with the given path and embeddings function
        self.db = Chroma(persist_directory=self.database_path, embedding_function=get_embeddings())
        print ("INIT COMPLETED")

    def update_database(self, data_path):
        """
        Processes documents and updates the ChromaDatabase with the resulting chunks.
        Removes old chunks if it runs into new version of a particular file.
        Then, it verifies the content of the database by printing the number of documents
        after the update.

        Args:
            data_path (str): The path to the folder containing the PDF files.
        """

        print ("UPDATE DATABASE - STEP 1")
        # Process documents to get chunks
        document_processor = DocumentProcessor(data_path=data_path)
        chunks = document_processor.process_documents()

        # Process chunks to generate unique IDs
        chunks_with_ids = process_chunks(chunks)

        print ("UPDATE DATABASE - STEP 2")
        # Retrieve existing items in the database and extract their IDs
        existing_items = self.db.get(include=[])  # IDs are always included by default
        existing_ids = set(':'.join(item.split(':')[0:3]) for item in existing_items["ids"])
        print("Number of existing documents in database: ",len(existing_ids))

        new_chunks = []
        prefixes_to_clear = set()

        print ("UPDATE DATABASE - STEP 3")
        # Identify new chunks by checking if their IDs (without UUID) are not in the existing IDs
        for chunk in chunks_with_ids:
            id_without_uuid = ':'.join(chunk.metadata["id"].split(':')[0:3])
            if id_without_uuid not in existing_ids:
                new_chunks.append(chunk)
                prefix = chunk.metadata["id"].split(':')[0]
                prefixes_to_clear.add(prefix)

        print ("UPDATE DATABASE - STEP 4")
        # Remove all existing chunks from the database that start with any of the prefixes in prefixes_to_clear
        if prefixes_to_clear:
            chunks_to_remove = [chunk_id for chunk_id in existing_items["ids"] if chunk_id.split(':')[0] in prefixes_to_clear]
            if chunks_to_remove:
                #logger.info(f"Removing {len(chunks_to_remove)} existing documents with matching prefixes")
                self.db.delete(ids=chunks_to_remove)

        print ("UPDATE DATABASE - STEP 5")
        print("total new_chunks: ",len(new_chunks))
        # Add new chunks to database
        if new_chunks:
            #logger.info(f"Adding {len(new_chunks)} new documents")

            new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
            print ("UPDATE DATABASE - STEP 51")
            for chunk, chunk_id in tqdm(zip(new_chunks, new_chunk_ids), total=len(new_chunks), desc="Adding documents"):
                # Add each chunk to the database
                print("\nchunk: ",chunk)
                print ("\nUPDATE DATABASE - STEP 52")
                self.db.add_documents([chunk], ids=[chunk_id])
                print("Added -- chunk_id: ",chunk_id)
                print ("\nUPDATE DATABASE - STEP 53")

            # Persist the changes after all documents are added
            self.db.persist()
        #else:
            #logger.info("No new documents to add")

        print ("UPDATE DATABASE - STEP 6")
        # Verify the database content
        existing_items = self.db.get(include=[])
        #logger.info(f"Documents in database after update: {len(existing_items['ids'])}")

In [None]:
# Create database instance

db = ChromaDatabase(database_path="CHROMA")

INIT COMPLETED


In [None]:
pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/298.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


In [None]:
import pypdf

In [None]:
# Update database with data

db.update_database("/content/drive/MyDrive/RAG_data")



UPDATE DATABASE - STEP 1
UPDATE DATABASE - STEP 2
Number of existing documents in database:  0
UPDATE DATABASE - STEP 3
Loading documents from  /content/drive/MyDrive/RAG_data




Loaded files:
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/poker.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
/content/drive/MyDrive/RAG_data/monopoly.pdf
full_chunk_id:  /content/drive/MyDrive/RAG_data/poker.pdf:0:380dcd47e5979aa1ae12b5a0c92cf93c38f904d409d994cd32881f5f2afeb18c:29fcac05-9603-416d-b96a-6577a770e8c1  c

Adding documents:   0%|          | 0/53 [00:00<?, ?it/s]


chunk:  page_content='Learn How To Play Poker Like A Pro 
 
If you’ve ever wanted to learn how to play poker, then this great e-
report will give you the ideal introduction to playing, winning plus lots 
of resources to help you improve your game! 
 
 
Official poker rules 
Basic Poker Terms: 
Hand:  Hand represents to the particular combination of cards held by 
the player.  
 
Play:  A single game, from one shuffle to the next is called a play. 
 
Pot:  The accumulation or pool of money bet by players during the 
game is referred to as pot. The game is a contest for a pot of money, 
which builds in the course of play of each hand. 
 
Hand Tie:  If two players have the same hand then they divide the pot 
between them. When the pool is not exactly divisible then the left over' metadata={'source': '/content/drive/MyDrive/RAG_data/poker.pdf', 'page': 0, 'page_label': '1', 'id': '/content/drive/MyDrive/RAG_data/poker.pdf:0:380dcd47e5979aa1ae12b5a0c92cf93c38f904d409d994cd32881f5f2afeb18c:




ValueError: Error raised by inference endpoint: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7e0926504550>: Failed to establish a new connection: [Errno 111] Connection refused'))

<br>
<br>
<br>
<br>
<br>

## Set up the retrieval algorithm

<br>

* a retrieval engine, often also called a retriever, is the part that handles user queries

* its operation can be broken down into a few key steps:
    <br>
    
    * first, the user submits a query for the RAG system to process
    * next, the query is encoded using the same model employed to create embeddings stored in a vector database
    * after, the query's vector is compared to the vectors stored in the database to identify the most relevant information and that information is added to the original query
    * finally, the enriched query is used as a prompt for an LLM

* the aforementioned describes the workings of a naive retriever, nowadays we use modern retrievers that implement strategies such as:
    <br>
    
    * query rewriting
    * multi-aspect retrieval
    * hybrid retrieval

* in this example, we will focus on hybrid retrieval

<br>
<br>

### Hybrid retrieval

<br>

* searching for data in vector databases is based on the principle that texts with similar meanings will produce similar embeddings when converted using the same embedding model
    <br>
    
    * this is often called performing a semantic search

* however, we can also perform a so-called sparse search, which is akin to a keyword search but uses more advanced algorithms for faster and more accurate word or phrase matching within the data

* sparse search might seem less effective than semantic search, both methods have their strengths and weaknesses
    <br>
    
    * semantic search excels at finding information with similar meanings, even if synonyms or different spellings are used in the prompt
    * sparse search is sometimes more effective because it looks for the exact phrase as entered

* to combine the two searches, we will build an **ensemble retriever**
    <br>
    
    * a retriever that will combine the results of our semantic search together with the results of performing a so-called **BM25** sparse search

* combining the two is achieved via a technique known as **Reciprocal Rank Fusion (RRF)**

<br>

**BM25**

* also known as Okapi BM25
* builds upon the traditional TF-IDF algorithm
* complex equation, won't go into it right now

**Reciprocal Rank Fusion (RRF)**

* method used to combine rankings from multiple retrieval models into a single, unified ranking
* modifies the information retrieval step of the RAG system by adding additional steps:
    <br>
    
    * the user submits a query
    * the query is processed by multiple retrieval models
    * each model generates a ranking of relevant documents
    * these rankings are combined using the RRF formula
    * a single, unified ranking is produced based on the RRF scores
    * the generative model uses the top-ranked documents to formulate the final answer

<br>
<br>

<br>

In [None]:
PROMPT_TEMPLATE = """
Answer the question based only on the following context:

{context}

---

Answer the question based on the above context: {question}
"""

In [None]:
query_text = "Which dice are used when determining if you roled doubles in Monopoly?"
database_path = "CHROMA/"

In [None]:
# Create a function for running an enriched query

#def run_query(query_text, database_path):
"""
Run a query using a combination of semantic and BM25 search.

    Args:

        database_path(str): The path to the database.

        query_text (str): The query text to search for in the database.



    Returns:

        str: The generated response text from the model.

"""
# Initialize the embedding function
embedding_function = get_embeddings()

# Initialize the Chroma database with the embeddings function
db = Chroma(persist_directory=database_path, embedding_function=embedding_function)

In [None]:
# Initialize the semantic retriever
semantic_retriever = db.as_retriever(search_kwargs={"k": 10})

In [None]:
# Perform the semantic search to get relevant documents
retrieved_docs = semantic_retriever.get_relevant_documents(query_text)

  retrieved_docs = semantic_retriever.get_relevant_documents(query_text)


ValueError: Error raised by inference endpoint: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7e092483cb90>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [None]:
# Extract the sources from the retrieved documents
sources = {doc.metadata['source'] for doc in retrieved_docs}

# Construct the where clause using the $in operator
where_clause = {"source": {"$in": list(sources)}}
bm25_collection = db.get(where=where_clause, include=["documents"])
bm25_chunks = [Document(page_content=text) for text in bm25_collection['documents']]

# Initialize the keyword retriever (e.g., top 3 results)
bm25_retriever = BM25Retriever.from_documents(bm25_chunks)
bm25_retriever.k = 3

# Combine the results of semantic retrieval and BM25 search
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, semantic_retriever], weights=[0.4, 0.6])

# Retrieve the top-k relevant documents for the query text
results = ensemble_retriever.invoke(query_text)


# Create the enriched prompt by joining the content of the retrieved documents
context_text = "\n\n---\n\n".join([doc.page_content for doc in results])
# Initialize the prompt template and format it with the context and query
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query_text)

NameError: name 'retrieved_docs' is not defined

In [None]:
# Initialize the model and invoke it with the formatted prompt
model = Ollama(base_url='http://localhost:11434',model="llama3.1:8b")
response_text = model.invoke(prompt)

port = get_ollama_port()
if port:
  print(f"Ollama is running on port: {port}")
else:
  print("Could not determine Ollama's port.")

# Extract the sources from the retrieved documents
# A source can be none if it is supplied by the BM25 algorithm
response_sources = [doc.metadata.get("id") for doc in results if doc.metadata.get("id") is not None]

#return response_text, response_sources

NameError: name 'prompt' is not defined

In [None]:
# Run a query

answer, sources = run_query(

  retrieved_docs = semantic_retriever.get_relevant_documents(query_text)


ValueError: Error raised by inference endpoint: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/embeddings (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x79306d6dda10>: Failed to establish a new connection: [Errno 111] Connection refused'))

In [None]:
# Display the query answer

answer

In [None]:
# Display the sources used to generate the answer

sources

<br>
<br>
<br>

 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>