# Semantic Search & RAG with LlamaIndex & HyDE prompt
## ABB #5 - Session 3

Code authored by: Shaw Talebi

Modified by: Barry Mukund

### imports

In [1]:
from IPython.display import display, Markdown
from bs4 import BeautifulSoup
from typing import List, Dict, Any

In [2]:
# Workaround for ImportError: cannot import name 'Mapping' from 'collections'
# This error is often caused by dependencies that expect 'Mapping' in 'collections' (pre-3.10),
# but in Python 3.10+ it is in 'collections.abc'. If you encounter this error, 
# ensure all dependencies are up to date. If not possible, patch before import.

import collections
import collections.abc
import sys

# Patch 'collections' to have 'Mapping' if missing (for legacy dependencies)
if not hasattr(collections, 'Mapping'):
    collections.Mapping = collections.abc.Mapping

from llama_index.core.indices.vector_store import VectorStoreIndex
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.settings import Settings
from llama_index.core.schema import TextNode
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

In [3]:
from dotenv import load_dotenv
import os

# import sk from .env file
load_dotenv()
my_sk = os.getenv("OPENAI_API_KEY")

#### Setup embedding model

In [5]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# changing embedding model
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",
    show_progress_bar=False
)

#### Use ollama as openai runs into ratelimit due to IP being blocked

In [6]:
# changing the global LLM
from llama_index.llms.ollama import Ollama

Settings.llm = Ollama(model="llama3.2:latest")
print(f"LLM: {Settings.llm.model}")

LLM: llama3.2:latest


### 1) chunk articles

In [7]:
articles_file_path='/Users/barry/ai_tut/cohort/AI-Builders-Bootcamp-5/session-3/articles'

In [9]:
# Get all HTML files from raw directory
filename_list = [f"{articles_file_path}/{f}" for f in os.listdir(articles_file_path)]

chunk_list = []
for filename in filename_list:
    # only process .html files
    if filename.lower().endswith(('.html')):
        # read html file
        with open(filename, 'r', encoding='utf-8') as file:
            html_content = file.read()
    
        # Parse HTML
        soup = BeautifulSoup(html_content, 'html.parser')
        
        # Get article title
        article_title = soup.find('title').get_text().strip() if soup.find('title') else "Untitled"
        
        # Initialize variables
        article_content = []
        current_section = "Main"  # Default section if no headers found
        
        # Find all headers and text content
        content_elements = soup.find_all(['h1', 'h2', 'h3', 'p', 'ul', 'ol'])
    
        # iterate through elements and extract text with metadata
        for element in content_elements:
            if element.name in ['h1', 'h2', 'h3']:
                current_section = element.get_text().strip()
            elif element.name in ['p', 'ul', 'ol']:
                text = element.get_text().strip()
                # Only add non-empty content that's at least 30 characters long
                if text and len(text) >= 30:
                    article_content.append({
                        'article_title': article_title,
                        'section': current_section,
                        'text': text
                    })
    
        # add article content to list
        chunk_list.extend(article_content)

#### Create LLAMA index nodes based on the chunked text above.

In [10]:
# create nodes with Llama Index (i.e. nodes)
node_list = []
for i, chunk in enumerate(chunk_list):
    node_list.append(
        TextNode(
            id_=str(i), 
            text=chunk["text"], 
            metadata = {
                "article":chunk["article_title"],
                "section":chunk["section"]
            }
        )
    )

print(len(node_list))

778


### 2) create index

In [11]:
index = VectorStoreIndex(node_list)

print(f"Embedding Model: {index._embed_model.model_name}")
print(f"Index Size: {len(index.vector_store.data.embedding_dict)}")
print(f"Embedding Size: {len(index.vector_store.data.embedding_dict["0"])}")

Embedding Model: BAAI/bge-small-en-v1.5
Index Size: 778
Embedding Size: 384


#### Setup basic functions for hyDE prompting
In this session, in order to do RAG retreival accurately, we take in the user prompt, then push it thru some LLM (ollama in this case), to generate a 
hypothetical document that contains more information based on the user query. This is then embedded and articles in VectorDb close to this hypothetical
document is retrieved.

I compared simple retrieval based solely on the prompt given and retrieval based on hypothetical document created and found that retrieval based on hypothetical document had better similarity scores (~0.8 for simple retrieval versus ~0.9 for hypothetical document based retrieval)

This concept is discussed here: https://aclanthology.org/2023.acl-long.99/

Python notebook: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb


In [12]:
from langchain_core.prompts import PromptTemplate
from langchain_ollama import OllamaLLM
from typing import Callable

def create_hyde_prompt(question: str) -> str:
    """
    Generates a hyDE-style prompt for a given question using LangChain's PromptTemplate.
    Pure function: no side effects.
    """
    template = (
        "You are given a user question. Carefully analyze its intent and semantic meaning. "
        "Generate a detailed, plausible answer that directly addresses the question, "
        "using relevant terminology and context. This hypothetical answer should be as informative and specific as possible, "
        "to maximize the chance of retrieving documents that truly match the user's information need.\n"
        "Question: {question}\n"
        "Hypothetical Answer:"
    )
    prompt = PromptTemplate(
        input_variables=["question"],
        template=template
    )
    return prompt.format(question=question)

def get_ollama_llm(model: str = "llama3.2:latest") -> Callable[[str], str]:
    """
    Factory function to create an OllamaLLM instance with the given model.
    Returns a function that takes a prompt and returns the LLM's response.
    """
    llm = OllamaLLM(model=model)
    def invoke(prompt: str) -> str:
        # Use the new .invoke method as per deprecation warning
        return llm.invoke(prompt)
    return invoke

def generate_hypothetical_document(question: str, model: str = "llama3.2:latest") -> str:
    """
    Uses OllamaLLM to generate a hypothetical document for the given question.
    Pure function: no side effects except for printing.
    """
    prompt = create_hyde_prompt(question)
    #print(prompt)
    ollama_invoke = get_ollama_llm(model)
    return ollama_invoke(prompt)

### 3) RAG retreival using semantic search (embeddings)

In [13]:
# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

#### RAG retrieval with HyDE based hypothetical document generation, then embedding it and looking to retrieve similar articles from Vector Store

In [14]:
def retrieve_with_hyde(
    query: str,
    retriever,
    hyde_llm: str = "llama3.2:latest"
) -> List[Any]:
    """
    Given a query, generate a hypothetical document using HyDE and retrieve relevant documents using the retriever.
    Returns the retrieval results.
    """
    # Step 1: Generate hypothetical document
    hypothetical_doc = generate_hypothetical_document(query, model=hyde_llm)
    #print("Hypothetical Document Generated:\n", hypothetical_doc)

    # Step 2: Retrieve relevant documents using the hypothetical document as the query
    results = retriever.retrieve(hypothetical_doc)
    return results

In [15]:
def display_retrieved_results(results) -> None:
    """
    Display retrieved results in markdown format.
    """
    print(results[0])
    # format results in markdown
    results_markdown = ""
    for i, result in enumerate(results, start=1):
        results_markdown += f"{i}. **Article title:** {result.metadata["article"]}  \n"
        results_markdown += f"   **Section:** {result.metadata["section"]}  \n"
        results_markdown += f"   **Snippet:** {result.text} \n\n"
        results_markdown += f"   **Score:** {result.score} \n\n"
    display(Markdown(results_markdown))

#### Non HyDE based RAG retrieval

In [16]:
results = retriever.retrieve("When do I perform fine-tuning?")
display_retrieved_results(results)

Node ID: 155
Text: This is not to say that fine-tuning is useless. A central
benefit of fine-tuning an AI assistant is lowering inference costs
[3].
Score:  0.811



1. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** When do I Fine-tune?  
   **Snippet:** This is not to say that fine-tuning is useless. A central benefit of fine-tuning an AI assistant is lowering inference costs [3]. 

   **Score:** 0.8114657628676825 

2. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** When NOT to Fine-tune  
   **Snippet:** The effectiveness of any approach will depend on the details of the use case. For example, fine-tuning is less effective than retrieval augmented generation (RAG) to provide LLMs with specialized knowledge [1]. 

   **Score:** 0.8002938091806874 

3. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** How to Prepare Data for Fine-tuning?  
   **Snippet:** For example, if I wanted to fine-tune an LLM to respond to viewer questions on YouTube, I would need to gather a set of comments with questions and my associated responses. For a concrete example of this, check out the code walk-through on YouTube. 

   **Score:** 0.7996616011957226 

4. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** When do I Fine-tune?  
   **Snippet:** Fine-tuning, on the other hand, can compress prompt sizes by directly training the model on examples. Shorter prompts mean fewer tokens at inference, leading to lower compute costs and faster model responses [3]. For instance, after fine-tuning, the above prompt could be compressed to the following. 

   **Score:** 0.7995040812236383 

5. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** RAG vs Fine-tuning?  
   **Snippet:** We’ve already mentioned situations where RAG and fine-tuning perform well. However, since this is such a common question, it’s worth reemphasizing when each approach works best. 

   **Score:** 0.7930143949465129 

6. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** 3 Ways to Fine-tune  
   **Snippet:** The next, and perhaps most popular, way to fine-tune a model is via supervised learning. This involves training a model on input-output pairs for a particular task. An example is instruction tuning, which aims to improve model performance in answering questions or responding to user prompts [1,3]. 

   **Score:** 0.7919754263499469 

7. **Article title:** How to Improve LLMs with RAG  
   **Section:** Why we care  
   **Snippet:** Previous articles in this series discussed fine-tuning, which adapts an existing model for a particular use case. While this is an alternative way to endow an LLM with specialized knowledge, empirically, fine-tuning seems to be less effective than RAG at doing this [1]. 

   **Score:** 0.7899394659656693 

8. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** What is Fine-tuning?  
   **Snippet:** Fine-tuning is taking a pre-trained model and training at least one internal model parameter (i.e. weights). In the context of LLMs, what this typically accomplishes is transforming a general-purpose base model (e.g. GPT-3) into a specialized model for a particular use case (e.g. ChatGPT) [1]. 

   **Score:** 0.7895567848964524 

9. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** What’s Next?  
   **Snippet:** Here, I summarized the most common fine-tuning questions I’ve received over the past 12 months. While fine-tuning is not a panacea for all LLM use cases, it has key benefits. 

   **Score:** 0.7862102117718668 

10. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** What is Fine-tuning?  
   **Snippet:** I like to define fine-tuning as taking an existing (pre-trained) model and training at least 1 model parameter to adapt it to a particular use case. 

   **Score:** 0.785435251961344 



#### hyDE based document retrieval

In [17]:
results = retrieve_with_hyde("When do I perform fine-tuning?", retriever)
display_retrieved_results(results)

Node ID: 1
Text: Fine-tuning involves adapting a pre-trained model to a
particular use case through additional training.
Score:  0.880



1. **Article title:** Fine-Tuning BERT for Text Classification  
   **Section:** Fine-tuning  
   **Snippet:** Fine-tuning involves adapting a pre-trained model to a particular use case through additional training. 

   **Score:** 0.8800494856674508 

2. **Article title:** LLM Fine-tuning — FAQs  
   **Section:** Advanced Fine-tuning  
   **Snippet:** Another way we can fine-tune language models is for classification tasks, such as classifying support ticket tiers, detecting spam emails, or determining the sentiment of a customer review. A classic fine-tuning approach for this is called transfer learning, where we replace the head of a language model to perform a new classification task. 

   **Score:** 0.8694429632303379 

3. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** What is Fine-tuning?  
   **Snippet:** Fine-tuning is taking a pre-trained model and training at least one internal model parameter (i.e. weights). In the context of LLMs, what this typically accomplishes is transforming a general-purpose base model (e.g. GPT-3) into a specialized model for a particular use case (e.g. ChatGPT) [1]. 

   **Score:** 0.8673302403269113 

4. **Article title:** Fine-Tuning BERT for Text Classification  
   **Section:** Conclusion  
   **Snippet:** Fine-tuning pre-trained models is a powerful paradigm for developing better models at a lower cost than training them from scratch. Here, we saw how to do this with BERT using the Hugging Face Transformers library. 

   **Score:** 0.8644245090324031 

5. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** 3 Ways to Fine-tune  
   **Snippet:** The next, and perhaps most popular, way to fine-tune a model is via supervised learning. This involves training a model on input-output pairs for a particular task. An example is instruction tuning, which aims to improve model performance in answering questions or responding to user prompts [1,3]. 

   **Score:** 0.8635334401448006 

6. **Article title:** Fine-Tuning BERT for Text Classification  
   **Section:** Fine-tuning  
   **Snippet:** Pre-trained models are developed via unsupervised learning, which precludes the need for large-scale labeled datasets. Fine-tuned models can then exploit pre-trained model representations to significantly reduce training costs and improve model performance compared to training from scratch [1]. 

   **Score:** 0.8591626484758503 

7. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** What is Fine-tuning?  
   **Snippet:** The key upside of this approach is that models can achieve better performance while requiring (far) fewer manually labeled examples compared to models that solely rely on supervised training. 

   **Score:** 0.8545674847194933 

8. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** Conclusions  
   **Snippet:** While fine-tuning an existing model requires more computational resources and technical expertise than using one out-of-the-box, (smaller) fine-tuned models can outperform (larger) pre-trained base models for a particular use case, even when employing clever prompt engineering strategies. Furthermore, with all the open-source LLM resources available, it’s never been easier to fine-tune a model for a custom application. 

   **Score:** 0.8485122703198279 

9. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** Supervised Fine-tuning Steps (High-level)  
   **Snippet:** Choose fine-tuning task (e.g. summarization, question answering, text classification)Prepare training dataset i.e. create (100–10k) input-output pairs and preprocess data (i.e. tokenize, truncate, and pad text).Choose a base model (experiment with different models and choose one that performs best on the desired task).Fine-tune model via supervised learningEvaluate model performance 

   **Score:** 0.845559047908106 

10. **Article title:** Fine-Tuning Large Language Models (LLMs)  
   **Section:** 3 Ways to Fine-tune  
   **Snippet:** Generate high-quality prompt-response pairs and fine-tune a pre-trained model using supervised learning. (~13k training prompts) Note: One can (alternatively) skip to step 2 with the pre-trained model [3].Use the fine-tuned model to generate completions and have human-labelers rank responses based on their preferences. Use these preferences to train the reward model. (~33k training prompts)Use the reward model and an RL algorithm (e.g. PPO) to fine-tune the model further. (~31k training prompts) 

   **Score:** 0.8441634112610668 



### 4) RAG pipeline putting the flow together
1. Setup how to format output using response_synthesizer
2. Provide a pipeline with retriever, response format and similarity cut-off

In [18]:
# configure response synthesizer
response_synthesizer = get_response_synthesizer()

#### RAG retriever flow

In [19]:
# assemble query engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)

#### Now query using the pipeline

In [20]:
response = query_engine.query("When do I perform fine-tuning?")
print(response)

Fine-tuning typically occurs when you want to take an existing (pre-trained) model and train at least one internal model parameter to adapt it to a particular use case. This process transforms a general-purpose base model into a specialized model for that specific purpose, resulting in compressed prompt sizes and lower inference costs.


#### HyDE based retriever pipeline
1. First in order to standardize the pipeline, we need to subclass BaseRetriever and override _retrieve() that is called by the pipeline to get the RAG articles

In [21]:
# assemble query engine with hyde
from typing import Any, List
from llama_index.core.retrievers import BaseRetriever
from llama_index.core.schema import QueryBundle, NodeWithScore

class HydeRetriever(BaseRetriever):
    def __init__(self, retriever_func, base_retriever):
        self._retriever_func = retriever_func
        self._base_retriever = base_retriever

    def _retrieve(self, query_bundle: QueryBundle, **kwargs: Any) -> List[NodeWithScore]:
        # delegate to the provided function, which should return List[NodeWithScore]
        return self._retriever_func(query_bundle.query_str, self._base_retriever)

# Wrap retrieve_with_hyde in a class that implements .retrieve()
hyde_retriever = HydeRetriever(retrieve_with_hyde, retriever)



#### Now Query using HyDE based retriever

In [22]:
query_engine_hyde = RetrieverQueryEngine(
    retriever=hyde_retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)],
)
response = query_engine_hyde.query("When do I perform fine-tuning?")
print(response)

Fine-tuning typically involves adapting a pre-trained model to a particular use case through additional training. This is often done when you want to transform a general-purpose base model into a specialized model for a specific task or application. You may need to perform fine-tuning in situations where you require better performance on a particular task while requiring fewer manually labeled examples than traditional supervised training methods.


#### Non-pipeline based simple RAG query

In [23]:
# simpler way to make query engine
query_engine = index.as_query_engine()
response = query_engine.query("When do I perform fine-tuning?")
print(response)

You typically perform fine-tuning when you need to adapt a pre-trained AI assistant to a specific task or domain where its existing capabilities are not sufficient. This can help improve the model's performance and efficiency. Fine-tuning is often used in situations where the goal is to optimize the model for a particular use case, such as lowering inference costs.
