# Graph based Hybrid RAG: Graph-Enhanced Retrieval-Augmented Generation

In this notebook, I implement a hybrid/mixed RAG system, that combines the strengths of both

**A) Graph RAG - Technique that enhances traditional RAG systems by organizing knowledge as a connected graph rather than a flat collection of documents.**

**B) semantic vector search - Context retrival based on traditional vector search based RAG.**

## Key Benefits of Graph RAG

- Preserves relationships between pieces of information
- Decomposition of structure(order of words) and signal (features) that make them so powerful
- Enables traversal through connected concepts to find relevant context
- Improves handling of complex, multi-part queries
- Provides better explainability through visualized knowledge paths
- This allows the system to navigate related concepts and retrieve more contextually relevant information than standard vector similarity approaches.

## Setting Up the Environment
We begin by importing necessary libraries.

In [None]:
import os
import numpy as np
import json
import fitz  # PyMuPDF
from openai import OpenAI
from typing import List, Dict, Tuple, Any
import networkx as nx
import matplotlib.pyplot as plt
import heapq
from collections import defaultdict
import re
from PIL import Image
import io
from sklearn.metrics.pairwise import cosine_similarity

## Setting Up the OpenAI API Client
We initialize the OpenAI client to generate embeddings and responses.

- An important step is to get an OPENAI_API_KEY from https://platform.openai.com/

In [None]:
# Initialize the OpenAI client with the base URL and API key

client = OpenAI(
    base_url="https://api.openai.com/v1/",
    api_key=os.getenv("OPENAI_API_KEY")  # Retrieve the API key from environment variables
)

## Document Processing Functions

### content extraction pipeline

- Content extraction pagewise

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Extract text content from a PDF file.
    
    Args:
        pdf_path (str): Path to the PDF file
        
    Returns:
        str: Extracted text content
    """
    print(f"Extracting text from {pdf_path}...")  # Print the path of the PDF being processed
    pdf_document = fitz.open(pdf_path)  # Open the PDF file using PyMuPDF
    text = ""  # Initialize an empty string to store the extracted text
    
    # Iterate through each page in the PDF
    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]  # Get the page object
        text += page.get_text()  # Extract text from the page and append to the text string
    
    return text  # Return the extracted text content

## Document Processing Functions

- Basic cleaning of text

In [None]:
def clean_text(text):
    
    """
    Clean text by removing extra whitespace and special characters.
    
    Args:
        text (str): Input text
        
    Returns:
        str: Cleaned text
    """
    
    # Replace multiple whitespace characters (including newlines and tabs) with a single space
    text = re.sub(r'\s+', ' ', text)
    
    # Fix common OCR issues by replacing tab and newline characters with a space
    text = text.replace('\\t', ' ')
    text = text.replace('\\n', ' ')
    
    # Remove any leading or trailing whitespace and ensure single spaces between words
    text = ' '.join(text.split())
    
    return text

## Chunking the Extracted Text
Once we have the extracted text, we divide it into smaller, overlapping chunks - more manageable pieces to improve retrieval accuracy and reduce computational overhead.

In [None]:
def chunk_text(text, chunk_size=1000, overlap=200):
    """
    Split text into overlapping chunks.
    
    Args:
        text (str): Input text to chunk
        chunk_size (int): Size of each chunk in characters
        overlap (int): Overlap between chunks in characters
        
    Returns:
        List[Dict]: List of chunks with metadata
    """
    chunks = []  # Initialize an empty list to store the chunks
    
    # Iterate over the text with a step size of (chunk_size - overlap)
    for i in range(0, len(text), chunk_size - overlap):
        # Extract a chunk of text from the current position
        chunk_text = text[i:i + chunk_size]
        chunk_text = clean_text(chunk_text)
        # Ensure we don't add empty chunks
        if chunk_text:
            # Append the chunk with its metadata to the list
            chunks.append({
                "text": chunk_text,  # The chunk of text
                "index": len(chunks),  # The index of the chunk
                "start_pos": i,  # The starting position of the chunk in the original text
                "end_pos": i + len(chunk_text)  # The ending position of the chunk in the original text
            })
    
    # Print the number of chunks created
    print(f"Created {len(chunks)} text chunks")
    
    return chunks  # Return the list of chunks

## Creating Embeddings for Text Chunks
Embeddings transform text into numerical vectors, which allow for efficient similarity search.

In [None]:
def create_embeddings(texts, model="BAAI/bge-en-icl"):
    """
    Create embeddings for the given texts.
    
    Args:
        texts (List[str]): Input texts
        model (str): Embedding model name
        
    Returns:
        List[List[float]]: Embedding vectors
    """
    # Handle empty input
    if not texts:
        return []
        
    # Process in batches if needed (OpenAI API limits)
    batch_size = 100
    all_embeddings = []
    
    # Iterate over the input texts in batches
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]  # Get the current batch of texts
        
        # Create embeddings for the current batch
        response = client.embeddings.create(
            model=model,
            input=batch
        )
        
        # Extract embeddings from the response
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)  # Add the batch embeddings to the list
    
    return all_embeddings  # Return all embeddings

## Creating Vector Store

- A simplified real time object based vector store implementation. An industry level application may require a permanent vector database consisting embeddings and lookup tables of entire knowledge base for faster and efficient retrieval

- The VectorStore object is consist of
   - chunk embeddings
   - contents/texts
   - metadata of elements


- The VectorStore object has `similarity_search` recipe of finding the most similar items to a query embedding based on **cosine similarity** 

In [None]:
class SimpleVectorStore:
    """
    A simple vector store implementation using NumPy.
    """
    def __init__(self):
        self.vectors = []  # List to store embedding vectors
        self.texts = []  # List to store text content
        self.metadata = []  # List to store metadata
    
    def add_item(self, text, embedding, metadata=None):
        """
        Add an item to the vector store.
        
        Args:
            text (str): The text content
            embedding (List[float]): The embedding vector
            metadata (Dict, optional): Additional metadata
        """
        self.vectors.append(np.array(embedding))  # Append the embedding vector
        self.texts.append(text)  # Append the text content
        self.metadata.append(metadata or {})  # Append the metadata (or empty dict if None)
    
    def add_items(self, items, embeddings):
        """
        Add multiple items to the vector store.
        
        Args:
            items (List[Dict]): List of text items
            embeddings (List[List[float]]): List of embedding vectors
        """
        for i, (item, embedding) in enumerate(zip(items, embeddings)):
            self.add_item(
                text=item["text"],  # Extract text from item
                embedding=embedding,  # Use corresponding embedding
                metadata={**item.get("metadata", {}), "index": i}  # Merge item metadata with index
            )
    
    def similarity_search_with_scores(self, query_embedding, k=5):
        """
        Find the most similar items to a query embedding with similarity scores.
        
        Args:
            query_embedding (List[float]): Query embedding vector
            k (int): Number of results to return
            
        Returns:
            List[Tuple[Dict, float]]: Top k most similar items with scores
        """
        if not self.vectors:
            return []  # Return empty list if no vectors are stored
        
        # Convert query embedding to numpy array
        query_vector = np.array(query_embedding)
        
        # Calculate similarities using cosine similarity
        similarities = []
        for i, vector in enumerate(self.vectors):
            similarity = cosine_similarity(query_vector, vector.reshape(1, -1))[0][0]  # Compute cosine similarity
            similarities.append((i, similarity))  # Append index and similarity score
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top k results with scores
        results = []
        for i in range(min(k, len(similarities))):
            idx, score = similarities[i]
            results.append({
                "text": self.texts[idx],  # Retrieve text by index
                "metadata": self.metadata[idx],  # Retrieve metadata by index
                "similarity": float(score)  # Add similarity score
            })
        
        return results

    def get_all_documents(self):
        """
        Get all documents in the store.
        
        Returns:
            List[Dict]: All documents
        """
        return [{"text": text, "metadata": meta} for text, meta in zip(self.texts, self.metadata)]  # Combine texts and metadata

## Knowledge Graph Construction

#### Extract important concepts in terms of key terms, entities

In [None]:
def extract_concepts(text):
    """
    Extract key concepts from text using OpenAI's API.
    
    Args:
        text (str): Text to extract concepts from
        
    Returns:
        List[str]: List of concepts
    """
    # System message to instruct the model on what to do
    system_message = """Extract key concepts and entities from the provided text.
Return ONLY a list of 5-10 key terms, entities, or concepts that are most important in this text.
Format your response as a JSON array of strings."""

    # Make a request to the OpenAI API
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": f"Extract key concepts from:\n\n{text[:3000]}"}  # Limit for API
        ],
        temperature=0.0,
        response_format={"type": "json_object"}
    )
    
    try:
        # Parse concepts from the response
        concepts_json = json.loads(response.choices[0].message.content)
        concepts = concepts_json.get("concepts", [])
        if not concepts and "concepts" not in concepts_json:
            # Try to get any array in the response
            for key, value in concepts_json.items():
                if isinstance(value, list):
                    concepts = value
                    break
        return concepts
    except (json.JSONDecodeError, AttributeError):
        # Fallback if JSON parsing fails
        content = response.choices[0].message.content
        # Try to extract anything that looks like a list
        matches = re.findall(r'\[(.*?)\]', content, re.DOTALL)
        if matches:
            items = re.findall(r'"([^"]*)"', matches[0])
            return items
        return []

## Build knowledge graph

- For each chunk from extracted concepts the pair wise similarities being calculated
- Knowledge graph construction - Based on the concept and similarity scores graph nodes and respective edge weights being constructed  

In [None]:
def build_knowledge_graph(chunks):
    """
    Build a knowledge graph from text chunks.
    
    Args:
        chunks (List[Dict]): List of text chunks with metadata
        
    Returns:
        Tuple[nx.Graph, List[np.ndarray]]: The knowledge graph and chunk embeddings
    """
    print("Building knowledge graph...")
    
    # Create a graph
    graph = nx.Graph()
    
    # Extract chunk texts
    texts = [chunk["text"] for chunk in chunks]
    
    # Create embeddings for all chunks
    print("Creating embeddings for chunks...")
    embeddings = create_embeddings(texts)
    
    # Add nodes to the graph
    print("Adding nodes to the graph...")
    for i, chunk in enumerate(chunks):
        # Extract concepts from the chunk
        print(f"Extracting concepts for chunk {i+1}/{len(chunks)}...")
        concepts = extract_concepts(chunk["text"])
        print(f"Concepts for chunk {i+1}: {concepts}")
        # Add node with attributes
        graph.add_node(i, 
                      text=chunk["text"], 
                      concepts=concepts,
                      embedding=embeddings[i])
    
    # Connect nodes based on shared concepts
    print("Creating edges between nodes...")
    for i in range(len(chunks)):
        node_concepts = set(graph.nodes[i]["concepts"])
        for j in range(i + 1, len(chunks)):
            # Calculate concept overlap
            other_concepts = set(graph.nodes[j]["concepts"])
            shared_concepts = node_concepts.intersection(other_concepts)
            
            # If they share concepts, add an edge
            if shared_concepts:
                # Calculate semantic similarity using embeddings
                similarity = np.dot(embeddings[i], embeddings[j]) / (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
                
                # Calculate edge weight based on concept overlap and semantic similarity
                concept_score = len(shared_concepts) / min(len(node_concepts), len(other_concepts))
                edge_weight = 0.7 * similarity + 0.3 * concept_score
                
                # Only add edges with significant relationship
                if edge_weight > 0.6:
                    graph.add_edge(i, j, 
                                  weight=edge_weight,
                                  similarity=similarity,
                                  shared_concepts=list(shared_concepts))
    
    print(f"Knowledge graph built with {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges")
    return graph, embeddings

## Graph Traversal and Query Processing

- Identify starting nodes to traverse with based on similarity score between query embedding and node embeddings

- Traverse the constructed graph using BFS(breadth-first search) algorithm

In [None]:
def traverse_graph(query, graph, embeddings, top_k=5, max_depth=3):
    """
    Traverse the knowledge graph to find relevant information for the query.
    
    Args:
        query (str): The user's question
        graph (nx.Graph): The knowledge graph
        embeddings (List): List of node embeddings
        top_k (int): Number of initial nodes to consider
        max_depth (int): Maximum traversal depth
        
    Returns:
        List[Dict]: Relevant information from graph traversal
    """
    print(f"Traversing graph for query: {query}")
    
    # Get query embedding
    query_embedding = create_embeddings(query)
    
    # Calculate similarity between query and all nodes
    similarities = []
    for i, node_embedding in enumerate(embeddings):
        similarity = np.dot(query_embedding, node_embedding) / (np.linalg.norm(query_embedding) * np.linalg.norm(node_embedding))
        similarities.append((i, similarity))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    # Get top-k most similar nodes as starting points
    starting_nodes = [node for node, _ in similarities[:top_k]]
    print(f"Starting traversal from {len(starting_nodes)} nodes")
    
    # Initialize traversal
    visited = set()  # Set to keep track of visited nodes
    traversal_path = []  # List to store the traversal path
    results = []  # List to store the results
    
    # Use a priority queue for traversal
    queue = []
    for node in starting_nodes:
        heapq.heappush(queue, (-similarities[node][1], node))  # Negative for max-heap
    
    # Traverse the graph using a modified breadth-first search with priority
    while queue and len(results) < (top_k * 3):  # Limit results to top_k * 3
        _, node = heapq.heappop(queue)
        
        if node in visited:
            continue
        
        # Mark as visited
        visited.add(node)
        traversal_path.append(node)
        
        # Add current node's text to results
        results.append({
            "text": graph.nodes[node]["text"],
            "concepts": graph.nodes[node]["concepts"],
            "node_id": node
        })
        
        # Explore neighbors if we haven't reached max depth
        if len(traversal_path) < max_depth:
            neighbors = [(neighbor, graph[node][neighbor]["weight"]) 
                        for neighbor in graph.neighbors(node)
                        if neighbor not in visited]
            
            # Add neighbors to queue based on edge weight
            for neighbor, weight in sorted(neighbors, key=lambda x: x[1], reverse=True):
                heapq.heappush(queue, (-weight, neighbor))
    
    print(f"Graph traversal found {len(results)} relevant chunks")
    return results, traversal_path

## Response Generation

- Based on the query and retrieved results as context generate the final answer.
- gpt-4o is used as LLM brain, however worth exploring _Meta’s Llama family models_
- Also temperature plays crucial role in response generation, the degree of exploration(randomness) or exploitation(deterministic).I used a moderate temperature = 0.

In [None]:
def create_context_string(retrieved_docs):
    """

    combine retrieved documents into a single context string for the query.
    Args:
        retrieved_docs (List[Dict]): List of retrieved documents with text and metadata
    Returns:
        str: Combined context string for the query
    """
    # Create embedding for the query
    
    context_texts = [chunk["text"] for chunk in retrieved_docs]
    combined_context = "\n\n---\n\n".join(context_texts)

    return combined_context


In [None]:
def generate_response(query, combined_context):
    """
    Generate a response using the retrieved context.
    
    Args:
        query (str): The user's question
        context_chunks (List[Dict]): Relevant chunks from graph traversal
        
    Returns:
        str: Generated response
    """
    
    # Define the maximum allowed length for the context (OpenAI limit)
    max_context = 14000
    
    # Truncate the combined context if it exceeds the maximum length
    if len(combined_context) > max_context:
        combined_context = combined_context[:max_context] + "... [truncated]"
    
    # Define the system message to guide the AI assistant
    system_message = """You are a helpful AI assistant. Answer the user's question based on the provided context.
If the information is not in the context, say so. Refer to specific parts of the context in your answer when possible."""

    # Generate the response using the OpenAI API
    response = client.chat.completions.create(
        model="gpt-4o",  # Specify the model to use
        messages=[
            {"role": "system", "content": system_message},  # System message to guide the assistant
            {"role": "user", "content": f"Context:\n{combined_context}\n\nQuestion: {query}"}  # User message with context and query
        ],
        temperature=0.2  # Set the temperature for response generation
    )
    
    # Return the generated response content
    return response.choices[0].message.content

## Visualization

- Visualize knowledge graph 

In [None]:
def visualize_graph_trarsveal(graph, traversal_path):
    """
    Visualize the knowledge graph and the traversal path.
    
    Args:
        graph (nx.Graph): The knowledge graph
        traversal_path (List): List of nodes in traversal order
    """
    plt.figure(figsize=(12, 10))  # Set the figure size
    
    # Define node colors, default to light blue
    node_color = ['lightblue'] * graph.number_of_nodes()
    
    # Highlight traversal path nodes in light green
    for node in traversal_path:
        node_color[node] = 'lightgreen'
    
    # Highlight start node in green and end node in red
    if traversal_path:
        node_color[traversal_path[0]] = 'green'
        node_color[traversal_path[-1]] = 'red'
    
    # Create positions for all nodes using spring layout
    pos = nx.spring_layout(graph, k=0.5, iterations=50, seed=42)
    
    # Draw the graph nodes
    nx.draw_networkx_nodes(graph, pos, node_color=node_color, node_size=500, alpha=0.8)
    
    # Draw edges with width proportional to weight
    for u, v, data in graph.edges(data=True):
        weight = data.get('weight', 1.0)
        nx.draw_networkx_edges(graph, pos, edgelist=[(u, v)], width=weight*2, alpha=0.6)
    
    # Draw traversal path with red dashed lines
    traversal_edges = [(traversal_path[i], traversal_path[i+1]) 
                      for i in range(len(traversal_path)-1)]
    
    nx.draw_networkx_edges(graph, pos, edgelist=traversal_edges, 
                          width=3, alpha=0.8, edge_color='red', 
                          style='dashed', arrows=True)
    
    # Add labels with the first concept for each node
    labels = {}
    for node in graph.nodes():
        concepts = graph.nodes[node]['concepts']
        label = concepts[0] if concepts else f"Node {node}"
        labels[node] = f"{node}: {label}"
    
    nx.draw_networkx_labels(graph, pos, labels=labels, font_size=8)
    
    plt.title("Knowledge Graph with Traversal Path")  # Set the plot title
    plt.axis('off')  # Turn off the axis
    plt.tight_layout()  # Adjust layout
    plt.show()  # Display the plot

## Complete Hybrid RAG Pipeline

- We created complete rag pipeline for following 3 types of context generation method  
    - pure vector search(**vector_only_rag**)
    - pure graph based(**graph_only_rag**)
    - Combining context from both approaches(**mixed_rag**)

In [None]:
def complete_rag_pipeline(pdf_path, query, chunk_size=1000, chunk_overlap=200, top_k=3, method="vector_only_rag"):
    """
    Complete Graph RAG pipeline from document to answer.
    
    Args:
        pdf_path (str): Path to the PDF document
        query (str): The user's question
        chunk_size (int): Size of text chunks
        chunk_overlap (int): Overlap between chunks
        top_k (int): Number of top nodes to consider for traversal
        method (str): Method to use for RAG ("vector_only_rag", "graph_only_rag")
        
    Returns:
        Dict: Results including answer and graph visualization data
    """
    combined_context = ""  # Initialize combined context string
    # Extract text from the PDF document
    text = extract_text_from_pdf(pdf_path)
    
    # Split the extracted text into overlapping chunks
    chunks = chunk_text(text, chunk_size, chunk_overlap)
    
    if method == "vector_only_rag" or method == "mixed_rag":
        
        embeddings = create_embeddings([chunk["text"] for chunk in chunks])
        # Vector only rag context text
        retrieved_docs_vector = vector_only_rag(query,chunks, embeddings, k=5)
        # Extract text from the retrieved documents
        combined_context = create_context_string(retrieved_docs_vector)
        retrieved_docs = retrieved_docs_vector  # Use vector results as retrieved documents
        traversal_path = []  # No traversal path for vector-only method
        graph = None  # No graph for vector-only method

    if method == "graph_only_rag" or method == "mixed_rag":
    # Build a knowledge graph from the text chunks
        graph, embeddings = build_knowledge_graph(chunks)
    # Traverse the knowledge graph to find relevant information for the query
        retrieved_docs_graph, traversal_path = traverse_graph(query, graph, embeddings, top_k)
        combined_context = create_context_string(retrieved_docs_graph)
        retrieved_docs = retrieved_docs_graph  # Use graph results as retrieved documents
        # Visualize the graph traversal path
        visualize_graph_trarsveal(graph, traversal_path)

    if method == 'mixed_rag':
        # Combine the context from both methods
        
        retrieved_docs_vector = [chunk["text"] for chunk in retrieved_docs_vector]
        retrieved_docs_graph = [chunk["text"] for chunk in retrieved_docs_graph]
        retrieved_docs = list(set(retrieved_docs_vector + retrieved_docs_graph))
        combined_context = "\n\n---\n\n".join(retrieved_docs)

    # Generate a response based on the query and the relevant chunks
    response = generate_response(query, combined_context)
    
    # Return the query, response, relevant chunks, traversal path, and the graph
    return {
        "query": query,
        "response": response,
        "relevant_chunks": retrieved_docs,
        "traversal_path": traversal_path,
        "graph": graph
    }

In [None]:
def generate_hybrid_rag_response(vector_result, graph_result):
    """
    Run the hybrid RAG pipeline with the given PDF and query.
    
    Args:
        pdf_path (str): Path to the PDF document
        query (str): The user's question
        
    Returns:
        Dict: Results including answer and graph visualization data
    """
    
    
    # Extract relevant chunks from the vector result

    if 'relevant_chunks' not in vector_result:
        vector_result['relevant_chunks'] = []
    vector_chunks = vector_result['relevant_chunks']
    # Extract relevant chunks from the graph result
    if 'relevant_chunks' not in graph_result:
        graph_result['relevant_chunks'] = []
    graph_chunks = graph_result['relevant_chunks']

    if not graph_chunks and not vector_chunks:
        return {
            "query": vector_result['query'],
            "response": "No relevant information found in the document.",
            "relevant_chunks": [],
            "traversal_path": [],
            "graph": None
        }
    elif not graph_chunks:
        return {
            "query": vector_result['query'],
            "response": vector_result['response'],
            "relevant_chunks": vector_chunks,
            "traversal_path": [],
            "graph": None
        }
    elif not vector_chunks:
        return {
            "query": graph_result['query'],
            "response": graph_result['response'],
            "relevant_chunks": graph_chunks,
            "traversal_path": graph_result['traversal_path'],
            "graph": graph_result['graph']
        }    
    
    else:
        retrieved_docs_vector = [chunk["text"] for chunk in vector_chunks]
        retrieved_docs_graph = [chunk["text"] for chunk in graph_chunks]
        context_text_all = list(set(retrieved_docs_vector + retrieved_docs_graph))
        combined_context = "\n\n---\n\n".join(context_text_all)

    response = generate_response(vector_result['query'], combined_context)
    # Return the query, response, relevant chunks, traversal path, and the graph
    return {
        "query": vector_result['query'],
        "response": response,
        "relevant_chunks": vector_chunks + graph_chunks,
        "traversal_path": graph_result['traversal_path'],
        "graph": graph_result['graph']
    }    

## Comparing Retrieval Methods

### I compared 3 retrieval methods to assess pros and cons of each

- `vector_only_rag` - text-only vector store and retrieval with out graph exploration

- `graph-based response` - Response from graph-only RAG

- `hybrid_response` - Response using contexts from both vector_only_rag and graph-based rag

In [None]:
def vector_only_rag(query,chunks, embeddings, k=5):
    """
    Answer a query using only vector-based RAG.
    
    Args:
        query (str): User query
        vector_store (SimpleVectorStore): Vector store
        k (int): Number of documents to retrieve
        
    Returns:
        Dict: Query results
    """
    # Create query embedding
    query_embedding = create_embeddings(query)
    
    vector_store = SimpleVectorStore()   
    # # Add the chunks and their embeddings to the vector store
    vector_store.add_items(chunks, embeddings)
    print(f"Added {len(chunks)} items to vector store")
    
    # Retrieve documents using vector-based similarity search
    retrieved_docs = vector_store.similarity_search_with_scores(query_embedding, k=k)
    return retrieved_docs

In [None]:
def compare_retrieval_methods(query, pdf_path, k=5, reference_answer=None):
    """
    Compare different retrieval methods for a query.
    
    Args:
        query (str): User query
        pdf_path (str): Path to the PDF document
        k (int): Number of documents to retrieve
        reference_answer (str, optional): Reference answer for comparison
        
    Returns:
        Dict: Comparison results
    """
    print(f"\n=== Comparing retrieval methods for query: {query} ===\n")
    
    # Run vector-only RAG
    print("\nRunning vector-only RAG...")
    vector_result = complete_rag_pipeline(pdf_path, query, chunk_size=1000, chunk_overlap=200, top_k=3, method="vector_only_rag")
    
    # Run graph RAG
    print("\nRunning graph RAG...")
    graph_result = complete_rag_pipeline(pdf_path, query, chunk_size=1000, chunk_overlap=200, top_k=3, method="graph_only_rag")
    
    # Run hybrid RAG
    print("\nRunning hybrid RAG...")
    hybrid_result = generate_hybrid_rag_response(vector_result, graph_result)
    
    # Compare responses from different retrieval methods
    print("\nComparing responses...")
    comparison = evaluate_responses(
        query, 
        vector_result["response"],
        graph_result["response"],
        hybrid_result["response"],
        reference_answer
    )
    
    # Return the comparison results
    return {
        "query": query,
        "vector_result": vector_result,
        "graph_result": graph_result,
        "hybrid_result": hybrid_result,
        "comparison": comparison
    }

## Evaluation Function

### Evaluates merits of above mentioned 4 retrieval strategies.

- There are different industry standard statistical metrics in NLP - BLEU, ROUGE, MRR, BERTScore etc. However in LLM world a qualitative evaluation is more relevant
- A manual review of all the the generated answers and retrieved documents would be too time consuming and also error prone
- I created the following frame work as qualitative evaluation
  
  -  __LLM as Generator__: Replace human effort by generating ___n___ nos of question-answer pairs by LLM using structured prompts from the pdf document as validation data
  
  (** it's possible that the system generate different sets of validation question-answer pairs each time the generator functions run.This is actually the beauty of the system and make the validation system random to navigate possible memorization)
  
  -  __LLM as Evaluator__: Replace human effort of qualitative evaluation by evaluating the question-answer pairs by LLM itself using well designed prompts.Further the generated evaluation clearly describe merits and drawback of the each validation query in terms of possible dimensions like __Relevance__, __Factual correctness__, __Completeness__, __Clarity and coherence__ etc. w.r.t. the respective reference answer.

In [None]:

def generate_validation_questions_answers(pdf_path, num_question_answers=10):
    """
    Generate factual and analytical question-answer pairs based on a PDF document using the gpt-4o-2-2B-IT model.

    Args:
        pdf_path (str): Path to the PDF file.
        num_question_answers (int): Number of question-answer pairs to generate.

    Returns:
        List[Dict]: A list of dictionaries with question-answer pairs and their types.
    """
    
    # SYSTEM PROMPT
    system_prompt = """
You are an expert in Retrieval-Augmented Generation (RAG) systems and LLM safety. Based on the given document content,
generate question-answer pairs:
- 5 factual questions (strictly based on text, including one on hallucination in LLMs),
- 5 analytical questions (that require critical thinking), analytical questions should not be the direct questions from the text, 
but rather require reasoning or inference based on the content and mostly should start with 'why' and will be difficult questions.
Atleast one analytical question should require with mathematical explanation based on the document involving attention concept
(Attention Is All You Need).
For each question of analytical, provide a elaborate answer based on the document content with critical thinking.
Classify each as "factual" or "analytical".
    """

    # Extract text from the PDF
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text()
    doc.close()

    truncated_text = full_text[:4000]  # Keep within token limit

    # USER PROMPT
    user_prompt = f"""
Document Text:
{truncated_text}

Based on the above, generate 10 question-answer pairs.
Each question should be clearly labeled and categorized.
Use the following format:

### Factual Questions
1. **Question:** ...
   - **Answer:** ...

### Analytical Questions
6. **Question:** ...
   - **Answer:** ...
Generate {num_question_answers} question-answer pairs based on the above content.
   """
     
    # gpt-4o model call
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.3,
        max_tokens=1024,
    )

    content = response.choices[0].message.content
    qa_list = []

    # Parsing generated content
    sections = re.split(r"### (Factual|Analytical) Questions", content)

    for i in range(1, len(sections), 2):
        question_type = sections[i].strip().lower()
        section_text = sections[i + 1]

        qa_pairs = re.findall(
            r"\*\*Question:\*\*\s*(.*?)\n\s*-\s*\*\*Answer:\*\*\s*(.*?)(?=\n\d+\.|\Z)",
            section_text,
            flags=re.DOTALL
        )

        for question, answer in qa_pairs:
            qa_list.append({
                "question": question.strip(),
                "answer": answer.strip(),
                "type": question_type
            })

    # Save results
    with open("data/val.json", "w") as f:
        json.dump(qa_list, f, indent=4)

    return qa_list

In [None]:
def generate_validation_questions_answers(pdf_path, num_question_answers=10):
    
    """
    Generate question-answer pairs by analyzing the content of a PDF document.
    
    Args:
        pdf_path (str): path to the PDF file
        num_question_answers (int): Number of question-answer pairs to generate
        
    Returns:
        List[Dict]: List of question-answer pairs
    """
    
    system_prompt = """You are an expert on Retrieval-Augmented Generation (RAG) systems. Based on the provided contents from the document,
    generate a set of question-answer pairs. The questions should be relevant to the content and cover various aspects including the main idea, reasoning, and image interpretation.
    The questions should be strinctly based the content of the document. 

    Generate 10 question-answer pairs:
    - 5 factual questions (one about hallucination in LLMs),
    - 5 analytical questions (critical thinking),
    Categorize also the question-answer pairs as "factual", "analytical".
    Answers should be brief and to the point, and image questions should refer to the figures explicitly."""

    # Extract text and captions
    doc = fitz.open(pdf_path)
    full_text = ""

    for page in doc:
        full_text += page.get_text()
        for block in page.get_text("dict")["blocks"]:
            if block["type"] == 0:
                for line in block["lines"]:
                    line_text = " ".join([span["text"] for span in line["spans"]])

    doc.close()

    user_prompt = f"""Document Text:
    {full_text[:4000]}  # Truncated to first 4000 chars to fit prompt length

    Generate {num_question_answers} question-answer pairs based on the above content.
    """
   
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0.1
    )
    
    # Extract the content from the response
    content = response.choices[0].message.content
    question_type = None
    qa_list = []

    sections = re.split(r"### (Factual|Analytical) Questions", content)

    # Skip the initial empty or heading text
    for i in range(1, len(sections), 2):
        question_type = sections[i].strip()
        section_text = sections[i + 1]

        # Find question-answer pairs
        qa_pairs = re.findall(
            r"\*\*Question:\*\*\s*(.*?)\n\s*-\s*\*\*Answer:\*\*\s*(.*?)(?=\n\d+\.|\Z)",
            section_text,
            flags=re.DOTALL
        )

        for question, answer in qa_pairs:
            qa_list.append({
                "question": question.strip(),
                "answer": answer.strip(),
                "type": question_type
            })
        
    # Save the question-answer pairs to a JSON file
    with open("data/val.json", "w") as f:
        json.dump(qa_list, f, indent=4)
    return qa_list

In [None]:
def evaluate_responses(query, vector_response, graph_response, hybrid_response, reference_answer=None):
    """
    Evaluate the responses from different retrieval methods.
    
    Args:
        query (str): User query
        vector_response (str): Response from vector-only RAG
        graph_response (str): Response from graph-only RAG
        hybrid_response (str): Response from hybrid approach RAG
        reference_answer (str, optional): Reference answer
        
    Returns:
        str: Evaluation of responses
    """
    # System prompt for the evaluator to guide the evaluation process
    system_prompt = """You are an expert evaluator of RAG systems. Compare responses from three different retrieval approaches:
    1. Vector-based retrieval: Uses semantic similarity for document retrieval
    2. graph_response: Uses a knowledge graph to traverse and find relevant information
    3. Hybrid approach: Combines both vector and graph methods


    Evaluate the responses based on:
    - Relevance to the query
    - Factual correctness
    - Comprehensiveness
    - reasoning capability
    - Clarity and coherence"""

    # User prompt containing the query and responses
    user_prompt = f"""Query: {query}

    Vector-based response:
    {vector_response}

    graph-based response:
    {graph_response}

    Hybrid response:
    {hybrid_response}

    """

    # Add reference answer to the prompt if provided
    if reference_answer:
        user_prompt += f"""
            Reference answer:
            {reference_answer}
        """

    # Add instructions for detailed comparison to the user prompt
    user_prompt += """
    Please provide a detailed comparison of these three responses. Which approach performed best for this query and why?
    Be specific about the strengths and weaknesses of each approach for this particular query.
    """

    # Generate the evaluation using meta-llama/Llama-3.2-3B-Instruct
    response = client.chat.completions.create(
        model="gpt-4o",  # Specify the model to use
        messages=[
            {"role": "system", "content": system_prompt},  # System message to guide the evaluator
            {"role": "user", "content": user_prompt}  # User message with query and responses
        ],
        temperature=0  # Set the temperature for response generation
    )
    
    # Return the generated evaluation content
    return response.choices[0].message.content

In [None]:
def compare_with_reference(response, reference, query):
    """
    Compare generated response with reference answer.
    
    Args:
        response (str): Generated response
        reference (str): Reference answer
        query (str): Original query
        
    Returns:
        str: Comparison analysis
    """
    # System message to instruct the model on how to compare the responses
    system_message = """Compare the AI-generated response with the reference answer.
Evaluate based on: correctness, completeness, and relevance to the query.
Provide a brief analysis (2-3 sentences) of how well the generated response matches the reference."""

    # Construct the prompt with the query, AI-generated response, and reference answer
    prompt = f"""
Query: {query}

AI-generated response:
{response}

Reference answer:
{reference}

How well does the AI response match the reference?
"""

    # Make a request to the OpenAI API to generate the comparison analysis
    comparison = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_message},  # System message to guide the assistant
            {"role": "user", "content": prompt}  # User message with the prompt
        ],
        temperature=0.0  # Set the temperature for response generation
    )
    
    # Return the generated comparison analysis
    return comparison.choices[0].message.content

## Evaluation of Graph RAG on a Sample PDF Document

In [None]:
# Path to the PDF document containing AI information
pdf_path = "/Users/aifora/Desktop/personal_laptop/learning_practice/git_projects/Multimodal-Fusion-RAG/data/attention_is_all_you_need.pdf"

# Define an AI-related query for testing Graph RAG
# query = "What are the key applications of transformers in natural language processing?"
query = "How does the encoding help to solve the lack of order of the input sequence?"

# Execute the Graph RAG pipeline to process the document and answer the query
results = complete_rag_pipeline(pdf_path, query, chunk_size=1000, chunk_overlap=200, top_k=3, method="graph_only_rag")

# Print the response generated from the Graph RAG system
print("\n=== ANSWER ===")
print(results["response"])

# Define a test query and reference answer for formal evaluation
# test_queries = [
#     "How do transformers handle sequential data compared to RNNs?"
# ]

# # Reference answer for evaluation purposes
# reference_answers = [
#     "Transformers handle sequential data differently from RNNs by using self-attention mechanisms instead of recurrent connections. This allows transformers to process all tokens in parallel rather than sequentially, capturing long-range dependencies more efficiently and enabling better parallelization during training. Unlike RNNs, transformers don't suffer from vanishing gradient problems with long sequences."
# ]

# # Run formal evaluation of the Graph RAG system with the test query
# evaluation = evaluate_graph_rag(pdf_path, test_queries, reference_answers)

# # Print evaluation summary statistics
# print("\n=== EVALUATION SUMMARY ===")
# print(f"Graph nodes: {evaluation['graph_stats']['nodes']}")
# print(f"Graph edges: {evaluation['graph_stats']['edges']}")
# for i, result in enumerate(evaluation['results']):
#     print(f"\nQuery {i+1}: {result['query']}")
#     print(f"Path length: {result['traversal_path_length']}")
#     print(f"Chunks used: {result['relevant_chunks_count']}")

## Complete Evaluation Pipeline
- For all the test query we generate a structured,comparative Evaluation results for 4 different retrieval strategies.

In [None]:
def evaluate_hybrid_retrieval(pdf_path, test_queries, reference_answers=None, k=5):
    """
    Evaluate hybrid retrieval compared to other methods.
    
    Args:
        pdf_path (str): Path to the PDF file
        test_queries (List[str]): List of test queries
        reference_answers (List[str], optional): Reference answers
        k (int): Number of documents to retrieve
        
    Returns:
        Dict: Evaluation results
    """
    print("=== EVALUATING DIFFERENT RETRIEVAL STRATEGIES===\n")
    
    
    # Initialize a list to store results for each query
    results = []
    
    # Iterate over each test query
    for i, query in enumerate(test_queries):
        print(f"\n\n=== Evaluating Query {i+1}/{len(test_queries)} ===")
        print(f"Query: {query}")
        
        # Get the reference answer if available
        reference = None
        if reference_answers and i < len(reference_answers):
            reference = reference_answers[i]
        
        # Compare retrieval methods for the current query
        comparison = compare_retrieval_methods(
            query, 
            pdf_path=pdf_path, 
            k=k, 
            reference_answer=reference
        )
        
        # Append the comparison results to the results list
        results.append(comparison)
        
        # Print the responses from different retrieval methods
        print("\n=== Vector-based Response ===")
        print(comparison["vector_result"]["response"])
        
        print("\n=== Graph-based Response ===")
        print(comparison["graph_result"]["response"])
        
        print("\n=== Hybrid Response ===")
        print(comparison["hybrid_result"]["response"])
        
        print("\n=== Comparison ===")
        print(comparison["comparison"])
    
    # Generate an overall analysis of the hybrid retrieval performance
    overall_analysis = generate_overall_analysis(results)
    
    # Return the results and overall analysis
    return {
        "results": results,
        "overall_analysis": overall_analysis
    }

In [None]:
def generate_overall_analysis(results):
    """
    Generate an overall analysis of hybrid retrieval.
    
    Args:
        results (List[Dict]): Results from evaluating queries
        
    Returns:
        str: Overall analysis
    """
    # System prompt to guide the evaluation process
    system_prompt = """You are an expert at evaluating information retrieval systems. 
    Based on multiple test queries, provide an overall analysis comparing three retrieval approaches:
    1. Vector-based retrieval (semantic similarity)
    2. graph_response (knowledge graph traversal)
    3. Hybrid approach (combination of vector and graph methods)

    Focus on:
    1. Types of queries where each approach performs best
    2. Overall strengths and weaknesses of each approach
    3. Is graph_response significantly better than vector-based retrieval?
    4. When is graph_response advantageous over vector-based retrieval?
    5. How hybrid retrieval combining both vector-based and graph-based balances the trade-offs
    6. Recommendations for when to use each approach
    
    At the end generate a summary of the analysis with approximate score for each method in a tabular format in overall 
    which retrieval approaches is the best for most of questions with the score for following dimentions with generic 
    qualitative observation with column names as follows.

    | Retrieval Method | Relevance | Factual Correctness | Reasoning Capability | Clarity and Coherence | Overall Score | observation
    provide a higher score for the method which is best for most of the questions.
    """

    # Create a summary of evaluations for each query
    evaluations_summary = ""
    for i, result in enumerate(results):
        evaluations_summary += f"Query {i+1}: {result['query']}\n"
        evaluations_summary += f"Comparison Summary: {result['comparison'][:200]}...\n\n"

    # User prompt containing the evaluations summary
    user_prompt = f"""Based on the following evaluations of different retrieval methods across {len(results)} queries, 
    provide an overall analysis comparing these three approaches:

    {evaluations_summary}

    Please provide a comprehensive analysis of vector-based, and graph retrieval approaches,
    highlighting when and why graph retrieval provides advantages over the vector-store-based retrieval method."""

    # Generate the overall analysis using meta-llama/Llama-3.2-3B-Instruct
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )
    
    # Return the generated analysis content
    return response.choices[0].message.content

## Evaluating Hybrid Retrieval

- Running the evaluator for all the reference queries and respective answers

In [None]:
# Path to validation document with questions and answers

# Path to the PDF document to be evaluated
# pdf_path = "data/RAG_white_papers_articles.pdf"
pdf_path = "/Users/aifora/Desktop/personal_laptop/learning_practice/git_projects/Multimodal-Fusion-RAG/data/RAG_white_papers_articles.pdf"
val_path = "data/val.json"
# generate_validation_questions_anaswers(pdf_path, num_question_answers=10)
validation_doc = json.load(open(val_path, "r"))

# test_queries = [item['question'] for item in validation_doc]

## Quick test queries
test_queries = [
    
    # "How does the encoding help to solve the lack of order of the input sequence?",
    # "Why might Modular RAG offer a significant advantage over Naive or Advanced RAG in real-world applications?",
    # "What is a primary challenge that LLMs face which RAG is designed to solve?"
]

# Optional reference answer

# reference_answers = [item['answer'] for item in validation_doc]

reference_answers = [
    #  "This is done by adding positional encodings to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings in the Transformer model is to inject information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimensionality as the embeddings, so they can be summed together. By using sine and cosine functions of different frequencies, the positional encodings form a geometric progression from 2π to 10000 · 2π. This allows the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.",
    # "Modular RAG provides flexibility by allowing modules to be rearranged or replaced, enabling better adaptation to diverse tasks, reducing redundancy, and supporting more dynamic interaction flows",
    # "Hallucination—producing content not grounded in factual sources—is a key challenge RAG addresses by grounding generation in retrieved external knowledge"

]
k = 4

# Run evaluation
evaluation_results = evaluate_hybrid_retrieval(
    pdf_path=pdf_path,
    test_queries=test_queries,
    reference_answers=reference_answers,
    k=k,
)

# Print overall analysis
print("\n\n=== OVERALL ANALYSIS ===\n")
print(evaluation_results["overall_analysis"])