# RAG Pipeline Demonstration

This notebook walks through a retrieval-augmented generation (RAG) pipeline using LLaMA for Q&A on an HR policy PDF. We will:

1. **Load and chunk** the PDF.
2. **Embed** the chunks and upsert them into **Qdrant**.
3. **Perform** a vector search and cross-encoder **reranking**.
4. **Optionally** expand the context with less-similar but unique chunks.
5. **Query** the LLaMA model to get a final answer.

---

## 1. Imports & Setup

Below, we import the necessary libraries:

- **Qdrant** client for storing/retrieving vector embeddings.
- **LangChain** (or `langchain_community`) for loading PDFs, splitting text, and embedding.
- **Transformers** for cross-encoder reranking and LLaMA model usage.
- **Torch** for GPU support (if available).


In [1]:
import os
import uuid
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors.chain_filter import LLMChainFilter
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Qdrant
import torch
import torch
from sklearn.metrics.pairwise import cosine_similarity
import torch
import numpy as np
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors.chain_filter import LLMChainFilter
from transformers import pipeline
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader

print("Imports done.")

Imports done.


## 2. LLaMA Model Initialization

Here, we load a smaller LLaMA 3.2–3B Instruct model. Make sure you adjust:
- **model_name** to the one you have locally or on your GPU.
- Torch settings if you’re on CPU only (remove `bfloat16` if not supported).


In [2]:
model_name = "meta-llama/Llama-3.2-3B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Loaded LLaMA model on device: {device}")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Loaded LLaMA model on device: cuda


## 3. Qdrant Client & Embeddings

- We connect to a remote Qdrant instance (edit the URL/API key as needed).
- We create/refresh a collection name called `hr_policy_docs`.
- We load a Hugging Face embedding model (`sentence-transformers/all-MiniLM-L6-v2`).
- We also prepare a cross-encoder pipeline for reranking (`ms-marco-MiniLM-L-6-v2`).


In [3]:
qdrant_client = QdrantClient(
    url="Quadrant_URL",
    api_key="YOUR_API_KEY",
)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

reranker = pipeline(
    "text-classification",
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    device=0 if torch.cuda.is_available() else -1
)

collection_name = "hr_policy_docs"

# Delete and recreate the collection
try:
    qdrant_client.delete_collection(collection_name=collection_name)
    print("Collection deleted successfully.")
except Exception as e:
    print(f"Collection deletion failed: {e}")

qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE)
)

print(f"Created collection '{collection_name}'.")

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Collection deleted successfully.
Created collection 'hr_policy_docs'.


## 4. PDF Loading & Splitting

We load a local PDF (`HR-Policy-Document.pdf`) using `PyPDFLoader`, then use `CharacterTextSplitter` to break it into overlapping chunks for better retrieval resolution.


In [4]:
pdf_path = "HR-Policy-Document.pdf"  # Adjust if needed
loader = PyPDFLoader(pdf_path)
documents = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
texts = [doc.page_content for doc in docs]

print(f"Loaded {len(documents)} pages and split into {len(texts)} chunks.")

Loaded 53 pages and split into 53 chunks.


## 5. Embedding & Upsert to Qdrant

We embed each chunk using `embedding_model.embed_documents(...)` and upsert them into Qdrant with payloads containing the chunk text.


In [5]:
text_embeddings = embedding_model.embed_documents(texts)
points = []
for i, embed in enumerate(text_embeddings):
    points.append(PointStruct(
        id=str(uuid.uuid4()),
        vector=embed,
        payload={"text": texts[i]}
    ))

qdrant_client.upsert(collection_name=collection_name, points=points)
print(f"Indexed {len(points)} chunks into Qdrant.")

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Indexed 53 chunks into Qdrant.


## 6. Query & Vector Search

We pick a sample user query, embed it, then call `qdrant_client.search(...)` to get top matches from Qdrant.


In [6]:
query_text = "What is the termination policy for drug use?"
query_embedding = embedding_model.embed_query(query_text)

search_results = qdrant_client.search(
    collection_name=collection_name,
    query_vector=query_embedding,
    limit=10
)

print("--- Vector Store Results ---")
for i, result in enumerate(search_results):
    print(f"Result {i+1}:")
    print(f"Score: {result.score}")
    snippet = result.payload.get('text', '')[:200]
    print(f"Text: {snippet}...\n")


--- Vector Store Results ---
Result 1:
Score: 0.55006707
Text: Use of Alcohol:
The Foundation does not provide funds for the purchaseof alcohol at Foundation sponsored activities. However,employees may consume alcohol if they so choose underthe following guidelin...

Result 2:
Score: 0.49573123
Text: ● They are required to refrain from any unsafe practices or hazardous actions and to exercise due careand diligence.
● Any unsafe conditions, materials or equipment andall accidents or injuries must b...

Result 3:
Score: 0.3858205
Text: Procedures:
1. The employee’s supervisor will implement progressivediscipline when addressing performance issues.
2. If, through the application of progressive discipline, conduct or performance probl...

Result 4:
Score: 0.371737
Text: SECTION 8
Change of Status
Resignation & Termination
Having clear processes for when employees leave yourFoundation can mitigate legal risks andnegative feelings.
Policy and Procedure Statement
Employ...

Result 5:
Score: 0

## 7. Cross-Encoder Reranking

We define a function `rerank_results` that uses the cross-encoder pipeline to reorder the top chunks by reading the query and each chunk text together.


In [7]:
def rerank_results(query, results, reranker, max_length=512, batch_size=8):
    """
    Rerank results using a cross-encoder reranker model with batch processing and truncation.
    """
    ranked_results = []
    batched_inputs = []
    original_data = []

    for result in results:
        text = result.payload.get("text", "")
        combined_input = f"{query} [SEP] {text}"
        batched_inputs.append(combined_input)
        original_data.append({
            "original_score": result.score,
            "text": text
        })

    for i in range(0, len(batched_inputs), batch_size):
        batch = batched_inputs[i : i + batch_size]
        try:
            scores = reranker(batch, truncation=True, max_length=max_length)
            for j, score in enumerate(scores):
                ranked_results.append({
                    "score": score["score"],
                    "text": original_data[i + j]["text"],
                    "original_score": original_data[i + j]["original_score"],
                })
        except Exception as e:
            print(f"Error reranking batch {i // batch_size}: {str(e)}")

    ranked_results.sort(key=lambda x: x["score"], reverse=True)
    return ranked_results

reranked_results = rerank_results(query_text, search_results, reranker)
print("--- Reranked Top 3 ---")
for item in reranked_results[:3]:
    print(f"Score: {item['score']:.3f}, Original: {item['original_score']:.3f}")
    print(f"Text: {item['text'][:200]}...\n")


--- Reranked Top 3 ---
Score: 0.007, Original: 0.365
Text: SECTION 7
Problem Resolution
Progressive Discipline
Employee discipline is a necessary evil for most employers. It’s never fun to do and can lead tounhappy employees.  Moreover, there is always theris...

Score: 0.005, Original: 0.345
Text: SECTION 5
Health and Safety
Health and Safety
Canadian health and safety legislation requires employersto have a health and safety program in theirworkplace. A written policy helps to promote an eﬀect...

Score: 0.003, Original: 0.550
Text: Use of Alcohol:
The Foundation does not provide funds for the purchaseof alcohol at Foundation sponsored activities. However,employees may consume alcohol if they so choose underthe following guidelin...



## 8. Expanded Context Filter

We define a small function that picks top N chunks, then scans for additional chunks that are below a similarity threshold to ensure we’re not missing unique context.


In [8]:
def expanded_context_filter(reranked_results, num_top_chunks=5, max_expansion=3, enrichment_threshold=0.85):
    """
    Perform expanded context filtering by selecting top N chunks and including
    contextually relevant surrounding chunks if they add unique value.
    """
    if not reranked_results:
        return []

    top_chunks = reranked_results[:num_top_chunks]
    expanded_context = top_chunks[:]
    seen_texts = [chunk['text'] for chunk in top_chunks]
    initial_embeddings = embedding_model.embed_documents(seen_texts)

    for result in reranked_results[num_top_chunks:]:
        if len(expanded_context) >= num_top_chunks + max_expansion:
            break

        new_text = result['text']
        new_embedding = embedding_model.embed_query(new_text)
        similarities = cosine_similarity([new_embedding], initial_embeddings).flatten()

        if np.all(similarities < enrichment_threshold):
            expanded_context.append(result)
            initial_embeddings = np.vstack([initial_embeddings, new_embedding])

    expanded_context = sorted(expanded_context, key=lambda x: x['score'], reverse=True)
    return expanded_context

final_context = expanded_context_filter(reranked_results, num_top_chunks=5, max_expansion=3, enrichment_threshold=0.85)
print(f"Final context size: {len(final_context)}")
for i, item in enumerate(final_context):
    print(f"[{i+1}] Score: {item['score']:.3f}, Text snippet: {item['text'][:150]}...\n")


Final context size: 8
[1] Score: 0.007, Text snippet: SECTION 7
Problem Resolution
Progressive Discipline
Employee discipline is a necessary evil for most employers. It’s never fun to do and can lead toun...

[2] Score: 0.005, Text snippet: SECTION 5
Health and Safety
Health and Safety
Canadian health and safety legislation requires employersto have a health and safety program in theirwor...

[3] Score: 0.003, Text snippet: Use of Alcohol:
The Foundation does not provide funds for the purchaseof alcohol at Foundation sponsored activities. However,employees may consume alc...

[4] Score: 0.003, Text snippet: ● They are required to refrain from any unsafe practices or hazardous actions and to exercise due careand diligence.
● Any unsafe conditions, material...

[5] Score: 0.002, Text snippet: requested on this basis, the Foundation may require the employee to temporarily transfer to an alternative positionwith equivalent pay and beneﬁts whi...

[6] Score: 0.002, Text snippet: Group Insur

## 9. Building a System Prompt & Querying LLaMA

We define a helper that formats these top context chunks into a single prompt, then calls the LLaMA model to get an answer. The prompt instructs the model to only answer with the given context or say “cannot be found.”


In [9]:
from typing import List

SYSTEM_PROMPT = """
You are an expert AI assistant. Utilize the most relevant context sources provided to answer the user's question. Each source is separated by section headings or line breaks.
If you did not find anything to answer with from the context provided, state that the answer cannot be found in the document.
Answer the following question directly and informatively. Provide no more than one response.

Context:
{context}

Question: {question}
Answer:
"""

def format_documents(final_context: List[dict]) -> str:
    return "\n\n".join([doc['text'] for doc in final_context])

def query_llama(model, tokenizer, context: List[dict], question: str) -> str:
    formatted_context = format_documents(context)
    prompt = SYSTEM_PROMPT.format(context=formatted_context, question=question)

    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_new_tokens=400,
        num_return_sequences=1,
        temperature=0.3,
        top_p=0.9
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

question = "What is the termination policy for drug use?"
response = query_llama(model, tokenizer, final_context, question)
print("LLaMA Response:\n")
print(response)


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


LLaMA Response:


You are an expert AI assistant. Utilize the most relevant context sources provided to answer the user's question. Each source is separated by section headings or line breaks.
If you did not find anything to answer with from the context provided, state that the answer cannot be found in the document.
Answer the following question directly and informatively. Provide no more than one response.

Context:
SECTION 7
Problem Resolution
Progressive Discipline
Employee discipline is a necessary evil for most employers. It’s never fun to do and can lead tounhappy employees.  Moreover, there is always therisk that discipline will lead to discrimination orother work-related claims, whether merited or not. But the long-term consequences of neglectingemployee discipline can soon outweigh the short-termdiscomfort of doing so.  Although there is not away to make employee discipline completely pain free, implementing a progressive discipline policycan help alleviate some of the discom

## Conclusion

We’ve shown an entire pipeline from PDF ingestion to LLaMA Q&A. You can further adapt or extend:

- Use additional retrieval strategies.
- Switch to a different LLM if needed (OpenAI, local GPT-J, etc.).
Note that the conversation memory for a full chat experience is added to the Flask app.

Happy experimenting!
