# __Demo: RAG Implementation from scratch__

### Steps to be followed:

1. Install and import the dependencies
2. Load the document
3. Split the document into chunks
4. Generate embeddings for each chunk
5. Build the FAISS vector store and create a retriever
6. Design a prompt template for the language model
7. Load and configure a quantized language model
8. Set up the generation pipeline and chain the components
9. Invoke the pipeline with a query

# **Step1: Install and import the dependecies**

In [2]:
# !pip install --upgrade huggingface_hub sentence-transformers bitsandbytes

In [1]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
import os


# **Step 2: Load the document**

Load the document that will be used as the knowledge source.

**Knowledge base**: The text document serves as the underlying knowledge base. Later, when a query is made, relevant parts of this document will be retrieved to augment the LLM's response.






In [13]:
text_loader = TextLoader("state_of_union.txt")
text_document = text_loader.load()
# print(text_document[:100])  # Prints the first 100 characters of the text document



# **Step 3: Split the document into chunks**

Break down the large document into manageable pieces.

**Fine-Grained Retrieval**: Smaller chunks allow the retriever to more precisely locate the context relevant to the query, enhancing the generation step with focused context.

In [15]:
doc_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
split_texts = doc_splitter.split_documents(text_document)
print(len(split_texts))  # Prints the number of chunks the PDF has been split into


53


# **Step 4: Generate embeddings for each chunk**

Convert text chunks into numerical vectors (embeddings) that capture semantic meaning.

**Semantic Search**: Embeddings allow the FAISS vector store to perform similarity searches, ensuring that the most relevant context is retrieved for any given query.

**Verification**: Printing the length of the embedding vector confirms the transformation was successful.

In [17]:
MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
hf_embed = HuggingFaceEmbeddings(model_name=MODEL_NAME)
text = split_texts[0].page_content
hf_embed_result = hf_embed.embed_documents([text])
print(len(hf_embed_result[0]))  # Prints the length of the first embedded document

768


#### If we quickly want to see how the embeddings for the chunks will look like we will do the below

In [18]:
embedded_chunks = [hf_embed.embed_query(chunk.page_content) for chunk in split_texts]

In [21]:
# import pandas as pd  -128.8569789456
# df_chunks = pd.DataFrame(embedded_chunks)
# dimension = df_chunks.shape[1]
import numpy as np

embedded_chunks = np.array(embedded_chunks).astype('float32')
type(embedded_chunks)

numpy.ndarray

# **Step 5: Build the FAISS vector store and create a retriever**

Build an index (FAISS) for the document embeddings and create a retriever.

**Retrieval step**: The retriever is responsible for fetching the most relevant chunks from the document based on the query. These retrieved contexts will later be fed into the generation step to produce an informed answer.


In [25]:
import faiss
index = faiss.IndexFlatIP(dimension)
index.add(embedded_chunks)

In [None]:
import faiss
index = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(index,dimension, 4) 
index.train(embedded_chunks)
index.add(embedded_chunks)

In [28]:
# dir(index)

In [30]:
faiss.write_index(index, "cosine_Similarity_index_vectordatabase.faiss")

In [None]:
i1 = faiss.read_index("cosine_Similarity.faiss")
i1.search(

In [None]:
vectorstore=FAISS.from_documents(split_texts, hf_embed)

# It will take thesame embedding of the chunks as shown above and and crate a vecor database for it which will be temporary, ie non persistent

#### Let's see if the retriever works

In [None]:
retriever=vectorstore.as_retriever()

In [None]:
# The way the retriever works

query = "What are the key points from the State Of The Union"
docs = retriever.get_relevant_documents(query)
for doc in docs:
    print(doc.page_content)

In [None]:
query2 = "How is the United States supporting Ukraine economically and militarily?"

In [None]:
docs = retriever.get_relevant_documents(query2)
for doc in docs:
    print(doc.page_content)

# **Step 6: Design a prompt template for the language model**
Establish a prompt that instructs the LLM on how to utilize the retrieved context to generate a concise answer.

**Guiding Generation**: The prompt template bridges retrieval and generation by ensuring the LLM uses the provided context (from the retriever) to answer the query accurately.

In [None]:
from langchain.prompts import ChatPromptTemplate

In [None]:
template="""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use one sentence and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""

In [None]:
prompt=ChatPromptTemplate.from_template(template)

In [None]:
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

In [None]:
output_parser=StrOutputParser()

# **Step 7: Load and configure a quantized language model**

Load a quantized version of a large language model (Falcon3-1B-Base) for efficient and cost-effective text generation.

**Generation Step**: This model is responsible for generating the final answer. It takes the prompt (which includes the retrieved context) and produces a response, completing the RAG pipeline.

**Efficiency**: 4-bit quantization reduces resource usage while maintaining performance, crucial for deploying RAG systems in production.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

MODEL_NAME = "tiiuae/Falcon3-1B-Base"

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model.eval()


In [None]:
model.eval()
generation_config = model.generation_config
# Set temperature to 0 for deterministic responses
generation_config.temperature = 0.8
# Set number of returned sequences to 1
generation_config.num_return_sequences = 1
# Set maximum new tokens per response
generation_config.max_new_tokens = 256
# Disable token caching
generation_config.use_cache = False
# Set repetition penalty for more diverse responses
generation_config.repetition_penalty = 1.7
# Define pad and EOS token IDs
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

In [None]:
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    StoppingCriteria,
    StoppingCriteriaList,
    pipeline,
)

# **Step 8: Set up the generation pipeline and chain the components**

Build an end-to-end pipeline that seamlessly connects document retrieval with text generation.

**Integration**: The chain uses the retriever to fetch context, applies the prompt template to integrate the query with the retrieved context, and then passes the final prompt to the LLM for answer generation.

**Pipeline composition**: Using the pipe operator (|), the components are elegantly chained together to perform a complete RAG operation in one go.

In [None]:
from langchain.llms import HuggingFacePipeline # Import HuggingFacePipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Create the HuggingFacePipeline object
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

In [None]:
rag_chain = (
    {"context": retriever,  "question": RunnablePassthrough()}
    | prompt
    | llm_pipeline
    | output_parser
)

# **Step 9: Invoke the pipeline with a query**

Execute the entire RAG pipeline with a sample query.

**Final output**: The pipeline retrieves relevant chunks from the document, forms a context-rich prompt, and the LLM generates a concise answer based on that context.

**End-to-end flow**: This step demonstrates the full cycle of RAG—retrieval and augmented generation—in action.

In [None]:
result = rag_chain.invoke("How is the United States supporting Ukraine economically and militarily?")

In [None]:
result

# Conclusion

This RAG (Retrieval-augmented generation) pipeline exemplifies how to combine retrieval-based methods with generative AI to produce informed, context-driven answers. By following these high-level steps—setting up the environment, loading and splitting the document, generating embeddings, building a FAISS vector store, and creating a retriever—you establish a robust foundation for pinpointing the most relevant pieces of information. Integrating a prompt template ensures that the language model is guided to leverage this retrieved context effectively. Finally, by employing a quantized language model in an end-to-end chain, the system efficiently generates concise and accurate responses. Overall, this approach not only enhances the model’s output by grounding it in factual context but also streamlines the process, making it scalable and adaptable to various domains and applications.