##1. Documents preprocessing and chunking

###Install Required Libraries

In [1]:
!pip install langchain==0.1.13 langchain-community langchain-google-genai sentence-transformers chromadb pypdf reportlab



### Download PDFs from Google Drive

In [2]:
import os

os.makedirs("/content/papers", exist_ok=True)

pdf_files = {
    "FewShot_Learning.pdf":"1PN3vhHXgdtmujKFN3jOML-q_Pwbg_LnB",
    "RAG_Overview.pdf":"1HYAaMKCP5WG0zaNDcDPzOGEhZMUJjOnv",
    "Transformer_Architecture.pdf":"1VlJxGYBC6ovYPi6L9dT7_KAzZ_nlqsyJ"
}

for filename, file_id in pdf_files.items():
  url = f"https://drive.google.com/uc?export=download&id={file_id}"
  output_path = f"/content/papers/{filename}"
  !wget -q --show-progress "{url}" -O "{output_path}"

print("Files downloaded successfully")

Files downloaded successfully


###Load and Chunk PDF Documents

In [3]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFDirectoryLoader("/content/papers")
docs = loader.load()

for i, doc in enumerate(docs):
  source = doc.metadata.get('source',f'doc_{i}.pdf')
  doc.metadata['filename'] = source.split('/')[-1]
  doc.metadata['page'] = doc.metadata.get('page', i + 1)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(docs)

print(f"Loaded {len(docs)} pages and split into {len(chunks)} chunks.")

Loaded 109 pages and split into 1422 chunks.


#2. Create Embeddings and Vector Store

In [4]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#3. End to End QA chatbot

###Set Google Gemini API Key

In [5]:
import os
os.environ["GOOGLE_API_KEY"] = "AIzaSyCssqMHotehgmDxMyl1ldr0rQrGcPjPsyM"

###Load Gemini Model and Build RAG Chain

In [6]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate

llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", temperature=0.5)

def prepare_context_with_sources(documents):
  context_blocks = []
  source_citations = set()

  for doc in documents:
    filename = doc.metadata.get("filename", "unknown_file")
    page = doc.metadata.get("page", "N/A")
    content = doc.page_content.strip().replace("\n", " ")

    context_blocks.append(f"[{filename}, Page {page}]: {content}")
    source_citations.add((filename, page))

  return "\n\n".join(context_blocks), source_citations

template = """
<context>
{context}
</context>

You are an AI assistant answering questions based on academic papers.
Answer the following question truthfully and clearly using only the above context.
Do not hallucinate or make up information.

Question: {query}
"""

prompt = ChatPromptTemplate.from_template(template)

qa_history = []

def rag_with_sources(query):
  docs = retriever.get_relevant_documents(query)
  context, sources = prepare_context_with_sources(docs)

  inputs = {"context": context, "query": query}
  answer = llm.invoke(prompt.format_prompt(**inputs).to_messages())

  formatted_sources = [f"{file}, Page {page}" for file, page in sources]
  qa_entry = {
      "question": query,
      "answer": answer.content.strip(),
      "sources": formatted_sources
  }
  qa_history.append(qa_entry)

  return qa_entry

#4. Test the RAG system

###Sample Questions

In [7]:
sample_questions = [
    "What are the main components of a RAG model, and how do they interact?",
    "What are the two sub-layers in each encoder layer of the Transformer model?",
    "Explain how positional encoding is implemented in Transformers and why it is necessary.",
    "Describe the concept of multi-head attention in the Transformer architecture. Why is it beneficial?",
    "What is few-shot learning, and how does GPT-3 implement it during inference?"
]

for q in sample_questions:
    result = rag_with_sources(q)
    print(f"\n {result['question']}\n {result['answer']}\n Sources: {', '.join(result['sources'])}\n")

  warn_deprecated(



 What are the main components of a RAG model, and how do they interact?
 Based on the provided text, RAG models use an input sequence (x) to retrieve text documents (z).  These retrieved documents (z) are then used as additional context when generating the target sequence (y).  The paper mentions  pη and pθ components, but does not detail their specific roles in the interaction.
 Sources: RAG_Overview.pdf, Page 9, RAG_Overview.pdf, Page 1, RAG_Overview.pdf, Page 2


 What are the two sub-layers in each encoder layer of the Transformer model?
 The first sub-layer is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.
 Sources: Transformer_Architecture.pdf, Page 7, Transformer_Architecture.pdf, Page 2


 Explain how positional encoding is implemented in Transformers and why it is necessary.
 The Transformer architecture lacks recurrence and convolution, meaning it inherently doesn't understand the order of the input sequ

### Your Questions

In [8]:
import sys

while True:
    user_input = input("Ask a question (or type 'exit'): ")
    if user_input.lower() == "exit":
        print("Exiting Q&A.")
        break
    if user_input.strip() == "":
        continue
    result = rag_with_sources(user_input)
    print(f"\n {result['question']}\n {result['answer']}\n Sources: {', '.join(result['sources'])}\n")

Ask a question (or type 'exit'): what is rag

 what is rag
 Based on the provided text, RAG refers to Retrieval Augmented Generation.  It is a method that produces more factual generations and offers more control and interpretability than methods without access to an external knowledge source.  The provided text also indicates that RAG models achieve state-of-the-art results on several open question datasets.  Code to run RAG models has been open-sourced.
 Sources: RAG_Overview.pdf, Page 9, RAG_Overview.pdf, Page 6, RAG_Overview.pdf, Page 1

Ask a question (or type 'exit'): exit
Exiting Q&A.


#5. Download results

### Export Q&A as PDF

In [9]:
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.pagesizes import A4

pdf_path = "/content/RAG_QA_Summary.pdf"
doc = SimpleDocTemplate(pdf_path, pagesize=A4)
styles = getSampleStyleSheet()
elements = []

for i, item in enumerate(qa_history, 1):
    elements.append(Paragraph(f"<b>Q{i}: {item['question']}</b>", styles['Heading4']))
    elements.append(Spacer(1, 6))
    elements.append(Paragraph(f"<b>Answer:</b> {item['answer']}", styles['BodyText']))
    elements.append(Spacer(1, 6))
    elements.append(Paragraph("<b>Sources:</b>", styles['BodyText']))
    for src in item["sources"]:
        elements.append(Paragraph(f"• {src}", styles['BodyText']))
    elements.append(Spacer(1, 12))

doc.build(elements)
print(f"PDF saved to: {pdf_path}")

PDF saved to: /content/RAG_QA_Summary.pdf


### Download the PDF in Colab

In [10]:
from google.colab import files
files.download(pdf_path)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>