In [1]:
!pip install pypdf sentence-transformers faiss-cpu transformers

Collecting pypdf
  Downloading pypdf-6.1.0-py3-none-any.whl.metadata (7.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading pypdf-6.1.0-py3-none-any.whl (322 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m68.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf, faiss-cpu
Successfully installed faiss-cpu-1.12.0 pypdf-6.1.0


# Load and Split the Document

 Download the PDF and use pypdf to extract the raw text. Then, split this text into smaller, manageable chunks. Chunking is a crucial step; it makes retrieval more efficient and ensures that each piece of information is small enough to fit into the LLM's context window.

In [2]:
import requests
from pypdf import PdfReader
import io

# Download the document
paper_url = "https://arxiv.org/pdf/1706.03762.pdf"
response = requests.get(paper_url)
pdf_bytes = io.BytesIO(response.content)
reader = PdfReader(pdf_bytes)

# Extract all text
text = "".join([page.extract_text() for page in reader.pages])

# Chunk the text
chunk_size = 500
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
print(f"Split document into {len(chunks)} chunks.")

Split document into 80 chunks.


# Step 2: Vectorization and Indexing (The "R" in RAG)
Next, we convert each text chunk into a numerical representation called an embedding. These embeddings are then stored in a vector database for fast, semantic search.

Create Embeddings: Use a pre-trained SentenceTransformer model to convert the text chunks into vectors. The model maps semantically similar text to vectors that are close to each other in a high-dimensional space.



In [3]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load an embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
chunk_embeddings = embedding_model.encode(chunks)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Build a Vector Index:
Use FAISS to create an efficient index of the embeddings. This allows you to quickly find the most relevant document chunks for a given query without a slow, brute-force search.

In [5]:
# Create a FAISS index
index = faiss.IndexFlatL2(chunk_embeddings.shape[1])
index.add(np.array(chunk_embeddings).astype('float32'))
print("FAISS index created successfully. ✅")

FAISS index created successfully. ✅


# Step 3: Retrieval and Generation (The "A" and "G" in RAG)
This is the final step, where you combine the retrieval and generation components to answer a question.

## Load the LLM:
Use a transformers pipeline to load a small, capable LLM for text generation. distilbert-base-cased-distilled-squad is a good starting point for a simple Q&A task.

In [8]:
from transformers import pipeline

# Load a pre-trained LLM for text generation
# We're switching to distilgpt2, which is designed for text generation
generator = pipeline("text-generation", model="distilgpt2")

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


## Build the RAG Pipeline: Create a function that:

Takes a user query.

Retrieves the top k most similar chunks from your FAISS index.

Augments a prompt by adding the retrieved chunks as context.

Generates a final answer using the LLM and the augmented prompt.

In [9]:
def rag_query(query_text, k=2):
    # 1. Retrieval
    query_embedding = embedding_model.encode([query_text]).astype('float32')
    _, indices = index.search(query_embedding, k)
    retrieved_chunks = [chunks[i] for i in indices[0]]

    # 2. Augmentation & Prompting
    context = "\n\n".join(retrieved_chunks)
    prompt = f"Based on the following context, answer the question.\n\nContext:\n{context}\n\nQuestion: {query_text}\nAnswer:"

    # 3. Generation
    response = generator(prompt, max_length=512, truncation=True)
    return response[0]['generated_text']

# Example Usage
query = "What is the main purpose of the self-attention mechanism?"
answer = rag_query(query)
print(f"Question: {query}\nAnswer:\n{answer}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Question: What is the main purpose of the self-attention mechanism?
Answer:
Based on the following context, answer the question.

Context:
ate to sequence lengths longer than the ones encountered
during training.
4 Why Self-Attention
In this section we compare various aspects of self-attention layers to the recurrent and convolu-
tional layers commonly used for mapping one variable-length sequence of symbol representations
(x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi, zi ∈ Rd, such as a hidden
layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we
consider th

ub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-
wise fully connected feed-forward network. We employ a residual connection [11] around each of
the two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is
LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function im