<a href="https://colab.research.google.com/github/pranao0609/rag-from-scratch/blob/main/RAG_from_scratch_with_dummy_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install faiss-cpu transformers datasets

In [9]:
corpus = [
    "Isaac Newton formulated the laws of motion and universal gravitation.",
    "The capital of France is Paris, known for the Eiffel Tower.",
    "Python is a popular programming language used in AI and web development.",
    "Mahatma Gandhi led India to independence through non-violent resistance.",
    "The Moon orbits the Earth and influences ocean tides."
]

print("Dummy corpus loaded:", len(corpus), "documents")


Dummy corpus loaded: 5 documents


In [10]:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
import torch
import faiss
import numpy as np

ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

def encode_passages(passages, batch_size=16):
    embeddings = []
    ctx_encoder.eval()
    with torch.no_grad():
        for i in range(0, len(passages), batch_size):
            batch = passages[i:i+batch_size]
            inputs = ctx_tokenizer(batch, return_tensors="pt", padding=True, truncation=True)
            outputs = ctx_encoder(**inputs).pooler_output
            embeddings.append(outputs.cpu().numpy())
    return np.vstack(embeddings)

passage_embeddings = encode_passages(corpus)

dim = passage_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(passage_embeddings)
print("FAISS index built with", index.ntotal, "vectors")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification mod

FAISS index built with 5 vectors


In [12]:
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer

question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

def retrieve_documents(question, k=3):
    inputs = question_tokenizer(question, return_tensors="pt")
    with torch.no_grad():
        question_emb = question_encoder(**inputs).pooler_output.cpu().numpy()
    distances, indices = index.search(question_emb, k)
    return [corpus[i] for i in indices[0]]


Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [13]:

question = "Who discovered gravity?"
top_docs = retrieve_documents(question)
for i, doc in enumerate(top_docs):
    print(f"[Doc {i+1}]:", doc[:200], "\n")

[Doc 1]: Isaac Newton formulated the laws of motion and universal gravitation. 

[Doc 2]: Python is a popular programming language used in AI and web development. 

[Doc 3]: Mahatma Gandhi led India to independence through non-violent resistance. 



In [15]:
from transformers import BartTokenizer, BartForConditionalGeneration

bart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
bart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large")

def generate_answer(question, documents):
    inputs = [question + " </s> " + doc for doc in documents]
    input_text = " </s> ".join(inputs)
    input_tokens = bart_tokenizer([input_text], return_tensors="pt", truncation=True, padding=True, max_length=1024)

    with torch.no_grad():
        outputs = bart_model.generate(**input_tokens, max_length=100, num_beams=4, early_stopping=True)

    return bart_tokenizer.decode(outputs[0], skip_special_tokens=True)


In [17]:
answer = generate_answer(question, top_docs)
print("\n Final Answer:\n", answer)


 Final Answer:
 Who discovered gravity?  Isaac Newton formulated the laws of motion and universal gravitation.  Who invented the computer? 
