# **Retrieval Augmented Generation Architecture Implementation on Lighting**

A RAG model encapsulates two core components: a question encoder and a generator.
During a forward pass, we encode the input with the question encoder and pass it
to the retriever to extract relevant context documents. The documents are then prepended to the input.
Such contextualized inputs are passed to the generator.

> Add blockquote



In [1]:
#from google.colab import drive
#drive.mount('/content/drive')

## Setting Up the Environment

In [2]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 16.3 gigabytes of available RAM

Not using a high-RAM runtime


In [3]:
import os

# Path to the repository directory
repo_directory = "/teamspace/studios/this_studio/RAG-end2end"

# Change the current working directory to the cloned repository directory
os.chdir(repo_directory)

In [4]:
%pwd

/teamspace/studios/this_studio/RAG-end2end


In [39]:
%huggingface-cli scan-cache -vvv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


REPO ID                                REPO TYPE REVISION                                 SIZE ON DISK NB FILES LAST_MODIFIED  REFS LOCAL PATH                                                                                                                           
-------------------------------------- --------- ---------------------------------------- ------------ -------- -------------- ---- ------------------------------------------------------------------------------------------------------------------------------------ 
facebook/dpr-ctx_encoder-multiset-base model     fdb3d46584386d2f20aa00724ae31cebc348d16b       438.7M        5 23 minutes ago main /home/zeus/.cache/huggingface/hub/models--facebook--dpr-ctx_encoder-multiset-base/snapshots/fdb3d46584386d2f20aa00724ae31cebc348d16b 
facebook/rag-token-nq                  model     c269b105d2322e9386b629a0a8663d20863a5167         2.1G        9 19 minutes ago main /home/zeus/.cache/huggingface/hub/models--facebook--rag-token-nq/snaps

# RAG End-to-End Model Implementation

In [None]:
%pip install -r rag-end2end-retriever/requirements.txt

In [17]:
%cat rag-end2end-retriever/requirements.txt

faiss-cpu >= 1.7.2
datasets 
psutil >= 5.9.1
torch == 1.11.0

torchtext == 0.12.0

pytorch-lightning == 1.6.4
nvidia-ml-py3 == 7.352.0
ray >=  1.13.0
GitPython

transformers


In [18]:
!ls -la rag-end2end-retriever/

total 148
drwxr-xr-x 1 oscarkaruga1 oscarkaruga1  4096 May 28 09:22 .
drwxr-xr-x 1 oscarkaruga1 oscarkaruga1  4096 May 28 09:22 ..
drwxr-xr-x 4 oscarkaruga1 oscarkaruga1  4096 May 28 09:22 Health-data
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1  3519 May 27 20:49 README.md
drwxr-xr-x 1 oscarkaruga1 oscarkaruga1  4096 May 28 09:21 __pycache__
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1  4473 May 27 20:49 callbacks_rag.py
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1  8211 May 27 20:49 distributed_ray_retriever.py
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1 11211 May 27 20:49 eval_rag.py
drwxr-xr-x 2 oscarkaruga1 oscarkaruga1  4096 May 28 09:22 evaluation
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1 33787 May 27 20:49 finetune_rag.py
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1  2069 May 27 20:49 finetune_rag_ray_end2end.sh
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1  3178 May 27 20:49 kb_encode_utils.py
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1 16091 May 27 20:49 lightning_base.py
-rwxr--r-- 1 oscarkaruga1 oscarkaruga1   174 Ma

In [21]:
!ls rag-end2end-retriever/Health-data/health-data/

NishauriGPT-Data.csv  my_knowledge_dataset
NishauriGPT-Data.tsv  my_knowledge_dataset_hnsw_index.faiss


### Setting up the data for the model ie Indexing

In [20]:
%%bash
python rag-end2end-retriever/use_own_knowledge_dataset.py \
    --csv_path rag-end2end-retriever/Health-data/health-data/NishauriGPT-Data.tsv \
    --output_dir rag-end2end-retriever/Health-data/health-data

INFO:__main__:Step 1 - Create the dataset
Generating train split: 93 examples [00:00, 3322.18 examples/s]
Map: 100%|██████████| 93/93 [00:00<00:00, 16995.04 examples/s]
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected

In [25]:
!ls rag-end2end-retriever/Health-data/health-training-data

test.source  test.target  train.source	train.target  val.source  val.target


In [26]:
!ls rag-end2end-retriever/Health-data/health-data/

NishauriGPT-Data.csv  my_knowledge_dataset
NishauriGPT-Data.tsv  my_knowledge_dataset_hnsw_index.faiss


### Final Finetune Script

In [None]:
%%bash
# Start a single-node Ray cluster.
ray start --head

# finetuning the RAG

python rag-end2end-retriever/finetune_rag.py \
    --data_dir rag-end2end-retriever/Health-data/health-training-data \
    --output_dir rag-end2end-retriever/model_checkpoints2 \
    --model_name_or_path rag-end2end-retriever/model_checkpoints2/checkpoint83 \
    --model_type rag_token \
    --index_name custom \
    --passages_path rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset \
    --index_path rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss \
    --csv_path rag-end2end-retriever/Health-data/health-data/NishauriGPT-Data.tsv \
    --distributed_retriever ray \
    --gpus 1  \
    --context_encoder_name facebook/dpr-ctx_encoder-multiset-base \
    --fp16 \
    --profile \
    --do_train \
    --end2end \
    --do_predict \
    --n_val -1  \
    --train_batch_size 1 \
    --eval_batch_size 1 \
    --max_source_length 128 \
    --max_target_length 25 \
    --val_max_target_length 25 \
    --test_max_target_length 25 \
    --label_smoothing 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --weight_decay 0.001 \
    --adam_epsilon 1e-08 \
    --max_grad_norm 0.1 \
    --learning_rate 3e-05 \
    --num_train_epochs 10 \
    --warmup_steps 500 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler polynomial \
    --indexing_freq 5 \
    --gpu_order [5,6,7,8,9,0,1,2,3,4] \
    --index_gpus 1 \

# Stop the ray cluster
ray stop

Adjusted the gradient accumulation steps from 8 to 4

## Retrieval Only **evaluation**

### Health Data Evaluation

In [None]:
!ls rag-end2end-retriever/Health-data/health-training-data/

test.source  test.target  train.source	train.target  val.source  val.target


In [None]:
%%bash
python rag-end2end-retriever/eval_rag.py \
    --model_name_or_path rag-end2end-retriever/model_checkpoints2/checkpoint83 \
    --model_type rag_token \
    --evaluation_set rag-end2end-retriever/Health-data/health-training-data/train.target \
    --gold_data_path rag-end2end-retriever/Health-data/health-training-data/train.source \
    --predictions_path evaluation/output/retrieval_preds.tsv \
    --eval_mode retrieval \
    --k 1

## Retrieval End to End Evaluation

In [None]:
%%bash
python rag-end2end-retriever/eval_rag.py \
    --model_name_or_path rag-end2end-retriever/model_checkpoints2/checkpoint83 \
    --model_type rag_token \
    --evaluation_set rag-end2end-retriever/Health-data/health-training-data/train.target \
    --gold_data_path rag-end2end-retriever/Health-data/health-training-data/train.source \
    --predictions_path rag-end2end-retriever/evaluation/e2e_preds.txt \
    --eval_mode e2e \
    --gold_data_mode ans \
    --n_docs 5 \
    --print_predictions \
    --recalculate \

# Testing the Model

## Out of the Box Implementation

In [None]:
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)

model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

input_dict = tokenizer.prepare_seq2seq_batch("who holds the record in 100m freestyle", return_tensors="pt")

generated = model.generate(input_ids=input_dict["input_ids"])

print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

# should give michael phelps => sounds reasonable

## Retriever component

In [None]:
from transformers import RagRetriever, RagTokenizer, RagConfig, DPRQuestionEncoderTokenizer, DPRQuestionEncoder

# Load dataset and retriever
dataset_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset"
index_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset_hnsw_index.faiss"

# Initialize RAG configuration with the required parameters
config = RagConfig.from_pretrained(
    "facebook/rag-token-base",
    question_encoder="facebook/dpr-question_encoder-single-nq-base",
    generator="facebook/bart-large"
)

retriever = RagRetriever.from_pretrained(
    config,
    question_encoder_tokenizer="facebook/dpr-ctx_encoder-single-nq-base",
    index_name="custom",
    passages_path=dataset_path,
    index_path=index_path,
)

# Create a query
query = "What does Moses' rod turn into ?"

# Tokenize the query
tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

inputs = tokenizer(query, return_tensors="pt")
question_embeddings = question_encoder(**inputs).pooler_output

# Retrieve top passages
retrieved_results = retriever(question_embeddings)

# Inspect the results
for result in retrieved_results:
    print(result)


## Generator Component

In [None]:
from transformers import AutoTokenizer, RagRetriever, RagTokenForGeneration
import torch

tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True
)
# initialize with RagRetriever to do everything in one forward call
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

inputs = tokenizer("How many people live in Paris?", return_tensors="pt")

targets = tokenizer(text_target="In Paris, there are 10 million people.", return_tensors="pt")

input_ids = inputs["input_ids"]
labels = targets["input_ids"]

outputs = model(input_ids=input_ids, labels=labels)

In [29]:
# To decode the generated response (if the model is in generation mode, not training):
generated_ids = model.generate(input_ids)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

generated_text

' 270,000,000'

## Own Knowledge test 

In [30]:
%ls rag-end2end-retriever/Health-data/health-data

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


NishauriGPT-Data.csv  my_knowledge_dataset
NishauriGPT-Data.tsv  my_knowledge_dataset_hnsw_index.faiss


In [31]:
import os
import logging
from pathlib import Path

import faiss
from datasets import load_from_disk
from transformers import (
    RagRetriever,
    RagSequenceForGeneration,
    RagTokenizer,
    DPRQuestionEncoder,
    DPRQuestionEncoderTokenizer,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Define the paths to the dataset and the index for health data

# dataset_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset"
# index_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss"

# # Define the test data
# dataset_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset"
# index_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset_hnsw_index.faiss"

In [33]:
!ls rag-end2end-retriever/model_checkpoints2/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


ls: cannot access 'rag-end2end-retriever/model_checkpoints2/': No such file or directory


In [34]:
# Define the paths to the dataset and the index for health data

dataset_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset"
index_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss"

# checkpoint_path = "rag-end2end-retriever/model_checkpoints2/checkpoint83"
# dpr_checkpoint_path = "rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83"

In [35]:
# Step 1: Load the dataset from disk
logger.info("Loading dataset from disk")
dataset = load_from_disk(dataset_path)

# Step 2: Load the Faiss index and attach it to the dataset
logger.info("Loading and attaching Faiss index")
dataset.load_faiss_index("embeddings", index_path)

INFO:__main__:Loading dataset from disk
INFO:__main__:Loading and attaching Faiss index


In [36]:
dataset

Dataset({
    features: ['text', 'title', 'embeddings'],
    num_rows: 94
})

In [None]:
# Step 3: Initialize the RAG Retriever with the loaded dataset and index
logger.info("Initializing RAG Retriever and Model")
retriever = RagRetriever.from_pretrained("Oscar066/RAG-end2end-Model", index_name="custom", indexed_dataset=dataset)

model = RagSequenceForGeneration.from_pretrained("Oscar066/RAG-end2end-Model", retriever=retriever)

tokenizer = RagTokenizer.from_pretrained("Oscar066/RAG-end2end-Model")

In [None]:
# Load the question encoder and its tokenizer
# question_encoder = DPRQuestionEncoder.from_pretrained("Oscar066/RAG-end2end-Model")
# question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("Oscar066/RAG-end2end-Model")

In [None]:
# # Load the question encoder and its tokenizer
question_encoder = DPRQuestionEncoder.from_pretrained(os.path.join("rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83"))
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(os.path.join("rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83"))

In [25]:
# Example usage
question = "What is Bacterial pneumonia?"
input_ids = question_tokenizer(question, return_tensors="pt")["input_ids"]
#input_ids = tokenizer.question_encoder(question, return_tensors="pt")["input_ids"]

In [26]:
# Generate answer using the model
generated = model.generate(input_ids)
generated_string = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

In [27]:
print("Q: " + question)
print("A: " + generated_string)

Q: What is Bacterial pneumonia?
A: B pneumonia. Bacterial pneumonia            
