# **Retrieval Augmented Generation Architecture Implementation 1**

A RAG model encapsulates two core components: a question encoder and a generator.
During a forward pass, we encode the input with the question encoder and pass it
to the retriever to extract relevant context documents. The documents are then prepended to the input.
Such contextualized inputs are passed to the generator.

> Add blockquote



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Setting Up the Environment

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


In [None]:
import os

# Path to the cloned repository directory
repo_directory = "/content/drive/MyDrive/RAG-end2end"

# Change the current working directory to the cloned repository directory
os.chdir(repo_directory)

In [None]:
!pwd

/content/drive/MyDrive/RAG-end2end


# RAG End-to-End Model Implementation

In [None]:
!pip install -r rag-end2end-retriever/requirements.txt



In [None]:
%ls -la rag-end2end-retriever/

total 122
-rw------- 1 root root  4473 May 20 09:42 callbacks_rag.py
-rw------- 1 root root  8211 May  8 09:46 distributed_ray_retriever.py
-rw------- 1 root root 11211 May  8 09:46 eval_rag.py
drwx------ 2 root root  4096 May 26 12:03 evaluation
-rw------- 1 root root 33787 May 21 13:37 finetune_rag.py
-rw------- 1 root root  2069 May  8 09:46 finetune_rag_ray_end2end.sh
drwx------ 3 root root  4096 May 24 10:31 Health-data
-rw------- 1 root root  3178 May  8 09:46 kb_encode_utils.py
-rw------- 1 root root 16091 May 20 09:47 lightning_base.py
drwx------ 2 root root  4096 May  8 12:31 model_checkpoints
drwx------ 2 root root  4096 May 24 10:51 model_checkpoints2
drwx------ 2 root root  4096 May  8 12:25 __pycache__
-rw------- 1 root root  3519 May  8 09:46 README.md
-rw------- 1 root root   160 May 20 09:57 requirements.txt
drwx------ 2 root root  4096 May  8 09:46 test_run
-rw------- 1 root root  6986 May  8 09:46 use_own_knowledge_dataset.py
-rw------- 1 root root  8107 May  8 09:46 

In [None]:
%ls rag-end2end-retriever/Health-data/health-data/

my_knowledge_dataset		       NishauriGPT-Data.csv
my_knowledge_dataset_hnsw_index.faiss  NishauriGPT-Data.tsv


### Setting up the data for the model ie Indexing

In [None]:
%%bash
python rag-end2end-retriever/use_own_knowledge_dataset.py \
    --csv_path rag-end2end-retriever/Health-data/health-data/NishauriGPT-Data.tsv \
    --output_dir rag-end2end-retriever/Health-data/health-data

INFO:__main__:Step 1 - Create the dataset
Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 93 examples [00:00, 1778.49 examples/s]
Map:   0%|          | 0/93 [00:00<?, ? examples/s]Map: 100%|██████████| 93/93 [00:00<00:00, 24050.20 examples/s]
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load fr

In [None]:
%ls rag-end2end-retriever/Health-data/health-training-data

test.source  test.target  train.source	train.target  val.source  val.target


In [None]:
%ls rag-end2end-retriever/Health-data/health-data/

my_knowledge_dataset		       NishauriGPT-Data.csv
my_knowledge_dataset_hnsw_index.faiss  NishauriGPT-Data.tsv


In [None]:
%ls rag-end2end-retriever/test_run/dummy-kb

my_knowledge_dataset  my_knowledge_dataset.csv	my_knowledge_dataset_hnsw_index.faiss


### Final Finetune Script

In [None]:
%%bash
# Start a single-node Ray cluster.
ray start --head

# finetuning the RAG

python rag-end2end-retriever/finetune_rag.py \
    --data_dir rag-end2end-retriever/Health-data/health-training-data \
    --output_dir rag-end2end-retriever/model_checkpoints2 \
    --model_name_or_path rag-end2end-retriever/model_checkpoints2/checkpoint83 \
    --model_type rag_token \
    --index_name custom \
    --passages_path rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset \
    --index_path rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss \
    --csv_path rag-end2end-retriever/Health-data/health-data/NishauriGPT-Data.tsv \
    --distributed_retriever ray \
    --gpus 1  \
    --context_encoder_name facebook/dpr-ctx_encoder-multiset-base \
    --fp16 \
    --profile \
    --do_train \
    --end2end \
    --do_predict \
    --n_val -1  \
    --train_batch_size 1 \
    --eval_batch_size 1 \
    --max_source_length 128 \
    --max_target_length 25 \
    --val_max_target_length 25 \
    --test_max_target_length 25 \
    --label_smoothing 0.1 \
    --dropout 0.1 \
    --attention_dropout 0.1 \
    --weight_decay 0.001 \
    --adam_epsilon 1e-08 \
    --max_grad_norm 0.1 \
    --learning_rate 3e-05 \
    --num_train_epochs 10 \
    --warmup_steps 500 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler polynomial \
    --indexing_freq 5 \
    --gpu_order [5,6,7,8,9,0,1,2,3,4] \
    --index_gpus 1 \

# Stop the ray cluster
ray stop

2024-05-26 08:08:12,286	INFO usage_lib.py:467 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2024-05-26 08:08:12,286	INFO scripts.py:764 -- Local node IP: 172.28.0.12
2024-05-26 08:08:18,486	SUCC scripts.py:801 -- --------------------
2024-05-26 08:08:18,486	SUCC scripts.py:802 -- Ray runtime started.
2024-05-26 08:08:18,486	SUCC scripts.py:803 -- --------------------
2024-05-26 08:08:18,486	INFO scripts.py:805 -- Next steps
2024-05-26 08:08:18,486	INFO scripts.py:808 -- To add another node to this Ray cluster, run
2024-05-26 08:08:18,486	INFO scripts.py:811 --   ray start --address='172.28.0.12:6379'
2024-05-26 08:08:18,487	INFO scripts.py:820 -- To connect to

Adjusted the gradient accumulation steps from 8 to 4

## Retrieval Only **evaluation**

### Health Data Evaluation

In [None]:
!ls rag-end2end-retriever/Health-data/health-training-data/

test.source  test.target  train.source	train.target  val.source  val.target


In [None]:
%%bash
python rag-end2end-retriever/eval_rag.py \
    --model_name_or_path rag-end2end-retriever/model_checkpoints2/checkpoint83 \
    --model_type rag_token \
    --evaluation_set rag-end2end-retriever/Health-data/health-training-data/train.target \
    --gold_data_path rag-end2end-retriever/Health-data/health-training-data/train.source \
    --predictions_path evaluation/output/retrieval_preds.tsv \
    --eval_mode retrieval \
    --k 1

INFO:__main__:Evaluate the following checkpoints: ['rag-end2end-retriever/model_checkpoints2/checkpoint83']
INFO:__main__:***** Running evaluation for rag-end2end-retriever/model_checkpoints2/checkpoint83 *****
INFO:__main__:  Batch size = 8
INFO:__main__:  Predictions will be stored under evaluation/output/retrieval_preds.tsv
loading configuration file rag-end2end-retriever/model_checkpoints2/checkpoint83/config.json
Model config RagConfig {
  "_name_or_path": "rag-end2end-retriever/model_checkpoints2/checkpoint83",
  "architectures": [
    "RagTokenForGeneration"
  ],
  "dataset": "wiki_dpr",
  "dataset_revision": null,
  "dataset_split": "train",
  "do_deduplication": true,
  "do_marginalize": false,
  "doc_sep": " // ",
  "exclude_bos_score": false,
  "forced_eos_token_id": 2,
  "generator": {
    "_name_or_path": "",
    "_num_labels": 3,
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_bias_logits": false,
    "add_cross_attention": false,
    "add_final

### Dummy data Evaluation

In [None]:
%%bash
python rag-end2end-retriever/eval_rag.py \
    --model_name_or_path rag-end2end-retriever/model_checkpoints/checkpoint481 \
    --model_type rag_token \
    --evaluation_set rag-end2end-retriever/test_run/dummy-train-data/train.target \
    --gold_data_path rag-end2end-retriever/test_run/dummy-train-data/train.source \
    --predictions_path rag-end2end-retriever/evaluation/test_e2e_preds.txt \
    --eval_mode retrieval \
    --k 1


INFO:__main__:Evaluate the following checkpoints: ['rag-end2end-retriever/model_checkpoints/checkpoint481']
INFO:__main__:***** Running evaluation for rag-end2end-retriever/model_checkpoints/checkpoint481 *****
INFO:__main__:  Batch size = 8
INFO:__main__:  Predictions will be stored under rag-end2end-retriever/evaluation/test_e2e_preds.txt
loading configuration file rag-end2end-retriever/model_checkpoints/checkpoint481/config.json
Model config RagConfig {
  "_name_or_path": "facebook/rag-token-base",
  "architectures": [
    "RagTokenForGeneration"
  ],
  "dataset": "wiki_dpr",
  "dataset_revision": null,
  "dataset_split": "train",
  "do_deduplication": true,
  "do_marginalize": false,
  "doc_sep": " // ",
  "exclude_bos_score": false,
  "forced_eos_token_id": 2,
  "generator": {
    "_name_or_path": "",
    "_num_labels": 3,
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_bias_logits": false,
    "add_cross_attention": false,
    "add_final_layer_norm": fa

## Retrieval End to End Evaluation

In [None]:
%%bash
python rag-end2end-retriever/eval_rag.py \
    --model_name_or_path rag-end2end-retriever/model_checkpoints2/checkpoint83 \
    --model_type rag_token \
    --evaluation_set rag-end2end-retriever/Health-data/health-training-data/train.target \
    --gold_data_path rag-end2end-retriever/Health-data/health-training-data/train.source \
    --predictions_path rag-end2end-retriever/evaluation/e2e_preds.txt \
    --eval_mode e2e \
    --gold_data_mode ans \
    --n_docs 5 \
    --print_predictions \
    --recalculate \

INFO:__main__:Evaluate the following checkpoints: ['rag-end2end-retriever/model_checkpoints2/checkpoint83']
INFO:__main__:***** Running evaluation for rag-end2end-retriever/model_checkpoints2/checkpoint83 *****
INFO:__main__:  Batch size = 8
INFO:__main__:  Predictions will be stored under rag-end2end-retriever/evaluation/e2e_preds.txt
loading configuration file rag-end2end-retriever/model_checkpoints2/checkpoint83/config.json
Model config RagConfig {
  "_name_or_path": "rag-end2end-retriever/model_checkpoints2/checkpoint83",
  "architectures": [
    "RagTokenForGeneration"
  ],
  "dataset": "wiki_dpr",
  "dataset_revision": null,
  "dataset_split": "train",
  "do_deduplication": true,
  "do_marginalize": false,
  "doc_sep": " // ",
  "exclude_bos_score": false,
  "forced_eos_token_id": 2,
  "generator": {
    "_name_or_path": "",
    "_num_labels": 3,
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_bias_logits": false,
    "add_cross_attention": false,
    "

In [None]:
!ls rag-end2end-retriever/test_run/dummy-train-data

test.source  test.target  train.source	train.target  val.source  val.target


In [None]:
%%bash
python rag-end2end-retriever/eval_rag.py \
    --model_name_or_path rag-end2end-retriever/model_checkpoints/checkpoint481 \
    --model_type rag_token \
    --evaluation_set rag-end2end-retriever/test_run/dummy-train-data/train.target \
    --gold_data_path rag-end2end-retriever/test_run/dummy-train-data/train.source \
    --predictions_path rag-end2end-retriever/evaluation/test_e2e_preds.txt \
    --eval_mode e2e \
    --gold_data_mode ans \
    --n_docs 5 \
    --print_predictions \
    --recalculate \

INFO:__main__:Evaluate the following checkpoints: ['rag-end2end-retriever/model_checkpoints/checkpoint481']
INFO:__main__:***** Running evaluation for rag-end2end-retriever/model_checkpoints/checkpoint481 *****
INFO:__main__:  Batch size = 8
INFO:__main__:  Predictions will be stored under rag-end2end-retriever/evaluation/test_e2e_preds.txt
loading configuration file rag-end2end-retriever/model_checkpoints/checkpoint481/config.json
Model config RagConfig {
  "_name_or_path": "facebook/rag-token-base",
  "architectures": [
    "RagTokenForGeneration"
  ],
  "dataset": "wiki_dpr",
  "dataset_revision": null,
  "dataset_split": "train",
  "do_deduplication": true,
  "do_marginalize": false,
  "doc_sep": " // ",
  "exclude_bos_score": false,
  "forced_eos_token_id": 2,
  "generator": {
    "_name_or_path": "",
    "_num_labels": 3,
    "activation_dropout": 0.0,
    "activation_function": "gelu",
    "add_bias_logits": false,
    "add_cross_attention": false,
    "add_final_layer_norm": fa

# Testing the Model

## Out of the Box Implementation

In [None]:
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")

retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)

model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

input_dict = tokenizer.prepare_seq2seq_batch("who holds the record in 100m freestyle", return_tensors="pt")

generated = model.generate(input_ids=input_dict["input_ids"])

print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

# should give michael phelps => sounds reasonable


## Retriever component

In [None]:
from transformers import RagRetriever, RagTokenizer, RagConfig, DPRQuestionEncoderTokenizer, DPRQuestionEncoder

# Load dataset and retriever
dataset_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset"
index_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset_hnsw_index.faiss"

# Initialize RAG configuration with the required parameters
config = RagConfig.from_pretrained(
    "facebook/rag-token-base",
    question_encoder="facebook/dpr-question_encoder-single-nq-base",
    generator="facebook/bart-large"
)

retriever = RagRetriever.from_pretrained(
    config,
    question_encoder_tokenizer="facebook/dpr-ctx_encoder-single-nq-base",
    index_name="custom",
    passages_path=dataset_path,
    index_path=index_path,
)

# Create a query
query = "What does Moses' rod turn into ?"

# Tokenize the query
tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

inputs = tokenizer(query, return_tensors="pt")
question_embeddings = question_encoder(**inputs).pooler_output

# Retrieve top passages
retrieved_results = retriever(question_embeddings)

# Inspect the results
for result in retrieved_results:
    print(result)


In [None]:
from transformers import RagRetriever, RagTokenizer, RagConfig, DPRQuestionEncoderTokenizer, DPRQuestionEncoder

# Paths to your dataset and FAISS index
dataset_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset"
index_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset_hnsw_index.faiss"

# Initialize RAG configuration with the required parameters
config = RagConfig.from_pretrained(
    "facebook/rag-token-base",
    question_encoder="facebook/dpr-question_encoder-single-nq-base",
    generator="facebook/bart-large"
)

# Initialize the retriever with the specified configuration
retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-base",
    config=config,
    index_name="custom",
    passages_path=dataset_path,
    index_path=index_path,
)

# Create a query
query = "What does Moses' rod turn into ?"

# Tokenize the query
tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

inputs = tokenizer(query, return_tensors="pt")
question_embeddings = question_encoder(**inputs).pooler_output

# Retrieve top passages
retrieved_results = retriever(question_embeddings)

# Inspect the results
for result in retrieved_results:
    print(result)


OSError: Could not locate question_encoder_tokenizer/config.json inside facebook/rag-token-base.

## Generator Component

In [None]:
from transformers import AutoTokenizer, RagRetriever, RagTokenForGeneration
import torch

tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True
)
# initialize with RagRetriever to do everything in one forward call
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

inputs = tokenizer("How many people live in Paris?", return_tensors="pt")

targets = tokenizer(text_target="In Paris, there are 10 million people.", return_tensors="pt")

input_ids = inputs["input_ids"]
labels = targets["input_ids"]

outputs = model(input_ids=input_ids, labels=labels)

config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]



(…)_encoder_tokenizer/tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

question_encoder_tokenizer/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)ncoder_tokenizer/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


(…)enerator_tokenizer/tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

generator_tokenizer/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

generator_tokenizer/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

(…)erator_tokenizer/special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may res

Downloading builder script:   0%|          | 0.00/8.63k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/40.8M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Generating train split: 0 examples [00:00, ? examples/s]

  0%|          | 0/10 [00:00<?, ?it/s]



pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# To decode the generated response (if the model is in generation mode, not training):
generated_ids = model.generate(input_ids)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

generated_text

' 270,000,000'

In [None]:
# or use retriever separately
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", use_dummy_dataset=True)

# 1. Encode
question_hidden_states = model.question_encoder(input_ids)[0]


# 2. Retrieve
docs_dict = retriever(input_ids.numpy(), question_hidden_states.detach().numpy(), return_tensors="pt")
doc_scores = torch.bmm(
    question_hidden_states.unsqueeze(1), docs_dict["retrieved_doc_embeds"].float().transpose(1, 2)
).squeeze(1)


# 3. Forward to generator
outputs = model(
    context_input_ids=docs_dict["context_input_ids"],
    context_attention_mask=docs_dict["context_attention_mask"],
    doc_scores=doc_scores,
    decoder_input_ids=labels,
)

# or directly generate
generated = model.generate(
    context_input_ids=docs_dict["context_input_ids"],
    context_attention_mask=docs_dict["context_attention_mask"],
    doc_scores=doc_scores,
)
generated_string = tokenizer.batch_decode(generated, skip_special_tokens=True)

Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
generated_string

[' 270,000,000']

In [None]:
doc_scores

tensor([[66.5352, 65.4459, 64.8436, 64.5702, 63.6988]],
       grad_fn=<SqueezeBackward1>)

In [None]:
print(docs_dict["context_input_ids"])

tensor([[   0, 4688,  415,  ...,    1,    1,    1],
        [   0,  347, 8810,  ...,    1,    1,    1],
        [   0, 4688,  415,  ...,    1,    1,    1],
        [   0,  347, 8810,  ...,    1,    1,    1],
        [   0,  347, 8810,  ...,    1,    1,    1]])


## Own Knowledge test

In [None]:
%%bash
python rag-end2end-retriever/use_own_knowledge_dataset.py \
    --csv_path rag-end2end-retriever/Health-data/health-data/NishauriGPT-Data.tsv \
    --output_dir rag-end2end-retriever/Health-data/health-data \
    --question "What is Candidiasis (thrush)?" \
    --rag_model_name facebook/rag-token-nq \

INFO:__main__:Step 1 - Create the dataset
Some weights of the model checkpoint at facebook/dpr-ctx_encoder-multiset-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is

In [None]:
!ls rag-end2end-retriever/Health-data/health-data

my_knowledge_dataset		       NishauriGPT-Data.csv
my_knowledge_dataset_hnsw_index.faiss  NishauriGPT-Data.tsv


In [None]:
import os
import logging
from pathlib import Path

import faiss
from datasets import load_from_disk
from transformers import (
    RagRetriever,
    RagSequenceForGeneration,
    RagTokenizer,
    DPRQuestionEncoder,
    DPRQuestionEncoderTokenizer,
)

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Define the paths to the dataset and the index for health data

# dataset_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset"
# index_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss"

# # Define the test data
# dataset_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset"
# index_path = "rag-end2end-retriever/test_run/dummy-kb/my_knowledge_dataset_hnsw_index.faiss"

In [None]:
!ls rag-end2end-retriever/test_run/dummy-kb

my_knowledge_dataset  my_knowledge_dataset.csv	my_knowledge_dataset_hnsw_index.faiss


In [None]:
!ls rag-end2end-retriever/model_checkpoints2/

checkpoint83  dpr_ctx_checkpoint  git_log.json	hparams.pkl  metrics.json


In [None]:
# Define the paths to the dataset and the index for health data

dataset_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset"
index_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss"

checkpoint_path = "rag-end2end-retriever/model_checkpoints2/checkpoint83"
dpr_checkpoint_path = "rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83"

In [None]:
# Step 1: Load the dataset from disk
logger.info("Loading dataset from disk")
dataset = load_from_disk(dataset_path)

# Step 2: Load the Faiss index and attach it to the dataset
logger.info("Loading and attaching Faiss index")
dataset.load_faiss_index("embeddings", index_path)

In [None]:
from datasets import load_from_disk
from huggingface_hub import HfApi

hf_token = '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'
repo_name = 'Oscar066/health-dataset'

# Step 1: Load the dataset from disk
dataset_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset"
dataset = load_from_disk(dataset_path)

# Step 2: Drop the Faiss index
dataset.drop_index("embeddings")

# Step 3: (Optional) Perform any transformations on the dataset here

# Step 4: Load the Faiss index and attach it to the dataset again
index_path = "rag-end2end-retriever/Health-data/health-data/my_knowledge_dataset_hnsw_index.faiss"
dataset.load_faiss_index("embeddings", index_path)

# Step 5: Push the dataset to the Hugging Face Hub
dataset.push_to_hub(repo_name, token=hf_token)


KeyError: 'embeddings'

In [None]:
dataset

Dataset({
    features: ['text', 'title', 'embeddings'],
    num_rows: 94
})

In [None]:
# Step 3: Initialize the RAG Retriever with the loaded dataset and index
logger.info("Initializing RAG Retriever and Model")
retriever = RagRetriever.from_pretrained("Oscar066/RAG-end2end-Model", index_name="custom", indexed_dataset=dataset)

model = RagSequenceForGeneration.from_pretrained("Oscar066/RAG-end2end-Model", retriever=retriever)

tokenizer = RagTokenizer.from_pretrained("Oscar066/RAG-end2end-Model")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Load the question encoder and its tokenizer
# question_encoder = DPRQuestionEncoder.from_pretrained("Oscar066/RAG-end2end-Model")
# question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("Oscar066/RAG-end2end-Model")

You are using a model of type rag to instantiate a model of type dpr. This is not supported for all configurations of models and can yield errors.


TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)


In [None]:
# # Load the question encoder and its tokenizer
question_encoder = DPRQuestionEncoder.from_pretrained(os.path.join("rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83"))
question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(os.path.join("rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83"))

Some weights of DPRQuestionEncoder were not initialized from the model checkpoint at rag-end2end-retriever/model_checkpoints2/dpr_ctx_checkpoint/checkpoint83 and are newly initialized: ['bert_model.embeddings.LayerNorm.bias', 'bert_model.embeddings.LayerNorm.weight', 'bert_model.embeddings.position_embeddings.weight', 'bert_model.embeddings.token_type_embeddings.weight', 'bert_model.embeddings.word_embeddings.weight', 'bert_model.encoder.layer.0.attention.output.LayerNorm.bias', 'bert_model.encoder.layer.0.attention.output.LayerNorm.weight', 'bert_model.encoder.layer.0.attention.output.dense.bias', 'bert_model.encoder.layer.0.attention.output.dense.weight', 'bert_model.encoder.layer.0.attention.self.key.bias', 'bert_model.encoder.layer.0.attention.self.key.weight', 'bert_model.encoder.layer.0.attention.self.query.bias', 'bert_model.encoder.layer.0.attention.self.query.weight', 'bert_model.encoder.layer.0.attention.self.value.bias', 'bert_model.encoder.layer.0.attention.self.value.weigh

In [None]:
# Example usage
question = "What is Bacterial pneumonia?"
input_ids = question_tokenizer(question, return_tensors="pt")["input_ids"]
#input_ids = tokenizer.question_encoder(question, return_tensors="pt")["input_ids"]

In [None]:
# Generate answer using the model
generated = model.generate(input_ids)
generated_string = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

In [None]:
print("Q: " + question)
print("A: " + generated_string)

Q: What is Bacterial pneumonia?
A: B pneumonia. Bacterial pneumonia            


In [None]:
# Get the question hidden states
question_hidden_states = question_encoder(input_ids).pooler_output

# Retrieve passages and scores
retrieved_results = retriever(question_hidden_states)

retrieved_docs = retrieved_results["retrieved_doc_embeds"]  # Document embeddings
doc_scores = retrieved_results["doc_scores"]  # Retrieval scores

TypeError: RagRetriever.__call__() missing 1 required positional argument: 'question_hidden_states'

In [None]:
!ls rag-end2end-retriever/model_checkpoints/checkpoint481

config.json		generator_tokenizer  question_encoder_tokenizer
generation_config.json	model.safetensors


In [None]:
for i, score in enumerate(doc_scores[0]):
    title = retrieved_results["retrieved_doc_ids"][0][i]  # Assuming titles are available in the dataset
    logger.info(f"Document {i + 1} - Score: {score:.4f}, Title: {title}")

In [None]:
for i, score in enumerate(doc_scores[0]):
    doc_id = retrieved_results["retrieved_doc_ids"][0][i]  # Document ID
    title = dataset[int(doc_id)]["title"]  # Retrieve title from the dataset using the doc_id
    logger.info(f"Document {i + 1} - Score: {score:.4f}, Title: {title}")

In [None]:
 #Get the question hidden states
question_hidden_states = question_encoder(input_ids).pooler_output

# Retrieve passages and scores
retrieved_results = retriever(question_hidden_states)
retrieved_docs = retrieved_results["retrieved_doc_embeds"]  # Document embeddings
doc_scores = retrieved_results["doc_scores"]  # Retrieval scores