# **Retrieval-Augmented Generation (RAG) System with GPT-2: Document Retrieval, Text Generation, and Query Response**


This notebook demonstrates a Retrieval-Augmented Generation (RAG) system using GPT-2. It combines document retrieval and text generation by encoding documents, retrieving relevant ones based on queries, and generating coherent responses. It showcases a practical approach to enhancing query answering with GPT-2.

A **RAG (Retrieval-Augmented Generation**) system is a hybrid model that combines two powerful components: information retrieval and text generation. It is designed to answer questions or generate responses by retrieving relevant documents from a knowledge base and then generating a coherent answer based on that information.

### Install Libraries

In [1]:
!pip install sentence-transformers -q
!pip install transformers -q
!pip install torch -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/249.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m245.8/249.1 kB[0m [31m11.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m249.1/249.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### Importing Libraries

In [2]:
from sentence_transformers import SentenceTransformer, util
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

  from tqdm.autonotebook import tqdm, trange


In [3]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Loading pre-trained embedding model and GPT-2 model

This is fine-tuned GPT-2 model specifically designed to answer questions based on the provided context. The model has been fine-tuned to generate accurate and contextually relevant responses

In [4]:
# Initialize the SentenceTransformer model for generating embeddings from text.
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Load the tokenizer for GPT-2 from the specified directory.
tokenizer = GPT2Tokenizer.from_pretrained('/content/drive/MyDrive/ChatBotRag/gpt2-finetuned')

# Load the fine-tuned GPT-2 model from the specified directory.
gpt2_model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/ChatBotRag/gpt2-finetuned/checkpoint-3000')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Seting the pad token to be the same as the eos token

In [5]:
# Set the padding token to be the same as the end-of-sequence (EOS) token.
# This ensures that padding tokens used during text generation or batching
tokenizer.pad_token = tokenizer.eos_token

### Example documents

In [6]:
documents = [
    "The Eiffel Tower is in Paris.",
    "The Great Wall of China is in China.",
    "Mount Everest is the highest mountain in the world.",
    "Praveen is a machine learning engineer.",
    "Mount Everest, the highest peak in the world.",
    "The Mona Lisa is a famous painting by Leonardo da Vinci.",
    "The Amazon Rainforest is located in South America.",
    "The Pacific Ocean is the largest ocean on Earth.",
    "Albert Einstein developed the theory of relativity.",
    "The Pyramids of Giza are one of the Seven Wonders of the Ancient World.",
    "The Sahara Desert is the largest hot desert in the world.",
    "Vincent van Gogh was a Dutch post-impressionist painter.",
    "The Taj Mahal is located in Agra, India.",
    "The Grand Canyon is a large canyon in the state of Arizona, USA.",
    "Shakespeare wrote many famous plays including 'Romeo and Juliet'.",
    "The Colosseum is an ancient amphitheater located in Rome, Italy.",
    "The Galápagos Islands are known for their unique wildlife.",
    "Leonardo da Vinci was also an inventor and scientist.",
    "The Berlin Wall once divided East and West Berlin.",
    "The Great Barrier Reef is the largest coral reef system in the world."
]


### Generating embeddings for documents

In [7]:
# Encode the list of documents into vectors (embeddings) using the embedding model.
doc_embeddings = embedding_model.encode(documents, convert_to_tensor=True)

### Function for retrieve document

In [8]:
def retrieve_documents(query, top_k=1):
    # Encode the query into a vector (embedding) using the embedding model.
    query_embedding = embedding_model.encode(query, convert_to_tensor=True)

    # Compute the cosine similarity between the query embedding and the document embeddings.
    similarities = util.pytorch_cos_sim(query_embedding, doc_embeddings)

    # Get the indices of the top_k most similar documents
    top_results = torch.topk(similarities, k=top_k)

    return [documents[idx] for idx in top_results.indices[0]]

In [26]:
relevant_docs = retrieve_documents("what is Berlin Wall")
relevant_docs

['The Berlin Wall once divided East and West Berlin.']

### Function to generate a response using GPT-2

In [11]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

def load_model_and_tokenizer(checkpoint_path, tokenizer_path):
    # Load the fine-tuned model and tokenizer from the checkpoint
    tokenizer = GPT2Tokenizer.from_pretrained(tokenizer_path)
    model = GPT2LMHeadModel.from_pretrained(checkpoint_path)
    return tokenizer, model

# Example usage
checkpoint_path = "/content/drive/MyDrive/ChatBotRag/gpt2-finetuned/checkpoint-3000"  # Path to the fine-tuned model checkpoint
tokenizer_path = "/content/drive/MyDrive/ChatBotRag/gpt2-finetuned"

# Load the model and tokenizer
tokenizer, model = load_model_and_tokenizer(checkpoint_path, tokenizer_path)

In [38]:
from transformers import TextStreamer, StoppingCriteriaList
# Define the stopping criteria function
def stop_generation(input_ids, scores):
    # Iterate through each batch
    for batch_id in range(len(input_ids)):
        # Decode tokens for the current batch
        decoded_tokens = tokenizer.decode(input_ids[batch_id].tolist())

        # Check if '' is present in the decoded tokens
        if '<stop>' in decoded_tokens:
            return True  # Stop generation if '' is generated in any batch
    return False  # Continue generation otherwise

def stream(user_prompt):
    runtimeFlag = "cpu"  # Change to CPU


    relevant_docs = retrieve_documents(user_prompt)
    prompt = f"""{relevant_docs[0]}\n<que> {user_prompt}"""
    print(prompt)

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Create a stopping criteria list with the defined function
    stopping_criteria = StoppingCriteriaList()
    stopping_criteria.append(stop_generation)

    # Call the generate method while passing the stopping criteria
    stream_out = model.generate(
        **inputs,
        streamer=streamer,
        max_new_tokens=512,
        stopping_criteria=stopping_criteria,
        do_sample=True,
        top_k=2,
        pad_token_id=tokenizer.eos_token_id
    )


In [39]:
stream("what is mount everest ?")

Mount Everest, the highest peak in the world.
<que> what is mount everest ?

<ans> highest peak
<stop>


END