### Practical Example: Semantic Search with Retrieval-Augmented Generation (RAG)

This example demonstrates how to build a simple semantic search system using a Retrieval-Augmented Generation (RAG) model from Hugging Face's Transformers library and FAISS (Facebook AI Similarity Search) for efficient nearest neighbor search. The RAG model combines a retriever and a generator to perform information retrieval and generate responses.

#### Pre-requisites

1. **Install Required Libraries**
   First, ensure you have the necessary libraries installed. You need `transformers`, `faiss-cpu`, `numpy` and `torch`. Install them using pip:


In [None]:
pip install transformers faiss-cpu torch numpy

#### Code Breakdown

In [None]:
import torch
import faiss
import numpy as np
from transformers import RagTokenizer, RagTokenForGeneration

- **Imports**: 
  - `torch` for PyTorch tensor operations.
  - `faiss` for efficient similarity search.
  - `numpy` for array manipulations.
  - `RagTokenizer` and `RagTokenForGeneration` from `transformers` to use the RAG model.

In [None]:
# Load pre-trained tokenizer and model
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq")

- **Model and Tokenizer**:
  - `RagTokenizer` is used to preprocess the text data for the RAG model.
  - `RagTokenForGeneration` loads the pre-trained RAG model, which includes both the retriever and generator components.

In [None]:
# Define some example documents
documents = [
    "To reset your SmartWatch, press and hold the power button for 10 seconds until the logo appears.",
    "If your SmartWatch is unresponsive, try performing a hard reset by pressing and holding the power and home buttons simultaneously.",
    "Check the battery level of your SmartWatch if it is not turning on."
]

- **Documents**:
  - A list of example documents that will be indexed and used for similarity search.


In [None]:
# Tokenize documents
inputs = tokenizer(documents, return_tensors="pt", padding=True, truncation=True, max_length=512)

- **Tokenization**:
  - The documents are tokenized into tensors suitable for the model. Padding and truncation ensure uniform input size.

In [None]:
# Generate document embeddings
with torch.no_grad():
    # Use the encoder from the RAG model to encode the documents
    encoder = model.rag.question_encoder
    encoder_outputs = encoder(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
    
    # Handle tuple output
    if isinstance(encoder_outputs, tuple):
        encoder_outputs = encoder_outputs[0]
    
    # Check the shape of encoder_outputs
    print(f"Shape of encoder_outputs: {encoder_outputs.shape}")
    
    # Ensure proper dimensions
    if encoder_outputs.dim() == 3:
        doc_embeddings = encoder_outputs[:, 0, :].numpy()
    elif encoder_outputs.dim() == 2:
        # Handle case where the output tensor is already 2D (batch_size, hidden_size)
        doc_embeddings = encoder_outputs.numpy()
    else:
        raise ValueError("Unexpected number of dimensions in encoder_outputs.")
    
    # Ensure the array is C-contiguous
    doc_embeddings = np.ascontiguousarray(doc_embeddings)

- **Embedding Generation**:
  - Using the model’s retriever (encoder) to produce embeddings for the documents.
  - The output is checked to ensure it’s a 3D tensor and is reshaped accordingly.
  - The embeddings are made C-contiguous to be compatible with FAISS.

In [None]:
# Create FAISS index
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)

- **FAISS Index Creation**:
  - A FAISS index is created to perform efficient similarity search using the document embeddings. The index type `IndexFlatL2` uses L2 (Euclidean) distance for comparisons.

In [None]:
# Define a search function
def search(query, top_k=1):
    inputs = tokenizer(query, return_tensors="pt")
    with torch.no_grad():
        query_embedding = encoder(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
        
        # Handle tuple output
        if isinstance(query_embedding, tuple):
            query_embedding = query_embedding[0]
        
        # Check the shape of query_embedding
        print(f"Shape of query_embedding: {query_embedding.shape}")
        
        # Ensure proper dimensions
        if query_embedding.dim() == 3:
            query_embedding = query_embedding[:, 0, :].numpy()
        elif query_embedding.dim() == 2:
            # Handle case where the output tensor is already 2D (batch_size, hidden_size)
            query_embedding = query_embedding.numpy()
        else:
            raise ValueError("Unexpected number of dimensions in query_embedding.")
        
        # Ensure the array is C-contiguous
        query_embedding = np.ascontiguousarray(query_embedding)
    
    distances, indices = index.search(query_embedding, top_k)
    return indices[0]

- **Search Function**:
  - Tokenizes the query and generates its embedding.
  - The embedding is converted to a C-contiguous NumPy array.
  - The FAISS index is queried to find the closest document embeddings.

In [None]:
# Perform search
query = "How do I reset my SmartWatch?"
if query:
    top_indices = search(query, top_k=1)
    top_documents = [documents[idx] for idx in top_indices]
    print("Response (on Top document):", top_documents[0])
else:
    print("Query is empty.")

- **Search Execution**:
  - Performs a search with a sample query and prints the most relevant document from the index.

#### Complete Code Example

Here is the complete code snippet for semantic search using RAG:

In [None]:
import torch
import faiss
import numpy as np
from transformers import RagTokenizer, RagTokenForGeneration

# Load pre-trained tokenizer and model
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq")

# Define some example documents
documents = [
    "To reset your SmartWatch, press and hold the power button for 10 seconds until the logo appears.",
    "If your SmartWatch is unresponsive, try performing a hard reset by pressing and holding the power and home buttons simultaneously.",
    "Check the battery level of your SmartWatch if it is not turning on."
]

# Tokenize documents
inputs = tokenizer(documents, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Generate document embeddings
with torch.no_grad():
    # Use the encoder from the RAG model to encode the documents
    encoder = model.rag.question_encoder
    encoder_outputs = encoder(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
    
    # Handle tuple output
    if isinstance(encoder_outputs, tuple):
        encoder_outputs = encoder_outputs[0]
    
    # Check the shape of encoder_outputs
    print(f"Shape of encoder_outputs: {encoder_outputs.shape}")
    
    # Ensure proper dimensions
    if encoder_outputs.dim() == 3:
        doc_embeddings = encoder_outputs[:, 0, :].numpy()
    elif encoder_outputs.dim() == 2:
        # Handle case where the output tensor is already 2D (batch_size, hidden_size)
        doc_embeddings = encoder_outputs.numpy()
    else:
        raise ValueError("Unexpected number of dimensions in encoder_outputs.")
    
    # Ensure the array is C-contiguous
    doc_embeddings = np.ascontiguousarray(doc_embeddings)

# Create FAISS index
dimension = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(doc_embeddings)

# Define a search function
def search(query, top_k=1):
    inputs = tokenizer(query, return_tensors="pt")
    with torch.no_grad():
        query_embedding = encoder(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
        
        # Handle tuple output
        if isinstance(query_embedding, tuple):
            query_embedding = query_embedding[0]
        
        # Check the shape of query_embedding
        print(f"Shape of query_embedding: {query_embedding.shape}")
        
        # Ensure proper dimensions
        if query_embedding.dim() == 3:
            query_embedding = query_embedding[:, 0, :].numpy()
        elif query_embedding.dim() == 2:
            # Handle case where the output tensor is already 2D (batch_size, hidden_size)
            query_embedding = query_embedding.numpy()
        else:
            raise ValueError("Unexpected number of dimensions in query_embedding.")
        
        # Ensure the array is C-contiguous
        query_embedding = np.ascontiguousarray(query_embedding)
    
    distances, indices = index.search(query_embedding, top_k)
    return indices[0]

# Perform search
query = "How do I reset my SmartWatch?"
if query:
    top_indices = search(query, top_k=1)
    top_documents = [documents[idx] for idx in top_indices]
    print("Response (on Top document):", top_documents[0])
else:
    print("Query is empty.")


### Summary
This example illustrates how to use a RAG model to convert text documents into embeddings and then use FAISS for semantic search. The key steps involve tokenizing the input documents, generating embeddings using the RAG model’s encoder, creating a FAISS index, and querying this index to find the most similar documents to a given query. This setup enables efficient and scalable semantic search capabilities.