## Benchmarking RAG pipeline with Redis and LLM using langsmith
This notebook provides steps to Benchmark RAG pipeline using Langsmith. The RAG pipeline is implemented using Redis as vector database and llama2-70b-chat-hf model as LLM which is served by Huggingface TGI endpoint</br>
Langsmith documentation: https://docs.smith.langchain.com/

In [1]:
#All imports
import os
import uuid
from operator import itemgetter
from typing import Sequence

from langchain_benchmarks import clone_public_dataset, registry
from langchain_community.embeddings import HuggingFaceEmbeddings, HuggingFaceHubEmbeddings
from langchain_community.vectorstores import Redis
from langchain_community.llms import HuggingFaceEndpoint
from langchain.prompts import ChatPromptTemplate
from langchain.schema.document import Document
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable.passthrough import RunnableAssign
from transformers import AutoTokenizer, LlamaForCausalLM

from langsmith.client import Client
from langchain_benchmarks.rag import get_eval_config


### Configuration parameters

In [None]:
#Configuration parameters

os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"] = "add-your-langsmith-key"  # Your API key

#Vector DB configuration
EMBED_MODEL = "" #Huggingface sentencetransformer model that you want to use. ex. "BAAI/bge-base-en-v1.5"
REDIS_INDEX_NAME = ""  #Name of the index to be created in DB
REDIS_SERVER_URL = ""  #Specify url of your redis server
REDIS_INDEX_SCHEMA = "" #path to redis schema yml file. Schema to stor data, vectors and desired metadata for every entry

#Endpoints
TEI_ENDPOINT = "Add your TEI endpoint" #Huggingface TEI endpoint url for Embedding model serving. Make sure TEI is serving the same EMBED_MODEL specified above
TGI_ENDPOINT = "Add your TGI endpoint" #Huggingface TGI endpoint url for Embedding model serving
VLLM_ENDPOINT = "Add your VLLM endpoint" #vllm server endpoint (either this or TGI_ENDPOINT should be specified)
METHOD = "" #give "tgi-gaudi" to use TGI_ENDPOINT or "vllm-openai" to use VLLM_ENDPOINT

#Test parameters
LANGSMITH_PROJECT_NAME = "" #The test result will be displayed in langsmith cloud with this project name and an unique uuid
CONCURRENCY_LEVEL = 16 #Number of concurrent queries to be sent to RAG chain
LANGCHAIN_DATASET_NAME = 'LangChain Docs Q&A' #Specify the Langchain dataset name (if using a dataset from langchain)

#LLM parameters
LLM_MODEL_NAME = "meta-llama/Llama-2-70b-chat-hf"
MAX_OUTPUT_TOKENS = 128
PROMPT_TOKENS_LEN=214 # Magic number for prompt template tokens. This changes if prompt changes
MAX_INPUT_TOKENS=1024 #Use this and PROMPT_TOKENS_LEN if there is a need to limit input tokens.

### Selecting dataset
Below section covers selecting and using LangChain Docs Q&A dataset
This can be modified to use any dataset

In [3]:
#Langchain supported datasets for Retrieval task
registry = registry.filter(Type="RetrievalTask")
registry

Name,Type,Dataset ID,Description
LangChain Docs Q&A,RetrievalTask,452ccafc-18e1-4314-885b-edd735f17b9d,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Semi-structured Reports,RetrievalTask,c47d9617-ab99-4d6e-a6e6-92b8daf85a7d,Questions and answers based on PDFs containing tables and charts. The task provides the raw documents as well as factory methods to easily index them and create a retriever. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Multi-modal slide decks,RetrievalTask,40afc8e7-9d7e-44ed-8971-2cae1eb59731,This public dataset is a work-in-progress and will be extended over time.  Questions and answers based on slide decks containing visual tables and charts. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer.


In [4]:
#Lets use LangChain Docs Q&A dataset for our benchmark
langchain_docs = registry[LANGCHAIN_DATASET_NAME]
langchain_docs

0,1
Name,LangChain Docs Q&A
Type,RetrievalTask
Dataset ID,452ccafc-18e1-4314-885b-edd735f17b9d
Description,Questions and answers based on a snapshot of the LangChain python docs. The environment provides the documents and the retriever information. Each example is composed of a question and reference answer. Success is measured based on the accuracy of the answer relative to the reference answer. We also measure the faithfulness of the model's response relative to the retrieved documents (if any).
Retriever Factories,"basic, parent-doc, hyde"
Architecture Factories,conversational-retrieval-qa
get_docs,


In [5]:
#Download the dataset locally
clone_public_dataset(langchain_docs.dataset_id, dataset_name=langchain_docs.name)

Dataset LangChain Docs Q&A already exists. Skipping.
You can access the dataset at https://smith.langchain.com/o/9534e90b-1d2b-55ed-bf79-31dc5ff16722/datasets/3ce3b4a1-0640-4fbf-925e-2c03caceb5ac.


### Ingesting data into Redis vector DB
This section needs to be run only when the Redis server doesn't already contain the data

In [None]:
#Embedding model for ingestion 
embedder = HuggingFaceEmbeddings(model_name=EMBED_MODEL)

In [None]:
#Ingest the dataset into vector DB
_ = Redis.from_texts(
    # appending this little bit can sometimes help with semantic retrieval
    # especially with multiple companies
    texts=[d.page_content for d in docs],
    metadatas=[d.metadata for d in docs],
    embedding=embedder,
    index_name=REDIS_INDEX_NAME,
    index_schema=REDIS_INDEX_SCHEMA,
    redis_url=REDIS_SERVER_URL,
)

### RAG pipeline
Initialize each component of RAG pipeline and setup the chain

In [6]:
#enable TEI endpoint to get high throughput high throughput queries
embedder =  HuggingFaceHubEmbeddings(model=TEI_ENDPOINT)

In [7]:
#Initialize retriever to be added in langchain RAG chain.
vectorstore = Redis.from_existing_index(
    embedding=embedder, index_name=REDIS_INDEX_NAME, schema=REDIS_INDEX_SCHEMA, redis_url=REDIS_SERVER_URL
)
retriever = vectorstore.as_retriever()

**Note:** Prompt is specific to dataset. Modify the prompt accordingly based on the dataset selected. </br>
The below prompt is for Langchain Docs Q&A dataset

In [8]:
#Setup prompt

#helper function to crop input tokens from retrieved doc from vector DB
#This can be used in format_docs function if there is a need to make sure
#number of input tokens doesn't exceed certain limit
tokenizer = AutoTokenizer.from_pretrained(LLM_MODEL_NAME)
def crop_tokens(prompt, max_len):
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs_cropped = inputs['input_ids'][0:,0:max_len]
    prompt_cropped=tokenizer.batch_decode(inputs_cropped, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return prompt_cropped

# After the retriever fetches documents, this
# function formats them in a string to present for the LLM
def format_docs(docs: Sequence[Document]) -> str:
    formatted_docs = []
    for i, doc in enumerate(docs):
        doc_string = (
            f"<document index='{i}'>\n"
            f"<source>{doc.metadata.get('source')}</source>\n"
            f"<doc_content>{doc.page_content}</doc_content>\n"
            "</document>"
        )
        # Truncate the retrieval data based on the max tokens required
        cropped= crop_tokens(doc_string,MAX_INPUT_TOKENS-PROMPT_TOKENS_LEN) #remove this if there is not need of limiting INPUT tokens to LLM
        formatted_docs.append(doc_string)
    formatted_str = "\n".join(formatted_docs)
    return f"<documents>\n{formatted_str}\n</documents>"

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are an AI assistant answering questions about LangChain."
            "\n{context}\n"
            "Respond solely based on the document content.",
        ),
        ("human", "{question}"),
    ]
)

In [9]:
#Setup LLM 

llm = None
match METHOD:
    case "tgi-gaudi":
        llm = HuggingFaceEndpoint(
        endpoint_url=TGI_ENDPOINT,
        max_new_tokens=MAX_OUTPUT_TOKENS,
        top_k=10,
        top_p=0.95,
        typical_p=0.95,
        temperature=1.0,
        repetition_penalty=1.03,
        streaming=False,
        truncate=1024
        )
    case "vllm-openai":
        llm = ChatOpenAI(
        model=LLM_MODEL_NAME,
        openai_api_key="EMPTY", 
        openai_api_base=VLLM_ENDPOINT,
        max_tokens=MAX_OUTPUT_TOKENS,
        temperature=1.0,
        top_p=0.95,
        streaming=False,
        frequency_penalty=1.03
        )

response_generator = (prompt | llm | StrOutputParser()).with_config(
    run_name="GenerateResponse",
)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
# This is the final response chain.
# It fetches the "question" key from the input dict,
# passes it to the retriever, then formats as a string.

chain = (
    RunnableAssign(
        {
            "context": (itemgetter("question") | retriever | format_docs).with_config(
                run_name="FormatDocs"
            )
        }
    )
    # The "RunnableAssign" above returns a dict with keys
    # question (from the original input) and
    # context: the string-formatted docs.
    # This is passed to the response_generator above
    | response_generator
)

### Setup and run Langsmith benchmark

In [11]:
#Initialize Langchain client
client = Client()

In [None]:
# Generate a unique run ID for this experiment
run_uid = uuid.uuid4().hex[:6]

#Run the test
test_run = client.run_on_dataset(
    dataset_name=LANGCHAIN_DATASET_NAME,
    llm_or_chain_factory=chain,
    evaluation=None,
    project_name=LANGSMITH_PROJECT_NAME+'_'+run_uid,
    project_metadata={
        "index_method": "basic",
    },
    concurrency_level=CONCURRENCY_LEVEL,
    verbose=True,
)