In this notebook, we will learn how to:

💻 Develop a retrieval augmented generation (RAG) based LLM application from scratch.

🚀 Scale the major workloads (load, chunk, embed, index, serve, etc.) across multiple workers with different compute resources.

✅ Evaluate different configurations of our application to optimize for both per-component (ex. retrieval_score) and overall performance (quality_score).

🔀 Implement a hybrid agent routing approach b/w OSS and closed LLMs to create the most performant and cost effective application.

📦 Serve the application in a highly scalable and available manner

💡 learn how methods like fine-tuning, prompt engineering, lexical search, reranking, data flywheel, etc. impact our application's performance.

### An overview of RAG

Retrieval-Augmented Generation can vary in its implementation, but at a conceptual level, using RAG in an AI-based application involves the following steps:

* The user inputs a question.

* The system searches for relevant documents that might answer the question. These documents often consist of proprietary data, and are stored in some kind of document index.

* The system creates an LLM prompt that combines the user input, the related documents, and instructions for the LLM to answer the user’s question using the documents provided.

* The system sends the prompt to an LLM.

* The LLM returns an answer to the user’s question, grounded by the context we provided. This is the output of our system.
Here’s a diagram of this general idea:

In this guide, we're going to build a RAG-based LLM application where we will incorporate external data sources to augment our LLM’s capabilities. Specifically, we will be building an assistant that can answer questions about own product. The goal here is to make it easier for employees and customers to ask questions about specific products. We’ll also share challenges we faced along the way and how we overcame them.

**Note:** We have generalized this entire guide so that it can easily be extended to build RAG-based LLM applications on top of your own data. Here’s an overview of the architecture presented in the paper:

<p align="center">
  <img src="./pics/bea-rag1.png" alt="drawing" width="600"/>
</p>

We’ll look at each piece of this architecture in depth throughout this post. At a high level, the proposed structure is composed of two components: a `retriever` and a `generator`. The retriever component transforms the input text into a sequence of floating point numbers (a vector) using the query encoder, transforms each document in the same way using the document encoder, and stores the document encodings in a search index. It then searches the search index for document vectors that are related to the input vector, converts the document vectors back into text, and returns their text as output. The generator then takes the user’s input text and matched documents, combines them into a prompt, and asks an LLM for a reply to the user’s input given the information in the documents. The output of that LLM is the output of the system.

### Type of search 
#### Keyword search
The simplest way to find documents that are related to a user query is to do a “keyword search” (also known as “full-text search”). Keyword search uses the exact terms in the user input text to search an index for documents with matching text. The matching is done based on text only, with no vectors involved. Even though this technique has been around for a while, it’s still relevant today. This type of search is very useful when you’re searching for user IDs, product codes, addresses, and any other data that requires high precision matches. Here’s a high-level diagram of this implementation:

<p align="center">
  <img src="./pics/bea-rag2.png" alt="drawing" width="600"/>
</p>


#### Vector search

“Vector search” (also known as “dense retrieval”) differs from keyword search in the sense that it can find matches when no search terms are present in the documents, but the general ideas are similar. For example, suppose you’re building a chatbot to support a property rental site. If the user asks, “Do you have recommendations for a spacious apartment close to the sea?” and the document for a particular property contains the text “4000 sq ft home with ocean view,” keyword search won’t identify that as a match, but vector search will. Vector search works best when we’re searching unstructured text for general ideas, rather than precise keywords.

Here’s a high-level overview of RAG with vector search:

<p align="center">
  <img src="./pics/bea-rag3.png" alt="drawing" width="600"/>
</p>

### Hybrid search

Hybrid search consists of the simultaneous use of keyword and vector search. For example, let’s consider a scenario where you have a customer ID and a text input query, and you want to do a search that captures both the high-precision of the customer ID and the general meaning of the user text. This is the perfect scenario for hybrid search. Hybrid search performs the two types of search separately, and then it combines the outputs using an algorithm that selects the best results of each technique. This method is used often in the industry, especially in more complex applications.

#### Semantic ranking

Semantic ranking (also known as “rerank”) is an optional step that follows the retrieval of documents. The retrieval step does its best to rank the returned documents based on how relevant they are to the user’s query, but the semantic ranking step can often improve on that result. It takes a subset of the documents returned by the retrieval, computes higher quality relevance scores using an LLM that is trained just for that task, and reranks the documents based on those scores.

<p align="center">
  <img src="./pics/bea-rag4.png" alt="drawing" width="600"/>
</p>

In the image above I show semantic ranking combined with vector search, but you could easily combine it with keyword search instead. I decided to show semantic ranking with vector search in the diagram because that’s the solution I’ve implemented in the code for this post, which we’ll look at next.

### RAG Implementation 

First load the libraries needed and Initializes an Azure Cognitive Search index with our custom data, using vector search 
and semantic ranking. To run this code, you must already have a `cognitive search` and an `openAI` resource created in Azure.

In [2]:
import os
from dotenv import load_dotenv
from langchain.document_loaders import DirectoryLoader, UnstructuredMarkdownLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
from langchain.vectorstores.utils import Document
from langchain.text_splitter import TokenTextSplitter
from langchain.document_loaders import TextLoader
# Config for Azure Search.
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_KEY = os.getenv("AZURE_SEARCH_KEY")
AZURE_SEARCH_INDEX_NAME = "products-index-1"

# Config for Azure OpenAI.
OPENAI_API_TYPE = "azure"
OPENAI_API_BASE = os.getenv("OPENAI_API_BASE")
OPENAI_API_VERSION = "2023-07-01-preview"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

DATA_DIR = "data/"

In [3]:
print(AZURE_SEARCH_ENDPOINT)
print(AZURE_SEARCH_KEY)

https://bea-azsearch.search.windows.net
5n4ZePHVChIWTjNuEZDUo4L8aCaGMes7iHqVDIQbX7AzSeDCzpdn


In [4]:
# print(OPENAI_API_VERSION)
print(OPENAI_API_BASE)
# print(OPENAI_API_KEY)


https://trefoil.openai.azure.com/


Instantiate and test the connection with Azure OpenAI service 

In [None]:
# Import Azure OpenAI
from langchain.llms import AzureOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage, SystemMessage, AIMessage
from langchain.embeddings import OpenAIEmbeddings

llm4 = AzureChatOpenAI(
    openai_api_base= OPENAI_API_BASE,
    openai_api_version= OPENAI_API_VERSION,
    deployment_name="gpt-4-32k",
    temperature=0,
    openai_api_key= OPENAI_API_KEY ,
    openai_api_type = OPENAI_API_TYPE,
)

llm3 = AzureChatOpenAI(
    openai_api_base= OPENAI_API_BASE,
    openai_api_version= OPENAI_API_VERSION,
    deployment_name="gpt-35-turbo-16k",
    temperature=0,
    openai_api_key= OPENAI_API_KEY,
    openai_api_type = OPENAI_API_TYPE,
)

llm = AzureOpenAI(
    openai_api_base= OPENAI_API_BASE,
    openai_api_version= OPENAI_API_VERSION,
    deployment_name="gpt-35-turbo-instruct",
    temperature=0,
    openai_api_key= OPENAI_API_KEY,
    openai_api_type = OPENAI_API_TYPE,
)

from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
    openai_api_base= OPENAI_API_BASE,
    openai_api_version= OPENAI_API_VERSION,
    deployment_name="text-embedding-ada-002",
    temperature=0,
    openai_api_key= OPENAI_API_KEY,
    openai_api_type = OPENAI_API_TYPE,
    chunk_size=1)
#llm("What day comes after Friday?")


In [6]:
llm4(
    [
        SystemMessage(content="You are a nice AI bot that helps a user figure out what to eat in one short sentence"),
        HumanMessage(content="I like tomatoes, what should I eat?")
    ]
)


AIMessage(content='How about a fresh Caprese salad with ripe tomatoes, mozzarella, and basil?')

In [7]:
llm3(
    [
        SystemMessage(content="You are a nice AI bot that helps a user figure out where to travel in one short sentence"),
        HumanMessage(content="I like the beaches where should I go?"),
        AIMessage(content="You should go to Nice, France"),
        HumanMessage(content="What else should I do when I'm there?")
    ]
)

AIMessage(content='You should explore the charming Old Town, visit the stunning Promenade des Anglais, and indulge in delicious French cuisine.')

In [8]:

#llm("What day comes after Friday?")
llm("Tell me a joke")

"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired."

In [9]:
text = "This is just a test"
e = embeddings.embed_query(text)
print(e)
print(len(e))

[-0.010031012400679239, 0.002363352705139523, 0.009305801786561171, -0.0045502160644454335, -0.020395751193564623, 0.013875271547451488, -0.03111346681547216, -0.01193709714941775, -0.006584657082058519, -0.03855810898977913, 0.01369557286415934, 0.003988659308971903, -0.0036902316706888885, 0.0028944251785389965, 0.004373726718611821, 0.0013196590040055094, 0.021371255670514764, -0.004973790401313851, -0.0070274846497105515, -0.008432982138446184, 0.005487214079828343, 0.001197720827971425, -0.01513315953652893, 0.031909274471775224, -0.032422697684628445, -0.015030474707693779, 0.015197336972474317, -0.028726047571853118, 0.012245151635923206, -0.006369660808996407, 0.014991967314804013, -0.00795806538716019, 0.015312857288498545, -0.034348035664150574, -0.004717078794887239, -0.018406233915452038, -0.009902656597465932, -0.022385266609032135, 0.01358005254813511, -0.017713112019306664, -0.0020167919898974704, 0.024118070418073033, -0.00788746927629773, -0.028212624358999894, 0.00401

#### Connect to Azure Cognitive Search

In [10]:
# Connect to Azure Cognitive Search
acs = AzureSearch(azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
                 azure_search_key=AZURE_SEARCH_KEY,
                 index_name=AZURE_SEARCH_INDEX_NAME,
                 embedding_function=embeddings.embed_query)

### Chunk data

In [11]:
# Load documents
loader = DirectoryLoader('./data/', glob="*.md", loader_cls=TextLoader, loader_kwargs={'autodetect_encoding': True})
documents = loader.load()
text_splitter = TokenTextSplitter(chunk_size=6000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
# Add documents to Azure Search
acs.add_documents(documents=docs)

['ODZiZGY2NzEtMDNiZC00OTc2LWE0MGUtMGI5NzMwOGNkZGEw',
 'NGNmYzFhZGMtZWVjYS00Yjg5LThiOTItMGVkMmFlYTI0ZTli',
 'NWY4MTdlMDYtNzA0Ni00M2I3LTg0NzItNjFlNDFhNTNiMTlj',
 'MjI4MzA2MzAtNDRjYy00NjYwLTk0ZDUtYjhkNzBlY2ZhZWE4',
 'NTgwM2ViODMtMWEyZS00NzhhLTkzZTctZGNkNjcxNjhlZmQ1',
 'MGUzYzc5ZWMtODJjNC00ZmQwLWI5YTYtZjFlYTU0ZWYxZWFi',
 'OTE0N2M2MmEtMmQ4Mi00NzZmLTgzMzMtOTYyYmZmMzc0MTRh',
 'YTRhYWE4N2UtYzkwNS00MGNiLWJjYjUtMGNhZGFlMDk0Njcy',
 'Zjc1YjViN2ItNTliOS00OGU1LWJhYjgtZTgxZDJmN2EyMmZl',
 'MWQ2YzU2MzQtMDFlOC00MzAzLWJkNzItNzUyNGEzOWNiYjUx',
 'OWNhNTMwOTEtMWZlMS00NDEwLWI5NDYtM2RhY2Y4YzU5YThm',
 'MGVmZWViMzYtNzE2Mi00YmIxLWI0YTktZDA3MTU3MGZmZjFm',
 'MTM4NWQxYTQtZDJhZC00NTNmLTgyNjEtMzYyNjRkNDE1MzAy',
 'NzY3MmI0MTMtNzE2Mi00M2VmLWFiNGQtZjk2Y2VhOGE1NDFl',
 'ZjYzODZhYTItNDMzNy00MmQwLWFmMjUtNDVhZjc0MWJiODdk',
 'YWRjYTBkMWQtNjY5ZS00YmJjLWIzZmYtZThiYTg0NWYxZGZj',
 'NTVkYTg0NjktYmY2Ny00MjczLWJlNDItZjgyODNmZTY2YWI1',
 'ZDA1OTNjNzctMWEzNS00M2RiLTljZWItMjFhOWZiZjY4MGJm',
 'ODdjZGZkNmUtOTBiYi00M2YxLThhZDctMjZiZTY1YzBj

#### Embed data, index an store the data 

In [12]:
acs.add_documents(documents=docs)

['ZWI2OGY4NTctNzQxNy00YWYxLTkxYTQtMGJhZmI4YWJiODVl',
 'OGZhMjU1YTItNGQ1Zi00MGI3LWEzOWMtN2EyMjk2OThiYzc1',
 'OGJkN2U4ZDYtZjZmYy00OTBiLWE1NWYtNGZkNzg3NDZmYjNl',
 'OWVkMTZlZGMtYWFmOC00NjcxLWFlNDQtYTE1NmY5N2IwOWQz',
 'NTg1ZmM1NGItNTM4Mi00MzRiLThlYzUtYmUzY2I4MGExYWZj',
 'Mjg4MTY4MDQtMDI1My00MmE2LWE4NjUtN2ZjYmZiNTkyNDFh',
 'ZjY2YmJkZDctNTFlNC00Nzc5LTljYjMtNjUwZGUyZmMzYWNk',
 'MjM1OWIxZDMtMDFhZC00ZDE1LTk5ODAtNGM3ZjFiOTBiNzI0',
 'M2U5MmRiMTgtY2IwNC00YWM0LTlmN2ItMTIxNzA5YTVhZmI2',
 'Zjg4MDUzNmYtNzY3My00Y2NiLTg1ODEtMjkyNzk5Zjk5ZGUy',
 'NDM2MDU5M2YtNDg0ZS00ZWE1LWI4MmQtYWRlNTliOGFmNmMy',
 'ZTA0ZjgxZDgtOGYxNS00MjU2LTljMWMtNjM0ZTg0MjhjMGYw',
 'NGZkZjNhYTEtOWNhMC00NjAwLWIzNjYtOGEyMjFkYzA3ZWVi',
 'MTZjMDE1OWMtNjM3My00NzNiLTkzZDUtMDlhOWEzNDZmM2Nk',
 'M2M2NjE2MTgtZjA1OC00NTgyLTk1ZjktYjJkMzQ2M2ViYTQ3',
 'NjYwMDY5OGYtOWM1Zi00YzRiLThmNzYtM2E0YzNlNDA3YzY1',
 'ZTA3MTM0YzctZTFiZS00ZjQ0LWJhMGQtMmFmODVkMzk5MDFh',
 'ZTc0MDRlZDYtODUyOS00OTdmLWI4ZjEtZWQzNzc4NWUxYzE0',
 'NDQ1ODc0ZDktODE3MC00ZGE5LThhNDEtMDMwOWZlOThm

#### Make a retriver

In [None]:
retriver = acs.as_retriever()
testdocs = retriver.get_relevant_documents(query="What is the best way to cook a steak?")
len(testdocs )

In [15]:
retriever = acs.as_retriever(search_kwargs={"k": 2})

In [16]:
retriever.search_type

'similarity'

In [17]:
retriever.search_kwargs

{'k': 2}

In [18]:
from langchain.chains import ConversationalRetrievalChain
from langchain.prompts import PromptTemplate

# Adapt if needed
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template("""Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:""")

qa = ConversationalRetrievalChain.from_llm(llm=llm,
                                           retriever=retriever,
                                           condense_question_prompt=CONDENSE_QUESTION_PROMPT,
                                           return_source_documents=True,
                                           verbose=False)

In [19]:
## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])


In [20]:
query = "I need a large backpack. Which one do you recommend?"
   
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm, 
                                  chain_type="stuff",
                                  retriever=acs.as_retriever(),
                                  return_source_documents=True,
                                  verbose=False)

## Cite sources
def process_llm_response(llm_response):
    print(llm_response['result'])
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])
        
llm_response = qa_chain(query)
process_llm_response(llm_response)       


 Based on the information provided, the SummitClimber Backpack (item_number: 9) would be a suitable option as it has a capacity of 60 liters and multiple compartments for organized storage. It also has an ergonomic design for comfortable carrying and is made of durable nylon material.


Sources:
data\product_info_2.md
data\product_info_2.md
data\product_info_9.md
data\product_info_9.md


### Using the ConversationalRetrievalChain from LangChain

In [21]:
# Adapt if needed
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template("""Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:""")

qa = ConversationalRetrievalChain.from_llm(llm=llm,
                                           retriever=acs.as_retriever(),
                                           condense_question_prompt=CONDENSE_QUESTION_PROMPT,
                                           return_source_documents=True,
                                           verbose=False)

query = "I need a large backpack. Which one do you recommend?"
chat_history = []
result = qa({"question": query, "chat_history": chat_history})

print("Question:", query)
print("Answer:", result["answer"])


Question: I need a large backpack. Which one do you recommend?
Answer:  Based on the information provided, the SummitClimber Backpack (item_number: 9) would be a suitable option as it has a capacity of 60 liters and multiple compartments for organized storage. It also has an ergonomic design for comfortable carrying and is made of durable nylon material.


### Evaluation
So far, we've chosen typical/arbitrary values for the various parts of our RAG application. But if we were to change something, such as our chunking logic, embedding model, LLM, etc. how can we know that we have a better configuration than before? A generative task like this is very difficult to quantitatively assess and so we need to develop reliable ways to do so.

<p align="center">
  <img src="./pics/rag-concept.png" alt="drawing" width="500"/>
</p>


In general, because RAG apps have moving parts, see previous picture, we need to perform both *unit/component* and *end-to-end evaluation*. *Component-wise* evaluation can involve evaluating our retrieval in isolation (is the best source in our set of retrieved chunks) and evaluating our LLMs response (given the best source, is the LLM able to produce a quality answer). And for *end-to-end evaluation*, we can assess the quality of the entire system (given the data sources, what is the quality of the response).


<p align="center">
  <img src="./pics/rag-eval.png" alt="drawing" width="500"/>
  <center><em>Component evaluations (left) of retrieval system and LLM. Overall evaluation (right).<em></center>
  
</p>


####  How to create an evaluator
Given a response to a query and relevant context, an evaluator should be a trusted way to score/assess the quality of the response. But before we can determine our evaluator, we need a dataset of questions and the source where the answer comes from. We can use this dataset to ask our different evaluators to provide an answer and then rate their answer (ex. score between 1-5). We can then inspect this dataset to determine if our evaluator is unbiased and has sound reasoning for the scores that are assigned.

**Note**: In this We’re evaluating the ability of our LLM to generate a response given the relevant context. This is a component-level evaluation (`quality_score` (LLM)) because we aren’t using retrieval to fetch the relevant context.

The evaluation process starts with the manual creation of a validation dataset. The validation dataset idealy exists of a list of queries and an ideal source to answer the query 

##### Evluator set
We may not always have a prepared dataset of questions and the best source to answer that question readily available. To address this cold start problem, we could use an LLM to look at our text chunks and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. However, this dataset generation method could be a bit noisy. The generated questions may not always have high alignment to what our users may ask. And the specific chunk we say is the best source may also have that exact information in other chunks. Nonetheless, this is a great way to start our development process while we collect + manually label a high quality dataset.
<p align="center">
  <img src="./pics/rag-eval-qa.png" alt="drawing" width="500"/>
  </p>

##### Experiments

With our evaluator set, Ypu can start experimenting with the various components in your LLM application. While we could perform this as a large hyperparameter tuning experiment, where we can search across promising combinations of values/decisions. Ypu can first start evaluating one decision at a time and set the best value for the next experiment.