# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

In [None]:
%pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

In [5]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### Data Collection

We're going to be using legal contract provided as context.

In [3]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../data/Raptor Contract.pdf")
base_docs = loader.load()


In [4]:
for doc in base_docs:
  print(doc.metadata)

{'source': '../data/Raptor Contract.pdf', 'page': 0}
{'source': '../data/Raptor Contract.pdf', 'page': 1}
{'source': '../data/Raptor Contract.pdf', 'page': 2}
{'source': '../data/Raptor Contract.pdf', 'page': 3}
{'source': '../data/Raptor Contract.pdf', 'page': 4}
{'source': '../data/Raptor Contract.pdf', 'page': 5}
{'source': '../data/Raptor Contract.pdf', 'page': 6}
{'source': '../data/Raptor Contract.pdf', 'page': 7}
{'source': '../data/Raptor Contract.pdf', 'page': 8}
{'source': '../data/Raptor Contract.pdf', 'page': 9}
{'source': '../data/Raptor Contract.pdf', 'page': 10}
{'source': '../data/Raptor Contract.pdf', 'page': 11}
{'source': '../data/Raptor Contract.pdf', 'page': 12}
{'source': '../data/Raptor Contract.pdf', 'page': 13}
{'source': '../data/Raptor Contract.pdf', 'page': 14}
{'source': '../data/Raptor Contract.pdf', 'page': 15}
{'source': '../data/Raptor Contract.pdf', 'page': 16}
{'source': '../data/Raptor Contract.pdf', 'page': 17}
{'source': '../data/Raptor Contract.pd

### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

docs = text_splitter.split_documents(base_docs)

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

  warn_deprecated(


In [7]:
len(docs)

4234

In [8]:
print(max([len(chunk.page_content) for chunk in docs]))

250


In [10]:
docs[0].page_content

'[R&G\nDraft\n12.__.2021]\nSTOCK\nPURCHASE\nAGREEMENT\nBY\nAND\nAMONG\n[BUYER],\n[TARGET\nCOMP ANY],\nTHE\nSELLERS\nLISTED\nON\nSCHEDULE\nI\nHERET O\nAND\nTHE\nSELLERS’\nREPRESENT ATIVE\nNAMED\nHEREIN\nDated\nas\nof\n[●]\n[This\ndocument\nis\nintended\nsolely\nto\nfacilitate'

Let's convert our `Chroma` vectorstore into a retriever with the `.as_retriever()` method.

In [11]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

Now to give it a test!

In [15]:
relevant_docs = base_retriever.get_relevant_documents("How much is the escrow amount?")

In [16]:
len(relevant_docs)

2

In [18]:
relevant_docs[1]

Document(page_content='release\nthe\nEscrow\nAmount\nto\nCompany\nSecurityholders\nin\naccordance\nwith \nthe\nEscrow\nAgreement\nor\n(ii)\nthe\namount,\nif\nany,\nby\nwhich\nsuch\nestimated\nPurchase\nPrice\npaid\nat \nClosing\nin\naccordance\nwith\nSection 2.05(a)(i)\nand\nSection\n2.07(a)\nexceeds\nsuch', metadata={'page': 25, 'source': '../data/Raptor Contract.pdf'})

## Creating a Retrieval Augmented Generation Prompt

Now we can set up a prompt template that will be used to provide the LLM with the necessary contexts, user query, and instructions!

In [19]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!


In [20]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

  warn_deprecated(


Let's test it out!

In [22]:
question = "How much is the retention amount?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

{'response': AIMessage(content='Retention Amount is $5,000,000.'), 'context': [Document(page_content='or\nother\nrepresentative\nof\nsuch\nPerson,\nincluding\nlegal \ncounsel,\naccountants,\nand\nfinancial\nadvisors.\n“\nRetention\nAmount\n”\nmeans\nan\namount\nequal\nto\n$5,000,000.\n“\nSection\n102\n”\nmeans\nsection\n102\nof\nthe\nIncome\nTax\nOrdinance.\n“\nSection\n102\nOptions\n”', metadata={'page': 15, 'source': '../data/Raptor Contract.pdf'}), Document(page_content='“\nRetention\nAmount\n”\nmeans\nan\namount\nequal\nto\n$5,000,000.\n“\nSection\n102\n”\nmeans\nsection\n102\nof\nthe\nIncome\nTax\nOrdinance.\n“\nSection\n102\nOptions\n”\nmeans\nOptions\ngranted\nand\nsubject\nto\ntax\npursuant\nto\nSection \n102(b)(2)\nof\nthe\nOrdinance.\n“\nSection\n102', metadata={'page': 15, 'source': '../data/Raptor Contract.pdf'})]}


### Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [23]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [24]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [25]:
question_generation_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

In [26]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})
output_dict = question_output_parser.parse(response.content)

In [27]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What is the purpose of the STOCK PURCHASE AGREEMENT?
context
page_content='[R&G\nDraft\n12.__.2021]\nSTOCK\nPURCHASE\nAGREEMENT\nBY\nAND\nAMONG\n[BUYER],\n[TARGET\nCOMP ANY],\nTHE\nSELLERS\nLISTED\nON\nSCHEDULE\nI\nHERET O\nAND\nTHE\nSELLERS’\nREPRESENT ATIVE\nNAMED\nHEREIN\nDated\nas\nof\n[●]\n[This\ndocument\nis\nintended\nsolely\nto\nfacilitate' metadata={'source': '../data/Raptor Contract.pdf', 'page': 0}


In [None]:
%pip install -q -U tqdm

In [28]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages})
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

  0%|          | 0/10 [00:00<?, ?it/s]

100%|██████████| 10/10 [00:20<00:00,  2.05s/it]


In [29]:
qac_triples[5]

{'question': 'What is the intention of the document and the discussions mentioned in the context?',
 'context': Document(page_content='identified\nherein. \nNeither\nthis\ndocument\nnor\nsuch\ndiscussions\nare\nintended\nto\ncreate,\nnor\nwill\neither\nor\nboth\nbe \ndeemed\nto\ncreate,\na\nlegally\nbinding\nor\nenforceable\noffer\nor\nagreement\nof\nany\ntype\nor\nnature, \nunless\nand\nuntil\na\ndefinitive\nwritten', metadata={'source': '../data/Raptor Contract.pdf', 'page': 0})}

In [30]:
answer_generation_llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)

In [31]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
The purpose of the document is to outline the terms and conditions of a stock purchase transaction between the buyer, the target company, the sellers listed on Schedule I, and the sellers' representative. It serves as a formal agreement that specifies the obligations and rights of all parties involved in the sale and purchase of stock.
question
What is the purpose of the document mentioned in the context?
context
page_content='[R&G\nDraft\n12.__.2021]\nSTOCK\nPURCHASE\nAGREEMENT\nBY\nAND\nAMONG\n[BUYER],\n[TARGET\nCOMPANY],\nTHE\nSELLERS\nLISTED\nON\nSCHEDULE\nI\nHERETO\nAND\nTHE\nSELLERS’\nREPRESENTATIVE\nNAMED\nHEREIN\nDated\nas\nof\n[●]\n[This\ndocument\nis\nintended\nsolely\nto\nfacilitate' metadata={'source': '../data/Raptor Contract.pdf', 'page': 0}


In [32]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 9/9 [00:33<00:00,  3.69s/it]


In [34]:
%pip install -q -U datasets

Note: you may need to restart the kernel to use updated packages.


In [35]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

  from .autonotebook import tqdm as notebook_tqdm


In [36]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

In [40]:
eval_dataset[3]

{'question': 'What is the purpose of this document?',
 'context': 'NAMED\nHEREIN\nDated\nas\nof\n[●]\n[This\ndocument\nis\nintended\nsolely\nto\nfacilitate\ndiscussions\namong\nthe\nparties\nidentified\nherein. \nNeither\nthis\ndocument\nnor\nsuch\ndiscussions\nare\nintended\nto\ncreate,\nnor\nwill\neither\nor\nboth\nbe \ndeemed\nto\ncreate,\na\nlegally',
 'ground_truth': 'The purpose of this document is to serve as a preliminary discussion tool among the parties named within. It is not intended to create any legal obligations or to be deemed as a legally binding contract.'}

In [41]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 69.12ba/s]


5374

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [None]:
# from datasets import Dataset
# eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

In [42]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

In [44]:
%pip install ragas

Collecting ragas
  Downloading ragas-0.1.2-py3-none-any.whl.metadata (4.7 kB)
Collecting pysbd>=0.3.4 (from ragas)
  Using cached pysbd-0.3.4-py3-none-any.whl (71 kB)
Collecting appdirs (from ragas)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Downloading ragas-0.1.2-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 kB[0m [31m455.3 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: appdirs, pysbd, ragas
Successfully installed appdirs-1.4.4 pysbd-0.3.4 ragas-0.1.2
Note: you may need to restart the kernel to use updated packages.


In [45]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"].content,
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        context_relevancy,
        answer_correctness,
        answer_similarity
    ],
  )
  return result

Lets create our dataset first:

In [46]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:13<00:00,  1.45s/it]


In [47]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 9
})

In [48]:
basic_qa_ragas_dataset[0]

{'question': 'What is the purpose of the document mentioned in the context?',
 'answer': "I don't know.",
 'contexts': ['to \nwriting,\nthe\nplan\ndocument\ntogether\nwith\nall\namendments\nthereto,\n(ii) if\nthe\nplan\nhas\nnot\nbeen \nreduced\nto\nwriting,\na\nwritten\nsummary\nof\nall\nmaterial\nplan\nterms,\n(iii) if\napplicable,\nany\ntrust \nagreements,\ncustodial\nagreements,\nnon-standard',
  'in\nthe\nOrganizational\nDocuments\nof\nany\nAcquired\nCompany\nwhich\nobligates\nan \nAcquired\nCompany\nto\npurchase,\nredeem\nor\notherwise\nacquire,\nor\nmake\nany\npayment\n(including \nany\ndividend\nor\ndistribution)\nin\nrespect\nof,\nany\nEquity\nInterest\nin\nany\nAcquired'],
 'ground_truths': ["The purpose of the document is to outline the terms and conditions of a stock purchase transaction between the buyer, the target company, and the sellers listed in the agreement. It serves as a legally binding contract that specifies the obligations and rights of all parties involved in 

Save it for later:

In [49]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 349.06ba/s]


9565

And finally - evaluate how it did!

In [51]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`


Evaluating:  52%|█████▏    | 33/63 [00:14<00:13,  2.24it/s]
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/site-packages/ragas/executor.py", line 75, in run
    results = self.loop.run_until_complete(self._aresults())
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/site-packages/ragas/executor.py", line 63, in _aresults
    raise e
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/site-packages/ragas/executor.py", line 58, in _aresults
    r = await future
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/asyncio/tasks.py", line 571, i

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass `raise_exceptions=False` incase you want to show only a warning message instead.

In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

We'll build this simple qa_chain factory to create standardized qa_chains where the only different component will be the retriever.

In [52]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context=itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

The basic outline of this retrieval method is as follows:

1. Obtain User Question
2. Retrieve child documents using Dense Vector Retrieval
3. Merge the child documents based on their parents. If they have the same parents - they become merged.
4. Replace the child documents with their respective parent documents from an in-memory-store.
5. Use the parent documents to augment generation.

In [53]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())

store = InMemoryStore()

In [54]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [55]:
parent_document_retriever.add_documents(base_docs)

Let's create, test, and then evaluate our new chain!

In [56]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [58]:
parent_document_retriever_qa_chain.invoke({"question" : "What is the retention amount"})["response"].content

'Retention Amount means an amount equal to $5,000,000.'

In [59]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:12<00:00,  1.40s/it]


In [60]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 311.43ba/s]


13421

In [61]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:  32%|███▏      | 20/63 [00:14<00:31,  1.35it/s]
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/site-packages/ragas/executor.py", line 75, in run
    results = self.loop.run_until_complete(self._aresults())
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/Users/abdi/miniconda3/envs/legal_contract_advisor/lib/python3.10/site-packages/ragas/executor.py", line 63, in _aresults
    raise e
  File "/Users/abdi/miniconda

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass `raise_exceptions=False` incase you want to show only a warning message instead.

In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

#### Ensemble Retrieval

Next let's look at ensemble retrieval!

You can read more about this [here](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

The basic idea is as follows:

1. Obtain User Question
2. Hit the Retriever Pair
    - Retrieve Documents with BM25 Sparse Vector Retrieval
    - Retrieve Documents with Dense Vector Retrieval Method
3. Collect and "fuse" the retrieved docs based on their weighting using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm into a single ranked list.
4. Use those documents to augment our generation.

Ensure your `weights` list - the relative weighting of each retriever - sums to 1!

In [62]:
%pip install -q -U rank_bm25

Note: you may need to restart the kernel to use updated packages.


In [63]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])

In [64]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [65]:
ensemble_retriever_qa_chain.invoke({"question" : "What the escrow amount?"})["response"].content

"I don't know."

In [66]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:11<00:00,  1.26s/it]


In [67]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 230.36ba/s]


20684

In [68]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating: 100%|██████████| 63/63 [00:19<00:00,  3.17it/s]


In [69]:
ensemble_qa_result

{'context_precision': 0.7418, 'faithfulness': 0.9444, 'answer_relevancy': 0.2018, 'context_recall': 0.8889, 'context_relevancy': 0.0120, 'answer_correctness': 0.6349, 'answer_similarity': 0.8618}

### Conclusion

Observe your results in a table!

In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

In [70]:
ensemble_qa_result

{'context_precision': 0.7418, 'faithfulness': 0.9444, 'answer_relevancy': 0.2018, 'context_recall': 0.8889, 'context_relevancy': 0.0120, 'answer_correctness': 0.6349, 'answer_similarity': 0.8618}

We can also zoom in on each result and find specific information about each of the questions and answers.

In [71]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [72]:
ensemble_qa_result_df

Unnamed: 0,question,answer,contexts,ground_truths,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What is the purpose of the document mentioned ...,I don't know.,"[Sections\n2.06\n,\neach\nof\nthe\nparties\nto...",[The purpose of the document is to outline the...,The purpose of the document is to outline the ...,0.0,,0.0,0.0,0.014981,0.173315,0.693062
1,Who are the parties identified in the document?,I don't know.,[Section\nIV.10\nSpecific\nPerformance\n.\nEac...,[The parties identified in the document are th...,The parties identified in the document are the...,0.7,,0.0,1.0,0.004016,0.933875,0.735499
2,What is the purpose of this document and the d...,The purpose of this document and the discussio...,[[R&G\nDraft\n12.__.2021]\nSTOCK\nPURCHASE\nAG...,[The purpose of this document is to serve as a...,The purpose of this document is to serve as a ...,1.0,0.666667,0.0,1.0,0.003717,0.527882,0.911529
3,What is the purpose of this document?,I don't know,"[Sections\n2.06\n,\neach\nof\nthe\nparties\nto...",[The purpose of this document is to serve as a...,The purpose of this document is to serve as a ...,0.416667,,0.0,1.0,0.026217,0.181838,0.727351
4,What is the purpose of this document and the d...,The purpose of this document and the discussio...,[[R&G\nDraft\n12.__.2021]\nSTOCK\nPURCHASE\nAG...,[The purpose of this document is to serve as a...,The purpose of this document is to serve as a ...,1.0,1.0,0.0,1.0,0.003906,0.842327,0.96931
5,What is the intention of the document and the ...,Answer: The intention of the document and disc...,[[R&G\nDraft\n12.__.2021]\nSTOCK\nPURCHASE\nAG...,[The intention of the document and the discuss...,The intention of the document and the discussi...,0.804167,1.0,0.0,1.0,0.015385,0.612931,0.951724
6,"According to the context, when will discussion...",Discussions will be deemed to create a legally...,"[or\nboth\nbe \ndeemed\nto\ncreate,\na\nlegall...",[Discussions will not be deemed to create a le...,Discussions will not be deemed to create a leg...,1.0,1.0,0.97591,1.0,0.004545,0.742807,0.971226
7,What must be executed and delivered by each of...,A definitive written agreement.,"[or\nboth\nbe \ndeemed\nto\ncreate,\na\nlegall...",[To create a legally binding offer or agreemen...,To create a legally binding offer or agreement...,1.0,1.0,0.0,1.0,0.030702,0.975697,0.902787
8,What is required for the document to be kept c...,A definitive written agreement executed and de...,"[or\nboth\nbe \ndeemed\nto\ncreate,\na\nlegall...","[For the document to be kept confidential, a d...","For the document to be kept confidential, a de...",0.755556,1.0,0.840736,1.0,0.004082,0.723481,0.893806


We'll also look at combining the results and looking at them in a single table so we can make inferences about them!

In [None]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name" : pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [None]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [None]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [None]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [None]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [None]:
results_df.sort_values("answer_correctness", ascending=False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.885833,0.7,0.891845,0.98,0.019158,0.775,1.0
0,basic_rag,0.5,0.4,0.953475,1.0,0.055904,0.616667,1.0
1,pdr_rag,0.697222,0.35,0.943909,1.0,0.013386,0.6,1.0


### ❓QUESTION❓

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

In [None]:
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)