# Evaluación con Ragas y métodos avanzados de recuperación utilizando LangChain

Vamos a aprovechar el marco de trabajo Ragas para las evaluaciones, ya que está convirtiéndose en un método estándar para evaluar (al menos en términos generales) sistemas RAG.

[Ragas Repository](https://github.com/explodinggradients/ragas)

[Ragas Documentation](https://docs.ragas.io/en/latest/)

In [None]:
%pip install langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

In [2]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

# 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

### Recopilación de datos

Se utilizarán artículos de Arxiv como contexto.

Podemos recopilar estos documentos de manera bastante directa con el cargador de documentos `ArxivLoader` de LangChain.

In [3]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query = "Retrieval Augmented Generation", load_max_docs = 5).load()
len(base_docs)

5

In [4]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according to different tasks\nincluding dialogue response generation, machine translation, and other\ngeneration tasks. Finally, it points out some important directions on top of\nrecent methods to facilitate future research.'}
{'Published': '2023-12-09', 'Title': 'Context Tuning for Retrieval Augmented Generation', 'Autho

### Creación de un índice

Se utilizará una estrategia de creación de índices que consiste en simplemente utilizar [`RecursiveCharacterTextSplitter`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html) en nuestros documentos y embed cada uno en nuestro `VectorStore` utilizando `OpenAIEmbeddings()`.

In [5]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 250)

docs = text_splitter.split_documents(base_docs)

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

  warn_deprecated(


In [6]:
len(docs)

4899

In [7]:
print(max([len(chunk.page_content) for chunk in docs]))

249


Vamos a convertir nuestro vectorstore `Chroma` en un recuperador utilizando el método `.as_retriever()`.

In [8]:
base_retriever = vectorstore.as_retriever(search_kwargs = {"k": 2})

Prueba.

In [9]:
relevant_docs = base_retriever.get_relevant_documents("What is Retrieval Augmented Generation?")

In [10]:
len(relevant_docs)

2

## Crear plantilla de prompts para RAG

Configurar una plantilla de prompts que se utilizará para proporcionar al LLM los contextos necesarios, la consulta del usuario y las instrucciones.

In [11]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Configurar cadena de preguntas y respuestas

Ahora podemos instanciar nuestra cadena básica de RAG.

In [12]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name = "gpt-3.5-turbo", temperature = 0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

  warn_deprecated(


Prueba:

In [13]:
question = "What is RAG?"

result = retrieval_augmented_qa_chain.invoke({"question": question})

print(result)

{'response': AIMessage(content='RAG stands for Retrieval-Augmented Generation.'), 'context': [Document(page_content='seamlessly coupled with various RAG-based\napproaches.\nExperiments on four datasets\ncovering short- and long-form generation tasks\nshow that CRAG can significantly improve the\nperformance of RAG-based approaches.1\n1\nIntroduction', metadata={'Authors': 'Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, Zhen-Hua Ling', 'Published': '2024-02-16', 'Summary': 'Large language models (LLMs) inevitably exhibit hallucinations since the\naccuracy of generated texts cannot be secured solely by the parametric\nknowledge they encapsulate. Although retrieval-augmented generation (RAG) is a\npracticable complement to LLMs, it relies heavily on the relevance of retrieved\ndocuments, raising concerns about how the model behaves if retrieval goes\nwrong. To this end, we propose the Corrective Retrieval Augmented Generation\n(CRAG) to improve the robustness of generation. Specifically, a lightweight

### Crear conjunto de datos de referencia utilizando gpt-3.5-turbo y gpt-4

La idea es que se use LangChain para crear preguntas basadas en nuestros contextos y luego responder esas preguntas.

In [14]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name = "question",
    description = "A question about the context."
)

question_response_schemas = [
    question_schema,
]

In [15]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [16]:
question_generation_llm = ChatOpenAI(model = "gpt-3.5-turbo-16k")

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template = bare_prompt_template)

In [17]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template = qa_template)

messages = prompt_template.format_messages(
    context = docs[0],
    format_instructions = format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content": messages})
output_dict = question_output_parser.parse(response.content)

In [18]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What is the main focus of the paper 'A Survey on Retrieval-Augmented Text Generation'?
context
{'page_content': 'A Survey on Retrieval-Augmented Text Generation\nHuayang Li♥,∗\nYixuan Su♠,∗\nDeng Cai♦,∗\nYan Wang♣,∗\nLemao Liu♣,∗\n♥Nara Institute of Science and Technology\n♠University of Cambridge\n♦The Chinese University of Hong Kong\n♣Tencent AI Lab', 'metadata': {'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented

In [19]:
%pip install tqdm

In [20]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(docs[: 10]):
  messages = prompt_template.format_messages(
      context = text,
      format_instructions = format_instructions
  )
  response = question_generation_chain.invoke({"content": messages})
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

100%|██████████| 10/10 [00:32<00:00,  3.24s/it]


In [21]:
qac_triples[5]

{'question': 'What are the advantages of retrieval-augmented text generation compared to conventional generation models?',
 'context': Document(page_content='lemaoliu@gmail.com\nAbstract\nRecently, retrieval-augmented text generation\nattracted increasing attention of the compu-\ntational linguistics community.\nCompared\nwith conventional generation models, retrieval-', metadata={'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-aug

In [22]:
answer_generation_llm = ChatOpenAI(model = "gpt-4-1106-preview", temperature = 0)

answer_schema = ResponseSchema(
    name = "answer",
    description = "An answer to the question."
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template = qa_template)

messages = prompt_template.format_messages(
    context = qac_triples[0]["context"],
    question = qac_triples[0]["question"],
    format_instructions = format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content": messages})
output_dict = answer_output_parser.parse(response.content)

In [23]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
This paper aims to conduct a survey about retrieval-augmented text generation, highlighting the generic paradigm of this approach and reviewing notable methods across various NLP tasks such as dialogue response generation, machine translation, and other generation tasks, while also identifying important future research directions.


In [24]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context = triple["context"],
      question = triple["question"],
      format_instructions = format_instructions
  )
  response = answer_generation_chain.invoke({"content": messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 9/9 [02:19<00:00, 15.52s/it]


In [25]:
%pip install datasets

In [26]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns = {"answer": "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

  from .autonotebook import tqdm as notebook_tqdm


In [27]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

In [28]:
eval_dataset[0]

{'question': 'What does this paper aim to conduct a survey about?',
 'context': 'A Survey on Retrieval-Augmented Text Generation\nHuayang Li♥,∗\nYixuan Su♠,∗\nDeng Cai♦,∗\nYan Wang♣,∗\nLemao Liu♣,∗\n♥Nara Institute of Science and Technology\n♠University of Cambridge\n♦The Chinese University of Hong Kong\n♣Tencent AI Lab',
 'ground_truth': 'This paper aims to conduct a survey about retrieval-augmented text generation, highlighting the generic paradigm of this approach, reviewing notable methods across various tasks such as dialogue response generation and machine translation, and identifying important future research directions.'}

In [29]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 125.08ba/s]


7533

### Evaluación de los conductos RAG

Se puede cargar el archivo `.csv` directamente y así no generarlo nuevamente, descomenta el código a continuación.

In [30]:
# from datasets import Dataset
# eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

In [31]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

### Evaluación usando Ragas

Necesitamos crear un conjunto de datos con nuestras respuestas generadas y nuestros contextos, luego evaluar utilizando el marco de trabajo.

In [32]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question": row["question"]})
    rag_dataset.append(
        {"question": row["question"],
         "answer": answer["response"].content,
         "contexts": [context.page_content for context in answer["context"]],
         "ground_truths": [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics = [
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        context_relevancy,
        answer_correctness,
        answer_similarity
    ],
  )
  return result

Primero debemos crear el dataset:

In [33]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:28<00:00,  3.11s/it]


In [34]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truths'],
    num_rows: 9
})

In [35]:
basic_qa_ragas_dataset[0]

{'question': 'What does this paper aim to conduct a survey about?',
 'answer': "I don't know.",
 'contexts': ['retrieval. Acm Computing Surveys (CSUR), 44(1):1–\n50.\nDanqi Chen, Adam Fisch, Jason Weston, and Antoine\nBordes. 2017. Reading Wikipedia to Answer Open-\nDomain Questions. In Proceedings of the 55th An-',
  'generation), PubHealth (Zhang et al., 2023a) (true-\nor-false question), and Arc-Challenge (Bhaktha-\nvatsalam et al., 2021) (multiple-choice question).\n2In this study, Google Search API is utilized for searching.\nPopQA\nBio\nPub\nARC\nMethod\n(Accuracy)'],
 'ground_truths': ['This paper aims to conduct a survey about retrieval-augmented text generation, highlighting the generic paradigm of this approach, reviewing notable methods across various tasks such as dialogue response generation and machine translation, and identifying important future research directions.']}

Guardar:

In [36]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 45.45ba/s]


11368

Evaluar:

In [37]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating: 100%|██████████| 63/63 [01:14<00:00,  1.19s/it]


In [38]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.8750, 'answer_relevancy': 0.8866, 'context_recall': 0.9352, 'context_relevancy': 0.2088, 'answer_correctness': 0.4933, 'answer_similarity': 0.9212}

### Probando otros recuperadores

Podemos probar cómo cambiar nuestro recuperador afecta a la evaluiación.

Construiremos esta fábrica de qa_chain simple para crear cadenas de qa_chains estandarizadas donde el único componente diferente será el recuperador.

In [39]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name = "gpt-3.5-turbo", temperature = 0)
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context = itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Recuperador de documento padre

Una de las formas más simples para mejorar un recuperador es embed nuestros documentos en pequeños fragmentos y luego recuperar una cantidad significativa de contexto adicional que "rodea" el contexto encontrado.

Puedes leer más sobre este método [aquí](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever).

El esquema básico de este método de recuperación es el siguiente:

1. Obtener la pregunta del usuario.
2. Recuperar documentos secundarios utilizando Dense Vector Retrieval.
3. Fusionar los documentos secundarios en función de sus padres. Si tienen los mismos padres, se fusionan.
4. Reemplazar los documentos secundarios con sus respectivos documentos padres de una in-memory-store.
5. Utilizar los documentos padres para aumentar la generación.

In [40]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size = 200)

vectorstore = Chroma(collection_name = "split_parents", embedding_function = OpenAIEmbeddings())

store = InMemoryStore()

In [41]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore = vectorstore,
    docstore = store,
    child_splitter = child_splitter,
    parent_splitter = parent_splitter,
)

In [42]:
parent_document_retriever.add_documents(base_docs)

Crear, probar y luego evaluar nuestra nueva cadena.

In [43]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [44]:
parent_document_retriever_qa_chain.invoke({"question": "What is RAG?"})["response"].content

'Answer: Retrieval-augmented generation (RAG) is a practicable complement to large language models (LLMs) that relies heavily on the relevance of retrieved documents to improve the robustness of generation.'

In [45]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:20<00:00,  2.30s/it]


In [46]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 330.52ba/s]


45260

In [47]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating: 100%|██████████| 63/63 [00:50<00:00,  1.26it/s]


In [48]:
pdr_qa_result

{'context_precision': 0.8951, 'faithfulness': 0.9444, 'answer_relevancy': 0.9590, 'context_recall': 0.9722, 'context_relevancy': 0.0275, 'answer_correctness': 0.5036, 'answer_similarity': 0.9386}

#### Ensemble Retrieval

Puedes leer más sobre esto [aquí](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

La idea básica es la siguiente:

1. Obtener la pregunta del usuario.
2. Utilizar el par de recuperadores:
    - Recuperar documentos con Recuperación de Vector Esparsa BM25
    - Recuperar documentos con un Método de Recuperación de Vector Denso
3. Recopilar y "fusionar" los documentos recuperados en función de su ponderación utilizando el algoritmo [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) en una sola lista clasificada.
4. Utilizar esos documentos para aumentar nuestra generación.

¡Asegúrate de que tu lista de `weights` - la ponderación relativa de cada recuperador - sume 1!

In [49]:
%pip install rank_bm25

In [50]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap = 75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs = {"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers = [bm25_retriever, chroma_retriever], weights = [0.75, 0.25])

In [51]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [52]:
ensemble_retriever_qa_chain.invoke({"question": "What is RAG?"})["response"].content

'RAG stands for Retrieval Augmented Generation.'

In [53]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:18<00:00,  2.06s/it]


In [54]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 330.78ba/s]


21354

In [62]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

# A veces causa problemas, ejecutarlo hasta que funcione.

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating: 100%|██████████| 63/63 [02:01<00:00,  1.93s/it]


In [63]:
ensemble_qa_result

{'context_precision': 0.6755, 'faithfulness': 1.0000, 'answer_relevancy': 0.9949, 'context_recall': 0.9167, 'context_relevancy': 0.1470, 'answer_correctness': 0.5536, 'answer_similarity': 0.9425}

### Conclusión

Observar resultados en una tabla.

In [64]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.8750, 'answer_relevancy': 0.8866, 'context_recall': 0.9352, 'context_relevancy': 0.2088, 'answer_correctness': 0.4933, 'answer_similarity': 0.9212}

In [65]:
pdr_qa_result

{'context_precision': 0.8951, 'faithfulness': 0.9444, 'answer_relevancy': 0.9590, 'context_recall': 0.9722, 'context_relevancy': 0.0275, 'answer_correctness': 0.5036, 'answer_similarity': 0.9386}

In [66]:
ensemble_qa_result

{'context_precision': 0.6755, 'faithfulness': 1.0000, 'answer_relevancy': 0.9949, 'context_recall': 0.9167, 'context_relevancy': 0.1470, 'answer_correctness': 0.5536, 'answer_similarity': 0.9425}

Podemos ampliar cada resultado y encontrar información específica sobre cada una de las preguntas y respuestas.

In [67]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [68]:
ensemble_qa_result_df

Unnamed: 0,question,answer,contexts,ground_truths,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What does this paper aim to conduct a survey a...,Answer: This paper aims to conduct a survey ab...,[attracted increasing attention of the compu-\...,[This paper aims to conduct a survey about ret...,This paper aims to conduct a survey about retr...,0.804167,1.0,0.967399,0.25,0.4,0.540072,0.960286
1,What is the main focus of the paper 'A Survey ...,The main focus of the paper 'A Survey on Retri...,[A Survey on Retrieval-Augmented Text Generati...,[The main focus of the paper 'A Survey on Retr...,The main focus of the paper 'A Survey on Retri...,0.679167,1.0,1.0,1.0,0.111111,0.621921,0.98762
2,What is the aim of this paper?,The aim of this paper is to propose Corrective...,[and main intent within questions.\nquestion: ...,[The aim of this paper is to conduct a compreh...,The aim of this paper is to conduct a comprehe...,0.0,1.0,0.989808,1.0,0.088889,0.596534,0.886143
3,What is the aim of this paper?,The aim of this paper is to propose Corrective...,[and main intent within questions.\nquestion: ...,[The aim of this paper is to conduct a survey ...,The aim of this paper is to conduct a survey a...,0.0,1.0,0.996603,1.0,0.088889,0.472811,0.891278
4,What is the focus of this paper?,The focus of this paper is to improve the robu...,[and main intent within questions.\nquestion: ...,[The focus of this paper is on retrieval-augme...,The focus of this paper is on retrieval-augmen...,1.0,1.0,1.0,1.0,0.155556,0.433662,0.877506
5,What are the advantages of retrieval-augmented...,Advantages of retrieval-augmented text generat...,[attracted increasing attention of the compu-\...,[The advantages of retrieval-augmented text ge...,The advantages of retrieval-augmented text gen...,1.0,1.0,1.0,1.0,0.153846,0.576429,0.972384
6,What are the advantages of retrieval-augmented...,Advantages of retrieval-augmented text generat...,[attracted increasing attention of the compu-\...,[The advantages of retrieval-augmented text ge...,The advantages of retrieval-augmented text gen...,1.0,1.0,1.0,1.0,0.153846,0.574617,0.965136
7,What are the advantages of retrieval-augmented...,Advantages of retrieval-augmented text generat...,[attracted increasing attention of the compu-\...,[Retrieval-augmented text generation models ha...,Retrieval-augmented text generation models hav...,0.916667,1.0,1.0,1.0,0.153846,0.470002,0.956948
8,What is the focus of the paper 'A Survey on Re...,The focus of the paper 'A Survey on Retrieval-...,[A Survey on Retrieval-Augmented Text Generati...,[The focus of the paper 'A Survey on Retrieval...,The focus of the paper 'A Survey on Retrieval-...,0.679167,1.0,1.0,1.0,0.016949,0.696202,0.984808


Examinaremos cómo combinar los resultados y visualizarlos en una tabla única para poder hacer inferencias sobre ellos.

In [69]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name": pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [70]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [71]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [72]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [73]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [74]:
results_df.sort_values("answer_correctness", ascending = False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.675463,1.0,0.994868,0.916667,0.146992,0.553583,0.942457
1,pdr_rag,0.895062,0.944444,0.959008,0.972222,0.027478,0.50358,0.938622
0,basic_rag,0.5,0.875,0.886624,0.935185,0.20881,0.493326,0.921228


In [4]:
from langchain_core.runnables import RunnableParallel

retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)

NameError: name 'itemgetter' is not defined