# Bonus Track - Ejemplo de uso de RAGAs para evaluar calidad de soluciones de IA Generativa

# Paso 1 - Importamos dependencias necesarias para la demo

En el fichero **requirements.txt** están definidas las dependencias necesarias de Python para poder correr la demo.

Para importarlas debemos ejecutar el siguiente comando: *pip install -r requirements.txt*

# Paso 2 - Definimos las variables de entorno necesarias para la demo

Editar el fichero _.env_ donde debemos definir las siguientes variables de entorno:
-  **OPENAI_API_KEY**: API Key para poder conectarnos al LLM que usaremos en la demo, OpenAI.

# Paso 3 - Importar librerías de Python que necesitamos para la demo

In [None]:
import os
import json
import random
import pandas as pd

from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

from langchain.chat_models import ChatOpenAI
from datasets import load_dataset
from datasets import Dataset


from dotenv import load_dotenv

# Load environment variables defined in .env
load_dotenv()

# Paso 4 - Cargamos un dataset de pruebas de HuggingFaces (explodinggradients/fiqa)

For this tutorial we are going to use an example dataset from one of the baselines we created for the Financial Opinion Mining and Question Answering (fiqa) Dataset. The dataset has the following columns.

-  question: list[str] - These are the questions your RAG pipeline will be evaluated on.
-  answer: list[str] - The answer generated from the RAG pipeline and given to the user.
-  contexts: list[list[str]] - The contexts which were passed into the LLM to answer the question.
-  ground_truths: list[list[str]] - The ground truth answer to the questions. (only required if you are using context_recall)

In [None]:
# Load the dataset and we will only work on the first 10 questions
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval_5_questions = fiqa_eval["baseline"].select(range(5))

# Show the dataset definition in terms of features and rows
print(fiqa_eval_5_questions)

# Show the dataset content
for item in fiqa_eval_5_questions:
    print(item)

# Paso 5 - Primera prueba de RAG para el modelo GPT3.5 Turbo

In [None]:
# We load the ChatOpenAI Langchain utility to interact with the model
chat_gpt_3_llm = ChatOpenAI(model="gpt-3.5-turbo-0125",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2)

# Invoque ChatGPT 3.5 with the dataset questions to get the answer for each of it
answers = []
for item in fiqa_eval_5_questions:
    model_answer = chat_gpt_3_llm.invoke(item['question']).content
    answers.append(model_answer)

print(answers)

# Prepare a dataset to evaluate with RAGAs with the questions and answers
ragas_dataset = Dataset.from_dict({
    'question': fiqa_eval_5_questions['question'],
    'answer': answers,
    'contexts': fiqa_eval_5_questions['contexts'],
    'ground_truths': fiqa_eval_5_questions['ground_truths']
})

# Evaluate the results with RAGAs
result = evaluate(
    ragas_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ],
)

# Print them
df = result.to_pandas()
df.head()

# Paso 6 - Mismo ejercico pero con ChatGPT 4o

In [None]:
# We load the ChatOpenAI Langchain utility to interact with the model
chat_gpt_4o_llm = ChatOpenAI(model="gpt-4o",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2)

# Invoque ChatGPT 4o with the dataset questions to get the answer for each of it
answers = []
for item in fiqa_eval_5_questions:
    model_answer = chat_gpt_4o_llm.invoke(item['question']).content
    answers.append(model_answer)

# Prepare a dataset to evaluate with RAGAs with the questions and answers
ragas_dataset = Dataset.from_dict({
    'question': fiqa_eval_5_questions['question'],
    'answer': answers,
    'contexts': fiqa_eval_5_questions['contexts'],
    'ground_truths': fiqa_eval_5_questions['ground_truths']
})

# Evaluate the results with RAGAs
result = evaluate(
    ragas_dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall
    ],
)

# Print them
df = result.to_pandas()
df.head()