# Evaluate Relevance Between Ground Truth Responses and Retrieved Contexts

**Authors**: 
- Komang Elang Surya Prawira (komang.e.s.prawira@gdplabs.id)
- Surya Mahadi (made.r.s.mahadi@gdplabs.id)

**Reviewers**: 
- Novan Parmonangan Simanjuntak (novan.p.simanjuntak@gdplabs.id)

## References
[1] [GDP Labs GenAI SDK - Evaluate Relevance Between Ground Truth Responses and Retrieved Contexts](#) \
[2] [LLamaIndex - Faithfullness Evaluator](https://docs.llamaindex.ai/en/stable/examples/evaluation/faithfulness_eval.html) \
[3] [Ragas - Evaluation](https://github.com/explodinggradients/ragas/blob/main/src/ragas/evaluation.py) \
[4] [Ragas - Context Recall](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_context_recall.py) \
[5] [LangChain - OpenAI](https://python.langchain.com/docs/integrations/chat/openai)

## Description

In this notebook, we will explore how to evaluate the performance of our retrieval using the retrieved contexts and ground truth responses as references. We will leverage LLM to evaluate the retrieved contexts. Below is the data needed to perform this evaluation:
1. QuestioN: Query used to get the retrieved contexts.
2. Retrieved Contexts: Contexts retrieved for each question.
3. Ground Truth Responses: Ground truth responses for each question.


We utilize two metrics each from LlamaIndex and Ragas to calculate the score:
1. **Faithfulness** measures the extent to which the retrieved contexts are correct based on the ground truth responses. 
2. **Context Recall** measures the extent to which the ground truth responses are reflected (mentioned) in the retrieved contexts.

# Prepare Environment

Before we start, ensure you have a GitHub account with access to the GDP Labs GenAI SDK GitHub repository. Then, follow these steps to create a personal access token:
1. Log in to your [GitHub](https://github.com/) account.
2. Navigate to the [Personal Access Tokens](https://github.com/settings/tokens) page.
3. Select the `Generate new token` option. You can use the classic version instead of the beta version.
4. Fill in the required information, ensuring that you've checked the `repo` option to grant access to private repositories.
5. Save the newly generated token.

In [None]:
import getpass
import subprocess
import sys

def install_sdk_library() -> None:
    """Installs the `gdplabs_gen_ai` library from a private GitHub repository using a Personal Access Token.

    This function prompts the user to input their Personal Access Token for GitHub authentication. It then constructs
    the repository URL with the provided token and executes a subprocess to install the library via pip from the
    specified repository.

    Raises:
        subprocess.CalledProcessError: If the installation process returns a non-zero exit code.

    Note:
        The function utilizes `getpass.getpass()` to securely receive the Personal Access Token without echoing it.
    """
    token = getpass.getpass("Input Your Personal Access Token: ")
    repo_url_with_token = f"https://{token}@github.com/GDP-ADMIN/gen-ai-internal.git@f/retrieval_evaluator"
    cmd = ["pip", "install", "-e", f"git+{repo_url_with_token}#egg=gdplabs_gen_ai[eval]"]

    try:
        with subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                              text=True, bufsize=1, universal_newlines=True) as process:
            for line in process.stdout:
                sys.stdout.write(line)

            process.wait()  # Wait for the process to complete.
            if process.returncode != 0:
                raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
    except Exception as e:
        print(f"An error occurred: {e}.")

install_sdk_library()

<b>Warning:</b>
After running the command above, you need to restart the runtime in Google Colab for the changes to take effect. Not doing so might lead to the newly installed libraries not being recognized.

To restart the runtime in Google Colab:
- Click on the `Runtime` menu.
- Select `Restart runtime`.

Once you have completed the previous step, you are ready to start the evaluation.

# Faithfullness Evaluation

In [2]:
from gdplabs_gen_ai.evaluation import FaithfulnessEvaluator
from llama_index import ServiceContext
from llama_index.llms import OpenAI

## Prepare Data
You need to prepare your data in the following format.

In [3]:
ground_truth_responses = [
    "AI is artificial intelligence",
    "Car is a transportation",
]
retrieved_contexts = [
    ["Today AI is used everywhere.", "AI was first developed on 1970, AI stands for Artificial Intelligence."],
    ["Toyota is a car factory that success in Japan.", "Today lot of people use car as their main transportation."]
]

## Set Up LLM and Evaluator
Next, you need to define the LLM. In this example, we will use `GPT-4` as the LLM. Remember to put your `OPENAI_API_KEY` into the environment variables, you can use `os.environ` function.

In [4]:
# Create service context.
gpt4 = OpenAI(temperature=0, model="gpt-4")
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)

# Create evaluator.
evaluator_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)

## Calculate the Score
Finally, you can calculate the `Faithfullness` score using the following code.

In [5]:
scores = []

for ground_truth_response, retrieved_context in zip(ground_truth_responses, retrieved_contexts):
  # There are 2 API for evaluation, sync and async.
  # In this example we use the async version.
  result = await evaluator_gpt4.aevaluate(response=ground_truth_response, contexts=retrieved_context)
  # If you want to use the sync version, you can use the following code.
  # result = evaluator_gpt4.evaluate(response=ground_truth_response, contexts=retrieved_context)
  
  scores.append(int(result.passing))

print(f"Score: {sum(scores)/len(scores)}")

Score: 1.0


The above example will calculate each ground truth response-context pair, either `PASS` or `NOT PASS` and than calculate the mean score.

**Note:** Since in this example we use Jupyter Notebook, and internally [Jupyter Notebook already running an event loop](https://blog.jupyter.org/ipython-7-0-async-repl-a35ce050f7f7), you can only use the async version here. If you want to use the sync version, use it outside Jupyter Notebook.

# Context Recall Evaluation

In [6]:
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM

from gdplabs_gen_ai.evaluation import evaluate, ContextRecall
from gdplabs_gen_ai.evaluation.utility import convert_to_hf_dataset

## Prepare Data
You need to prepare your data in the following format.

In [7]:
# Define your data here before converting it into a Hugging Face's `Dataset` object.
retrieved_contexts = [["AI is dangerous", "Artificial Intelligence is hard to master"], 
                      ["Elon Musk is rich", "CEO of SpaceX is Elon Musk"]]
questions = ["What is AI?", "Who is Elon Musk?"]
ground_truth_responses = [["A field of computer science"],
                          ["An entrepreneur and business magnate"]]

dataset = convert_to_hf_dataset(retrieved_contexts, questions=questions, ground_truth_responses=ground_truth_responses)
print(dataset)

Dataset({
    features: ['retrieved_contexts', 'questions', 'ground_truth_responses'],
    num_rows: 2
})


## Set Up LLM and Evaluator
Next, you need to define the LLM. In this example, we will use `GPT-4` as the LLM. Remember to put your `OPENAI_API_KEY` into the environment variables, you can use `os.environ` function.

In [8]:
gpt4 = ChatOpenAI(model_name="gpt-4")
gpt4_wrapper = LangchainLLM(llm=gpt4)

context_recall = ContextRecall(
    batch_size=10
)
context_recall.llm = gpt4_wrapper

## Calculate the Score
Finally, you can calculate the `ContextRecall` score using the following code.

In [9]:
score_gpt4 = evaluate(
    dataset,
    metrics=[context_recall],
    column_map={"contexts": "retrieved_contexts", "question": "questions", "ground_truths": "ground_truth_responses"},
)

print(score_gpt4)

evaluating with [context_recall]


100%|██████████| 1/1 [00:10<00:00, 10.34s/it]


{'context_recall': 0.0000}
