# Example of Evaluate Relevance Between Retrieved Contexts and Ground Truth Contexts

**Authors**: 
- Novan Parmonangan Simanjuntak (novan.p.simanjuntak@gdplabs.id)

**Reviewers**: 
- Komang Elang Surya Prawira (komang.e.s.prawira@gdplabs.id)
- Surya Mahadi (made.r.s.mahadi@gdplabs.id)

## References
[1] [GDP Labs GenAI SDK - Evaluate Relevance Between Retrieved Contexts and Ground Truth Contexts](https://docs.glair.ai/generative-internal/modules/evaluator/cookbook/retrieval-evaluator/retrieval-evaluation-methods/evaluate-relevance-between-retrieved-contexts-and-ground-truth-contexts) \
[2] [BEIR](https://github.com/beir-cellar/beir)

# Prepare Environment

Before we start, ensure you have a GitHub account with access to the GDP Labs GenAI SDK GitHub repository. Then, follow these steps to create a personal access token:
1. Log in to your [GitHub](https://github.com/) account.
2. Navigate to the [Personal Access Tokens](https://github.com/settings/tokens) page.
3. Select the `Generate new token` option. You can use the classic version instead of the beta version.
4. Fill in the required information, ensuring that you've checked the `repo` option to grant access to private repositories.
5. Save the newly generated token.

In [None]:
import getpass
import subprocess
import sys

def install_sdk_library() -> None:
    """Installs the `gdplabs_gen_ai` library from a private GitHub repository using a Personal Access Token.

    This function prompts the user to input their Personal Access Token for GitHub authentication. It then constructs
    the repository URL with the provided token and executes a subprocess to install the library via pip from the
    specified repository.

    Raises:
        subprocess.CalledProcessError: If the installation process returns a non-zero exit code.

    Note:
        The function utilizes `getpass.getpass()` to securely receive the Personal Access Token without echoing it.
    """
    token = getpass.getpass("Input Your Personal Access Token: ")
    repo_url_with_token = f"https://{token}@github.com/GDP-ADMIN/gen-ai-internal.git"
    cmd = ["pip", "install", f"gdplabs_gen_ai[eval] @ git+{repo_url_with_token}"]

    try:
        with subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                              text=True, bufsize=1, universal_newlines=True) as process:
            for line in process.stdout:
                sys.stdout.write(line)

            process.wait()  # Wait for the process to complete.
            if process.returncode != 0:
                raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
    except Exception as e:
        print(f"An error occurred: {e}.")

install_sdk_library()

<b>Warning:</b>
After running the command above, you need to restart the runtime in Google Colab for the changes to take effect. Not doing so might lead to the newly installed libraries not being recognized.

To restart the runtime in Google Colab:
- Click on the `Runtime` menu.
- Select `Restart runtime`.

Once you have completed the previous step, you are ready to start the evaluation.

## Evaluate Contexts and Ground Truth Contexts
Once you have completed the previous step, you can start by importing the necessary library.

In [1]:
from typing import Dict, List

from gdplabs_gen_ai.evaluation import EvaluateRetrieval

Then prepare your dataset.

In [2]:
# Ground truth contexts for each query with relevance scores. 
# Use `query_id` and `context_id` to represent the query and the contexts.
ground_truth_contexts: Dict[str, Dict[str, int]] = {
    "query1": {"doc1": 1, "doc3": 1, "doc5": 1},
    "query2": {"doc2": 1, "doc4": 1},
    "query3": {"doc6": 1, "doc8": 1, "doc10": 1},
    "query4": {"doc7": 1, "doc9": 1}
}

# Retrieved contexts for each query with similarity scores. It doesn't need to be sorted by scores.
# In this examples, we retrieve top 5 contexts for each query.
retrieved_contexts: Dict[str, Dict[str, float]] = {
    "query1": {"doc1": 0.3, "doc2": 0.9, "doc3": 0.8, "doc4": 0.2, "doc5": 0.7},
    "query2": {"doc1": 0.8, "doc2": 0.3, "doc3": 0.6, "doc4": 0.2, "doc5": 0.4},
    "query3": {"doc6": 0.5, "doc7": 0.3, "doc8": 0.7, "doc9": 0.2, "doc10": 0.6},
    "query4": {"doc1": 0.8, "doc3": 0.8, "doc5": 0.5, "doc7": 0.3, "doc9": 0.2}
}

Prepare list of k values that you want to evaluate (NDCG@k, MAP@k, Recall@k, Precision@k).

In [3]:
# List of k values for evaluation.
k_values: List[int] = [1, 3, 5]

After we prepare the dataset and k values, we can start the evaluation.

In [4]:
# Evaluate retrieval performance using NDCG@k, MAP@K, Recall@K, and Precision@K.
ndcg, map_score, recall, precision = EvaluateRetrieval.evaluate(
    ground_truth_contexts, retrieved_contexts, k_values
)

# Display the evaluation metrics.
print(f"NDCG: {ndcg}")
print(f"MAP: {map_score}")
print(f"Recall: {recall}")
print(f"Precision: {precision}")

NDCG: {'NDCG@1': 0.25, 'NDCG@3': 0.38268, 'NDCG@5': 0.68384}
MAP: {'MAP@1': 0.08333, 'MAP@3': 0.34722, 'MAP@5': 0.57222}
Recall: {'Recall@1': 0.08333, 'Recall@3': 0.41667, 'Recall@5': 1.0}
Precision: {'P@1': 0.25, 'P@3': 0.41667, 'P@5': 0.5}


## Analyze Results

The output provides key insights into the performance of the retrieval system. Let's break down each metric and interpret what it signifies:  

Normalized Discounted Cumulative Gain (NDCG):  
NDCG@1: 0.25  
NDCG@3: 0.38268  
NDCG@5: 0.68384  
Interpretation: NDCG measures the quality of the ranked retrieved documents based on their relevance. The values indicate that the relevance improves as more documents are considered (from 1 to 5). A score of 0.68384 at NDCG@5 suggests that the top 5 documents have a moderately high relevance. However, the lower score at NDCG@1 indicates that the most relevant document is not always ranked highest.  


Mean Average Precision (MAP):  
MAP@1: 0.08333  
MAP@3: 0.34722  
MAP@5: 0.57222  
Interpretation: MAP assesses the precision across all queries. The scores reflect a similar trend as NDCG, where precision improves with more documents. The low score at MAP@1 implies that the top-ranked document is often not the most relevant. However, considering more documents (up to 5) increases the likelihood of including relevant documents.  


Recall:  
Recall@1: 0.08333  
Recall@3: 0.41667  
Recall@5: 1.0  
Interpretation: Recall measures how many relevant documents are retrieved. The perfect score at Recall@5 indicates that all relevant documents are included within the top 5. However, the lower scores at Recall@1 and Recall@3 suggest that not all relevant documents are ranked at the very top.
Precision:  
P@1: 0.25  
P@3: 0.41667  
P@5: 0.5  
Interpretation: Precision evaluates how many of the retrieved documents are relevant. The scores show that while half of the top 5 documents are relevant (P@5), the precision at the very top (P@1) is lower.  

## Action Based on Metrics:

1. **Focus on Improving Top-Ranked Results**: The lower scores at NDCG@1 and MAP@1 suggest a need to enhance the algorithm to ensure the most relevant documents are ranked higher.  
2. **Balance Between Precision and Recall**: While recall at 5 is perfect, precision is only 50%, indicating a trade-off. Aim to improve precision without significantly compromising recall.  
3. **Analyze Individual Queries**: Investigate queries where the retrieval system underperforms, especially those contributing to lower scores at lower 'k' values.  
4. **Refine Retrieval Algorithms**: Consider adjusting or fine-tuning the retrieval algorithm, possibly integrating more sophisticated techniques like semantic search or machine learning models that can better understand query intent.  
5. **Iterative Testing and Evaluation**: Continuously evaluate and refine the system using different datasets and queries to ensure robustness and versatility.  