## 📚 Prerequisites

Before executing this notebook, make sure you have properly set up your Azure Services, created your Conda environment, and configured your environment variables as per the instructions provided in the [README.md](README.md) file.


## Understanding Quantitative Measures of Relevance

In order to evaluate the effectiveness of our search system, we use several metrics that measure the relevance of the results it returns. These metrics help us understand how well our system is performing and guide us in tuning it for better performance.

- **NDCG@10**: Normalized Discounted Cumulative Gain at 10 (NDCG@10) is a metric that assesses the effectiveness of a retrieval system at finding and correctly ordering the top 10 documents. The score ranges from 0 to 100, with higher scores indicating that the system's ordered list of documents closely matches the ideal order. This metric is widely used because it balances the need for precision (returning relevant results) with the need for proper sequencing (ordering the results correctly).

- **NDCG@3**: NDCG@3 is similar to NDCG@10, but it focuses on the top 3 documents. This metric is particularly relevant in contexts where it's crucial to have the highest accuracy in the topmost results.

### Deep Dive into Discounted Cumulative Gain (DCG)

Discounted Cumulative Gain (DCG) is a metric used to measure the effectiveness of ranking algorithms, especially in information retrieval tasks like search engine result ranking. It assesses the quality of the ranking by considering both the relevance of the documents and their positions in the ranking list.

#### Calculation

In the context of Azure AI search, suppose we have a system that returns five chunks or documents (C1, C2, C3, C4, C5) in a specific order based on a user query. Each chunk or document is assigned a relevance score on a scale from 0 to 3, where:

- 0 indicates the chunk/document is not relevant.
- 1-2 indicates the chunk/document is somewhat relevant.
- 3 indicates the chunk/document is completely relevant.

For example, let's say we have the following relevance scores:

- C1: 3
- C2: 2
- C3: 0
- C4: 0
- C5: 1

The Cumulative Gain (CG) is the sum of these relevance scores:

```
CG = Σ(rel)i = 3 + 2 + 0 + 0 + 1 = 6
```

The Discounted Cumulative Gain (DCG) is a measure that discounts or reduces the relevance scores based on their position in the result set. It's calculated using the formula:

```
DCG = Σ(reli / log2(i + 1))
```

For the given example:

```
DCG5 = (3 / log2(2)) + (2 / log2(3)) + (0 / log2(4)) + (0 / log2(5)) + (1 / log2(6))
DCG5 ≈ 4.67
```

This DCG score indicates the overall relevance of the search results, taking into account both the relevance of each individual chunk/document and its position in the result set. The higher the DCG score, the better the search results are in terms of relevance.

#### Ideal Discounted Cumulative Gain (IDCG)

To calculate the IDCG, we reorder the chunks/documents in descending order of relevance and compute the DCG. This gives us the best possible DCG for a given set of chunks/documents.

For the example given:
```
IDCG5 = (3 / log2(2)) + (2 / log2(3)) + (1 / log2(4)) + (0 / log2(5)) + (0 / log2(6))
IDCG5 ≈ 4.76
```

#### Normalized Discounted Cumulative Gain (nDCG)

Normalized DCG (nDCG) is obtained by dividing the DCG by the IDCG. This normalization helps in comparing the performance of different ranking algorithms across different datasets.

For the example given:
```
nDCG = DCG5 / IDCG5
nDCG ≈ 0.98
```

DCG, IDCG, and nDCG provide insights into the quality of ranking algorithms by considering both relevance and ranking position. They are valuable metrics in evaluating and improving Azure AI search performance in the context of the RAg pattern. By understanding these metrics, we can better tune our Azure AI search systems to deliver the most relevant chunks and documents to our users based on their queries.

## Python Implementation Using sklearn.metrics.ndcg_score

The `sklearn.metrics.ndcg_score` function is a powerful tool for measuring the quality of rankings, particularly in scenarios like information retrieval tasks. It calculates the Normalized Discounted Cumulative Gain (NDCG), which evaluates the relevance of items in a ranked list.

Here's how you can leverage `ndcg_score` for measuring ranking and relevance:

### Data Preparation

Ensure you have two arrays:

- `y_true`: This array contains the true relevance scores of items (ground truth).
- `y_score`: This array contains the predicted scores of items.

### Calculate NDCG

Call the `ndcg_score` function with appropriate parameters:

- `y_true`: Array-like of shape (n_samples, n_labels) containing true relevance scores.
- `y_score`: Array-like of shape (n_samples, n_labels) containing predicted scores.

Optionally, you can specify:

- `k`: Consider only the highest k scores in the ranking.
- `sample_weight`: Apply sample weights if necessary.
- `ignore_ties`: Ignore ties in `y_score` for efficiency gains.

### Interpret Results

The function returns a value between 0 and 1, representing the averaged NDCG scores for all samples. A higher NDCG score indicates better ranking performance, where items with higher true relevance scores are ranked higher in `y_score`.

Here's a simplified example:

```python
import numpy as np
from sklearn.metrics import ndcg_score

# Example data
true_relevance = np.asarray([[10, 0, 0, 1, 5]])  # True relevance scores
predicted_scores = np.asarray([[.1, .2, .3, 4, 70]])  # Predicted scores

# Calculate NDCG
ndcg = ndcg_score(true_relevance, predicted_scores)

print("NDCG Score:", ndcg)
```


In [23]:
import numpy as np
from sklearn.metrics import ndcg_score

# Example data
true_relevance = np.asarray([[10, 5, 5, 1, 5]])  # True relevance scores
predicted_scores = np.asarray([[8, 10, 0, 0, 0]])  # Predicted scores

# Calculate NDCG
ndcg = ndcg_score(true_relevance, predicted_scores, k=5)

print("NDCG Score:", ndcg)

NDCG Score: 0.887075631680536


: 

## LLM driven evaluation

In [1]:
from src.evaluators.pf_tester import PromptFlowManagerEvaluator

eval_flow = "src\\evaluators\\promptflow_util\\relevance\\flow.dag.yaml"
evaluator = PromptFlowManagerEvaluator(eval_flow=eval_flow)

In [3]:
chat_history = []
question="How does the 'Input Range Hi and Lo' setting correlate with valve travel range and Zero Power Condition in the FIELDVUE DVC6200 HW2, and what are the implications for valve calibration?",
source="Figure 3-1 and related text detailing calibration related to Zero Power Condition."
answer="The band The Beatles began their journey in London, England, and they changed the history of music."

In [4]:
evaluator.run_promptflow_evaluations(chat_history, question=question, context=source, answer=answer)

2024-02-13 18:36:30 -0600   12808 execution.flow     INFO     Start executing nodes in thread pool mode.
2024-02-13 18:36:30 -0600   12808 execution.flow     INFO     Start to run 2 nodes with concurrency level 16.
2024-02-13 18:36:30 -0600   12808 execution.flow     INFO     Executing node relevance_score. node run id: 7a1813e2-5bff-465f-b82b-c355f5e8d9d4_relevance_score_0
2024-02-13 18:36:31 -0600   12808 execution.flow     INFO     Node relevance_score completes.
2024-02-13 18:36:31 -0600   12808 execution.flow     INFO     Executing node concat_scores. node run id: 7a1813e2-5bff-465f-b82b-c355f5e8d9d4_concat_scores_0
2024-02-13 18:36:31 -0600   12808 execution.flow     INFO     Node concat_scores completes.
2024-02-13 18:36:31 -0600   12808 execution.flow     INFO     Start to run 1 nodes with concurrency level 16.
2024-02-13 18:36:31 -0600   12808 execution.flow     INFO     Executing node aggregate_variants_results. node run id: 7a1813e2-5bff-465f-b82b-c355f5e8d9d4_aggregate_vari

2024-02-13 18:36:31,988 - micro - MainProcess - INFO     Test result: {'gpt_relevance': 1.0} (pf_tester.py:run_promptflow_evaluations:100)


{'gpt_relevance': 1.0}