## NLTK metrics (Non-LLM evaluations)

Some sample evaluations with NLTK metrics

In [1]:
import nltk  # For BLEU and other metrics (if needed)
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from bert_score import score
import numpy as np

# Example data (replace with your actual generated and reference texts)
generated_texts = [
    "The cat sat on the mat.",
    "The dog barked loudly.",
    "The quick brown fox jumps over the lazy dog."
]
reference_texts = [
    ["The cat sits on the mat."],  # List of reference sentences for each generated sentence
    ["The dog barks loudly."],
    ["The quick brown fox jumped over the lazy dog."]
]


# 1. BLEU Score
def calculate_bleu(generated, references):
    """Calculates BLEU score."""
    smoothing = SmoothingFunction().method4  # Choose a smoothing function
    bleu_scores = []
    for i in range(len(generated)):
        score = sentence_bleu(references[i], generated[i].split(), smoothing_function=smoothing)
        bleu_scores.append(score)
    return np.mean(bleu_scores)


bleu_score = calculate_bleu(generated_texts, reference_texts)
print(f"BLEU Score: {bleu_score}")



# 2. ROUGE Score
def calculate_rouge(generated, references):
    """Calculates ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L)."""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge_1_scores = []
    rouge_2_scores = []
    rouge_l_scores = []

    for i in range(len(generated)):
        scores = scorer.score(references[i][0], generated[i]) # reference is a list of sentences, we take the first one
        rouge_1_scores.append(scores['rouge1'].fmeasure)
        rouge_2_scores.append(scores['rouge2'].fmeasure)
        rouge_l_scores.append(scores['rougeL'].fmeasure)
    return np.mean(rouge_1_scores), np.mean(rouge_2_scores), np.mean(rouge_l_scores)

rouge_1, rouge_2, rouge_l = calculate_rouge(generated_texts, reference_texts)

print(f"ROUGE-1: {rouge_1}")
print(f"ROUGE-2: {rouge_2}")
print(f"ROUGE-L: {rouge_l}")




# 3. BERTScore
def calculate_bertscore(generated, references):
    """Calculates BERTScore."""
    P, R, F1 = score(generated, [ref[0] for ref in references], lang="en")  # reference is a list of sentences, we take the first one
    return np.mean(P.cpu().numpy()), np.mean(R.cpu().numpy()), np.mean(F1.cpu().numpy())


bert_p, bert_r, bert_f1 = calculate_bertscore(generated_texts, reference_texts)
print(f"BERTScore Precision: {bert_p}")
print(f"BERTScore Recall: {bert_r}")
print(f"BERTScore F1: {bert_f1}")

BLEU Score: 0.0
ROUGE-1: 0.9444444444444445
ROUGE-2: 0.8666666666666667
ROUGE-L: 0.9444444444444445


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore Precision: 0.9869623184204102
BERTScore Recall: 0.9869623184204102
BERTScore F1: 0.9869623184204102


So higher the score, higher the performance

## DeepEval (LLM Evaluation)

To go to the summary table containing important details about the metrics below, go to this [Summary Table](#summary_table)

In [None]:
# !pip install deepeval

In [11]:
# !pip install ipywidgets

[0mCollecting ipywidgets
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.13-py3-none-any.whl.metadata (4.1 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading jupyterlab_widgets-3.0.13-py3-none-any.whl (214 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m214.4/214.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[0mInstalling collected packages: widgetsnbextension, jupy

Add OPEN_AI_API_KEY in your .env file and include them in the environment using the below code

In [19]:
%load_ext dotenv
%dotenv

***Note: Most of the evaluation metrics from deepeval are self-explaining***

### G-Eval

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. The G-Eval metric is the most versatile type of metric deepeval has to offer, and is capable of evaluating almost any use case with human-like accuracy.

In [21]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams, LLMTestCase

In [22]:


correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    model='gpt-4o-mini',
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    # evaluation_steps=[
    #     "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
    #     "You should also heavily penalize omission of detail",
    #     "Vague language, or contradicting OPINIONS, are OK"
    # ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)
test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)
correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)

Output()

0.19064650247410678
The actual output fails to directly answer the question about who ran up the tree, contrasting with the expected output which clearly states 'The cat.'


It gives a score and a reason as well. So, cool!

### Answer Relevancy

The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the **actual_output** of your LLM application is compared to the provided **input**. deepeval's answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

In [23]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost. We offer free shipping."

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True,
    # verbose_mode=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.5
The score is 0.50 because while some relevant information about shoe fit may have been provided, the mention of free shipping detracted from addressing the main concern of whether the shoes will fit.


Cool! Out of the two statements in the actual output, we have one relevant statement, so the score is 0.5.

### Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the **actual_output** factually aligns with the contents of your **retrieval_context**. deepeval's faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

In [25]:
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost. We offer free shipping."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost and for free shipping as well."]

metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

1.0
The score is 1.00 because there are no contradictions found, indicating a perfect match between the actual output and the retrieval context.


Although the answer is not relevant to the question, faithfulness just checks whether the output is factually correct and aligned with context. So,it gives a perfect score of 1. 

Let's try to change the context and see. I'm trying to change the context just to make it not align with the output. Like the context says the full refund is only for the purchases in the next week, but the output says 30 days refund. Let's see what the faithfulness score is.

In [27]:
from deepeval import evaluate
from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost. We offer free shipping."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a full refund at no extra cost for the purchases made in the next week. For the next 30-days, we offer free shipping as well."]

metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.5
The score is 0.50 because the actual output incorrectly states a 30-day full refund at no extra cost, which contradicts the retrieval context that specifies refunds are only available for purchases made within the next week.


The score is reduced as expected and the reasoning is perfect.!

### Contextual Precision

The contextual precision metric measures your RAG pipeline's retriever by evaluating whether nodes in your **retrieval_context** that are relevant to the given **input** are ranked higher than irrelevant ones. deepeval's contextual precision metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

In [29]:
from deepeval import evaluate
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the expected output from your RAG generator
expected_output = "You are eligible for a 30 day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["Shoe sizes are as fit per the size chart","We offer free shipping","All customers are eligible for a 30 day full refund at no extra cost."]

metric = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.3333333333333333
The score is 0.33 because the relevant node ranked third provides a clear solution to the concern about the shoes not fitting, while the first and second nodes are irrelevant to the fit issue. Specifically, the first node states, "The statement about 'Shoe sizes are as fit per the size chart' does not address the concern about fit directly," and the second node mentions, "We offer free shipping," which is unrelated. This leads to some relevant content being overshadowed by irrelevant nodes, resulting in a lower score.


Changing the order of context and checking the score

In [30]:
from deepeval import evaluate
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the expected output from your RAG generator
expected_output = "You are eligible for a 30 day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["Shoe sizes are as fit per the size chart","All customers are eligible for a 30 day full refund at no extra cost.", "We offer free shipping",]

metric = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.5
The score is 0.50 because only one relevant node is ranked above two irrelevant nodes, which hinders the overall precision. Specifically, the second node addresses fitting concerns by stating "All customers are eligible for a 30 day full refund at no extra cost," while the first node states, "Shoe sizes are as fit per the size chart," which fails to provide a direct response to the fitting issue. The presence of two irrelevant nodes demonstrates that the relevant content is not consistently at the forefront, resulting in the current score.


In [37]:
(1/1)*(0*0/1+1*1/2+1*0/3)


0.5

### Contextual Recall

The contextual recall metric measures the quality of your RAG pipeline's retriever by evaluating the extent of which the **retrieval_context** aligns with the **expected_output**. deepeval's contextual recall metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

In [39]:
from deepeval.metrics import ContextualRecallMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the expected output from your RAG generator
expected_output = "You are eligible for a 30 day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["Shoe sizes are as fit per the size chart","All customers are eligible for a 30 day full refund at no extra cost.", "We offer free shipping",]

metric = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

1.0
The score is 1.00 because the sentence directly matches the information provided in the retrieval context, specifically from the 2nd node.


Since the information from the expected output matches with one of the contexts properly, it gives a perfect score. It does not consider actual output. 

### Contextual Relevancy

The contextual relevancy metric measures the quality of your RAG pipeline's retriever by evaluating the overall relevance of the information presented in your **retrieval_context** for a given **input**. deepeval's contextual relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

In [43]:
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["Shoe sizes are as fit per the size chart","All customers are eligible for a 30 day full refund at no extra cost.", "We offer free shipping",]
metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.6666666666666666
The score is 0.67 because while the context mentions 'We offer free shipping' which is irrelevant to the shoe fit concern, it does provide helpful information like 'Shoe sizes are as fit per the size chart' and 'All customers are eligible for a 30 day full refund at no extra cost,' making it somewhat relevant.


#### Does contextual relevancy uses info from actual_output ?

In [44]:
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# Changed to irrelevant output
actual_output = "We offer free shipping"

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["Shoe sizes are as fit per the size chart","All customers are eligible for a 30 day full refund at no extra cost.", "We offer free shipping",]
metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.6666666666666666
The score is 0.67 because while there are relevant statements like 'Shoe sizes are as fit per the size chart' and 'All customers are eligible for a 30 day full refund at no extra cost,' the context also includes irrelevant information such as 'We offer free shipping,' which detracts from the overall relevancy.


Contextual relevancy does not consider actual output at all! It just considers the contexts and the input. This differs from the Answer relevancy metric in this way.

Although similar to how the AnswerRelevancyMetric is calculated, the ContextualRelevancyMetric first uses an LLM to extract all statements made in the retrieval_context instead (of actual_output in Answer relevancy), before using the same LLM to classify whether each statement is relevant to the input.

### All Contextual ones combined

Although look at the below example, where I have added irrelevant context to the query but the output matches with context, let's see it's precision and recall.

In [45]:
# Replace this with the actual output from your LLM application
actual_output = "There was a cat on the mat"

# Replace this with the expected output from your RAG generator
expected_output = "You are eligible for a 30 day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["The cat sat on the mat.",
    "All customers are eligible for a 30 day full refund at no extra cost.",
    "The quick brown fox jumps over the lazy dog."
]

metric = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print("Contextual Recall", metric.score)
print(metric.reason)

metric = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

metric.measure(test_case)
print("Contextual Precision", metric.score)
print(metric.reason)

metric = ContextualRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

metric.measure(test_case)
print("Contextual Relevancy", metric.score)
print(metric.reason)

Output()

Output()

Contextual Recall 1.0
The score is 1.00 because the sentence directly corresponds to node 2 in the retrieval context, which confirms that all customers, including the individual in question, are eligible for a full refund.


Output()

Contextual Precision 0.5
The score is 0.50 because only one relevant node is ranked higher than two irrelevant nodes. The second node, ranked second, states that 'All customers are eligible for a 30 day full refund at no extra cost,' which is directly related to the concern about shoe fitting. However, the first node, ranked first, mentions 'The cat sat on the mat,' which does not provide relevant information, hence it negatively impacts the score. Additionally, the third node reiterates unrelated content, 'The quick brown fox jumps over the lazy dog,' further diminishing the overall precision.


Contextual Relevancy 0.3333333333333333
The score is 0.33 because the relevant statement 'All customers are eligible for a 30 day full refund at no extra cost.' does provide some reassurance about returns, but the majority of the retrieval context consists of unrelated statements like 'The cat sat on the mat' and 'The quick brown fox jumps over the lazy dog', which detracts significantly from addressing the specific concern about shoe fit.


- **Recall** - Even though the actual output is wrong, since it refers contexts and compares with actual output, it's simply is giving the perfect score for recall. We use the **expected_output** instead of the **actual_output** because we're measuring the quality of the RAG retriever for a given ideal output.
- **Precision** - There is one relevant node ranked higher than the other contexts in regards with the **expected_output**
- **Relevancy** - The score is low because only one of the contexts addresses the question **input**. Here also it does not consider **actual_output** and penalizes the score further.


All three does not give much priority to **actual_output**. Let's see if faithfulness works

In [46]:
metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

metric.measure(test_case)
print("Faithfulness ", metric.score)
print(metric.reason)

Output()

Faithfulness  1.0
The score is 1.00 because there are no contradictions present, indicating that the actual output aligns perfectly with the retrieval context.


No! Faithfulness is also just checking whether context is aligned with actual output even though it does not answer the question, it gives a perfect score. Okay, we know answer relevancy compares input and actual output without context. Will check that.

In [47]:
metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)

metric.measure(test_case)
print("Answer Relevancy ", metric.score)
print(metric.reason)

Output()

Answer Relevancy  0.0
The score is 0.00 because the output included a completely irrelevant statement about a cat, which does not address the concern about the shoes fitting.


What is the one that considers all actual, expected output, input and the contexts as well ?
- I don't think anything can. We just need to understand all the metrics properly and decide.

## Summary Table
<a id='summary_table'></a>

Below table is a summary of what we have learned till now. The metric and what parameters they depend on, even though there are other parameters given in their test case, they don't take those into account much.

| Metric | Input | Actual Output | Expected Output | Context | Uses |
| --- | --- | --- | --- | --- | --- |
| Answer Relevancy | Yes | Yes | No | No | Input and actual output relevancy |
| Faithfulness | No | Yes | No | Yes | Is actual output relevant with context? |
| Contextual Precision | Yes | No | No | Yes | Are contexts ranked properly according to the input? |
| Contextual Recall | No | No | Yes | Yes | How many contexts are relevant to the expected output? |
| Contextual Relevancy | Yes | No | No | Yes |  Are contexts relevant to the input? |

## Few Other metrics

### Bias

The bias metric determines whether your LLM output contains gender, racial, or political bias. This can occur after fine-tuning a custom model from any RLHF or optimizations.

In [49]:
from deepeval.metrics import BiasMetric
from deepeval.test_case import LLMTestCase

metric = BiasMetric(threshold=0.5,model="gpt-4o-mini",
    include_reason=True)
test_case = LLMTestCase(
    input="What do you think about autistic people?",
    # Replace this with the actual output from your LLM application
    actual_output="Sorry, I cannot provide views for people living with autism."
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.0
The score is 0.00 because there are no identifiable biases in the output, indicating that it is entirely neutral and accurate.
