# Evaluation Metrics with LastMile AI

In this example notebook, we showcase how to measure the quality of your LLM applications (particularly RAG-based systems) with the **LastMile Evals** library.




## Notebook Outline
* [Introduction](#intro)
* [Setup](#setup)
* [Part 1: RAG Evaluators](#rag_metrics)
  * [p-faithful](#p-faithful)
  * [Q/A on Retrieved Data](#qa-score)
  * [AI vs Human (Ground Truth)](#ai-human-score)
* [Part 2: Generic Evaluators](#generic-evaluators)
   * [BLEU Score](#bleu)
   * [ROUGE Score](#rouge)
   * [Exact Match Score](#exact-match)
   * [Summarization Score](#summarization)
   * [Relevance Score](#relevance)
   * [Toxicity Score](#toxicity)
   * [Custom Semantic Similarity Score](#custom-semantic-similarity)




<a name="intro"></a>
# Introduction

Evaluation is a crucial part of LLM development. To improve the performance of your LLM app, you must have a way to measure it. Evaluation metrics (aka evaluators) allow you to measure the quality of LLM-generated results. Evaluators can take in various inputs including the generated response, ground truth data, context, etc. and typically output a numeric score from 0 to 1.

The **LastMile Evals library** provides a suite of evaluators for simple, fast, and accurate LLM-based evaluations. This notebook showcases how you can easily use the suite of evaluators to measure the quality of your LLM application.

<a name="setup"></a>

# Setup

To begin, we need to install the lastmile-eval library.

In [None]:
!pip install lastmile-eval

Now, we import all modules used in this tutorial

In [8]:
from google.colab import userdata
from textwrap import dedent
import pandas as pd
from tabulate import tabulate

from lastmile_eval.rag import get_rag_eval_scores
import lastmile_eval.text as lm_eval_text

Before we start this tutorial, we need the following tokens/keys:

* LastMile AI API Token: Go to the [LastMile Settings page](https://lastmileai.dev/settings?page=tokens). You will need to first create a LastMile AI account.
* OpenAI API Key: Go to [OpenAI API Keys page](https://platform.openai.com/account/api-keys) to create and access your OpenAI API Key.

We're using Google Colab's Secret Manager to set our tokens in this notebook.



In [61]:
userdata.get('LASTMILE_API_TOKEN')
userdata.get('OPENAI_API_KEY')

'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'

<a name="rag_metrics"></a>
#Part 1: RAG Evaluators
RAG evaluators are helpful for measuring RAG systems. These metrics are specifically used to evaluate the quality of a model's response by assessing its relevance to the retrieved context, the original user query, and ground truth if available.

First, let's set our model-generated responses, retrieved contexts, and user queries.

In [55]:
model_response_texts = [
        "The quick brown fox jumps over the lazy dog.",
        "The fox is gold",
    ]
retrieved_contexts = [
        "The quick brown fox jumps over the lazy dog.",
        "The swift brown fox leaps over the lazy dog.",
    ]
ground_truth_texts = [
        "The quick brown fox jumps over the lazy dog.",
        "The fox is yellow",
    ]
user_queries = [
        "What does the animal do",
        "Describe the fox"
    ]

<a name="p-faithful"></a>
### p-faithful score
The p-faithful score, computed by an LLM, evaluates the faithfulness of responses generated by a RAG system. This metric assesses how well the LLM's output aligns with the retrieved data, given a user query. By considering the triplet of information - user query, input data, and LLM's response - the p-faithful score ensures that the generated output is grounded in the provided context and accurately addresses the user's question. The score ranges from 0 to 1, with higher values indicating a more faithful response to the input data.

Read more about p-faithful [here](https://blog.lastmileai.dev/harder-better-faster-stronger-llm-hallucination-detection-for-real-world-rag-part-i-949248f0ad94).


In [69]:
api_token = userdata.get('LASTMILE_API_TOKEN')

result_dict = get_rag_eval_scores(
    user_queries,
    retrieved_contexts,
    model_response_texts,
    api_token,
)

ReadTimeout: HTTPSConnectionPool(host='lastmileai.dev', port=443): Read timed out. (read timeout=60)

<a name="qa-score"></a>
### Q/A on Retrieved Data
This metric evaluates whether a question was correctly answered by the system based on the retrieved data. The score is 1 if the answer is correct. The score is 0 if the question is not correctly or only partially answered by the model.



In [56]:
qa = lm_eval_text.calculate_qa_score(model_response_texts, retrieved_contexts, user_queries, model_name="gpt-3.5-turbo")



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [57]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], retrieved_contexts[i], user_queries[i], qa[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Retrieved Context', 'User Query', 'Q/A Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+-------------------------+-----------+
| Example |             Model Response Text              |              Retrieved Context               |       User Query        | Q/A Score |
+---------+----------------------------------------------+----------------------------------------------+-------------------------+-----------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. | What does the animal do |    1.0    |
|    2    |               The fox is gold                | The swift brown fox leaps over the lazy dog. |    Describe the fox     |    0.0    |
+---------+----------------------------------------------+----------------------------------------------+-------------------------+-----------+


<a name="ai-human-score"></a>
### AI vs Human (Ground Truth)
The AI vs Human Score, calculated by an LLM, compares AI-generated answers to a golden dataset of human-authored question-answer pairs. It assigns a score of 1 (correct) if the AI answer matches the human answer or captures its main idea, and 0 (incorrect) otherwise. This metric ensures that the AI system provides accurate and comprehensive responses, mirroring the quality of human-generated answers.

In [58]:
ai_vs_human = lm_eval_text.calculate_human_vs_ai_score(model_response_texts, ground_truth_texts, user_queries, model_name="gpt-3.5-turbo")



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [59]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], ground_truth_texts[i], user_queries[i], ai_vs_human[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Ground Truth Text', 'User Query', 'AI vs Human Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+-------------------------+-------------------+
| Example |             Model Response Text              |              Ground Truth Text               |       User Query        | AI vs Human Score |
+---------+----------------------------------------------+----------------------------------------------+-------------------------+-------------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. | What does the animal do |        1.0        |
|    2    |               The fox is gold                |              The fox is yellow               |    Describe the fox     |        1.0        |
+---------+----------------------------------------------+----------------------------------------------+-------------------------+-------------------+


<a name="generic-evaluators"></a>
# Part 2: Generic Evaluators
Generic evaluators are metrics used to assess the performance of NLP models (including LLMs) in tasks such as text generation, summarization, and translation often by comparing the model's output to reference data.

First, let's set our model-generated responses and human-labeled reference data.

In [72]:
model_response_texts = [
        "The quick brown fox jumps over the lazy dog.",
        "The quick brown fox jumps over the lazy dog.",
    ]

reference_texts = [
        "The quick brown fox jumps over the lazy dog.",
        "The swift brown fox leaps over the lazy dog.",
    ]

<a name="bleu"></a>
### BLEU Score
BLEU (Bilingual Evaluation Understudy) score measures the similarity between the model-generated text response and the human labeled reference text. BLEU score ranges from 0 to 1, with higher *values* indicating better translation quality. A perfect translation would have a BLEU score of 1, while a completely incorrect translation would have a BLEU score of 0.

In [7]:
bleu = lm_eval_text.calculate_bleu_score(model_response_texts, reference_texts)

In [21]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], reference_texts[i], bleu[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Human Labeled Reference Text', 'BLEU Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+--------------------+
| Example |             Model Response Text              |         Human Labeled Reference Text         |     BLEU Score     |
+---------+----------------------------------------------+----------------------------------------------+--------------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. |        1.0         |
|    2    | The quick brown fox jumps over the lazy dog. | The swift brown fox leaps over the lazy dog. | 0.4671379777282001 |
+---------+----------------------------------------------+----------------------------------------------+--------------------+


<a name="rouge"></a>
### ROUGE Score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score measures the similarity between a machine-generated summary and a human-created reference summaries. ROUGE score ranges from 0 to 1, with higher values indicating better summarization quality. A perfect summary would have a ROUGE score of 1, meaning it captures all the important information from the reference summaries, while a completely irrelevant summary would have a ROUGE score of 0.

In [15]:
rouge1 = lm_eval_text.calculate_rouge1_score(model_response_texts, reference_texts)

In [20]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], reference_texts[i], rouge1[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Human Labeled Reference Text', 'ROUGE Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+--------------------+
| Example |             Model Response Text              |         Human Labeled Reference Text         |    ROUGE Score     |
+---------+----------------------------------------------+----------------------------------------------+--------------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. |        1.0         |
|    2    | The quick brown fox jumps over the lazy dog. | The swift brown fox leaps over the lazy dog. | 0.7777777777777778 |
+---------+----------------------------------------------+----------------------------------------------+--------------------+


<a name="exact-match"></a>

### Exact Match Score
Exact match is a binary metric where a given model-generated text receives an exact match score of 1 if it is identical to its reference string, and 0 otherwise.


In [18]:
exact_match = lm_eval_text.calculate_exact_match_score(model_response_texts, reference_texts)

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

In [22]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], reference_texts[i], exact_match[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Reference Text', 'Exact Match Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+-------------------+
| Example |             Model Response Text              |                Reference Text                | Exact Match Score |
+---------+----------------------------------------------+----------------------------------------------+-------------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. |        1.0        |
|    2    | The quick brown fox jumps over the lazy dog. | The swift brown fox leaps over the lazy dog. |        0.0        |
+---------+----------------------------------------------+----------------------------------------------+-------------------+


<a name="summarization"></a>
### Summarization Score
The Summarization Score, calculated by an LLM like GPT-3.5, evaluates the quality of a generated summary. It measures how well the summary captures the essential information from the original document. The score uses the default [Summarization Prompt Template from Phoenix Arize](https://docs.arize.com/phoenix/evaluation/how-to-evals/running-pre-tested-evals/summarization-eval) and ranges from 0 to 1. Higher values indicate better summarization performance.



In [35]:
import os
os.environ['OPENAI_API_KEY'] =  userdata.get('OPENAI_API_KEY')
summarization = lm_eval_text.calculate_summarization_score(model_response_texts, reference_texts, model_name="gpt-3.5-turbo")



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [73]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], reference_texts[i], summarization[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Reference Text', 'Summarization Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+---------------------+
| Example |             Model Response Text              |                Reference Text                | Summarization Score |
+---------+----------------------------------------------+----------------------------------------------+---------------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. |         1.0         |
|    2    | The quick brown fox jumps over the lazy dog. | The swift brown fox leaps over the lazy dog. |         0.0         |
+---------+----------------------------------------------+----------------------------------------------+---------------------+


<a name="relevance"></a>
### Relevance Score
The Relevance Score, computed by an LLM, measures how pertinent an AI-generated response is to a given reference. It assigns a float score between 0 and 1 to each input-reference pair, with 1 indicating high relevance and 0 indicating irrelevance. This metric ensures that the AI system generates responses that are on-topic and aligned with the desired context.



In [74]:
relevance = lm_eval_text.calculate_relevance_score(model_response_texts, reference_texts, model_name="gpt-3.5-turbo")



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [75]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], reference_texts[i], relevance[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Reference Text', 'Relevance Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+-----------------+
| Example |             Model Response Text              |                Reference Text                | Relevance Score |
+---------+----------------------------------------------+----------------------------------------------+-----------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. |       1.0       |
|    2    | The quick brown fox jumps over the lazy dog. | The swift brown fox leaps over the lazy dog. |       1.0       |
+---------+----------------------------------------------+----------------------------------------------+-----------------+


<a name="toxicity"></a>
### Toxicity Score
The Toxicity Score, determined by an LLM, assesses whether an AI-generated response contains toxic content, such as hateful statements, demeaning language, inappropriate words, or threats of violence. The LLM assigns a binary score of 1 (toxic) if the response meets the definition of toxicity, and 0 (non-toxic) if the response is free from any words, sentiments, or meanings that could be considered toxic. This score helps ensure that the AI system generates safe and respectful responses, avoiding the production of harmful or offensive content.

In [67]:
texts_to_evaluate = ["I am happy", "I am threatening violence",]

toxicity = lm_eval_text.calculate_toxicity_score(texts_to_evaluate, model_name="gpt-3.5-turbo")



llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [68]:
# Print results in a nicely formatted table
data = []
for i in range(len(texts_to_evaluate)):
    data.append([i+1, texts_to_evaluate[i], toxicity[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Text to Evaluate', 'Toxicity Score'], tablefmt='pretty'))

+---------+---------------------------+----------------+
| Example |     Text to Evaluate      | Toxicity Score |
+---------+---------------------------+----------------+
|    1    |        I am happy         |      0.0       |
|    2    | I am threatening violence |      1.0       |
+---------+---------------------------+----------------+


<a name="custom-semantic-similarity"></a>
### Custom Semantic Similarity Score
The Custom Semantic Similarity Score, computed by an LLM like GPT-3.5, measures the semantic similarity between a generated text and a reference text. This score uses a customizable prompt template (default from Phoenix Arize is used in this notebook) that allows you to define the specific criteria for evaluating similarity. The LLM assigns a score between 0 and 1 to each pair of texts, with higher values indicating greater semantic similarity.

In [40]:
import os
os.environ['OPENAI_API_KEY'] =  userdata.get('OPENAI_API_KEY')
custom_semantic_similarity = lm_eval_text.calculate_custom_llm_metric_example_semantic_similarity(model_response_texts, reference_texts, model_name="gpt-3.5-turbo")

  self.pid = os.fork()
  self.pid = os.fork()


In [41]:
# Print results in a nicely formatted table
data = []
for i in range(len(model_response_texts)):
    data.append([i+1, model_response_texts[i], reference_texts[i], custom_semantic_similarity[i]])  # Here i+1 will serve as the example number.

print(tabulate(data, headers=['Example', 'Model Response Text', 'Reference Text', 'Custom Semantic Similarity Score'], tablefmt='pretty'))

+---------+----------------------------------------------+----------------------------------------------+----------------------------------+
| Example |             Model Response Text              |                Reference Text                | Custom Semantic Similarity Score |
+---------+----------------------------------------------+----------------------------------------------+----------------------------------+
|    1    | The quick brown fox jumps over the lazy dog. | The quick brown fox jumps over the lazy dog. |               1.0                |
|    2    | The quick brown fox jumps over the lazy dog. | The swift brown fox leaps over the lazy dog. |               0.7                |
+---------+----------------------------------------------+----------------------------------------------+----------------------------------+
