## **Evaluate a simple LLM application**

In [16]:
!pip install ragas
!pip install sacrebleu
!pip install rapidfuzz
!pip install rouge_score
!pip install langchain-google-genai



### **Evaluating using a Non-LLM Metric**



Here is a simple example that uses **BleuScore** score to score summary

The **BleuScore** score is a metric used to evaluate the quality of response by comparing it with reference. It measures the similarity between the response and the reference based on n-gram precision and brevity penalty. BLEU score was originally designed to evaluate machine translation systems, but it is also used in other natural language processing tasks. BLEU score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

In [17]:
from ragas import SingleTurnSample
from ragas.metrics import BleuScore

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}

metric = BleuScore()
test_data = SingleTurnSample(**test_data)
metric.single_turn_score(test_data)


0.13718598426177148

**NonLLMStringSimilarity** metric measures the similarity between the reference and the response using traditional string distance measures such as Levenshtein, Hamming, and Jaro. This metric is useful for evaluating the similarity of response to the reference text without relying on large language models (LLMs). The metric returns a score between 0 and 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric.

In [18]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics._string import NonLLMStringSimilarity
from ragas.metrics._string import NonLLMStringSimilarity, DistanceMeasure

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = NonLLMStringSimilarity(distance_measure=DistanceMeasure.LEVENSHTEIN)
await scorer.single_turn_ascore(sample)

0.8918918918918919

The **RougeScore** score is a set of metrics used to evaluate the quality of natural language generations. It measures the overlap between the generated response and the reference text based on n-gram recall, precision, and F1 score. ROUGE score ranges from 0 to 1, where 1 indicates a perfect match between the response and the reference. This is a non LLM based metric

In [19]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import RougeScore

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="The Eiffel Tower is located in Paris."
)

scorer = RougeScore(mode="precision", rouge_type="rouge1") # recall or fmeasure
await scorer.single_turn_ascore(sample)

0.8571428571428571

The **ExactMatch** metric checks if the response is exactly the same as the reference text. It is useful in scenarios where you need to ensure that the generated response matches the expected output word-for-word. **For example, arguments in tool calls, etc**. The metric returns 1 if the response is an exact match with the reference, and 0 otherwise.

In [20]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import ExactMatch

sample = SingleTurnSample(
    response="India",
    reference="Paris"
)

scorer = ExactMatch()
await scorer.single_turn_ascore(sample)

0.0

The **StringPresence** metric checks if the response contains the reference text. It is useful in scenarios where you need to ensure that the generated response contains certain keywords or phrases. The metric returns 1 if the response contains the reference, and 0 otherwise.

In [21]:
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import StringPresence

sample = SingleTurnSample(
    response="The Eiffel Tower is located in India.",
    reference="Eiffel Tower"
)
scorer = StringPresence()
await scorer.single_turn_ascore(sample)

1.0

### **Evaluating using a LLM based Metric**

In [22]:
from google.colab import userdata
import os

os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')

In [23]:
config = {
    "model": "gemini-2.0-flash",
    "temperature": 0.4,
    "max_tokens": None,
    "top_p": 0.8,
}

In [24]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_google_genai import ChatGoogleGenerativeAI

# Initialize with Google AI Studio
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
    model=config["model"],
    temperature=config["temperature"],
    max_tokens=config["max_tokens"],
    top_p=config["top_p"],
))

**AspectCritic** is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.

In [25]:
from ragas import SingleTurnSample
from ragas.metrics import AspectCritic

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
}

metric_summary_accuracy = AspectCritic(name="summary_accuracy", llm=evaluator_llm, definition="Verify if the summary is accurate.")
test_data = SingleTurnSample(**test_data)

await metric_summary_accuracy.single_turn_ascore(test_data)

1

### **Evaluating on a Dataset**

In [26]:
data = [
    {
        "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
        "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    },
    {
        "user_input": "Investigadores de la Universidad de Stanford han publicado un estudio que detalla el desarrollo de un nuevo material sintético. Este material, un polímero ligero y extremadamente flexible, demuestra ser más resistente a la corrosión que el acero. Se espera que sus aplicaciones iniciales sean en el campo de la ingeniería aeroespacial, lo que podría reducir significativamente el peso de las naves espaciales y los satélites.",
        "response": "Investigadores han creado un nuevo polímero sintético que es más resistente que el acero, con posibles aplicaciones en la ingeniería aeroespacial para reducir el peso de las naves."
    },
    {
        "user_input": "La conferencia anual de tecnología, 'TechSphere 2024', se celebró con éxito en San Francisco, atrayendo a más de 15,000 participantes. El evento se centró en los avances de la inteligencia artificial y la computación cuántica, con presentaciones de líderes de la industria. Un tema recurrente fue el potencial de la IA para transformar la atención médica y la logística, generando un gran interés entre los asistentes.",
        "response": "La conferencia 'TechSphere 2024' atrajo a 15,000 participantes y se centró en avances en IA y computación cuántica, destacando su potencial para transformar la atención médica y la logística."
    },
    {
        "user_input": "La conferencia anual de tecnología, 'TechSphere 2024', se celebró con éxito en San Francisco, atrayendo a más de 15,000 participantes. El evento se centró en los avances de la inteligencia artificial y la computación cuántica, con presentaciones de líderes de la industria. Un tema recurrente fue el potencial de la IA para transformar la atención médica y la logística, generando un gran interés entre los asistentes.",
        "response": "Investigadores han creado un nuevo polímero sintético que es más resistente que el acero, con posibles aplicaciones en la ingeniería aeroespacial para reducir el peso de las naves."
    }
]

In [27]:
from datasets import load_dataset
from ragas import EvaluationDataset

# get data from HuggingFace
#eval_dataset = load_dataset("explodinggradients/earning_report_summary", split="train")
eval_dataset = EvaluationDataset.from_list(data)

print("Features in dataset:", eval_dataset.features())
print("Total samples in dataset:", len(eval_dataset))

Features in dataset: ['user_input', 'response']
Total samples in dataset: 4


In [28]:
from ragas import evaluate
from ragas.metrics import SimpleCriteriaScore
from ragas.metrics import RubricsScore

metric_summary_accuracy = AspectCritic(name="summary_accuracy", llm=evaluator_llm, definition="Verify if the summary is accurate.")
metric_verbosity = AspectCritic(name="verbosity", llm=evaluator_llm, definition="Verify if the response is concise and not unnecessarily wordy.")
metric_maliciousness = AspectCritic(name="maliciousness", llm=evaluator_llm, definition="Is the submission intended to harm, deceive, or exploit users?")

# Simple Criteria Scoring
simple_criteria_score =  SimpleCriteriaScore(name="course_grained_score", definition="Score 0 to 5 by similarity", llm=evaluator_llm)

# Rubrics based criteria scoring
rubrics = {
    "score1_description": "The response is entirely incorrect and fails to address any aspect of the reference.",
    "score2_description": "The response contains partial accuracy but includes major errors or significant omissions that affect its relevance to the reference.",
    "score3_description": "The response is mostly accurate but lacks clarity, thoroughness, or minor details needed to fully address the reference.",
    "score4_description": "The response is accurate and clear, with only minor omissions or slight inaccuracies in addressing the reference.",
    "score5_description": "The response is completely accurate, clear, and thoroughly addresses the reference without any errors or omissions.",
}

rubrics_score = RubricsScore(rubrics=rubrics, llm=evaluator_llm)

results = evaluate(eval_dataset, metrics=[metric_summary_accuracy, metric_maliciousness, metric_verbosity, simple_criteria_score, rubrics_score])
results

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

{'summary_accuracy': 0.7500, 'maliciousness': 0.0000, 'verbosity': 1.0000, 'course_grained_score': 3.7500, 'domain_specific_rubrics': 3.5000}

In [29]:
results.to_pandas()

Unnamed: 0,user_input,response,summary_accuracy,maliciousness,verbosity,course_grained_score,domain_specific_rubrics
0,summarise given text\nThe company reported an ...,The company experienced an 8% increase in Q3 2...,1,0,1,5,4
1,Investigadores de la Universidad de Stanford h...,Investigadores han creado un nuevo polímero si...,1,0,1,5,4
2,"La conferencia anual de tecnología, 'TechSpher...","La conferencia 'TechSphere 2024' atrajo a 15,0...",1,0,1,5,5
3,"La conferencia anual de tecnología, 'TechSpher...",Investigadores han creado un nuevo polímero si...,0,0,1,0,1
