# BLEU/score (Bilingual Evaluation Understudy Score)
Purpose: Evaluates how closely the machine-generated text matches human reference text(s).
How it works: BLEU calculates the overlap of n-grams (sequences of words) between the generated and reference texts, considering precision.
Scale: Typically ranges from 0 to 1, where 1 means a perfect match. Scores are often expressed as percentages (e.g., 0.75 = 75%).
Use case: Primarily used in machine translation.
# COMET/score (Cross-lingual Optimized Metric for Evaluation of Translation)
Purpose: Measures the quality of machine translations by considering semantic meaning and fluency.
How it works: COMET uses a neural model trained on human judgment data to evaluate the generated text's semantic similarity and adequacy compared to the reference text.
Scale: Scores can range from -1 to 1, with higher scores indicating better quality. Positive values often represent outputs closer to human-like translations.
Use case: Advanced translation quality assessment, focusing on both content and meaning.
# MetricX/score
Purpose: This evaluates specific aspects of a generated text, such as:
Semantic similarity
Contextual relevance
Adherence to domain-specific requirements
 
If you have access to the system using MetricX, reviewing their documentation will clarify its exact purpose and scoring method. Let me know if you'd like help researching more!

In [8]:
!pip install google-cloud-translate

[0mCollecting google-cloud-translate
  Downloading google_cloud_translate-3.19.0-py2.py3-none-any.whl.metadata (5.5 kB)
Downloading google_cloud_translate-3.19.0-py2.py3-none-any.whl (192 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.5/192.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[0mInstalling collected packages: google-cloud-translate
Successfully installed google-cloud-translate-3.19.0


In [5]:
!pip install google-cloud-language

[0m

In [1]:
# General
import pandas as pd

# Main
from vertexai import evaluation
from vertexai.evaluation.metrics import pointwise_metric

In [2]:
# @title Helper functions
from IPython.display import Markdown, display


def display_eval_result(eval_result, metrics=None, model_name=None, rows=0):
    if model_name is not None:
        display(Markdown("## Eval Result for %s" % model_name))

    """Display the evaluation results."""
    summary_metrics, metrics_table = (
        eval_result.summary_metrics,
        eval_result.metrics_table,
    )

    metrics_df = pd.DataFrame.from_dict(summary_metrics, orient="index").T
    if metrics:
        metrics_df = metrics_df.filter(
            [
                metric
                for metric in metrics_df.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )
        metrics_table = metrics_table.filter(
            [
                metric
                for metric in metrics_table.columns
                if any(selected_metric in metric for selected_metric in metrics)
            ]
        )

    # Display the summary metrics
    display(Markdown("### Summary Metrics"))
    display(metrics_df)
    if rows > 0:
        # Display samples from the metrics table
        display(Markdown("### Row-based Metrics"))
        display(metrics_table.head(rows))

# Set up eval metrics for your data.

You can evaluate the translation quality of your data generated from an LLM using:
- [BLEU](https://en.wikipedia.org/wiki/BLEU)
- [COMET](https://unbabel.github.io/COMET/html/index.html)
- [MetricX](https://github.com/google-research/metricx)

In [3]:
metrics = [
    "bleu",
    pointwise_metric.Comet(),
    pointwise_metric.MetricX(),
]

In [31]:
PROJECT_ID='nine-quality-test'
# Imports the Google Cloud Translation library
import os

from google.cloud import translate_v3


# Initialize Translation client
def translate_text(
    text: str = "YOUR_TEXT_TO_TRANSLATE",
    language_code: str = "fr",
    source_language: str="en-US"
) -> translate_v3.TranslationServiceClient:
    """Translating Text from English.
    Args:
        text: The content to translate.
        language_code: The language code for the translation.
            E.g. "fr" for French, "es" for Spanish, etc.
            Available languages: https://cloud.google.com/translate/docs/languages#neural_machine_translation_model
    """

    client = translate_v3.TranslationServiceClient()
    parent = f"projects/{PROJECT_ID}/locations/global"
    # Translate text from English to chosen language
    # Supported mime types: # https://cloud.google.com/translate/docs/supported-formats
    response = client.translate_text(
        contents=[text],
        target_language_code=language_code,
        parent=parent,
        mime_type="text/plain",
        source_language_code=source_language,
    )

    # Display the translation for each input text provided
    for translation in response.translations:
        print(f"Translated text: {translation.translated_text}")
    # Example response:
    # Translated text: Bonjour comment vas-tu aujourd'hui?

    return response


In [32]:
# Text to translate
input_text = "Dem Feuer konnte Einhalt geboten werden"
source_language="de"    
# Target language code (e.g., 'es' for Spanish)
target_language = "en-US"
    
# Call the translation function
translated_text = translate_text(input_text, target_language,source_language)
    
print(f"Original: {input_text}")
print(f"Translated: {translated_text}")

Translated text: The fire was stopped
Original: Dem Feuer konnte Einhalt geboten werden
Translated: translations {
  translated_text: "The fire was stopped"
}



In [33]:
# Text to translate
input_text = "Schulen und Kindergärten wurden eröffnet."
source_language="de"    
# Target language code (e.g., 'es' for Spanish)
target_language = "en-US"
    
# Call the translation function
translated_text = translate_text(input_text, target_language,source_language)
    
print(f"Original: {input_text}")
print(f"Translated: {translated_text}")

Translated text: Schools and kindergartens were opened.
Original: Schulen und Kindergärten wurden eröffnet.
Translated: translations {
  translated_text: "Schools and kindergartens were opened."
}



# Prepare your dataset

Evaluate stored generative AI model responses in an evaluation dataset.

In [34]:
sources = [
    "Dem Feuer konnte Einhalt geboten werden",
    "Schulen und Kindergärten wurden eröffnet.",
]

responses = [
    "The fire could be stopped",
    "Schools and kindergartens were open",
]

references = [
    "The fire was stopped",
    "Schools and kindergartens were opened.",
]

eval_dataset = pd.DataFrame(
    {
        "source": sources,
        "response": responses,
        "reference": references,
    }
)

# Run evaluation

With the evaluation dataset and metrics defined, you can run evaluation for an `EvalTask` on different models and applications, and many other use cases.

In [37]:
EXPERIMENT_NAME='translationevaluationexperiment'
eval_task = evaluation.EvalTask(
    dataset=eval_dataset, metrics=metrics, experiment=EXPERIMENT_NAME
)
eval_result = eval_task.evaluate()

Associating projects/494586852359/locations/us-central1/metadataStores/default/contexts/translationevaluationexperiment-530625f6-90a1-4f8e-8bfe-3b2213a33b89 to Experiment: translationevaluationexperiment


Computing metrics with a total of 6 Vertex Gen AI Evaluation Service API requests.


100%|██████████| 6/6 [00:09<00:00,  1.63s/it]

All 6 metric requests are successfully computed.
Evaluation Took:9.861956587999885 seconds





You can view the summary metrics and row-based metrics for each response in the `EvalResult`.


In [38]:
display_eval_result(eval_result, rows=2)

### Summary Metrics

Unnamed: 0,row_count,bleu/mean,bleu/std,comet/mean,comet/std,metricx/mean,metricx/std
0,2.0,0.391977,0.219969,0.947085,0.02699,3.363877,0.248002


### Row-based Metrics

Unnamed: 0,source,response,reference,bleu/score,comet/score,metricx/score
0,Dem Feuer konnte Einhalt geboten werden,The fire could be stopped,The fire was stopped,0.236435,0.928,3.188514
1,Schulen und Kindergärten wurden eröffnet.,Schools and kindergartens were open,Schools and kindergartens were opened.,0.547518,0.96617,3.539241


# Clean up

Delete ExperimentRun created by the evaluation.

In [39]:
from google.cloud import aiplatform

aiplatform.ExperimentRun(
    run_name=eval_result.metadata["experiment_run"],
    experiment=eval_result.metadata["experiment"],
).delete()

Experiment run 530625f6-90a1-4f8e-8bfe-3b2213a33b89 skipped backing tensorboard run deletion.
To delete backing tensorboard run, execute the following:
tensorboard_run_artifact = aiplatform.metadata.artifact.Artifact(artifact_name=f"translationevaluationexperiment-530625f6-90a1-4f8e-8bfe-3b2213a33b89-tb-run")
tensorboard_run_resource = aiplatform.TensorboardRun(tensorboard_run_artifact.metadata["resourceName"])
tensorboard_run_resource.delete()
tensorboard_run_artifact.delete()
Deleting Context : projects/494586852359/locations/us-central1/metadataStores/default/contexts/translationevaluationexperiment-530625f6-90a1-4f8e-8bfe-3b2213a33b89
Context deleted. . Resource name: projects/494586852359/locations/us-central1/metadataStores/default/contexts/translationevaluationexperiment-530625f6-90a1-4f8e-8bfe-3b2213a33b89
Deleting Context resource: projects/494586852359/locations/us-central1/metadataStores/default/contexts/translationevaluationexperiment-530625f6-90a1-4f8e-8bfe-3b2213a33b89
De