<a href="https://colab.research.google.com/github/olonok69/LLM_Notebooks/blob/main/mlflow/summarization/T5_large_Evaluation_multi_news_summarization_mlflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# LLM Evaluation Metrics

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html


There are two types of LLM evaluation metrics in MLflow:

- Heuristic-based metrics: These metrics calculate a score for each data record (row in terms of Pandas/Spark dataframe), based on certain functions, such as: Rouge (rougeL()), Flesch Kincaid (flesch_kincaid_grade_level()) or Bilingual Evaluation Understudy (BLEU) (bleu()). These metrics are similar to traditional continuous value metrics. For the list of built-in heuristic metrics and how to define a custom metric with your own function definition, see the Heuristic-based Metrics section.

- LLM-as-a-Judge metrics: LLM-as-a-Judge is a new type of metric that uses LLMs to score the quality of model outputs. It overcomes the limitations of heuristic-based metrics, which often miss nuances like context and semantic accuracy. LLM-as-a-Judge metrics provides a more human-like evaluation for complex language tasks while being more scalable and cost-effective than human evaluation. MLflow provides various built-in LLM-as-a-Judge metrics and supports creating custom metrics with your own prompt, grading criteria, and reference examples. See the LLM-as-a-Judge Metrics section for more details.



### MLFLOW Metrics
The mlflow.metrics module helps you quantitatively and qualitatively measure your models.

https://mlflow.org/docs/latest/python_api/mlflow.metrics.html


Create a test case of inputs that will be passed into the model and ground_truth which will be used to compare against the generated output from the model.

#### TASK: text-summarization: model_type="text-summarization":
- ROUGE

- toxicity

- ari_grade_level

- flesch_kincaid_grade_level

#### Descriptions

- https://huggingface.co/spaces/evaluate-measurement/toxicity
- https://en.wikipedia.org/wiki/Automated_readability_index
- https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch%E2%80%93Kincaid_grade_level

### Toxicity
https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target

### Textstat
Textstat is an easy to use library to calculate statistics from text. It helps determine readability, complexity, and grade level.

https://pypi.org/project/textstat/

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install mlflow   --quiet
! pip install  evaluate  textstat tiktoken -q
! pip install psutil pynvml
! pip install bert_score -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.3/27.3 MB[0m [31m79.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m105.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m575.1/575.1 kB[0m [31m40.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.2/203.2 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Transformers installation
! pip install -q --disable-pip-version-check py7zr sentencepiece loralib peft trl
! pip install -q    bitsandbytes
! pip install datasets evaluate rouge_score -q
! pip install transformers[torch] -q
! pip install accelerate -U -q
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/310.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.9/310.9 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.0/3.0 MB[0m [31m93.4 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m56.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
! pip install onnxruntime optimum -q
! pip install optimum[onnxruntime] -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m105.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m424.1/424.1 kB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.0/16.0 MB[0m [31m32.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:

import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
from torch import cuda, bfloat16
import transformers
import openai
import torch
import torch.nn as nn
from google.colab import userdata

In [None]:

from google.colab import output
output.enable_custom_widget_manager()

from transformers.utils import logging


In [None]:
logging.set_verbosity_warning()

os.environ["TRANSFORMERS_VERBOSITY"] = "warning"

In [None]:


device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
device


'cuda:0'

# Load multi_news dataset
https://huggingface.co/datasets/multi_news

In [None]:
from datasets import load_dataset

dataset  = load_dataset("multi_news", trust_remote_code=True)

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

multi_news.py:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

train.src.cleaned:   0%|          | 0.00/548M [00:00<?, ?B/s]

train.tgt:   0%|          | 0.00/58.8M [00:00<?, ?B/s]

val.src.cleaned:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

val.tgt:   0%|          | 0.00/7.30M [00:00<?, ?B/s]

test.src.cleaned:   0%|          | 0.00/69.0M [00:00<?, ?B/s]

test.tgt:   0%|          | 0.00/7.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['document', 'summary'],
        num_rows: 44972
    })
    validation: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
    test: Dataset({
        features: ['document', 'summary'],
        num_rows: 5622
    })
})

In [None]:

print(f"Train dataset size: {len(dataset['train'])}")
print(f"test dataset size: {len(dataset['test'])}")
print(f"Validation dataset size: {len(dataset['validation'])}")

Train dataset size: 44972
test dataset size: 5622
Validation dataset size: 5622


In [None]:
dataset['train'][100]['document']

'Katy Perry is all about breaking conventional beauty rules, from her love of everything technicolor and coated in glitter, to her no-brows, black lipstick Met Gala look. So, of course, the pop star — and face of CoverGirl — was the perfect person to help announce that the beauty brand has named its first-ever male CoverGirl, social media star James Charles. \n \n According to a press release from the brand, all CoverGirls “are role models and boundary-breakers, fearlessly expressing themselves, standing up for what they believe, and redefining what it means to be beautiful,” and who better to embody that ethos than Instagram sensation James Charles. After launching his beauty account a year ago, the teen has since quickly attracted hundreds of thousands of followers (427,000 to be exact) thanks to his unique, transformative approach to makeup artistry. \n \n RELATED PHOTOS: Katy Perry’s Most Outrageous Twitpics \n \n While Charles’ partnership with the brand kicks off today, we’ll hav

In [113]:
dataset['train'][100]['summary']

'– If a woman can be president, who\'s to say a man can\'t be a CoverGirl. On Tuesday, the makeup company\'s current spokesperson, Katy Perry, announced James Charles as the first ever "CoverBoy" on her Instagram page. Charles, a 17-year-old "aspiring makeup artist," started using makeup only a year ago but has already amassed more than 430,000 followers on Instagram, the Huffington Post reports. According to People, Charles will appear in TV, print, and digital ads for "So Lashy" mascara later this month and will work with CoverGirl through 2017. "I am so thankful and excited," Charles posted on Instagram. "And yes I know I have lipstick on my teeth. It was a looonnnnggg day." CoverGirl says it wants to work with "role models and boundary-breakers, fearlessly expressing themselves, standing up for what they believe, and redefining what it means to be beautiful," Teen Vogue reports. The company calls Charles an inspiration. Teen Vogue is definitely on board, stating: "We\'re firm belie

In [None]:

len(dataset['train'][100]['document'])

6217

In [None]:

len(dataset['train'][100]['summary'])

1268

In [None]:
import transformers
from mlflow.models import infer_signature
from mlflow.transformers import generate_signature_output
import locale
import mlflow
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [None]:
model_uri = "runs:/490668a70c06448d83903669efde0a8b/text_summarizer"

In [None]:
MLFLOW_TRACKING_URI="databricks"
# Specify the workspace hostname and token
DATABRICKS_HOST="https://adb-2467347032368999.19.azuredatabricks.net/"
DATABRICKS_TOKEN=userdata.get('DATABRCKS_TTOKEN')

In [None]:


if "MLFLOW_TRACKING_URI" not in os.environ:
    os.environ["MLFLOW_TRACKING_URI"] = MLFLOW_TRACKING_URI
if "DATABRICKS_HOST" not in os.environ:
    os.environ["DATABRICKS_HOST"] = DATABRICKS_HOST
if "DATABRICKS_TOKEN" not in os.environ:
    os.environ["DATABRICKS_TOKEN"] = DATABRICKS_TOKEN

In [None]:
os.environ["OPENAI_API_KEY"]=userdata.get('KEY_OPENAI')

In [None]:
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

In [None]:

mlflow.set_experiment("/Users/***REMOVED***/summarization_evaluation")


<Experiment: artifact_location='dbfs:/databricks/mlflow-tracking/3768969581786328', creation_time=1732444499247, experiment_id='3768969581786328', last_update_time=1732447879110, lifecycle_stage='active', name='/Users/***REMOVED***/summarization_evaluation', tags={'mlflow.experiment.sourceName': '/Users/***REMOVED***/summarization_evaluation',
 'mlflow.experimentType': 'MLFLOW_EXPERIMENT',
 'mlflow.ownerEmail': '***REMOVED***',
 'mlflow.ownerId': '1331640755799986'}>

In [None]:
mlflow.end_run()

In [None]:
# summarization_components = mlflow.transformers.load_model(
#     model_uri, return_type="components"
# )

Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]

2024/11/24 11:40:17 INFO mlflow.transformers: 'runs:/490668a70c06448d83903669efde0a8b/text_summarizer' resolved as 'dbfs:/databricks/mlflow-tracking/837187481682972/490668a70c06448d83903669efde0a8b/artifacts/text_summarizer'


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# summarization_components.keys()

dict_keys(['task', 'framework', 'device', 'torch_dtype', 'model', 'tokenizer'])

In [None]:
import torch
from tqdm.auto import tqdm

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

In [None]:
# reconstructed_pipeline = transformers.pipeline(**summarization_components)

In [None]:
# test1= dataset['test'][100]['document']

In [None]:
# reconstructed_pipeline(test1)

In [None]:
df_test = dataset['validation'].to_pandas()

In [None]:
df_test.columns = ['inputs', 'summary']

In [None]:
df_test.head()

Unnamed: 0,inputs,summary
0,Whether a sign of a good read; or a comment on...,– The Da Vinci Code has sold so many copies—th...
1,The deaths of three American soldiers in Afgha...,– A major snafu has hit benefit payments to st...
2,DUBAI Al Qaeda in Yemen has claimed responsibi...,– Yemen-based al-Qaeda in the Arabian Peninsul...
3,"Cambridge Analytica, a data firm that worked f...",– Cambridge Analytica is calling it quits. The...
4,The N.S.A.’s Evolution: The National Security ...,"– A lengthy report in the New York Times, base..."


In [None]:
import gc
import torch
import datetime
torch.cuda.empty_cache()
gc.collect()

52

# Evaluate MLFLOW default metrics

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html


In [None]:
now = datetime.datetime.now()

description= f"""Evaluation Fine Tuned T5-Large Model on Multi_News Dataset
model_uri: {model_uri}
"""
with mlflow.start_run(run_name=f"Evaluation_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
         model_uri,
         df_test[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )


Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2024/11/24 11:40:44 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2024/11/24 11:41:52 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

🏃 View run Evaluation_2024-11-24_11:40:21 at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328/runs/508764dfc3f04514bf22c1810435ef89
🧪 View experiment at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328


# Custom Metrics

https://github.com/mlflow/mlflow/blob/master/examples/evaluation/evaluate_with_custom_metrics.py

https://huggingface.co/spaces/evaluate-metric/bertscore

In [None]:
from mlflow.metrics import latency
from mlflow.metrics.genai import answer_correctness
from mlflow.models import infer_signature, make_metric

In [None]:
mlflow.enable_system_metrics_logging()


In [None]:
mlflow.metrics.__all__

['EvaluationMetric',
 'MetricValue',
 'make_metric',
 'flesch_kincaid_grade_level',
 'ari_grade_level',
 'accuracy',
 'rouge1',
 'rouge2',
 'rougeL',
 'rougeLsum',
 'toxicity',
 'mae',
 'mse',
 'rmse',
 'r2_score',
 'max_error',
 'mape',
 'binary_recall',
 'binary_precision',
 'binary_f1_score',
 'token_count',
 'latency',
 'genai',
 'bleu']

In [None]:
mlflow.metrics.genai.__all__

['EvaluationExample',
 'make_genai_metric',
 'make_genai_metric_from_prompt',
 'answer_similarity',
 'answer_correctness',
 'faithfulness',
 'answer_relevance',
 'relevance',
 'retrieve_custom_metrics']

In [None]:
from evaluate import load
import pandas as pd
from typing import List
bertscore = load("bertscore")
predictions = ["hello there"]
references = ["hello there"]
results = bertscore.compute(predictions=predictions, references=references, lang="en")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
results

{'precision': [0.998046875],
 'recall': [0.998046875],
 'f1': [0.998046875],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.46.2)'}

In [None]:
def calculate_bert_f1(eval_df, _builtin_metrics):
    predictions = []

    return bertscore.compute(predictions=eval_df["prediction"], references=eval_df["target"], lang="en")['f1'][0]
def calculate_bert_recall(eval_df, _builtin_metrics):
    predictions = []

    return bertscore.compute(predictions=eval_df["prediction"], references=eval_df["target"], lang="en")['recall'][0]
def calculate_bert_precision(eval_df, _builtin_metrics):
    predictions = []

    return bertscore.compute(predictions=eval_df["prediction"], references=eval_df["target"], lang="en")['precision'][0]

In [None]:

torch.cuda.empty_cache()
gc.collect()

0

In [None]:
now = datetime.datetime.now()

description= f"""Evaluation Fine Tuned T5-Large Model on Multi_News Dataset
model_uri: {model_uri}

custom metric BertScore and latency
"""
with mlflow.start_run(run_name=f"Evaluation_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
         model_uri,
         df_test[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        extra_metrics=[

        latency(),
      make_metric(
                eval_fn=calculate_bert_f1,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_recall,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_precision,
                greater_is_better=True,
            ),
    ],
    )


2024/11/24 12:06:20 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2024/11/24 12:06:40 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2024/11/24 12:07:47 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
2024/11/24 12:07:54 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...


🏃 View run Evaluation_2024-11-24_12:06:20 at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328/runs/9efd5cea293c4509bba854d3ef0235e5
🧪 View experiment at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328


2024/11/24 12:07:54 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


# Evaluate with LLM-as-a-Judge metrics


In [None]:
from mlflow.metrics.genai import EvaluationExample, make_genai_metric

professionalism_metric = make_genai_metric(
    name="professionalism",
    definition=(
        "Professionalism refers to the use of a formal, respectful, and appropriate style of communication that is tailored to the context and audience. It often involves avoiding overly casual language, slang, or colloquialisms, and instead using clear, concise, and respectful language"
    ),
    grading_prompt=(
        "Professionalism: If the answer is written using a professional tone, below "
        "are the details for different scores: "
        "- Score 1: Language is extremely casual, informal, and may include slang or colloquialisms. Not suitable for professional contexts."
        "- Score 2: Language is casual but generally respectful and avoids strong informality or slang. Acceptable in some informal professional settings."
        "- Score 3: Language is balanced and avoids extreme informality or formality. Suitable for most professional contexts. "
        "- Score 4: Language is noticeably formal, respectful, and avoids casual elements. Appropriate for business or academic settings. "
        "- Score 5: Language is excessively formal, respectful, and avoids casual elements. Appropriate for the most formal settings such as textbooks. "
    ),
    examples=[
        EvaluationExample(
            input="What is MLflow?",
            output=(
                "MLflow is like your friendly neighborhood toolkit for managing your machine learning projects. It helps you track experiments, package your code and models, and collaborate with your team, making the whole ML workflow smoother. It's like your Swiss Army knife for machine learning!"
            ),
            score=2,
            justification=(
                "The response is written in a casual tone. It uses contractions, filler words such as 'like', and exclamation points, which make it sound less professional. "
            ),
        )
    ],
    version="v1",
    model="openai:/gpt-4",
    parameters={"temperature": 0.0},
    grading_context_columns=[],
    aggregations=["mean", "variance", "p90"],
    greater_is_better=True,
)

print(professionalism_metric)

EvaluationMetric(name=professionalism, greater_is_better=True, long_name=professionalism, version=v1, metric_details=
Task:
You must return the following fields in your response in two lines, one below the other:
score: Your numerical score for the model's professionalism based on the rubric
justification: Your reasoning about the model's professionalism score

You are an impartial judge. You will be given an input that was sent to a machine
learning model, and you will be given an output that the model produced. You
may also be given additional information that was used by the model to generate the output.

Your task is to determine a numerical score called professionalism based on the input and output.
A definition of professionalism and a grading rubric are provided below.
You must use the grading rubric to determine your score. You must also justify your score.

Examples could be included below for reference. Make sure to use them as references and to
understand them before complet

In [None]:
torch.cuda.empty_cache()
gc.collect()

0

In [None]:
now = datetime.datetime.now()

description= f"""Evaluation Fine Tuned T5-Large Model on Multi_News Dataset
model_uri: {model_uri}

custom metric BertScore , latency and professionalism
"""
with mlflow.start_run(run_name=f"Evaluation_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
        model_uri,
        df_test[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
        extra_metrics=[

        latency(),
        make_metric(
                eval_fn=calculate_bert_f1,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_recall,
                greater_is_better=True,
            ),
        make_metric(
                eval_fn=calculate_bert_precision,
                greater_is_better=True,
            ),
        professionalism_metric,
    ],
    )
results.metrics

2024/11/24 12:16:11 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.


Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

2024/11/24 12:16:34 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
2024/11/24 12:17:41 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

2024/11/24 12:18:09 INFO mlflow.system_metrics.system_metrics_monitor: Stopping system metrics monitoring...


🏃 View run Evaluation_2024-11-24_12:16:11 at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328/runs/6f2f7e23a968445ab127ba90043ed334
🧪 View experiment at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328


2024/11/24 12:18:09 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


{'latency/mean': 6.671608710289002,
 'latency/variance': 0.6731030340274226,
 'latency/p90': 8.005288028717041,
 'toxicity/v1/mean': 0.008075549112982116,
 'toxicity/v1/variance': 0.00011447759693802493,
 'toxicity/v1/p90': 0.020779486000537868,
 'toxicity/v1/ratio': 0.0,
 'flesch_kincaid_grade_level/v1/mean': 8.2,
 'flesch_kincaid_grade_level/v1/variance': 4.4,
 'flesch_kincaid_grade_level/v1/p90': 10.91,
 'ari_grade_level/v1/mean': 10.64,
 'ari_grade_level/v1/variance': 8.4824,
 'ari_grade_level/v1/p90': 14.2,
 'rouge1/v1/mean': 0.3028427195535619,
 'rouge1/v1/variance': 0.0031546244668660336,
 'rouge1/v1/p90': 0.35755637692595205,
 'rouge2/v1/mean': 0.09975140660777004,
 'rouge2/v1/variance': 0.0017836328314240729,
 'rouge2/v1/p90': 0.1453489083236499,
 'rougeL/v1/mean': 0.17154302370150604,
 'rougeL/v1/variance': 0.0009294687842309258,
 'rougeL/v1/p90': 0.20263299970929022,
 'rougeLsum/v1/mean': 0.17154302370150604,
 'rougeLsum/v1/variance': 0.0009294687842309258,
 'rougeLsum/v1/p9

In [None]:
torch.cuda.empty_cache()
gc.collect()

628

# Evaluate ONNX models in Custom PythonModel

```
class ONNXModelForSeq2SeqLM(PythonModel):
  def load_context(self, context):
        """
        This method initializes the tokenizer and language model
        using the specified model snapshot directory.
        """

        from transformers import AutoTokenizer
        from optimum.onnxruntime import ORTModelForSeq2SeqLM
        from optimum.pipelines import pipeline

        self.model = ORTModelForSeq2SeqLM.from_pretrained(context.artifacts["snapshot"])
        self.tokenizer = AutoTokenizer.from_pretrained(context.artifacts["snapshot"])


  def predict(self, context, model_input, params=None):
        """
        This method generates prediction for the given input.
        """
        prompt = model_input["prompt"][0]
         # Retrieve or use default values for temperature and max_tokens
        temperature = params.get("temperature", 0.7) if params else 0.7
        max_tokens = params.get("max_tokens", 128) if params else 128
        task = params.get("task", "summarization") if params else "summarization"


        pipe = pipeline(task, model=self.model, tokenizer=self.tokenizer)
        result = pipe(prompt)
        return {"candidates": [result[0]['summary_text']]}

  ```

In [115]:
model_uri_onnx = "runs:/79c1dcaabd214f0cae2c55797175b16a/t5-summarization-onnx"

In [116]:
loaded_model = mlflow.pyfunc.load_model(model_uri_onnx)

Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]

Downloading /tmp/tmp3q5jhaq0/t5-summarization-onnx/artifacts/t5_onnx/decoder_with_past_model.onnx:   0%|      …

Downloading /tmp/tmp3q5jhaq0/t5-summarization-onnx/artifacts/t5_onnx/decoder_model_merged.onnx:   0%|         …

Downloading /tmp/tmp3q5jhaq0/t5-summarization-onnx/artifacts/t5_onnx/decoder_model.onnx:   0%|          | 0.00…

Downloading /tmp/tmp3q5jhaq0/t5-summarization-onnx/artifacts/t5_onnx/encoder_model.onnx:   0%|          | 0.00…



In [117]:
from typing import List
def onnx_summ(inputs: pd.DataFrame) -> List[str]:
    predictions = []

    for _, row in inputs.iterrows():
        response = loaded_model.predict(pd.DataFrame(
    {"prompt": [row["inputs"]]}), params={"temperature": 0.8, "max_tokens": 128}
)
        predictions.append(response['candidates'][0])

    return predictions

In [118]:
df_val = dataset['validation'].to_pandas()

In [119]:
df_val.columns = ['inputs', 'summary']

In [120]:
df_val.head()

Unnamed: 0,inputs,summary
0,Whether a sign of a good read; or a comment on...,– The Da Vinci Code has sold so many copies—th...
1,The deaths of three American soldiers in Afgha...,– A major snafu has hit benefit payments to st...
2,DUBAI Al Qaeda in Yemen has claimed responsibi...,– Yemen-based al-Qaeda in the Arabian Peninsul...
3,"Cambridge Analytica, a data firm that worked f...",– Cambridge Analytica is calling it quits. The...
4,The N.S.A.’s Evolution: The National Security ...,"– A lengthy report in the New York Times, base..."


In [122]:
torch.cuda.empty_cache()
gc.collect()

0

In [123]:
now = datetime.datetime.now()

description= f"""Evaluation  Tuned T5-Large Model converted to ONNX with optimum-cli
model_uri: {model_uri_onnx}
"""
with mlflow.start_run(run_name=f"evaluation_to_onnx_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
         model=onnx_summ,
         data= df_val[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )


2024/11/24 18:30:30 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.
2024/11/24 18:30:30 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, bu

🏃 View run evaluation_to_onnx_2024-11-24_18:30:30 at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328/runs/2e650461ae23427686747e6482ad056e
🧪 View experiment at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328


2024/11/24 18:56:49 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


In [124]:
import pprint
pprint.pprint(results.metrics)

{'ari_grade_level/v1/mean': 10.790000000000001,
 'ari_grade_level/v1/p90': 14.639999999999999,
 'ari_grade_level/v1/variance': 7.298900000000001,
 'flesch_kincaid_grade_level/v1/mean': 8.23,
 'flesch_kincaid_grade_level/v1/p90': 11.02,
 'flesch_kincaid_grade_level/v1/variance': 3.7861,
 'rouge1/v1/mean': 0.314253727742804,
 'rouge1/v1/p90': 0.3626764886433394,
 'rouge1/v1/variance': 0.0036922586975859156,
 'rouge2/v1/mean': 0.10554817520868061,
 'rouge2/v1/p90': 0.14376130198915008,
 'rouge2/v1/variance': 0.0015292036203371031,
 'rougeL/v1/mean': 0.17739143955610143,
 'rougeL/v1/p90': 0.21420280149561916,
 'rougeL/v1/variance': 0.0013818673226002169,
 'rougeLsum/v1/mean': 0.17739143955610143,
 'rougeLsum/v1/p90': 0.21420280149561916,
 'rougeLsum/v1/variance': 0.0013818673226002169,
 'toxicity/v1/mean': 0.011647447518771514,
 'toxicity/v1/p90': 0.041468564048409456,
 'toxicity/v1/ratio': 0.0,
 'toxicity/v1/variance': 0.000274218382722407}


In [125]:
# c22d0ac9d7e54c659bc9c1206471dfc7
model_uri_onnx = "runs:/c22d0ac9d7e54c659bc9c1206471dfc7/t5-summarization-onnx"
loaded_model_q = mlflow.pyfunc.load_model(model_uri_onnx)
def onnx_summ(inputs: pd.DataFrame) -> List[str]:
    predictions = []
    for _, row in inputs.iterrows():
        response = loaded_model_q.predict(pd.DataFrame(
    {"prompt": [row["inputs"]]}), params={"temperature": 0.8, "max_tokens": 128}
)
        predictions.append(response['candidates'][0])

    return predictions

Downloading artifacts:   0%|          | 0/17 [00:00<?, ?it/s]



In [126]:
now = datetime.datetime.now()

description= f"""Evaluation  Tuned T5-Large Model converted to ONNX with optimum-cli
quantized with INT8
model_uri: {model_uri_onnx}
"""
with mlflow.start_run(run_name=f"evaluation_to_onnx_{now.strftime('%Y-%m-%d_%H:%M:%S')}", description=description) as run:

    results = mlflow.evaluate(
         model=onnx_summ,
         data= df_val[:10],
        targets="summary",  # specify which column corresponds to the expected output
        model_type="text-summarization",  # model type indicates which metrics are relevant for this task
        evaluators="default",
    )

2024/11/24 18:57:27 INFO mlflow.system_metrics.system_metrics_monitor: Started monitoring system metrics.
2024/11/24 18:57:27 INFO mlflow.models.evaluation.evaluators.default: Computing model predictions.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Hardware accelerator e.g. GPU is available in the environment, bu

🏃 View run evaluation_to_onnx_2024-11-24_18:57:27 at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328/runs/90b9ecc64b7d46deb02e415101876b68
🧪 View experiment at: https://adb-2467347032368999.19.azuredatabricks.net/ml/experiments/3768969581786328


2024/11/24 19:31:05 INFO mlflow.system_metrics.system_metrics_monitor: Successfully terminated system metrics monitoring!


In [127]:
pprint.pprint(results.metrics)

{'ari_grade_level/v1/mean': 11.86,
 'ari_grade_level/v1/p90': 17.13,
 'ari_grade_level/v1/variance': 18.7444,
 'flesch_kincaid_grade_level/v1/mean': 7.1899999999999995,
 'flesch_kincaid_grade_level/v1/p90': 11.51,
 'flesch_kincaid_grade_level/v1/variance': 11.178899999999999,
 'rouge1/v1/mean': 0.2313331565719913,
 'rouge1/v1/p90': 0.3060869565217391,
 'rouge1/v1/variance': 0.007604724027665523,
 'rouge2/v1/mean': 0.07467023632907469,
 'rouge2/v1/p90': 0.13046080937831261,
 'rouge2/v1/variance': 0.0018025250403035456,
 'rougeL/v1/mean': 0.1377223841805843,
 'rougeL/v1/p90': 0.17156005522319373,
 'rougeL/v1/variance': 0.0022367412097056057,
 'rougeLsum/v1/mean': 0.1377223841805843,
 'rougeLsum/v1/p90': 0.17156005522319373,
 'rougeLsum/v1/variance': 0.0022367412097056057,
 'toxicity/v1/mean': 0.004668057763774413,
 'toxicity/v1/p90': 0.014208577387034892,
 'toxicity/v1/ratio': 0.0,
 'toxicity/v1/variance': 3.550725239558539e-05}
