## Evaluating Finetuned Model

In this section, we demonstrate how to evaluate the previously finetuned model.

In [None]:
! pip install transformers comet-llm comet-ml sentencepiece --quiet

In [11]:
import warnings
warnings.filterwarnings("ignore")
import os
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
import transformers
import pandas as pd
import comet_llm

COMET_WORKSPACE = os.getenv("COMET_WORKSPACE")
COMET_API_KEY = os.getenv("COMET_API_KEY")

transformers.set_seed(35)

### Load the Finetuned Model

The first step is to load the finetuned model. You can load the model different ways, but in this example, we download our model and tokenizer from Comet (we stored them there in the last assignment), and then use Huggingface's Transformers library  to load the pretrained model and tokenizer.

In [None]:
# Download model from registry:
 
from comet_ml import API

api = API(api_key=COMET_API_KEY)

# model name
model_name = "Emotion-T5-Base"

#get the Model object
model = api.get_model(workspace=COMET_WORKSPACE, model_name=model_name)

# Download a Registry Model:
model.download("1.0.0", "./deploy", expand=True)

[1;38;5;39mCOMET INFO:[0m Remote Model 'josephlyu/Emotion-T5-Base:1.0.0' download has been started asynchronously.
[1;38;5;39mCOMET INFO:[0m Still downloading 10 file(s), remaining 1.14 GB/1.14 GB
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.13 GB/1.14 GB, Throughput 682.22 KB/s, ETA ~1740s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.12 GB/1.14 GB, Throughput 814.13 KB/s, ETA ~1443s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.11 GB/1.14 GB, Throughput 611.27 KB/s, ETA ~1907s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.10 GB/1.14 GB, Throughput 747.32 KB/s, ETA ~1545s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.08 GB/1.14 GB, Throughput 1.19 MB/s, ETA ~930s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.06 GB/1.14 GB, Throughput 1.33 MB/s, ETA ~822s
[1;38;5;39mCOMET INFO:[0m Still downloading 2 file(s), remaining 1.04 GB/1.14 GB, Through

The `model.download()` method will download not only the model file, but all the related assets we logged, meaning we can point Huggingface's `from_pretrained()` method directly at our download folder and everything will just work.

In [2]:
# load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained("./deploy/checkpoint-7")
# tokenizer = AutoTokenizer.from_pretrained("./deploy/checkpoint-7/")

# model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
tokenizer = AutoTokenizer.from_pretrained("t5-small")

### Load the Data to Evaluate

The next step is to load the evaluation dataset. We are reloading the dataset from the previous notebook.

In [3]:
emotion_dataset_val_temp = pd.read_json(path_or_buf="https://raw.githubusercontent.com/comet-ml/comet-llmops/main/data/merged_training_sample_prepared_valid.jsonl", lines=True)
emotion_dataset_test = emotion_dataset_val_temp.iloc[int(len(emotion_dataset_val_temp)/2):]

In [4]:
emotion_dataset_test.head().prompt.tolist()

['i feel very very disturbed right now i dont know how to say this but guess i couldnt sleep tonight just to think about this about him\n\n###\n\n',
 'i feel make them the most dangerous and their level of annoyance is what gives them high priority\n\n###\n\n',
 'i can feel sympathetic joy for my boyfriend and colleagues the latter being like times harder than the former\n\n###\n\n',
 'i found these emails from scott dale and just reading them frusterated me so much that i feel the need to post them and show the world what a neurotic freak he was is\n\n###\n\n',
 'i won t lie and say there isn t a part of me that still feels insulted by it\n\n###\n\n']

### Evaluate Finetuned Emotion Classifier

Evaluate different models and prompting techniques and log results when prompting the fine-tuned model. As a take-how exercise feel free to log results with few-shot and one-shot prompting using gpt-3.5-turbo. This way it will be possible to compare the finetuned model with other high-performing models.


In [None]:
# for comet logging
comet_llm.init(project="emotion-evaluation")

# prompt prefix
prefix = "Classify the provided piece of text into one of the following emotion labels.\n\nEmotion labels: ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise']\n\nText:"

# prepare prompts
prompts = [{"prompt": row.prompt.strip("\n\n###\n\n") + "\n\n" + "Emotion output:", "completion": row.completion.strip("\n").strip(" ")} for index, row in emotion_dataset_test.iterrows()]

# expected results to log
actual_completions = [prompt["completion"] for prompt in prompts]

# the results from the fine-tuned model
finetuned_completions = []

for prompt in prompts:

    # finetuned model outputs
    input_ids = tokenizer.encode(prefix + prompt["prompt"], return_tensors="pt")
    output = model.generate(input_ids, do_sample=True, max_new_tokens=1, temperature=0.1)
    output_text = tokenizer.decode(output[0], skip_special_tokens=True).strip("<pad>").strip(" ")
    finetuned_completions.append(output_text)

    # log the prompts
    comet_llm.log_prompt(
        prompt = prefix + prompt["prompt"],
        tags = ["flan-t5-base", "fine-tuned"],
        metadata = {
            "model_name": "flan-t5-base",
            "temperature": 0.1,
            "expected_output": prompt["completion"],
        },
        output = output_text
    )

    # exercise: log zero-shot and few-shot results with GPT-3.5-Turbo and GPT-4 and compare with your fine-tuned model


In [7]:
# View steps from above cell
print("------------------------")
print(f"prompt: {prompt}")
print("------------------------")
print(f"input_ids: {input_ids}")
print("------------------------")
print(f"output: {output}")
print("------------------------")
print(f"output_text: {output_text}")
print("------------------------")

------------------------
prompt: {'prompt': 'i put it down on paper like this i don t feel quite as frantic about the upcoming show\n\nEmotion output:', 'completion': 'fear'}
------------------------
input_ids: tensor([[ 4501,  4921,     8,   937,  1466,    13,  1499,   139,    80,    13,
             8,   826, 13868, 11241,     5,   262,  7259, 11241,    10,   784,
            31,     9,  9369,    31,     6,     3,    31,    89,  2741,    31,
             6,     3,    31,  1927,    63,    31,     6,     3,    31,  5850,
            15,    31,     6,     3,    31,     7,     9,    26,   655,    31,
             6,     3,    31,  3042,   102,  7854,    31,   908,  5027,    10,
            23,   474,    34,   323,    30,  1040,   114,    48,     3,    23,
           278,     3,    17,   473,   882,    38,     3,  6296,  1225,    81,
             8,     3,  4685,   504,   262,  7259,  3911,    10,     1]])
------------------------
output: tensor([[    0, 11213]])
------------------------


In [8]:
finetuned_completions = ["anger" if i=="nger" else i for i in finetuned_completions]
set(finetuned_completions)

{'anger', 'fear', 'joy', 'love', 'sadness', 'surprise'}

### Finetuned Model - Confusion Matrix

Prepare a confusion matrix to better understand the performance of the fine-tuned model on the multi-label classification task.

In [18]:
# confusion matrix (logged to experiments as well)
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# map completion labels to integers
completion_map = {
    "anger": 0,
    "fear": 1,
    "joy": 2,
    "love": 3,
    "sadness": 4,
    "surprise": 5
}

# mapper back to string labels
completion_map_string = {
    0: "anger",
    1: "fear",
    2: "joy",
    3: "love",
    4: "sadness",
    5: "surprise"
}

actual_completions_int = [completion_map[completion] for completion in actual_completions]
finetuned_completions_int = [1 if completion == "nightmare" else completion_map[completion] for completion in finetuned_completions]

cm = confusion_matrix(actual_completions_int, finetuned_completions_int)

# plot confusion matrix
plt.figure(figsize=(8, 8))
sns.heatmap(cm, annot=True, fmt=".0f", linewidths=0.5, square=True, cmap="Blues_r")

# add emotion labels to confusion matrix
plt.ylabel("Actual label")
plt.xlabel("Predicted label")

# annotate the confusion matrix with completion labels
tick_marks = [i for i in range(len(completion_map_string))]
plt.xticks(tick_marks, list(completion_map_string.values()), rotation="vertical")
plt.yticks(tick_marks, list(completion_map_string.values()), rotation="horizontal")
plt.savefig("imgs/confusion_matrix.png")

![confusion_matrix](imgs/confusion_matrix.png)

### Saving Confusion Matrix

The code below saves the confusion matrix to the selected Comet experiment. You can obtained the experiment key from Comet's experiment dashboard.

Make sure to change the experiment key to your own experiment key. Refer to the video lecture or [Comet's documentation](https://www.comet.com/docs/v2/api-and-sdk/python-sdk/reference/ExistingExperiment/#existingexperimentlog_code) for how to locate the experiment key for your experiment.

In [17]:
from comet_ml import Experiment

experiment = Experiment(api_key=COMET_API_KEY, project_name="emotion-classification")
experiment.log_confusion_matrix(actual_completions_int, finetuned_completions_int, labels=list(completion_map_string.values()))
experiment.end()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1;38;5;39mCOMET INFO:[0m Experiment is live on comet.com https://www.comet.com/josephlyu/emotion-classification/908ff1729dcd4c48bc65d16f4e23ff1a



{'web': 'https://www.comet.com/api/asset/download?assetId=948261be72b54314bf279f2f2dd61a19&experimentKey=908ff1729dcd4c48bc65d16f4e23ff1a',
 'api': 'https://www.comet.com/api/rest/v2/experiment/asset/get-asset?assetId=948261be72b54314bf279f2f2dd61a19&experimentKey=908ff1729dcd4c48bc65d16f4e23ff1a',
 'assetId': '948261be72b54314bf279f2f2dd61a19'}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### View in comet dashbaord

![comet_cm](imgs/comet_cm.png)