<a href="https://colab.research.google.com/github/lapaniku/generative-sentiment/blob/main/evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import urllib.request
urllib.request.urlretrieve("https://gist.githubusercontent.com/lapaniku/7632ced04a0ccc8465a67be13ce40c19/raw/b157628171c34edc4e2ec25a913b1babeceb5051/regen_valid.json", "regen_valid.json")

('regen_valid.json', <http.client.HTTPMessage at 0x7f5b27ddca00>)

In [2]:
import json

with open("regen_valid.json", "r") as f:
  data = json.load(f)

  print(f"Validation sample size: {len(data)}")

Validation sample size: 693


In [3]:
def translate_score_to_sentiment(label: int) -> str:
    return ["Strong Negative", "Mild Negative", "Neutral", "Mild Positive", "Strong Positive"][label]

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("alapanik/blair-roberta-base-generative-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("alapanik/blair-roberta-base-generative-sentiment")

sentences = [item["ticket"] for item in data]


import time
start = time.time()
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
  outputs = model(**inputs)
duration = time.time() - start
print(f"Inference duration per sentence in a batch mode: {duration/len(sentences)}")

predicted = torch.argmax(outputs.logits, dim=1)
gt = [item["output"] for item in data]
predicted_labels = [translate_score_to_sentiment(label) for label in predicted]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/958 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Inference duration per sentence in a batch mode: 0.18217612000942918


In [5]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
precision_macro = precision_score(gt, predicted_labels, average='macro', zero_division=0)
recall_macro = recall_score(gt, predicted_labels, average='macro', zero_division=0)
precision_micro = precision_score(gt, predicted_labels, average='micro', zero_division=0)
recall_micro = recall_score(gt, predicted_labels, average='micro', zero_division=0)
print(f"Precision macro: {precision_macro:.2f}")
print(f"Precision micro: {precision_micro:.2f}")
print(f"Recall macro: {recall_macro:.2f}")
print(f"Recall micro: {recall_micro:.2f}")


Precision macro: 0.79
Precision micro: 0.78
Recall macro: 0.75
Recall micro: 0.78


In [6]:
confusion_matrix(gt, predicted_labels).tolist()

[[162, 1, 15, 24, 0],
 [2, 55, 10, 0, 16],
 [49, 5, 58, 0, 0],
 [24, 0, 0, 76, 0],
 [0, 3, 0, 0, 193]]

Confusion matrix higlights the problem of spearating "Strong Negative" and "Neutral" cases which might be something with generated data and requires additional attention. The next step for improvement is to figure out why Mild Negative is sometimes missed with Positive label. I don't think it requires some model fine-tuning but rather careful working with data.