<a href="https://colab.research.google.com/github/lapaniku/generative-sentiment/blob/main/evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [7]:
import urllib.request
urllib.request.urlretrieve("https://gist.githubusercontent.com/lapaniku/697a9f205d2b06fd10213fff28bb7fb2/raw/216da915b5bb3129cf9e708bf0b220287030f6ef/regen_valid.json", "regen_valid.json")

('regen_valid.json', <http.client.HTTPMessage at 0x7d5f01ef9c00>)

In [8]:
import json

with open("regen_valid.json", "r") as f:
  data = json.load(f)

  print(f"Validation sample size: {len(data)}")

Validation sample size: 464


In [9]:
def translate_score_to_sentiment(label: int) -> str:
    return ["Strong Negative", "Mild Negative", "Neutral", "Mild Positive", "Strong Positive"][label]

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("alapanik/blair-roberta-base-generative-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("alapanik/blair-roberta-base-generative-sentiment")

sentences = [item["ticket"] for item in data]


import time
start = time.time()
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
  outputs = model(**inputs)
duration = time.time() - start
print(f"Inference duration per sentence in a batch mode: {duration/len(sentences)}")

predicted = torch.argmax(outputs.logits, dim=1)
gt = [item["output"] for item in data]
predicted_labels = [translate_score_to_sentiment(label) for label in predicted]

Inference duration per sentence in a batch mode: 0.3007255458626254


In [11]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
precision_macro = precision_score(gt, predicted_labels, average='macro', zero_division=0)
recall_macro = recall_score(gt, predicted_labels, average='macro', zero_division=0)
precision_micro = precision_score(gt, predicted_labels, average='micro', zero_division=0)
recall_micro = recall_score(gt, predicted_labels, average='micro', zero_division=0)
print(f"Precision macro: {precision_macro:.2f}")
print(f"Precision micro: {precision_micro:.2f}")
print(f"Recall macro: {recall_macro:.2f}")
print(f"Recall micro: {recall_micro:.2f}")


Precision macro: 0.81
Precision micro: 0.82
Recall macro: 0.79
Recall micro: 0.82


In [12]:
confusion_matrix(gt, predicted_labels).tolist()

[[115, 1, 13, 11, 0],
 [3, 33, 6, 0, 8],
 [18, 5, 52, 1, 0],
 [14, 0, 0, 53, 0],
 [0, 3, 0, 0, 128]]

Confusion matrix higlights the problem of spearating "Strong Negative" and "Neutral" cases which might be something with generated data and requires additional attention. The next step for improvement is to figure out why Mild Negative is sometimes missed with Positive label. I don't think it requires some model fine-tuning but rather careful working with data.