<a href="https://colab.research.google.com/github/lapaniku/generative-sentiment/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install datasets transformers[torch] evaluate scikit-learn --quiet


In [3]:

import urllib.request
urllib.request.urlretrieve("https://gist.githubusercontent.com/lapaniku/90661c72db40684310a3033bfdcc2e78/raw/04170f67d729657d1c234ebaa6763b877b56116f/regen.json", "regen.json")

('regen.json', <http.client.HTTPMessage at 0x7d092834a050>)

In [4]:
#prepare dataset
import datasets
from datasets import Dataset, load_metric
import json

generated_tickets = json.load(open("regen.json"))

In [5]:
generated_tickets[:1]

[{'ticket': "Just got my new coffee maker and I couldn't be happier! It's sleek, brews super fast, and the coffee tastes amazing. I love how it has a programmable timer, so I can wake up to the smell of fresh coffee. Definitely worth the purchase!",
  'input': '',
  'output': 'Strong Positive',
  'most_similar_instructions': {'I was a little disappointed since it came without a box. So, where do I leave the adapter while the card is in the camera?': 0.1971830985915493,
   "Storage prices will always go down (though I won't say at what rate over what interval of time, a la Moore's Law - I'm not that smart.) At the time I purchased it, it was a very good price for this amount of memory.  In a year's time, it will probably be a laughably expensive sum for a pittance of storage.  Oh, well.With it in my phone, I don't have to worry about the number of photos or the duration of video I take.": 0.17777777777777778,
   "im currently using it for my nokia lumia 810 and my camera, works very wel

In [62]:
tickets = [item["ticket"] for item in generated_tickets]
sentiments = [item["output"] for item in generated_tickets]
ds = Dataset.from_dict({"text": tickets, "labels": sentiments})

## Base model

https://huggingface.co/hyp1231/blair-roberta-base

BLaIR, which is short for "Bridging Language and Items for Retrieval and Recommendation", is a series of language models pre-trained on Amazon Reviews 2023 dataset.

In [63]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("hyp1231/blair-roberta-base")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = ds.map(tokenize_function, batched=True)

Map:   0%|          | 0/1008 [00:00<?, ? examples/s]

In [64]:
from datasets import ClassLabel
ds = tokenized_datasets.cast_column("labels", ClassLabel(names = ["Strong Negative", "Mild Negative", "Neutral", "Mild Positive", "Strong Positive"] ))
class_label_feature = ds.features["labels"]
id2label = [ds.features['labels'].int2str(id) for id in range(5)]


Casting the dataset:   0%|          | 0/1008 [00:00<?, ? examples/s]

In [65]:

id2label

['Strong Negative',
 'Mild Negative',
 'Neutral',
 'Mild Positive',
 'Strong Positive']

In [66]:
ds = ds.train_test_split(test_size=0.3, stratify_by_column="labels")

In [67]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 705
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 303
    })
})

In [68]:
from transformers import AutoModelForSequenceClassification

# 110M parameters
model = AutoModelForSequenceClassification.from_pretrained("hyp1231/blair-roberta-base", num_labels=5)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at hyp1231/blair-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [69]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [70]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix


In [71]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision_macro = precision_score(labels, predictions, average='macro', zero_division=0)
    recall_macro = recall_score(labels, predictions, average='macro', zero_division=0)
    precision_micro = precision_score(labels, predictions, average='micro', zero_division=0)
    recall_micro = recall_score(labels, predictions, average='micro', zero_division=0)
    conf_matrix = confusion_matrix(labels, predictions).tolist()

    return {
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'precision_micro': precision_micro,
        'recall_micro': recall_micro,
        'confusion_matrix': conf_matrix
    }

In [72]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", report_to="none")

In [73]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    compute_metrics=compute_metrics,
)

From confusion matrix I see that the most challenging is to deal with negative part. For example strong negative and neutral are not clearly separated, also mild negative is noisy, but it could be just a lack of data and training.


In [74]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,Precision Micro,Recall Micro,Confusion Matrix
1,No log,0.652145,0.57803,0.601452,0.712871,0.712871,"[[0, 42, 0, 0, 0], [0, 90, 3, 0, 0], [0, 22, 18, 2, 0], [0, 0, 7, 23, 3], [0, 0, 0, 8, 85]]"
2,No log,0.563407,0.732116,0.750286,0.768977,0.768977,"[[36, 5, 0, 0, 1], [16, 64, 9, 4, 0], [0, 6, 22, 14, 0], [0, 0, 0, 25, 8], [0, 0, 0, 7, 86]]"
3,No log,0.507652,0.79403,0.777391,0.808581,0.808581,"[[34, 8, 0, 0, 0], [14, 76, 3, 0, 0], [0, 17, 23, 2, 0], [0, 0, 2, 26, 5], [0, 0, 0, 7, 86]]"


TrainOutput(global_step=267, training_loss=0.6048426824562558, metrics={'train_runtime': 238.8748, 'train_samples_per_second': 8.854, 'train_steps_per_second': 1.118, 'total_flos': 556494871311360.0, 'train_loss': 0.6048426824562558, 'epoch': 3.0})

In [75]:

model_path = "generative-sentiment"
trainer.save_model(model_path)

In [20]:
!pip install huggingface_hub



In [21]:

from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [76]:
from huggingface_hub import HfApi

api = HfApi()
model_repo_name = "alapanik/blair-roberta-base-generative-sentiment"
api.create_repo(repo_id=model_repo_name)
api.upload_folder(
    folder_path=model_path,
    repo_id=model_repo_name
)
tokenizer.push_to_hub(model_repo_name)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/5.11k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/alapanik/blair-roberta-base-generative-sentiment/commit/2347bfde8834b24e143dd8fa4574e8bbf2ab667f', commit_message='Upload tokenizer', commit_description='', oid='2347bfde8834b24e143dd8fa4574e8bbf2ab667f', pr_url=None, pr_revision=None, pr_num=None)