<a href="https://colab.research.google.com/github/lapaniku/generative-sentiment/blob/main/train.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
!pip install datasets transformers[torch] evaluate scikit-learn --quiet


In [39]:

import urllib.request
urllib.request.urlretrieve("https://gist.githubusercontent.com/lapaniku/90661c72db40684310a3033bfdcc2e78/raw/04170f67d729657d1c234ebaa6763b877b56116f/regen.json", "regen.json")

('regen.json', <http.client.HTTPMessage at 0x7fad94241e70>)

In [40]:
#prepare dataset
import datasets
from datasets import Dataset, load_metric
import json

generated_tickets = json.load(open("regen.json"))

In [19]:
generated_tickets[:1]

[{'ticket': "Just got my new coffee maker and I couldn't be happier! It's sleek, brews super fast, and the coffee tastes amazing. I love how it has a programmable timer, so I can wake up to the smell of fresh coffee. Definitely worth the purchase!",
  'input': '',
  'output': 'Strong Positive',
  'most_similar_instructions': {'I was a little disappointed since it came without a box. So, where do I leave the adapter while the card is in the camera?': 0.1971830985915493,
   "Storage prices will always go down (though I won't say at what rate over what interval of time, a la Moore's Law - I'm not that smart.) At the time I purchased it, it was a very good price for this amount of memory.  In a year's time, it will probably be a laughably expensive sum for a pittance of storage.  Oh, well.With it in my phone, I don't have to worry about the number of photos or the duration of video I take.": 0.17777777777777778,
   "im currently using it for my nokia lumia 810 and my camera, works very wel

In [20]:
tickets = [item["ticket"] for item in generated_tickets]
sentiments = [item["output"] for item in generated_tickets]
ds = Dataset.from_dict({"text": tickets, "labels": sentiments})

In [21]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = ds.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/1008 [00:00<?, ? examples/s]

In [22]:
ds = tokenized_datasets.class_encode_column("labels")

Casting to class labels:   0%|          | 0/1008 [00:00<?, ? examples/s]

In [23]:
ds = ds.train_test_split(test_size=0.3, stratify_by_column="labels")

In [24]:
ds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 705
    })
    test: Dataset({
        features: ['text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 303
    })
})

In [25]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=5)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [27]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix


In [42]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    precision_macro = precision_score(labels, predictions, average='macro', zero_division=0)
    recall_macro = recall_score(labels, predictions, average='macro', zero_division=0)
    precision_micro = precision_score(labels, predictions, average='micro', zero_division=0)
    recall_micro = recall_score(labels, predictions, average='micro', zero_division=0)
    conf_matrix = confusion_matrix(labels, predictions).tolist()

    return {
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'precision_micro': precision_micro,
        'recall_micro': recall_micro,
        'confusion_matrix': conf_matrix
    }

In [43]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", report_to="none")

In [44]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    compute_metrics=compute_metrics,
)

From confusion matrix I see that the most challenging is to deal with negative part. For example strong negative and neutral are not clearly separated, also mild negative is noisy, but it could be just a lack of data and training.


In [45]:
trainer.train()

Epoch,Training Loss,Validation Loss,Precision Macro,Recall Macro,Precision Micro,Recall Micro,Confusion Matrix
1,No log,2.059468,0.700172,0.640162,0.726073,0.726073,"[[76, 0, 12, 5, 0], [5, 12, 7, 0, 9], [21, 1, 20, 0, 0], [16, 0, 0, 26, 0], [0, 7, 0, 0, 86]]"
2,No log,1.946654,0.766155,0.690281,0.768977,0.768977,"[[78, 1, 7, 7, 0], [5, 17, 2, 0, 9], [25, 1, 16, 0, 0], [11, 0, 0, 31, 0], [0, 2, 0, 0, 91]]"
3,No log,2.022929,0.7556,0.692473,0.772277,0.772277,"[[75, 1, 9, 8, 0], [3, 11, 5, 1, 13], [19, 1, 22, 0, 0], [8, 0, 0, 34, 0], [0, 1, 0, 0, 92]]"


TrainOutput(global_step=267, training_loss=0.053092849388551175, metrics={'train_runtime': 229.3971, 'train_samples_per_second': 9.22, 'train_steps_per_second': 1.164, 'total_flos': 556494871311360.0, 'train_loss': 0.053092849388551175, 'epoch': 3.0})