This notebook goes through steps to fine-tune the pretrained "bert-base-uncased" model from HuggingFace transformers on the dataset https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences, which provides reviews from amazon.com, imdb.com, and yelp.com and the sentiment of each review: 0 for negative or 1 for positive.

The model head is changed to a multi-label-classification task and outputs probabilities for 5 labels:
- Positive
- Negative
- Amazon
- Imdb
- Yelp

As a result, the model essentially aims to learn two things: 
1. The sentiment of the sentence (Positive or Negative).
2. The type of website (Amazon, Imdb, Yelp) the sentence would appear in.

#### Prepare the data

In [None]:
!pip install transformers

In [None]:
!pip install datasets

In [None]:
from datasets import load_dataset

In [None]:
import csv

In [None]:
amazon_data = []

In [None]:
with open('amazon.txt', 'r') as file:
  for line in file:
    l = line.strip()
    review = l[0:-2]
    label = int(l[-1])
    amazon_data.append([review, int(label==1), int(label==0), 1, 0, 0])

In [None]:
imdb_data = []

In [None]:
with open('imdb.txt', 'r') as file:
  for line in file:
    l = line.strip()
    review = l[0:-2]
    label = int(l[-1])
    imdb_data.append([review, int(label==1), int(label==0), 0, 1, 0])

In [None]:
yelp_data = []

In [None]:
with open('yelp.txt', 'r') as file:
  for line in file:
    l = line.strip()
    review = l[0:-2]
    label = int(l[-1])
    yelp_data.append([review, int(label==1), int(label==0), 0, 0, 1])

Combine the data into a CSV file.

In [None]:
with open("data.csv", 'w') as csvfile:
  csvwriter = csv.writer(csvfile)
  csvwriter.writerow(['Text', 'Positive', 'Negative', 'Amazon', 'Imdb', 'Yelp'])
  csvwriter.writerows(amazon_data)
  csvwriter.writerows(imdb_data)
  csvwriter.writerows(yelp_data)

Create a Dataset from the data

In [None]:
ds = load_dataset("csv", data_files="data.csv")

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-a8511f7ad0a55b54/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-a8511f7ad0a55b54/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['Text', 'Positive', 'Negative', 'Amazon', 'Imdb', 'Yelp'],
        num_rows: 3000
    })
})


Train-test split with 10% of data in test dataset.

In [None]:
train_test = ds['train'].train_test_split(test_size=0.1)

In [None]:
print(train_test)

DatasetDict({
    train: Dataset({
        features: ['Text', 'Positive', 'Negative', 'Amazon', 'Imdb', 'Yelp'],
        num_rows: 2700
    })
    test: Dataset({
        features: ['Text', 'Positive', 'Negative', 'Amazon', 'Imdb', 'Yelp'],
        num_rows: 300
    })
})


In [None]:
labels = ['Positive', 'Negative', 'Amazon', 'Imdb', 'Yelp']

In [None]:
id2label = {id : label for id, label in enumerate(labels)}
label2id = {label : id for id, label in enumerate(labels)}

In [None]:
print(id2label)

{0: 'Positive', 1: 'Negative', 2: 'Amazon', 3: 'Imdb', 4: 'Yelp'}


In [None]:
print(label2id)

{'Positive': 0, 'Negative': 1, 'Amazon': 2, 'Imdb': 3, 'Yelp': 4}


In [None]:
train_test["train"][100]

{'Text': 'The servers went back and forth several times, not even so much as an "Are you being helped?"',
 'Positive': 0,
 'Negative': 1,
 'Amazon': 0,
 'Imdb': 0,
 'Yelp': 1}

Preprocess data

In [None]:
from transformers import AutoTokenizer
import numpy as np

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [None]:
def preprocess_function(examples):
  encoding = tokenizer(examples["Text"], padding="max_length", truncation=True, max_length=128)
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  labels_matrix = np.zeros((len(examples["Text"]), len(labels)))
  for id, label in enumerate(labels):
    labels_matrix[:, id] = labels_batch[label]
  encoding["labels"] = labels_matrix.tolist()
  return encoding

In [None]:
processed_data = train_test.map(preprocess_function, batched=True, remove_columns=train_test['train'].column_names)

In [None]:
print(processed_data['train'][0])

{'input_ids': [101, 2035, 1996, 5889, 2507, 1037, 6919, 2836, 1010, 2926, 7673, 20524, 2004, 6175, 5671, 1010, 2040, 3431, 2013, 1996, 6091, 2732, 7485, 1999, 1996, 2927, 2083, 1996, 4326, 2824, 2016, 2003, 2112, 1997, 2000, 1996, 4658, 2732, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

In [None]:
example = processed_data['train'][0]

In [None]:
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [None]:
len(example['input_ids'])

128

In [None]:
tokenizer.decode(example['input_ids'])

'[CLS] all the actors give a wonderful performance, especially jennifer rubin as jamie harris, who changes from the nervous starlet in the beginning through the strange events she is part of to the cool star. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

In [None]:
example['labels']

[1.0, 0.0, 0.0, 1.0, 0.0]

In [None]:
for id, label in enumerate(example['labels']):
  if label == 1.0:
    print(id2label[id])

Positive
Imdb


In [None]:
processed_data.set_format("torch")

#### Model

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
outputs = model(input_ids=processed_data['train']['input_ids'][0].unsqueeze(0), 
                labels=processed_data['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.7366, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[ 0.4863, -0.1103,  0.4953, -0.4412, -0.0773]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
!pip install --upgrade accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
#!pip uninstall -y transformers accelerate
#!pip install transformers accelerate

#### Train the model using Trainer class

In [None]:
training_args = TrainingArguments(
    "JosephTK/NLP-reviews", 
    num_train_epochs=10, 
    evaluation_strategy="epoch", 
    push_to_hub=True)

In [None]:
trainer = Trainer(
    model,
    args=training_args,
    train_dataset=processed_data['train'],
    eval_dataset=processed_data['test'],
    tokenizer=tokenizer
)

In [None]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss
1,No log,0.22702
2,0.223500,0.273674
3,0.064400,0.317131
4,0.064400,0.351071
5,0.019300,0.37256
6,0.011900,0.363833
7,0.011900,0.333673
8,0.004300,0.342381
9,0.001900,0.338725
10,0.001900,0.346695


TrainOutput(global_step=3380, training_loss=0.04826677525360909, metrics={'train_runtime': 972.6145, 'train_samples_per_second': 27.76, 'train_steps_per_second': 3.475, 'total_flos': 1776047461632000.0, 'train_loss': 0.04826677525360909, 'epoch': 10.0})

In [None]:
trainer.evaluate()

{'eval_loss': 0.3466948866844177,
 'eval_runtime': 2.171,
 'eval_samples_per_second': 138.184,
 'eval_steps_per_second': 17.503,
 'epoch': 10.0}

#### Evaluate on some test inputs

Negative Yelp-type review

In [None]:
text = "I'm never coming back to this restaurant"

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k, v in encoding.items()}

outputs = trainer.model(**encoding)

In [None]:
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-7.4608,  7.4232, -6.7949, -6.7606,  7.6961]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


Negative Amazon-type review

In [None]:
text = "The item was broken"

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k, v in encoding.items()}

outputs = trainer.model(**encoding)

In [None]:
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-7.1130,  7.0180,  7.5952, -7.0188, -6.9386]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


Positive Imdb-type review

In [None]:
text = "The story was great. It made me smile."

encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(trainer.model.device) for k, v in encoding.items()}

outputs = trainer.model(**encoding)

In [None]:
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[ 7.1301, -7.2671, -6.8889,  7.3633, -7.0190]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


#### Save and upload the model

In [None]:
trainer.save_model("./my_model")

Upload file pytorch_model.bin:   0%|          | 1.00/418M [00:00<?, ?B/s]

Upload file runs/May15_17-16-27_15c9f1379f26/events.out.tfevents.1684170990.15c9f1379f26.2743.14:   0%|       …

Upload file runs/May15_17-16-27_15c9f1379f26/events.out.tfevents.1684171965.15c9f1379f26.2743.16:   0%|       …

To https://huggingface.co/JosephTK/NLP-reviews
   57363b2..527c161  main -> main

   57363b2..527c161  main -> main

To https://huggingface.co/JosephTK/NLP-reviews
   527c161..acbca08  main -> main

   527c161..acbca08  main -> main



In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:
trainer.push_to_hub()