This notebook demonstrates fine-tuning  [AlephBert](https://github.com/OnlpLab/AlephBERT) for a sentiment analysis task.



**First, we download the data**

In [26]:
!pip install transformers
labels = 5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


**Download Alephbert pretrained model**

In [27]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('onlplab/alephbert-base')
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("onlplab/alephbert-base", num_labels=labels)
model.save_pretrained("./initial_pretrained")


Some weights of the model checkpoint at onlplab/alephbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at onlplab/alephbert-base

In [28]:
!ls -latr ./initial_pretrained

total 492192
drwxr-xr-x 2 root root      4096 Jun 16 15:16 .
drwxr-xr-x 1 root root      4096 Jun 16 15:19 ..
-rw-r--r-- 1 root root       913 Jun 16 15:19 config.json
-rw-r--r-- 1 root root 503989869 Jun 16 15:20 pytorch_model.bin


**Convert Sentiment Analysis Data to Hugging Face Input with encodings and labels**

In [29]:
import pandas as pd
from pathlib import Path
import torch
from transformers import DataCollatorWithPadding

class HebrewSentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

def get_datasets():
    token_root=Path("./")
    train = pd.read_csv(token_root/f"train_{labels}_labels.tsv", sep='\t')
    dev = pd.read_csv(token_root/f"dev_{labels}_labels.tsv", sep='\t')
    test = pd.read_csv(token_root/f"test_{labels}_labels.tsv", sep='\t')
  
    
    train_encodings = tokenizer(train["comment"].to_list(), truncation=True)
    dev_encodings = tokenizer(dev["comment"].to_list(), truncation=True)
    test_encodings = tokenizer(test["comment"].to_list(), truncation=True)
    train_labels=train["label"].to_list()
    dev_labels=dev["label"].to_list()
    test_labels=test["label"].to_list()

    train_dataset = HebrewSentimentDataset(train_encodings, train_labels)
    dev_dataset = HebrewSentimentDataset(dev_encodings, dev_labels)
    test_dataset = HebrewSentimentDataset(test_encodings, test_labels)
    
    return train_dataset, dev_dataset, test_dataset

**Fine Tune using sentiment data. The system infers number of classes from data.**

In [30]:
from transformers import Trainer,TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=5,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10
)

train_dataset, dev_dataset, test_dataset=get_datasets()
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=dev_dataset,             # evaluation dataset
    data_collator=data_collator
)

trainer.train()
trainer.save_model("./alephbert_sentiment")

***** Running training *****
  Num examples = 10805
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3380


Step,Training Loss
10,1.6341
20,1.6461
30,1.6331
40,1.5894
50,1.5422
60,1.514
70,1.4804
80,1.4533
90,1.3042
100,1.2822


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1000
Configuration saved in ./results/checkpoint-1000/config.json
Model weights saved in ./results/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-1500
Configuration saved in ./results/checkpoint-1500/config.json
Model weights saved in ./results/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2000
Configuration saved in ./results/checkpoint-2000/config.json
Model weights saved in ./results/checkpoint-2000/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-2500
Configuration saved in ./results/checkpoint-2500/config.json
Model weights saved in ./results/checkpoint-2500/pytorch_model.bin
Saving model checkpoint to ./results/checkpoint-3000
Configuration saved in ./results/checkpoint-3

In [31]:
!ls -latr ./alephbert_sentiment/

total 492236
-rw-r--r-- 1 root root     36014 Jun 16 15:10 trainer_state.json
drwxr-xr-x 1 root root      4096 Jun 16 15:20 ..
-rw-r--r-- 1 root root       962 Jun 16 15:56 config.json
drwxr-xr-x 2 root root      4096 Jun 16 15:56 .
-rw-r--r-- 1 root root 503992685 Jun 16 15:56 pytorch_model.bin
-rw-r--r-- 1 root root      3119 Jun 16 15:56 training_args.bin


In [32]:
import numpy as np


**Use fine-tuned transformer for prediction**

**Calculate Accuracy**

In [33]:
class AlephBERTModel:
    def __init__(self,checkpoint_folder, labels=5):
        self.labels = labels
        self.model = AutoModelForSequenceClassification.from_pretrained(checkpoint_folder, num_labels=self.labels)
        self.tokenizer = AutoTokenizer.from_pretrained('onlplab/alephbert-base')

    class HebrewSentimentDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    def get_dataset(self,root_folder):
        token_root = Path(root_folder)
        train = pd.read_csv(token_root / f"train_{self.labels}_labels.tsv", sep='\t')
        dev = pd.read_csv(token_root / f"dev_{self.labels}_labels.tsv", sep='\t')
        test = pd.read_csv(token_root / f"test_{self.labels}_labels.tsv", sep='\t')

        train_encodings = self.tokenizer(train["comment"].to_list(), truncation=True)
        dev_encodings = self.tokenizer(dev["comment"].to_list(), truncation=True)
        test_encodings = self.tokenizer(test["comment"].to_list(), truncation=True)
        train_labels = train["label"].to_list()
        dev_labels = dev["label"].to_list()
        test_labels = test["label"].to_list()

        self.train_dataset = self.HebrewSentimentDataset(train_encodings, train_labels)
        self.dev_dataset = self.HebrewSentimentDataset(dev_encodings, dev_labels)
        self.test_dataset = self.HebrewSentimentDataset(test_encodings, test_labels)

        self.training_args = TrainingArguments(
            output_dir=f'./alephbert_sentiment/{self.labels}_labels_results',  # output directory
            num_train_epochs=3,  # total number of training epochs
            per_device_train_batch_size=16,  # batch size per device during training
            per_device_eval_batch_size=64,  # batch size for evaluation
            warmup_steps=500,  # number of warmup steps for learning rate scheduler
            weight_decay=0.01,  # strength of weight decay
            logging_dir=f'./alephbert_sentiment/{self.labels}_labels_logs',  # directory for storing logs
            logging_steps=10
        )
        self.data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer)
        self.trainer = Trainer(
            model=self.model,  # the instantiated 🤗 Transformers model to be trained
            args=self.training_args,  # training arguments, defined above
            train_dataset=self.train_dataset,  # training dataset
            eval_dataset=self.dev_dataset,  # evaluation dataset
            data_collator=self.data_collator
        )

    def predict(self):
        raw_pred, _, _ = self.trainer.predict(self.test_dataset)
        y_pred = np.argmax(raw_pred, axis=1)
        count_equals = 0
        for a, b in zip(self.test_dataset.labels, y_pred):
            if a == b:
                count_equals += 1
        print(f"{self.labels} labels accuracy={count_equals / len(y_pred)}")
        return y_pred

In [34]:
#  Evaluating AlephBERT
print(f"Evaluating AlephBERT:")
model = AlephBERTModel('alephbert_sentiment',labels=labels)
model.get_dataset('./')
predicted_labels = model.predict()

loading configuration file alephbert_sentiment/config.json
Model config BertConfig {
  "_name_or_path": "alephbert_sentiment",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.19.4",
  "type_vocab_size": 1,
  "

Evaluating AlephBERT:


All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the model checkpoint at alephbert_sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.
loading configuration file https://huggingface.co/onlplab/alephbert-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/c311cde38d67060cfab2730d54b583d4d7b55c2bf556914da310d274b806e592.6df48d87da51ccd2d7121eb1fd6ebc489d701a2baed5666032a314e019327cb0
Model config BertConfig {
  "_name_or_path": "onlplab/alephbert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate

5 labels accuracy=0.7034045008655511


In [36]:
test = pd.read_csv(f"test_{labels}_labels.tsv", sep='\t')
Y_test = test['label']
X_test = test['comment']
X_test

0       \n                 חיובי : אוזניות מצויינות לג...
1       \n                מכשיר הכי טוב שיש בשוק לדעתי...
2       \n                טלוויזיה מעולה שווה כל שקל ....
3       \n                מקרר נוח מבחינת עיצוב פנימי ...
4       \n                מכשיר פשוט מעולה , עוצמתי, ק...
                              ...                        
1728             \n                אחלה מוצר רכישה מוצלחת
1729    \n                רכשתי את המקרר לפני שבוע ואנ...
1730    \n                יש לי אותם כשלוש שנים אוזניו...
1731    \n                רכשתי 2 לפני פחות משנה... לא...
1732    \n                משתמש מעל שנה וחצי מאוד מרוצ...
Name: comment, Length: 1733, dtype: object

In [39]:
from sklearn.metrics import confusion_matrix,precision_score
from sklearn.metrics import recall_score,f1_score,accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def plot_AlephBERT_matrix(y_true, y_pred,labels=5):
      cm = confusion_matrix(y_true, y_pred)
      display_labels = []
      for i in range(labels):
        display_labels.append(i)
      disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=display_labels)
      return disp.plot(
      include_values=True,
      cmap="viridis",
      ax=None,
      xticks_rotation="horizontal",
      values_format=None,
      colorbar=True,
  )

overSample = True
prec_mic = precision_score(Y_test, predicted_labels, average="micro")

rec_mic = recall_score(Y_test, predicted_labels, average="micro")

f1_mic = f1_score(Y_test, predicted_labels, average="micro")
print(f"Micro precision:{prec_mic}, recall:{rec_mic}, f1:{f1_mic}")
prec_mac = precision_score(Y_test, predicted_labels, average="macro")

rec_mac = recall_score(Y_test, predicted_labels, average="macro")

f1_mac = f1_score(Y_test, predicted_labels, average="macro")
print(f"Macro precision:{prec_mac}, recall:{rec_mac}, f1:{f1_mac}")
acc = accuracy_score(Y_test, predicted_labels)

distance = np.abs(Y_test.to_numpy()[0] - predicted_labels)
# distance[distance <= 1] = 0

print(f"Accuracy: {acc} mean distance:{np.mean(distance)}")
plot = False
if plot:
  plot_AlephBERT_matrix(Y_test, predicted_labels)
  strategy = ""
  if overSample:
      strategy = " with over sampling"
  format_string = ".2f"
  algo = 'AlephBERT'
  plt.title(
      f'{algo}{strategy} \nAcc:{format(acc, format_string)} Mac per:{format(prec_mac, format_string)} Mac rec:{format(rec_mac, format_string)} avg dist:{format(np.mean(distance), format_string)}')
  plt.savefig(f'{algo}{strategy}_{labels}_label.png')
  plt.show()
      
else:
    cm = confusion_matrix(Y_test, predicted_labels)
    print(cm)



Micro precision:0.7034045008655511, recall:0.7034045008655511, f1:0.7034045008655511
Macro precision:0.5328366745122418, recall:0.5077882927547386, f1:0.5133536862419379
Accuracy: 0.7034045008655511 mean distance:1.3560300057703405
[[270  22  14   5  10]
 [ 58  36  18  15  10]
 [ 30  18  27  26  22]
 [  7   7  22  80 128]
 [  8   1   6  87 806]]
