<a href="https://colab.research.google.com/github/j-hartmann/siebert/blob/main/SieBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **SieBERT: Leveraging Transfer Learning for Sentiment Analysis**



In [None]:
# install Hugging Face's transformers and datasets libraries
!pip install transformers
!pip install datasets

In [8]:
# check GPU status
!nvidia-smi

Sun May  8 06:21:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
# check version of transformers
import transformers
print(transformers.__version__)

4.18.0


### **Example 1:** Applying SieBERT, a pretrained sentiment analysis model, with *3 lines of code*

In [None]:
from transformers import pipeline  # load pipeline() function from transformers library
sentiment_analysis = pipeline("sentiment-analysis", model="siebert/sentiment-roberta-large-english")  # load pretrained SieBERT model ("Sentiment in English")

In [11]:
sentiment_analysis("This is super helpful. I love it!")  # apply pretrained model to example sentence

[{'label': 'POSITIVE', 'score': 0.9988920092582703}]

### **Example 2:** Classifying multiple sentences using SieBERT

In [12]:
# load dependencies
import torch
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# specify path of pretrained model
checkpoint = "siebert/sentiment-roberta-large-english"  # SieBERT

# load pretrained tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [13]:
# provide 2 example sentences
sequences = ["This is amazing", "I don't think it's useless.", "I hate this!"]

# tokenize sequences
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# predict with model
output = model(**tokens)

# transform logits to class labels
predictions = torch.nn.functional.softmax(output.logits, dim=-1)
confidences = predictions.max(1)[0].tolist()
classes = predictions.argmax(-1).tolist()
labels = pd.Series(classes).map(model.config.id2label)

In [14]:
# consolidate results
df = pd.DataFrame(list(zip(sequences, classes, labels, confidences)), columns=['text', 'class', 'class_label', 'confidence'])

# return dataframe
print(df)

                          text  class class_label  confidence
0              This is amazing      1    POSITIVE    0.998669
1  I don't think it's useless.      1    POSITIVE    0.991253
2                 I hate this!      0    NEGATIVE    0.999456


### **Example 3:** Fine-tuning SieBERT for multi-class sentiment analysis in a different domain

In [None]:
# load three-class sentiment data set from Hugging Face
from datasets import load_dataset
sentiment = load_dataset('sentiment140')  # source: https://huggingface.co/datasets/sentiment140/viewer/sentiment140/test
print(sentiment)

In [16]:
# print first row from training data split
print(sentiment['train'][0])

# count number of labels
NUM_LABELS = len(set(sentiment['test']['sentiment']))
print(set(sentiment['test']['sentiment']))
print(NUM_LABELS)

{'text': "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D", 'date': 'Mon Apr 06 22:19:45 PDT 2009', 'user': '_TheSpecialOne_', 'sentiment': 0, 'query': 'NO_QUERY'}
{0, 2, 4}
3


In [17]:
# define preprocessing function
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

In [None]:
# tokenize dataset
tokenized_sentiment = sentiment.map(preprocess_function, batched=True)

# use dynamic padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [19]:
# define evaluation metrics
from sklearn.metrics import precision_recall_fscore_support, accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [None]:
# initialize pretrained model with updated classification head
model2 = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)

In [21]:
# set number of epochs
NUM_EPOCHS = 1
NUM_EXAMPLES = 400

In [22]:
# rename label column
tokenized_sentiment = tokenized_sentiment.rename_column("sentiment", "label")

In [None]:
from datasets import ClassLabel, Value

# update labels
def update_labels(example):
  example['label'] = example['label'] / 2
  return example

tokenized_sentiment = tokenized_sentiment.map(update_labels)

new_features = tokenized_sentiment['test'].features.copy()
new_features["label"] = ClassLabel(names=['neg', 'neu', 'pos'])
tokenized_sentiment['test'] = tokenized_sentiment['test'].cast(new_features)

In [None]:
# check features
tokenized_sentiment['test'].features

In [25]:
# train SieBERT
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=NUM_EPOCHS,
    weight_decay=0.01,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_sentiment["test"].select(range(0,NUM_EXAMPLES)),  
    eval_dataset=tokenized_sentiment["test"].select(range(NUM_EXAMPLES,498)),
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: date, text, user, query. If date, text, user, query are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 400
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 25


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.921058,0.683673,0.687327,0.719561,0.733903


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: date, text, user, query. If date, text, user, query are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=25, training_loss=0.7060818481445312, metrics={'train_runtime': 13.4413, 'train_samples_per_second': 29.759, 'train_steps_per_second': 1.86, 'total_flos': 32064371807904.0, 'train_loss': 0.7060818481445312, 'epoch': 1.0})

In [26]:
# store evaluations for SieBERT
siebert_eval = trainer.evaluate()

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: date, text, user, query. If date, text, user, query are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16


In [None]:
# specify path of pretrained model
checkpoint = "roberta-large"  # RoBERTa-large

# load pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# initialize pretrained model with updated classification head
model3 = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=NUM_LABELS, ignore_mismatched_sizes=True)

In [28]:
# train RoBERTa
trainer = Trainer(
    model=model3,
    args=training_args,
    train_dataset=tokenized_sentiment["test"].select(range(0,NUM_EXAMPLES)),  
    eval_dataset=tokenized_sentiment["test"].select(range(NUM_EXAMPLES,498)),
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

The following columns in the training set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: date, text, user, query. If date, text, user, query are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 400
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 25


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,1.128973,0.295918,0.25293,0.194259,0.362773


The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: date, text, user, query. If date, text, user, query are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16
  _warn_prf(average, modifier, msg_start, len(result))


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=25, training_loss=1.1199967956542969, metrics={'train_runtime': 13.7375, 'train_samples_per_second': 29.117, 'train_steps_per_second': 1.82, 'total_flos': 32064371807904.0, 'train_loss': 1.1199967956542969, 'epoch': 1.0})

In [29]:
# store evaluations for RoBERTa
roberta_eval = trainer.evaluate()
models = ['SieBERT', 'RoBERTa']
accuracies = [siebert_eval['eval_accuracy'], roberta_eval['eval_accuracy']]
f1_scores = [siebert_eval['eval_f1'], roberta_eval['eval_f1']]

The following columns in the evaluation set  don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: date, text, user, query. If date, text, user, query are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 98
  Batch size = 16


  _warn_prf(average, modifier, msg_start, len(result))


In [30]:
# consolidate results
eval = pd.DataFrame(list(zip(models, accuracies, f1_scores)), columns=['model', 'accuracy', 'f1_score'])

# return dataframe
eval

Unnamed: 0,model,accuracy,f1_score
0,SieBERT,0.683673,0.687327
1,RoBERTa,0.295918,0.25293


Source: https://huggingface.co/
