# Step 2a: Fine-tuning Sentiment Analysis Model

Next, we'll have to fine-tune a sentiment analysis model. This model should be able to extract the sentiment (positive, negative, or neutral) from a tweet.

In [1]:
!pip install transformers datasets

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.13.1-py3-none-any.whl (486 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m52.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Download

In [2]:
!pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/244.2 kB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.21.0


I decided to use the [tweet_eval](https://huggingface.co/datasets/tweet_eval/viewer/sentiment/train) dataset from the HuggingFace Hub.



In [63]:
from datasets import load_dataset

dataset = load_dataset("tweet_eval", 'sentiment')



  0%|          | 0/3 [00:00<?, ?it/s]

In [67]:
train = dataset["train"]
eval = dataset["validation"]
test = dataset["test"]

Here's what the dataset looks like:

In [None]:
train = train.rename_column('label', 'labels')
print(train)
print(train[0])

I used the bert-base-uncased tokenizer.

In [107]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [108]:
def tokenize(example):
  return tokenizer(example["text"], truncation=True, padding=True)

train_tokenized = train.map(tokenize, batched=True)
eval_tokenized = eval.map(tokenize, batched=True)
test_tokenized = test.map(tokenize, batched=True)



Map:   0%|          | 0/2000 [00:00<?, ? examples/s]



In [109]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Since I decided to fine-tune a bert-base-uncased model, I had to change the number of labels to 3 in the config.

In [110]:
from transformers import AutoModelForSequenceClassification, AutoConfig, TrainingArguments

model_name = 'bert-base-uncased'
config = AutoConfig.from_pretrained(model_name, num_labels=3)
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="tweet-sentiment-analyzer",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [111]:
from transformers import Trainer
from sklearn.metrics import accuracy_score

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=eval_tokenized,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=lambda pred: {"accuracy": accuracy_score(pred.label_ids, pred.predictions.argmax(-1))},
)

In [112]:
# Train the model
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6251,0.602797,0.7445


Epoch,Training Loss,Validation Loss,Accuracy
1,0.6251,0.602797,0.7445
2,0.4424,0.685299,0.7445


TrainOutput(global_step=11404, training_loss=0.5565205108738915, metrics={'train_runtime': 24414.5194, 'train_samples_per_second': 3.737, 'train_steps_per_second': 0.467, 'total_flos': 4242639624189750.0, 'train_loss': 0.5565205108738915, 'epoch': 2.0})

In [113]:
# Evaluate the model on the test set
eval_results = trainer.evaluate(test_tokenized)

# Print the accuracy
print(f"Test Accuracy: {eval_results['eval_accuracy']}")

Test Accuracy: 0.7030283295343537


Now, I'll push the fine-tuned model to the HuggingFace Hub, so we can access it in Step 3.

In [114]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [117]:
model_name = "Sentiment-Analyzer"

model.push_to_hub(model_name,
                  commit_message="Tweet Sentiment Analyzer Model",
                  private=False)

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/mayapapaya/Sentiment-Analyzer/commit/c55342e2a7f95c551597c01f3a6fdd3c2f47dbe0', commit_message='Tweet Sentiment Analyzer Model', commit_description='', oid='c55342e2a7f95c551597c01f3a6fdd3c2f47dbe0', pr_url=None, pr_revision=None, pr_num=None)

Great! Now, we have our sentiment analysis model.