# Basic tutorial

This is the Ferrer (2024) data camp tutorial.

In [1]:
!pip install datasets
!pip install transformers
import numpy as np
import pandas as pd
from datasets import load_dataset

dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/3.63M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/465k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/27481 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [3]:
from transformers import GPT2Tokenizer

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [24]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [4]:
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
!pip install numpy scikit-learn
import numpy as np

Collecting evaluate==0.4.2
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.2


In [2]:
from sklearn.metrics import balanced_accuracy_score

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    balanced_acc = balanced_accuracy_score(labels, predictions)
    return {"balanced_accuracy": balanced_acc}

In [7]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,No log,0.95174,0.680234
2,0.955900,0.868013,0.708551
3,0.955900,1.06399,0.739895


TrainOutput(global_step=750, training_loss=0.8000375264485677, metrics={'train_runtime': 1079.5845, 'train_samples_per_second': 2.779, 'train_steps_per_second': 0.695, 'total_flos': 1567794659328000.0, 'train_loss': 0.8000375264485677, 'epoch': 3.0})

# Adjusting training and testing set sizes

## 5000 train set

In [25]:
small_train_dataset1 = tokenized_datasets["train"].shuffle(seed=42).select(range(5000))
small_eval_dataset1 = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [26]:
from transformers import GPT2ForSequenceClassification

model1 = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model1,
   args=training_args,
   train_dataset=small_train_dataset1,
   eval_dataset=small_eval_dataset1,
   compute_metrics=compute_metrics,

)

trainer.train()



Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,0.7502,0.634055,0.741388
2,0.6822,0.725511,0.774971
3,0.5539,1.329678,0.778118


TrainOutput(global_step=3750, training_loss=0.6817633341471354, metrics={'train_runtime': 1877.2439, 'train_samples_per_second': 7.99, 'train_steps_per_second': 1.998, 'total_flos': 7838973296640000.0, 'train_loss': 0.6817633341471354, 'epoch': 3.0})

## 10,000 train set

In [38]:
small_train_dataset2 = tokenized_datasets["train"].shuffle(seed=42).select(range(10000))
small_eval_dataset2 = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [39]:
from transformers import GPT2ForSequenceClassification

model2 = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model2,
   args=training_args,
   train_dataset=small_train_dataset2,
   eval_dataset=small_eval_dataset2,
   compute_metrics=compute_metrics,

)

trainer.train()



Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,0.6951,0.781725,0.759567
2,0.6709,0.930331,0.783953
3,0.5515,1.273375,0.785005


TrainOutput(global_step=7500, training_loss=0.6811662231445312, metrics={'train_runtime': 3971.2997, 'train_samples_per_second': 7.554, 'train_steps_per_second': 1.889, 'total_flos': 1.567794659328e+16, 'train_loss': 0.6811662231445312, 'epoch': 3.0})

As expected and established in previous projects, simply increasing the size of the training set results in increasing validation loss and plateaued accuracy, indicating that overfitting might be occuring in these cases.

# Implementing early stopping

In [35]:
small_train_dataset3 = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset3 = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [36]:
from transformers import GPT2ForSequenceClassification

model3 = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [37]:
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback

early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=3,
    early_stopping_threshold=0.01
)

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   save_strategy = "epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4,
   load_best_model_at_end=True,
   num_train_epochs= 15
   )


trainer = Trainer(
   model=model3,
   args=training_args,
   train_dataset=small_train_dataset3,
   eval_dataset=small_eval_dataset3,
   compute_metrics=compute_metrics,
   callbacks=[early_stopping_callback]
)

trainer.train()



Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,No log,0.753506,0.725744
2,1.021200,0.976755,0.716443
3,1.021200,1.109429,0.748622
4,0.637100,1.702237,0.738372


TrainOutput(global_step=1000, training_loss=0.8291336669921875, metrics={'train_runtime': 717.2079, 'train_samples_per_second': 20.914, 'train_steps_per_second': 5.229, 'total_flos': 2090392879104000.0, 'train_loss': 0.8291336669921875, 'epoch': 4.0})

Given the unchanged parameters of the basic tutorial, no significant improvment (>1%) occurs within three training epochs, and the accuracy plateaus at about 72%.

# Implementing pre-processing steps

Chong and Soon (2014) added various preprocessing steps in their solution to the sentiment classification problem. Namely, they removed URLs, hastags, and other elements of the tweets that might impact the ability of the model to properly identify the sentiment of the tweet. After examining the data from HuggingFace, it was apparent that some of these elements were included in this dataset. As such, URLs and hastags were removed from the dataset prior to tokenization to examine whether or not this impacts the balanced accuracy values.

In [21]:
import re
dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])
def preprocess_text(tweet):
    hashtag_pattern = r"#\w+" #hashtags removed
    url_pattern = r"http[s]?://\S+" #urls removed

    tweet["text"] = re.sub(hashtag_pattern, "", tweet["text"])
    tweet["text"] = re.sub(url_pattern, "", tweet["text"])
    return tweet

# Apply preprocessing to the dataset
preprocessed_dataset = dataset.map(preprocess_text)

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = preprocessed_dataset.map(tokenize_function, batched=True)

In [9]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [10]:
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()



Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,No log,2.025805,0.727796
2,0.481700,2.007914,0.745244
3,0.481700,2.120335,0.729669


TrainOutput(global_step=750, training_loss=0.4120229288736979, metrics={'train_runtime': 1040.6995, 'train_samples_per_second': 2.883, 'train_steps_per_second': 0.721, 'total_flos': 1567794659328000.0, 'train_loss': 0.4120229288736979, 'epoch': 3.0})

As seen in this instance, basic pre-processing of the dataset did not result in any significant increase in the balanced accuracy or any pattern of reduced training or validation loss over time. This could be an avenue for future work as greater efforts in pre-processing could possibly solve issues related to syntax elements specific to Tweet-based media.

# Adjusting accumulation steps

## 8 accumulation steps

In [4]:
small_train_dataset4 = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset4 = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [5]:
from transformers import GPT2ForSequenceClassification

model4 = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   logging_strategy="epoch",  # Add this line
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=8
   )


trainer = Trainer(
   model=model4,
   args=training_args,
   train_dataset=small_train_dataset4,
   eval_dataset=small_eval_dataset4,
   compute_metrics=compute_metrics
)
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,1.0199,0.832067,0.58141
2,0.6609,0.702951,0.718869
3,0.4779,0.694577,0.745489


TrainOutput(global_step=375, training_loss=0.7195721944173177, metrics={'train_runtime': 1048.511, 'train_samples_per_second': 2.861, 'train_steps_per_second': 0.358, 'total_flos': 1567794659328000.0, 'train_loss': 0.7195721944173177, 'epoch': 3.0})

## 12 accumulation steps

In [45]:
small_train_dataset5 = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset5 = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [46]:
from transformers import GPT2ForSequenceClassification

model5 = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   logging_strategy="epoch",
   per_device_train_batch_size=1,
   per_device_eval_batch_size=1,
   gradient_accumulation_steps=12
   )


trainer = Trainer(
   model=model5,
   args=training_args,
   train_dataset=small_train_dataset5,
   eval_dataset=small_eval_dataset5,
   compute_metrics=compute_metrics
)
trainer.train()



Epoch,Training Loss,Validation Loss,Balanced Accuracy
0,1.1075,1.016199,0.458067
1,0.8335,0.751468,0.708993
2,0.5634,0.683903,0.726397


TrainOutput(global_step=249, training_loss=0.8348278826977833, metrics={'train_runtime': 983.3244, 'train_samples_per_second': 3.051, 'train_steps_per_second': 0.253, 'total_flos': 1561523480690688.0, 'train_loss': 0.8348278826977833, 'epoch': 2.988})

Increasing the accumulation steps demonstrates how the model learns through the first epoch and plateaus at around 70% balanced accuracy.

# Comparing to BERT

In [3]:
from transformers import AutoTokenizer

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
tokenizer.pad_token = '[PAD]'
def tokenize_function(examples):
    tokens = tokenizer(examples["text"], padding="max_length", truncation=True)
    tokens["labels"] = examples["label"]
    return tokens

tokenized_datasets = dataset.map(tokenize_function, batched=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [4]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [5]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=3
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,No log,0.744724,0.727427
2,0.787200,1.030595,0.747094
3,0.787200,1.38255,0.748441


TrainOutput(global_step=750, training_loss=0.7002465006510417, metrics={'train_runtime': 180.1113, 'train_samples_per_second': 16.656, 'train_steps_per_second': 4.164, 'total_flos': 789340253184000.0, 'train_loss': 0.7002465006510417, 'epoch': 3.0})

Bert performs comparably similar to GPT2 in this case. It is important to note that GPU runtime errors occured often with this model, and were mitigated with the use of a TPU. Loss increases over time, indicating that overfitting is likely occuring.

# Comparing to BertTweet

In [12]:
from transformers import AutoTokenizer
!pip3 install emoji==0.6.0
# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)
tokenizer.pad_token = '[PAD]'
def tokenize_function(examples):
    tokens = tokenizer(examples["text"], padding="max_length", truncation=True)
    return tokens

tokenized_datasets = dataset.map(tokenize_function, batched=True)



Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

In [13]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", num_labels=3)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at vinai/bertweet-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()



Epoch,Training Loss,Validation Loss,Balanced Accuracy
1,No log,0.737384,0.748322
2,0.727500,1.04838,0.75863
3,0.727500,1.160444,0.771868


TrainOutput(global_step=750, training_loss=0.6096639709472657, metrics={'train_runtime': 132.3355, 'train_samples_per_second': 22.67, 'train_steps_per_second': 5.667, 'total_flos': 30744186189444.0, 'train_loss': 0.6096639709472657, 'epoch': 3.0})

Utilizing BertTweet for this classification does not result in marked changes in balanced accuracy. It performs comparably to the other models -- GPT2 and Bert.

# Discussion

Overall, changing different parameters -- the number of training epochs, the gradient accumulation steps, the test-train split, and even other models -- did not result in significant differences in training and validation loss patterns over time and balanced accuracy values. In general, training and validation loss worsened over training epochs, with the exception of the instances in which gradient accumulation steps were increased. Balanced accuracy values remained around 70% for GPT2, Bert, and BertTweet. Kheiri and Karimi (2023) evaluated many models for their performance in tweet sentiment evaluation, and found that a variation of Bert performed at around 72% accuracy, which GPT3.5 turbo performed at 97% accuracy. For any future work with this idea, it may be prudent to develop more thorough pre-processing of the tweets, and to evaluate some of the more promising models according to the literature regarding this problem.

# References

Chong, W. Y., Selvaretnam, B., & Soon, L. K. (2014, December). Natural language processing for sentiment analysis: an exploratory analysis on tweets. In 2014 4th international conference on artificial intelligence with applications in engineering and technology (pp. 212-217). IEEE.

Ferrer, J. (2024, August 1). An introductory guide to fine-tuning LLMS. DataCamp. https://www.datacamp.com/tutorial/fine-tuning-large-language-models

Hugging Face. (n.d.). transformers. Hugging Face. https://huggingface.co/docs/transformers/en/index

Kheiri, K., & Karimi, H. (2023). Sentimentgpt: Exploiting gpt for advanced sentiment analysis and its departure from current machine learning. arXiv preprint arXiv:2307.10234.



