# Natural Language Processing and Text Analytics Final Exam [KAN-CDSCO1002U]

Afnan El-Segaier (137863)  
Une Aspelin (152294)  
Kristin Sundby (167303)  
Marlin Haavengen (167342)



---



This section outlines the code for the **BERT** model:
*   Preprocessing
*   Building the model using the training dataset
*   Tokenization
*   Testing the model using the test dataset
*   Cross Validation
*   Tuning



## BERT Model

### **Step 1: Preprocessing:**

In [None]:
# Importing the necessary libraries:
#pip install transformers
#pip install torch
#pip install optuna
import pandas as pd
import numpy as np
import re
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import KFold
import optuna
from transformers import BertConfig, BertForSequenceClassification, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Loading the training and test datasets:
df_train = pd.read_csv('isarcasm_train.csv')
df_test = pd.read_csv('isarcasm_test.csv')

In [None]:
# Renaming the columns in both train and test to tweet and label for consistency:
df_train.columns = ['tweet', 'label']
df_test.columns = ['tweet', 'label']

**Quick Data Inspection:**

In [None]:
df_train["label"].value_counts()

0    2601
1     867
Name: label, dtype: int64

In [None]:
df_test["label"].value_counts()

0    1200
1     200
Name: label, dtype: int64

In [None]:
# Check for missing values in the train set
df_train.isnull().sum()

tweet    1
label    0
dtype: int64

In [None]:
# Check for missing values in the test set
df_test.isnull().sum()

tweet    0
label    0
dtype: int64

**Preparing The Data:**

In [None]:
# Drop the missing value in the train set
df_train.dropna(inplace=True)

In [None]:
# Combine the datasets to be used for cross-validation
df_full = pd.concat([df_train, df_test], ignore_index=True)

### **Step 2: Build The Model**

In [None]:
# Defining test and train data:
X_train = df_train['tweet']
y_train = df_train['label']
X_test = df_test['tweet']
y_test = df_test['label']

In [None]:
# Model Initialization:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Load pre-trained BERT tokenizer:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Printing Model And Tokenizer For Inspection:**

In [None]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
print(tokenizer)

BertTokenizer(name_or_path='bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


###**Step 3: Tokenize The Data And Convert To PyTorch Dataset**

**Inspection of Token Lengths:**  
To be used in the tokenizer function.

In [None]:
# Find max token length
max_token_length_train = max(len(tokenizer.encode(tweet)) for tweet in df_train['tweet'])
max_token_length_test = max(len(tokenizer.encode(tweet)) for tweet in df_test['tweet'])
print("Maximum token length in training data:", max_token_length_train)
print("Maximum token length in test data:", max_token_length_test)


Maximum token length in training data: 116
Maximum token length in test data: 146


**Defining And Applying Tokenizer Function:**

In [None]:
# Defining a function to tokenize input texts using a tokenizer:
# The function returns a dictionary containing the tokenized input sentences, attention mask and other info BERT requires:

def tokenize_data(texts, max_length=150): # Setting the maximum length of the tokenized sequences to 150, as that will ensure that all are included.
    return tokenizer(
        texts.tolist(),  # Converting input to a list if they are not already.
        padding="max_length",  # Padding all sequences to the maximum length.
        truncation=True,  # Truncate sequences to the maximum length if they exceed it.
        max_length=max_length,  # Specifying the maximum length for the tokenized sequences.
        return_tensors="pt"  # Return the tokenized sequences as PyTorch tensors.
    )

In [None]:
# Applying the tokenizer function to the training and testing data:
train_encodings = tokenize_data(X_train)
test_encodings = tokenize_data(X_test)

**Converting Token IDs For Inspection:**

In [None]:
#  Printing the token IDs for the first tweet:
print("Token IDs for the first tweet:", train_encodings['input_ids'][0])

Token IDs for the first tweet: tensor([  101,  1996,  2069,  2518,  1045,  2288,  2013,  2267,  2003,  1037,
        24689,  7959,  3170, 13449,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     

In [None]:
# Converting token IDs back to tokens and printing the tokens for the first tweet:
print("Tokens for the first tweet:", tokenizer.convert_ids_to_tokens(train_encodings['input_ids'][0]))

Tokens for the first tweet: ['[CLS]', 'the', 'only', 'thing', 'i', 'got', 'from', 'college', 'is', 'a', 'caf', '##fe', '##ine', 'addiction', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PA

In [None]:
# Move model to GPU if available
#device = torch.device(
#	'cuda') if torch.cuda.is_available() else torch.device('cpu')
#model = model.to(device)

**Defining And Applying Function Converting To PyTorch:**

In [None]:
# Defining the 'SarcasmDataset' class which converts tokenized data and labels into a format that can be used with PyTorch's data loaders.
class SarcasmDataset(Dataset):
    def __init__(self, encodings, labels):
      # Initializing the dataset with encodings and labels:
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # To retrieve an item by the index.
        item = {key: self.encodings[key][idx] for key in self.encodings}  # Get the tokenized input at the specified index.
        item['labels'] = self.labels[idx]  # Add the corresponding label to the item.
        return item  # Return the dictionary containing the tokenized input and label.

    def __len__(self):
        # Return the length of the dataset (the number of samples).
        return len(self.labels)  # The length is based on the number of labels.

In [None]:
# Applying the function:
train_dataset = SarcasmDataset(train_encodings, y_train.values)
test_dataset = SarcasmDataset(test_encodings, y_test.values)

## **Step 4: Train The Model On The Data**

In [None]:
#pip install transformers[torch] --upgrade

In [None]:
#pip install accelerate -U

In [None]:
# Defining the training arguments:
training_args = TrainingArguments(
    output_dir='./results',          # Where to save the model and logs
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=8,   # Batch size for each device during training
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=10,                # Log training info every 10 steps
    evaluation_strategy="no"         # Evaluate only at the end
)

**Initializing the Trainer class form the Hugging Face library, to set op the training process:**

In [None]:
# Specifying the model, arguments and dataset:
trainer = Trainer(
    model=model,                  # The pre-trained BERT-model to be fine-tuned
    args=training_args,           # The training arguments defined above
    train_dataset=train_dataset   # Specifying that the model is to be trained on the train dataset
)

**Starting the training process using the Trainer Class:**  
Handles the entire training loop, including the forward and backward pass, and optimization steps over the training dataset.  
It applies the arguments specified above.

In [None]:
trainer.train() # Initiates the training loop and describes what the method handles during the training process.

  0%|          | 0/1302 [00:00<?, ?it/s]

  1%|          | 10/1302 [00:11<22:25,  1.04s/it]

{'loss': 0.6447, 'grad_norm': 5.92990255355835, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.02}


  2%|▏         | 20/1302 [00:19<17:36,  1.21it/s]

{'loss': 0.5957, 'grad_norm': 5.627193927764893, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.05}


  2%|▏         | 30/1302 [00:28<17:16,  1.23it/s]

{'loss': 0.5668, 'grad_norm': 3.492962598800659, 'learning_rate': 3e-06, 'epoch': 0.07}


  3%|▎         | 40/1302 [00:36<16:55,  1.24it/s]

{'loss': 0.6185, 'grad_norm': 2.959822416305542, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.09}


  4%|▍         | 50/1302 [00:44<16:46,  1.24it/s]

{'loss': 0.6046, 'grad_norm': 5.445382118225098, 'learning_rate': 5e-06, 'epoch': 0.12}


  5%|▍         | 60/1302 [00:52<16:46,  1.23it/s]

{'loss': 0.4519, 'grad_norm': 7.433879852294922, 'learning_rate': 6e-06, 'epoch': 0.14}


  5%|▌         | 70/1302 [01:00<16:39,  1.23it/s]

{'loss': 0.5792, 'grad_norm': 2.805103302001953, 'learning_rate': 7.000000000000001e-06, 'epoch': 0.16}


  6%|▌         | 80/1302 [01:08<16:33,  1.23it/s]

{'loss': 0.4907, 'grad_norm': 2.839106559753418, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.18}


  7%|▋         | 90/1302 [01:17<16:40,  1.21it/s]

{'loss': 0.5612, 'grad_norm': 2.847900629043579, 'learning_rate': 9e-06, 'epoch': 0.21}


  8%|▊         | 100/1302 [01:25<16:23,  1.22it/s]

{'loss': 0.6013, 'grad_norm': 2.530181646347046, 'learning_rate': 1e-05, 'epoch': 0.23}


  8%|▊         | 110/1302 [01:33<16:02,  1.24it/s]

{'loss': 0.5062, 'grad_norm': 5.800738334655762, 'learning_rate': 1.1000000000000001e-05, 'epoch': 0.25}


  9%|▉         | 120/1302 [01:41<15:53,  1.24it/s]

{'loss': 0.5615, 'grad_norm': 5.338832855224609, 'learning_rate': 1.2e-05, 'epoch': 0.28}


 10%|▉         | 130/1302 [01:49<15:45,  1.24it/s]

{'loss': 0.5024, 'grad_norm': 2.630504846572876, 'learning_rate': 1.3000000000000001e-05, 'epoch': 0.3}


 11%|█         | 140/1302 [01:58<16:03,  1.21it/s]

{'loss': 0.5777, 'grad_norm': 9.434591293334961, 'learning_rate': 1.4000000000000001e-05, 'epoch': 0.32}


 12%|█▏        | 150/1302 [02:06<15:25,  1.25it/s]

{'loss': 0.5572, 'grad_norm': 3.2998549938201904, 'learning_rate': 1.5e-05, 'epoch': 0.35}


 12%|█▏        | 160/1302 [02:14<15:14,  1.25it/s]

{'loss': 0.4851, 'grad_norm': 6.325292110443115, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.37}


 13%|█▎        | 170/1302 [02:22<15:05,  1.25it/s]

{'loss': 0.6711, 'grad_norm': 5.084110736846924, 'learning_rate': 1.7000000000000003e-05, 'epoch': 0.39}


 14%|█▍        | 180/1302 [02:30<15:07,  1.24it/s]

{'loss': 0.5784, 'grad_norm': 3.537839889526367, 'learning_rate': 1.8e-05, 'epoch': 0.41}


 15%|█▍        | 190/1302 [02:38<15:12,  1.22it/s]

{'loss': 0.5795, 'grad_norm': 4.488585948944092, 'learning_rate': 1.9e-05, 'epoch': 0.44}


 15%|█▌        | 200/1302 [02:46<14:55,  1.23it/s]

{'loss': 0.5921, 'grad_norm': 5.294689655303955, 'learning_rate': 2e-05, 'epoch': 0.46}


 16%|█▌        | 210/1302 [02:55<15:00,  1.21it/s]

{'loss': 0.5911, 'grad_norm': 7.786727428436279, 'learning_rate': 2.1e-05, 'epoch': 0.48}


 17%|█▋        | 220/1302 [03:03<14:46,  1.22it/s]

{'loss': 0.6464, 'grad_norm': 2.634347677230835, 'learning_rate': 2.2000000000000003e-05, 'epoch': 0.51}


 18%|█▊        | 230/1302 [03:11<14:33,  1.23it/s]

{'loss': 0.6393, 'grad_norm': 5.68317985534668, 'learning_rate': 2.3000000000000003e-05, 'epoch': 0.53}


 18%|█▊        | 240/1302 [03:19<14:17,  1.24it/s]

{'loss': 0.496, 'grad_norm': 2.9087421894073486, 'learning_rate': 2.4e-05, 'epoch': 0.55}


 19%|█▉        | 250/1302 [03:28<14:27,  1.21it/s]

{'loss': 0.6302, 'grad_norm': 2.4079952239990234, 'learning_rate': 2.5e-05, 'epoch': 0.58}


 20%|█▉        | 260/1302 [03:36<14:07,  1.23it/s]

{'loss': 0.5257, 'grad_norm': 3.604937791824341, 'learning_rate': 2.6000000000000002e-05, 'epoch': 0.6}


 21%|██        | 270/1302 [03:44<13:53,  1.24it/s]

{'loss': 0.5929, 'grad_norm': 4.377285957336426, 'learning_rate': 2.7000000000000002e-05, 'epoch': 0.62}


 22%|██▏       | 280/1302 [03:52<13:49,  1.23it/s]

{'loss': 0.5661, 'grad_norm': 6.665372371673584, 'learning_rate': 2.8000000000000003e-05, 'epoch': 0.65}


 22%|██▏       | 290/1302 [04:00<13:45,  1.23it/s]

{'loss': 0.4806, 'grad_norm': 7.129540920257568, 'learning_rate': 2.9e-05, 'epoch': 0.67}


 23%|██▎       | 300/1302 [04:08<13:36,  1.23it/s]

{'loss': 0.4943, 'grad_norm': 12.313594818115234, 'learning_rate': 3e-05, 'epoch': 0.69}


 24%|██▍       | 310/1302 [04:17<13:19,  1.24it/s]

{'loss': 0.4789, 'grad_norm': 2.921327590942383, 'learning_rate': 3.1e-05, 'epoch': 0.71}


 25%|██▍       | 320/1302 [04:25<13:26,  1.22it/s]

{'loss': 0.6878, 'grad_norm': 6.915173053741455, 'learning_rate': 3.2000000000000005e-05, 'epoch': 0.74}


 25%|██▌       | 330/1302 [04:33<13:05,  1.24it/s]

{'loss': 0.5615, 'grad_norm': 5.3234333992004395, 'learning_rate': 3.3e-05, 'epoch': 0.76}


 26%|██▌       | 340/1302 [04:41<13:05,  1.22it/s]

{'loss': 0.5417, 'grad_norm': 6.282633304595947, 'learning_rate': 3.4000000000000007e-05, 'epoch': 0.78}


 27%|██▋       | 350/1302 [04:49<12:50,  1.24it/s]

{'loss': 0.5289, 'grad_norm': 3.9189069271087646, 'learning_rate': 3.5e-05, 'epoch': 0.81}


 28%|██▊       | 360/1302 [04:58<12:55,  1.21it/s]

{'loss': 0.6458, 'grad_norm': 5.412281036376953, 'learning_rate': 3.6e-05, 'epoch': 0.83}


 28%|██▊       | 370/1302 [05:06<12:29,  1.24it/s]

{'loss': 0.5575, 'grad_norm': 5.554996490478516, 'learning_rate': 3.7e-05, 'epoch': 0.85}


 29%|██▉       | 380/1302 [05:14<12:29,  1.23it/s]

{'loss': 0.5543, 'grad_norm': 4.966384410858154, 'learning_rate': 3.8e-05, 'epoch': 0.88}


 30%|██▉       | 390/1302 [05:22<12:21,  1.23it/s]

{'loss': 0.6247, 'grad_norm': 3.095430850982666, 'learning_rate': 3.9000000000000006e-05, 'epoch': 0.9}


 31%|███       | 400/1302 [05:30<12:13,  1.23it/s]

{'loss': 0.5234, 'grad_norm': 14.44895076751709, 'learning_rate': 4e-05, 'epoch': 0.92}


 31%|███▏      | 410/1302 [05:38<12:06,  1.23it/s]

{'loss': 0.5755, 'grad_norm': 3.054457902908325, 'learning_rate': 4.1e-05, 'epoch': 0.94}


 32%|███▏      | 420/1302 [05:46<11:57,  1.23it/s]

{'loss': 0.423, 'grad_norm': 3.32279634475708, 'learning_rate': 4.2e-05, 'epoch': 0.97}


 33%|███▎      | 430/1302 [05:55<12:32,  1.16it/s]

{'loss': 0.5368, 'grad_norm': 4.439864158630371, 'learning_rate': 4.3e-05, 'epoch': 0.99}


 34%|███▍      | 440/1302 [06:03<11:38,  1.23it/s]

{'loss': 0.4615, 'grad_norm': 3.3408291339874268, 'learning_rate': 4.4000000000000006e-05, 'epoch': 1.01}


 35%|███▍      | 450/1302 [06:11<11:25,  1.24it/s]

{'loss': 0.6173, 'grad_norm': 6.056369304656982, 'learning_rate': 4.5e-05, 'epoch': 1.04}


 35%|███▌      | 460/1302 [06:20<11:30,  1.22it/s]

{'loss': 0.5231, 'grad_norm': 6.068505764007568, 'learning_rate': 4.600000000000001e-05, 'epoch': 1.06}


 36%|███▌      | 470/1302 [06:28<11:17,  1.23it/s]

{'loss': 0.4869, 'grad_norm': 4.241458892822266, 'learning_rate': 4.7e-05, 'epoch': 1.08}


 37%|███▋      | 480/1302 [06:36<11:05,  1.24it/s]

{'loss': 0.5625, 'grad_norm': 10.584300994873047, 'learning_rate': 4.8e-05, 'epoch': 1.11}


 38%|███▊      | 490/1302 [06:44<11:01,  1.23it/s]

{'loss': 0.5938, 'grad_norm': 4.356757640838623, 'learning_rate': 4.9e-05, 'epoch': 1.13}


 38%|███▊      | 500/1302 [06:52<10:55,  1.22it/s]

{'loss': 0.625, 'grad_norm': 10.815343856811523, 'learning_rate': 5e-05, 'epoch': 1.15}


 39%|███▉      | 510/1302 [07:04<11:33,  1.14it/s]

{'loss': 0.5913, 'grad_norm': 3.3696529865264893, 'learning_rate': 4.937655860349127e-05, 'epoch': 1.18}


 40%|███▉      | 520/1302 [07:12<10:52,  1.20it/s]

{'loss': 0.5901, 'grad_norm': 3.796961545944214, 'learning_rate': 4.875311720698255e-05, 'epoch': 1.2}


 41%|████      | 530/1302 [07:21<10:48,  1.19it/s]

{'loss': 0.414, 'grad_norm': 4.983432769775391, 'learning_rate': 4.812967581047382e-05, 'epoch': 1.22}


 41%|████▏     | 540/1302 [07:29<10:29,  1.21it/s]

{'loss': 0.4214, 'grad_norm': 10.889023780822754, 'learning_rate': 4.750623441396509e-05, 'epoch': 1.24}


 42%|████▏     | 550/1302 [07:37<10:24,  1.20it/s]

{'loss': 0.5318, 'grad_norm': 6.002346992492676, 'learning_rate': 4.688279301745636e-05, 'epoch': 1.27}


 43%|████▎     | 560/1302 [07:46<10:11,  1.21it/s]

{'loss': 0.4022, 'grad_norm': 6.036193370819092, 'learning_rate': 4.6259351620947635e-05, 'epoch': 1.29}


 44%|████▍     | 570/1302 [07:54<10:21,  1.18it/s]

{'loss': 0.6014, 'grad_norm': 8.150571823120117, 'learning_rate': 4.5635910224438905e-05, 'epoch': 1.31}


 45%|████▍     | 580/1302 [08:03<10:02,  1.20it/s]

{'loss': 0.483, 'grad_norm': 5.583421230316162, 'learning_rate': 4.5012468827930175e-05, 'epoch': 1.34}


 45%|████▌     | 590/1302 [08:11<09:44,  1.22it/s]

{'loss': 0.4435, 'grad_norm': 6.6615471839904785, 'learning_rate': 4.438902743142145e-05, 'epoch': 1.36}


 46%|████▌     | 600/1302 [08:19<09:37,  1.22it/s]

{'loss': 0.6183, 'grad_norm': 4.727989196777344, 'learning_rate': 4.376558603491272e-05, 'epoch': 1.38}


 47%|████▋     | 610/1302 [08:28<09:30,  1.21it/s]

{'loss': 0.3996, 'grad_norm': 4.204909801483154, 'learning_rate': 4.314214463840399e-05, 'epoch': 1.41}


 48%|████▊     | 620/1302 [08:36<09:20,  1.22it/s]

{'loss': 0.5302, 'grad_norm': 4.595942974090576, 'learning_rate': 4.251870324189526e-05, 'epoch': 1.43}


 48%|████▊     | 630/1302 [08:44<09:03,  1.24it/s]

{'loss': 0.6236, 'grad_norm': 8.864283561706543, 'learning_rate': 4.189526184538654e-05, 'epoch': 1.45}


 49%|████▉     | 640/1302 [08:52<09:06,  1.21it/s]

{'loss': 0.6329, 'grad_norm': 5.174968242645264, 'learning_rate': 4.127182044887781e-05, 'epoch': 1.47}


 50%|████▉     | 650/1302 [09:01<09:16,  1.17it/s]

{'loss': 0.5595, 'grad_norm': 4.461569309234619, 'learning_rate': 4.064837905236908e-05, 'epoch': 1.5}


 51%|█████     | 660/1302 [09:09<08:45,  1.22it/s]

{'loss': 0.5058, 'grad_norm': 8.742401123046875, 'learning_rate': 4.0024937655860354e-05, 'epoch': 1.52}


 51%|█████▏    | 670/1302 [09:17<08:33,  1.23it/s]

{'loss': 0.543, 'grad_norm': 4.507143974304199, 'learning_rate': 3.9401496259351623e-05, 'epoch': 1.54}


 52%|█████▏    | 680/1302 [09:26<09:05,  1.14it/s]

{'loss': 0.4347, 'grad_norm': 5.999179363250732, 'learning_rate': 3.877805486284289e-05, 'epoch': 1.57}


 53%|█████▎    | 690/1302 [09:34<08:23,  1.22it/s]

{'loss': 0.5676, 'grad_norm': 2.725646734237671, 'learning_rate': 3.815461346633416e-05, 'epoch': 1.59}


 54%|█████▍    | 700/1302 [09:42<08:21,  1.20it/s]

{'loss': 0.4384, 'grad_norm': 5.908743858337402, 'learning_rate': 3.753117206982544e-05, 'epoch': 1.61}


 55%|█████▍    | 710/1302 [09:51<08:16,  1.19it/s]

{'loss': 0.6775, 'grad_norm': 7.658224582672119, 'learning_rate': 3.690773067331671e-05, 'epoch': 1.64}


 55%|█████▌    | 720/1302 [09:59<08:10,  1.19it/s]

{'loss': 0.4691, 'grad_norm': 4.6806559562683105, 'learning_rate': 3.628428927680798e-05, 'epoch': 1.66}


 56%|█████▌    | 730/1302 [10:08<08:02,  1.19it/s]

{'loss': 0.5333, 'grad_norm': 4.131410598754883, 'learning_rate': 3.5660847880299256e-05, 'epoch': 1.68}


 57%|█████▋    | 740/1302 [10:17<08:46,  1.07it/s]

{'loss': 0.5046, 'grad_norm': 5.0296759605407715, 'learning_rate': 3.5037406483790526e-05, 'epoch': 1.71}


 58%|█████▊    | 750/1302 [10:26<08:46,  1.05it/s]

{'loss': 0.447, 'grad_norm': 5.066383361816406, 'learning_rate': 3.4413965087281796e-05, 'epoch': 1.73}


 58%|█████▊    | 760/1302 [10:36<09:17,  1.03s/it]

{'loss': 0.3542, 'grad_norm': 8.390397071838379, 'learning_rate': 3.3790523690773065e-05, 'epoch': 1.75}


 59%|█████▉    | 770/1302 [10:46<08:26,  1.05it/s]

{'loss': 0.4919, 'grad_norm': 7.429685592651367, 'learning_rate': 3.316708229426434e-05, 'epoch': 1.77}


 60%|█████▉    | 780/1302 [10:54<06:54,  1.26it/s]

{'loss': 0.438, 'grad_norm': 5.88437557220459, 'learning_rate': 3.254364089775561e-05, 'epoch': 1.8}


 61%|██████    | 790/1302 [11:02<06:47,  1.26it/s]

{'loss': 0.5092, 'grad_norm': 4.1512346267700195, 'learning_rate': 3.192019950124688e-05, 'epoch': 1.82}


 61%|██████▏   | 800/1302 [11:10<06:40,  1.25it/s]

{'loss': 0.4974, 'grad_norm': 15.297371864318848, 'learning_rate': 3.129675810473816e-05, 'epoch': 1.84}


 62%|██████▏   | 810/1302 [11:18<06:38,  1.23it/s]

{'loss': 0.5595, 'grad_norm': 6.820730686187744, 'learning_rate': 3.067331670822943e-05, 'epoch': 1.87}


 63%|██████▎   | 820/1302 [18:01<49:00,  6.10s/it]    

{'loss': 0.5893, 'grad_norm': 11.999015808105469, 'learning_rate': 3.0049875311720698e-05, 'epoch': 1.89}


 64%|██████▎   | 830/1302 [18:10<07:40,  1.03it/s]

{'loss': 0.4458, 'grad_norm': 4.440852642059326, 'learning_rate': 2.942643391521197e-05, 'epoch': 1.91}


 65%|██████▍   | 840/1302 [18:18<06:07,  1.26it/s]

{'loss': 0.3709, 'grad_norm': 5.791650295257568, 'learning_rate': 2.880299251870324e-05, 'epoch': 1.94}


 65%|██████▌   | 850/1302 [18:26<05:59,  1.26it/s]

{'loss': 0.4428, 'grad_norm': 11.865669250488281, 'learning_rate': 2.8179551122194514e-05, 'epoch': 1.96}


 66%|██████▌   | 860/1302 [18:34<05:53,  1.25it/s]

{'loss': 0.4746, 'grad_norm': 8.407787322998047, 'learning_rate': 2.7556109725685787e-05, 'epoch': 1.98}


 67%|██████▋   | 870/1302 [18:42<05:23,  1.34it/s]

{'loss': 0.4081, 'grad_norm': 4.73918342590332, 'learning_rate': 2.6932668329177057e-05, 'epoch': 2.0}


 68%|██████▊   | 880/1302 [18:50<05:34,  1.26it/s]

{'loss': 0.1915, 'grad_norm': 3.917379140853882, 'learning_rate': 2.630922693266833e-05, 'epoch': 2.03}


 68%|██████▊   | 890/1302 [18:58<05:25,  1.27it/s]

{'loss': 0.3339, 'grad_norm': 6.213657379150391, 'learning_rate': 2.56857855361596e-05, 'epoch': 2.05}


 69%|██████▉   | 900/1302 [19:06<05:19,  1.26it/s]

{'loss': 0.3534, 'grad_norm': 19.390840530395508, 'learning_rate': 2.5062344139650874e-05, 'epoch': 2.07}


 70%|██████▉   | 910/1302 [19:14<05:10,  1.26it/s]

{'loss': 0.3125, 'grad_norm': 5.5154900550842285, 'learning_rate': 2.4438902743142143e-05, 'epoch': 2.1}


 71%|███████   | 920/1302 [19:22<05:06,  1.24it/s]

{'loss': 0.2389, 'grad_norm': 44.49089431762695, 'learning_rate': 2.3815461346633417e-05, 'epoch': 2.12}


 71%|███████▏  | 930/1302 [19:30<04:52,  1.27it/s]

{'loss': 0.3378, 'grad_norm': 2.303248882293701, 'learning_rate': 2.319201995012469e-05, 'epoch': 2.14}


 72%|███████▏  | 940/1302 [19:38<04:49,  1.25it/s]

{'loss': 0.4245, 'grad_norm': 2.4250645637512207, 'learning_rate': 2.256857855361596e-05, 'epoch': 2.17}


 73%|███████▎  | 950/1302 [19:46<04:39,  1.26it/s]

{'loss': 0.309, 'grad_norm': 4.453381538391113, 'learning_rate': 2.1945137157107233e-05, 'epoch': 2.19}


 74%|███████▎  | 960/1302 [19:55<04:42,  1.21it/s]

{'loss': 0.2011, 'grad_norm': 18.5045166015625, 'learning_rate': 2.1321695760598503e-05, 'epoch': 2.21}


 75%|███████▍  | 970/1302 [20:03<04:24,  1.26it/s]

{'loss': 0.5322, 'grad_norm': 5.756302356719971, 'learning_rate': 2.0698254364089776e-05, 'epoch': 2.24}


 75%|███████▌  | 980/1302 [20:11<04:25,  1.21it/s]

{'loss': 0.1978, 'grad_norm': 2.541327714920044, 'learning_rate': 2.0074812967581046e-05, 'epoch': 2.26}


 76%|███████▌  | 990/1302 [20:19<04:06,  1.27it/s]

{'loss': 0.1999, 'grad_norm': 15.596291542053223, 'learning_rate': 1.945137157107232e-05, 'epoch': 2.28}


 77%|███████▋  | 1000/1302 [20:27<03:59,  1.26it/s]

{'loss': 0.1393, 'grad_norm': 13.04025936126709, 'learning_rate': 1.8827930174563592e-05, 'epoch': 2.3}


 78%|███████▊  | 1010/1302 [20:37<04:01,  1.21it/s]

{'loss': 0.2423, 'grad_norm': 18.349170684814453, 'learning_rate': 1.8204488778054865e-05, 'epoch': 2.33}


 78%|███████▊  | 1020/1302 [20:45<03:43,  1.26it/s]

{'loss': 0.3277, 'grad_norm': 2.8709561824798584, 'learning_rate': 1.7581047381546135e-05, 'epoch': 2.35}


 79%|███████▉  | 1030/1302 [20:53<03:35,  1.26it/s]

{'loss': 0.1593, 'grad_norm': 2.095651388168335, 'learning_rate': 1.695760598503741e-05, 'epoch': 2.37}


 80%|███████▉  | 1040/1302 [21:01<03:26,  1.27it/s]

{'loss': 0.2238, 'grad_norm': 0.3645477592945099, 'learning_rate': 1.633416458852868e-05, 'epoch': 2.4}


 81%|████████  | 1050/1302 [21:09<03:18,  1.27it/s]

{'loss': 0.1876, 'grad_norm': 0.6350302696228027, 'learning_rate': 1.571072319201995e-05, 'epoch': 2.42}


 81%|████████▏ | 1060/1302 [21:17<03:14,  1.24it/s]

{'loss': 0.2529, 'grad_norm': 2.1427369117736816, 'learning_rate': 1.5087281795511225e-05, 'epoch': 2.44}


 82%|████████▏ | 1070/1302 [21:25<03:05,  1.25it/s]

{'loss': 0.1325, 'grad_norm': 18.09832763671875, 'learning_rate': 1.4463840399002496e-05, 'epoch': 2.47}


 83%|████████▎ | 1080/1302 [21:33<02:55,  1.27it/s]

{'loss': 0.2134, 'grad_norm': 18.542219161987305, 'learning_rate': 1.3840399002493768e-05, 'epoch': 2.49}


 84%|████████▎ | 1090/1302 [21:41<02:46,  1.27it/s]

{'loss': 0.4617, 'grad_norm': 29.316246032714844, 'learning_rate': 1.321695760598504e-05, 'epoch': 2.51}


 84%|████████▍ | 1100/1302 [21:49<02:39,  1.27it/s]

{'loss': 0.3128, 'grad_norm': 5.538759708404541, 'learning_rate': 1.259351620947631e-05, 'epoch': 2.53}


 85%|████████▌ | 1110/1302 [21:57<02:31,  1.27it/s]

{'loss': 0.2715, 'grad_norm': 0.5320150256156921, 'learning_rate': 1.197007481296758e-05, 'epoch': 2.56}


 86%|████████▌ | 1120/1302 [22:05<02:23,  1.27it/s]

{'loss': 0.2325, 'grad_norm': 4.110889911651611, 'learning_rate': 1.1346633416458854e-05, 'epoch': 2.58}


 87%|████████▋ | 1130/1302 [22:13<02:16,  1.26it/s]

{'loss': 0.1352, 'grad_norm': 0.1399904489517212, 'learning_rate': 1.0723192019950125e-05, 'epoch': 2.6}


 88%|████████▊ | 1140/1302 [22:21<02:09,  1.25it/s]

{'loss': 0.1174, 'grad_norm': 0.22517356276512146, 'learning_rate': 1.0099750623441397e-05, 'epoch': 2.63}


 88%|████████▊ | 1150/1302 [22:29<02:00,  1.26it/s]

{'loss': 0.1817, 'grad_norm': 0.1964261382818222, 'learning_rate': 9.476309226932668e-06, 'epoch': 2.65}


 89%|████████▉ | 1160/1302 [22:37<01:52,  1.26it/s]

{'loss': 0.3436, 'grad_norm': 0.30650174617767334, 'learning_rate': 8.85286783042394e-06, 'epoch': 2.67}


 90%|████████▉ | 1170/1302 [22:45<01:44,  1.26it/s]

{'loss': 0.3721, 'grad_norm': 4.262545585632324, 'learning_rate': 8.229426433915211e-06, 'epoch': 2.7}


 91%|█████████ | 1180/1302 [22:53<01:38,  1.24it/s]

{'loss': 0.2505, 'grad_norm': 49.222618103027344, 'learning_rate': 7.605985037406484e-06, 'epoch': 2.72}


 91%|█████████▏| 1190/1302 [23:01<01:29,  1.26it/s]

{'loss': 0.1123, 'grad_norm': 2.5640556812286377, 'learning_rate': 6.982543640897755e-06, 'epoch': 2.74}


 92%|█████████▏| 1200/1302 [23:09<01:21,  1.26it/s]

{'loss': 0.3647, 'grad_norm': 7.026347637176514, 'learning_rate': 6.359102244389027e-06, 'epoch': 2.76}


 93%|█████████▎| 1210/1302 [23:17<01:13,  1.25it/s]

{'loss': 0.0242, 'grad_norm': 0.2397855967283249, 'learning_rate': 5.735660847880299e-06, 'epoch': 2.79}


 94%|█████████▎| 1220/1302 [23:25<01:10,  1.16it/s]

{'loss': 0.4888, 'grad_norm': 0.32330232858657837, 'learning_rate': 5.112219451371572e-06, 'epoch': 2.81}


 94%|█████████▍| 1230/1302 [23:34<00:58,  1.24it/s]

{'loss': 0.2362, 'grad_norm': 4.474343776702881, 'learning_rate': 4.488778054862843e-06, 'epoch': 2.83}


 95%|█████████▌| 1240/1302 [23:42<00:50,  1.22it/s]

{'loss': 0.1952, 'grad_norm': 14.361903190612793, 'learning_rate': 3.865336658354115e-06, 'epoch': 2.86}


 96%|█████████▌| 1250/1302 [23:50<00:42,  1.23it/s]

{'loss': 0.1128, 'grad_norm': 29.49915313720703, 'learning_rate': 3.2418952618453866e-06, 'epoch': 2.88}


 97%|█████████▋| 1260/1302 [23:58<00:33,  1.24it/s]

{'loss': 0.5436, 'grad_norm': 22.390413284301758, 'learning_rate': 2.6184538653366586e-06, 'epoch': 2.9}


 98%|█████████▊| 1270/1302 [24:06<00:25,  1.23it/s]

{'loss': 0.2143, 'grad_norm': 14.416962623596191, 'learning_rate': 1.99501246882793e-06, 'epoch': 2.93}


 98%|█████████▊| 1280/1302 [24:14<00:17,  1.25it/s]

{'loss': 0.2126, 'grad_norm': 4.4092488288879395, 'learning_rate': 1.371571072319202e-06, 'epoch': 2.95}


 99%|█████████▉| 1290/1302 [24:23<00:09,  1.24it/s]

{'loss': 0.3919, 'grad_norm': 1.9017080068588257, 'learning_rate': 7.481296758104738e-07, 'epoch': 2.97}


100%|█████████▉| 1300/1302 [24:31<00:01,  1.25it/s]

{'loss': 0.1635, 'grad_norm': 0.7562901377677917, 'learning_rate': 1.2468827930174563e-07, 'epoch': 3.0}


100%|██████████| 1302/1302 [24:32<00:00,  1.13s/it]

{'train_runtime': 1472.646, 'train_samples_per_second': 7.063, 'train_steps_per_second': 0.884, 'train_loss': 0.4446687590353729, 'epoch': 3.0}





TrainOutput(global_step=1302, training_loss=0.4446687590353729, metrics={'train_runtime': 1472.646, 'train_samples_per_second': 7.063, 'train_steps_per_second': 0.884, 'total_flos': 801743580117000.0, 'train_loss': 0.4446687590353729, 'epoch': 3.0})

**Predict on Train And Test Data:**

In [None]:
# Predicting on the train set:
predictions_train = trainer.predict(train_dataset)

# Extracting the predicted labels:
pred_labels_train = np.argmax(predictions_train.predictions, axis=-1)

# Predicting on the test set:
predictions_test = trainer.predict(test_dataset)

# Extracting the predicted labels:
pred_labels_test = np.argmax(predictions_test.predictions, axis=-1)

100%|██████████| 434/434 [01:28<00:00,  4.91it/s]
100%|██████████| 175/175 [00:35<00:00,  4.95it/s]


**Printing Confusion Matrices And Classification Reports For Evaluation:**

In [None]:
print("Classification Report Train set: \n", classification_report(y_train, pred_labels_train))
print("Confusion Matrix Train set: \n", confusion_matrix(y_train, pred_labels_train))

Classification Report Train set: 
               precision    recall  f1-score   support

           0       0.97      1.00      0.98      2600
           1       0.99      0.91      0.95       867

    accuracy                           0.98      3467
   macro avg       0.98      0.95      0.97      3467
weighted avg       0.98      0.98      0.97      3467

Confusion Matrix Train set: 
 [[2592    8]
 [  78  789]]


In [None]:
print("Classification Report Test set: \n", classification_report(y_test, pred_labels_test))
print("Confusion Matrix Test set: \n", confusion_matrix(y_test, pred_labels_test))

Classification Report Test set: 
               precision    recall  f1-score   support

           0       0.90      0.84      0.87      1200
           1       0.30      0.42      0.35       200

    accuracy                           0.78      1400
   macro avg       0.60      0.63      0.61      1400
weighted avg       0.81      0.78      0.79      1400

Confusion Matrix Test set: 
 [[1005  195]
 [ 115   85]]


The model is performing very well on the training data but much less so on the test data, suggesting overfitting. Therefore we will performe Cross Validation and Hyperparameter Tuning.

## **Step 5: Cross Validation**

**Utilizing K-Fold Cross-Validation:**

In [None]:
# Initializeing the KFold cross-validator with 5 splits, shuffling the data before splitting, and setting a random seed for reproducibility:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

results = [] # List to store evaluation results for each fold
confusion_matrices_val = [] # List to store confusion matrices for validation data for each fold

# Loop over each fold generated by KFold
for fold, (train_index, val_index) in enumerate(kf.split(df_full)):
    # Split data into training and validation sets for the current fold
    X_train, X_val = df_full.iloc[train_index]['tweet'], df_full.iloc[val_index]['tweet']
    y_train, y_val = df_full.iloc[train_index]['label'], df_full.iloc[val_index]['label']

    # Tokenizing the training and validation data:
    train_encodings = tokenize_data(X_train)
    val_encodings = tokenize_data(X_val)

    # Preparing datasets for the Trainer:
    train_dataset = SarcasmDataset(train_encodings, y_train.values)
    val_dataset = SarcasmDataset(val_encodings, y_val.values)

    # Reinitialize the BERT model for each fold to ensure independence:
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

    # Defining training arguments including output directories, number of epochs, batch size, evaluation strategy, logging settings, and learning rate:
    training_args = TrainingArguments(
        output_dir=f'./results_fold_{fold}', # Output directory for model checkpoints and other files
        num_train_epochs=3, # Number of training epochs = 3.
        per_device_train_batch_size=8, # Batch size for training.
        evaluation_strategy='epoch',  # Evaluate at the end of each epoch.
        logging_dir=f'./logs_fold_{fold}', # Directory for storing logs
        logging_steps=50, # Log every 50 steps
        learning_rate=5e-5, # Learning rate for the optimizer
    )

    # Initialize the Trainer with the model, training arguments, datasets, and optionally metrics computation:
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=None  # (Optionally define a compute_metrics function for evaluation metrics).
    )

    # Train the model on the training dataset:
    trainer.train()
    # Evaluating the model on the validation dataset:
    eval_result = trainer.evaluate()

    # Predicting on the validation dataset to obtain raw model predictions:
    predictions = trainer.predict(val_dataset)
    preds = np.argmax(predictions.predictions, axis=-1)

    # Computing the confusion matrix for the validation data predictions:
    cm_val = confusion_matrix(y_val, preds)
    report_val = classification_report(y_val, preds)

    # Storing the evaluation result and confusion matrix for this fold:
    results.append(eval_result)
    # Generating a detailed classification report for the validation data predictions:
    confusion_matrices_val.append(cm_val)

    print(f"Confusion Matrix for validation data for fold {fold+1}:\n{cm_val}\n")
    print(f"Classification Report for validation data for fold {fold+1}:\n{report_val}\n")

# After completing all folds, print the results for each fold:
for fold, result in enumerate(results):
    print(f"Fold {fold+1} results: {result}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 50/1461 [00:39<18:40,  1.26it/s]

{'loss': 0.549, 'grad_norm': 2.5183486938476562, 'learning_rate': 4.8288843258042436e-05, 'epoch': 0.1}


  7%|▋         | 100/1461 [01:21<19:02,  1.19it/s]

{'loss': 0.5109, 'grad_norm': 9.904523849487305, 'learning_rate': 4.657768651608487e-05, 'epoch': 0.21}


 10%|█         | 150/1461 [04:48<16:55,  1.29it/s]   

{'loss': 0.5722, 'grad_norm': 4.253085613250732, 'learning_rate': 4.486652977412731e-05, 'epoch': 0.31}


 14%|█▎        | 200/1461 [10:09<16:29,  1.27it/s]   

{'loss': 0.5247, 'grad_norm': 14.571314811706543, 'learning_rate': 4.315537303216975e-05, 'epoch': 0.41}


 17%|█▋        | 250/1461 [25:52<15:52,  1.27it/s]    

{'loss': 0.5722, 'grad_norm': 5.9519124031066895, 'learning_rate': 4.1444216290212186e-05, 'epoch': 0.51}


 21%|██        | 300/1461 [35:00<19:55,  1.03s/it]    

{'loss': 0.4792, 'grad_norm': 7.611654758453369, 'learning_rate': 3.973305954825462e-05, 'epoch': 0.62}


 24%|██▍       | 350/1461 [35:38<14:23,  1.29it/s]

{'loss': 0.5303, 'grad_norm': 7.1768388748168945, 'learning_rate': 3.802190280629706e-05, 'epoch': 0.72}


 27%|██▋       | 400/1461 [36:17<13:43,  1.29it/s]

{'loss': 0.5663, 'grad_norm': 2.321303606033325, 'learning_rate': 3.6310746064339495e-05, 'epoch': 0.82}


 31%|███       | 450/1461 [36:56<13:02,  1.29it/s]

{'loss': 0.5479, 'grad_norm': 1.1229898929595947, 'learning_rate': 3.459958932238193e-05, 'epoch': 0.92}


                                                  
 33%|███▎      | 487/1461 [37:47<11:51,  1.37it/s]

{'eval_loss': 0.6251915097236633, 'eval_runtime': 22.8773, 'eval_samples_per_second': 42.575, 'eval_steps_per_second': 5.333, 'epoch': 1.0}


 34%|███▍      | 500/1461 [37:57<13:55,  1.15it/s]  

{'loss': 0.5295, 'grad_norm': 3.3769092559814453, 'learning_rate': 3.288843258042437e-05, 'epoch': 1.03}


 38%|███▊      | 550/1461 [42:39<12:02,  1.26it/s]   

{'loss': 0.4739, 'grad_norm': 2.255213737487793, 'learning_rate': 3.117727583846681e-05, 'epoch': 1.13}


 41%|████      | 600/1461 [43:18<11:04,  1.29it/s]

{'loss': 0.5545, 'grad_norm': 3.07189679145813, 'learning_rate': 2.9466119096509244e-05, 'epoch': 1.23}


 44%|████▍     | 650/1461 [43:57<10:34,  1.28it/s]

{'loss': 0.4961, 'grad_norm': 4.22857666015625, 'learning_rate': 2.775496235455168e-05, 'epoch': 1.33}


 48%|████▊     | 700/1461 [44:35<09:55,  1.28it/s]

{'loss': 0.5006, 'grad_norm': 1.8548916578292847, 'learning_rate': 2.6043805612594112e-05, 'epoch': 1.44}


 51%|█████▏    | 750/1461 [45:14<09:08,  1.30it/s]

{'loss': 0.5201, 'grad_norm': 2.5425729751586914, 'learning_rate': 2.433264887063655e-05, 'epoch': 1.54}


 55%|█████▍    | 800/1461 [45:53<08:33,  1.29it/s]

{'loss': 0.5242, 'grad_norm': 5.30844783782959, 'learning_rate': 2.262149212867899e-05, 'epoch': 1.64}


 58%|█████▊    | 850/1461 [54:45<18:11,  1.79s/it]    

{'loss': 0.5777, 'grad_norm': 4.785468578338623, 'learning_rate': 2.0910335386721424e-05, 'epoch': 1.75}


 62%|██████▏   | 900/1461 [55:24<07:13,  1.29it/s]

{'loss': 0.5096, 'grad_norm': 6.0311079025268555, 'learning_rate': 1.919917864476386e-05, 'epoch': 1.85}


 65%|██████▌   | 950/1461 [56:03<06:33,  1.30it/s]

{'loss': 0.5533, 'grad_norm': 9.902678489685059, 'learning_rate': 1.74880219028063e-05, 'epoch': 1.95}


                                                  
 67%|██████▋   | 974/1461 [56:44<05:48,  1.40it/s]

{'eval_loss': 0.5511655211448669, 'eval_runtime': 22.8612, 'eval_samples_per_second': 42.605, 'eval_steps_per_second': 5.337, 'epoch': 2.0}


 68%|██████▊   | 1000/1461 [57:13<08:05,  1.05s/it] 

{'loss': 0.5266, 'grad_norm': 2.8407340049743652, 'learning_rate': 1.5776865160848733e-05, 'epoch': 2.05}


 72%|███████▏  | 1050/1461 [58:00<05:30,  1.24it/s]

{'loss': 0.5292, 'grad_norm': 2.521515130996704, 'learning_rate': 1.406570841889117e-05, 'epoch': 2.16}


 75%|███████▌  | 1100/1461 [58:40<04:52,  1.24it/s]

{'loss': 0.5455, 'grad_norm': 2.9504432678222656, 'learning_rate': 1.2354551676933608e-05, 'epoch': 2.26}


 79%|███████▊  | 1150/1461 [59:21<04:08,  1.25it/s]

{'loss': 0.574, 'grad_norm': 2.114499092102051, 'learning_rate': 1.0643394934976045e-05, 'epoch': 2.36}


 82%|████████▏ | 1200/1461 [1:00:01<03:30,  1.24it/s]

{'loss': 0.5298, 'grad_norm': 9.144115447998047, 'learning_rate': 8.932238193018481e-06, 'epoch': 2.46}


 86%|████████▌ | 1250/1461 [1:00:41<02:51,  1.23it/s]

{'loss': 0.4639, 'grad_norm': 5.078207969665527, 'learning_rate': 7.2210814510609185e-06, 'epoch': 2.57}


 89%|████████▉ | 1300/1461 [1:01:22<02:09,  1.24it/s]

{'loss': 0.4979, 'grad_norm': 1.8541442155838013, 'learning_rate': 5.509924709103354e-06, 'epoch': 2.67}


 92%|█████████▏| 1350/1461 [1:02:03<01:30,  1.23it/s]

{'loss': 0.5023, 'grad_norm': 5.780013561248779, 'learning_rate': 3.7987679671457908e-06, 'epoch': 2.77}


 96%|█████████▌| 1400/1461 [1:02:43<00:49,  1.24it/s]

{'loss': 0.5487, 'grad_norm': 2.1060383319854736, 'learning_rate': 2.0876112251882273e-06, 'epoch': 2.87}


 99%|█████████▉| 1450/1461 [1:03:24<00:08,  1.22it/s]

{'loss': 0.5256, 'grad_norm': 4.658479690551758, 'learning_rate': 3.7645448323066393e-07, 'epoch': 2.98}


                                                     
100%|██████████| 1461/1461 [1:03:57<00:00,  2.63s/it]


{'eval_loss': 0.554473876953125, 'eval_runtime': 24.3736, 'eval_samples_per_second': 39.961, 'eval_steps_per_second': 5.005, 'epoch': 3.0}
{'train_runtime': 3837.9228, 'train_samples_per_second': 3.043, 'train_steps_per_second': 0.381, 'train_loss': 0.5292493969018455, 'epoch': 3.0}


100%|██████████| 122/122 [00:24<00:00,  4.99it/s]
100%|██████████| 122/122 [00:25<00:00,  4.81it/s]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Confusion Matrix for validation data for fold 1:
[[741   0]
 [233   0]]

Classification Report for validation data for fold 1:
              precision    recall  f1-score   support

           0       0.76      1.00      0.86       741
           1       0.00      0.00      0.00       233

    accuracy                           0.76       974
   macro avg       0.38      0.50      0.43       974
weighted avg       0.58      0.76      0.66       974




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 50/1461 [00:41<19:23,  1.21it/s]

{'loss': 0.5387, 'grad_norm': 5.086459159851074, 'learning_rate': 4.8288843258042436e-05, 'epoch': 0.1}


  7%|▋         | 100/1461 [01:23<18:30,  1.23it/s]

{'loss': 0.5409, 'grad_norm': 7.689627170562744, 'learning_rate': 4.657768651608487e-05, 'epoch': 0.21}


 10%|█         | 150/1461 [02:04<17:40,  1.24it/s]

{'loss': 0.5644, 'grad_norm': 1.7159061431884766, 'learning_rate': 4.486652977412731e-05, 'epoch': 0.31}


 14%|█▎        | 200/1461 [02:49<20:30,  1.02it/s]

{'loss': 0.5175, 'grad_norm': 9.184463500976562, 'learning_rate': 4.315537303216975e-05, 'epoch': 0.41}


 17%|█▋        | 250/1461 [03:51<32:10,  1.59s/it]  

{'loss': 0.562, 'grad_norm': 2.179447889328003, 'learning_rate': 4.1444216290212186e-05, 'epoch': 0.51}


 21%|██        | 300/1461 [04:32<15:36,  1.24it/s]

{'loss': 0.4551, 'grad_norm': 4.094814777374268, 'learning_rate': 3.973305954825462e-05, 'epoch': 0.62}


 24%|██▍       | 350/1461 [05:13<14:56,  1.24it/s]

{'loss': 0.5481, 'grad_norm': 8.808262825012207, 'learning_rate': 3.802190280629706e-05, 'epoch': 0.72}


 27%|██▋       | 400/1461 [05:54<14:19,  1.23it/s]

{'loss': 0.5472, 'grad_norm': 2.233069896697998, 'learning_rate': 3.6310746064339495e-05, 'epoch': 0.82}


 31%|███       | 450/1461 [06:35<13:33,  1.24it/s]

{'loss': 0.5662, 'grad_norm': 4.893377780914307, 'learning_rate': 3.459958932238193e-05, 'epoch': 0.92}


 33%|███▎      | 487/1461 [07:05<13:16,  1.22it/s]
 33%|███▎      | 487/1461 [07:32<13:16,  1.22it/s]

{'eval_loss': 0.5349330902099609, 'eval_runtime': 26.7497, 'eval_samples_per_second': 36.412, 'eval_steps_per_second': 4.561, 'epoch': 1.0}


 34%|███▍      | 500/1461 [07:43<14:56,  1.07it/s]  

{'loss': 0.5575, 'grad_norm': 6.309485912322998, 'learning_rate': 3.288843258042437e-05, 'epoch': 1.03}


 38%|███▊      | 550/1461 [08:31<15:25,  1.02s/it]

{'loss': 0.533, 'grad_norm': 1.9990744590759277, 'learning_rate': 3.117727583846681e-05, 'epoch': 1.13}


 41%|████      | 600/1461 [09:20<12:54,  1.11it/s]

{'loss': 0.5842, 'grad_norm': 10.135369300842285, 'learning_rate': 2.9466119096509244e-05, 'epoch': 1.23}


 44%|████▍     | 650/1461 [10:03<11:04,  1.22it/s]

{'loss': 0.5294, 'grad_norm': 2.1002023220062256, 'learning_rate': 2.775496235455168e-05, 'epoch': 1.33}


 48%|████▊     | 700/1461 [10:44<10:19,  1.23it/s]

{'loss': 0.4958, 'grad_norm': 8.005758285522461, 'learning_rate': 2.6043805612594112e-05, 'epoch': 1.44}


 51%|█████▏    | 750/1461 [11:25<09:39,  1.23it/s]

{'loss': 0.5026, 'grad_norm': 2.737550973892212, 'learning_rate': 2.433264887063655e-05, 'epoch': 1.54}


 55%|█████▍    | 800/1461 [12:06<08:56,  1.23it/s]

{'loss': 0.4969, 'grad_norm': 6.171459197998047, 'learning_rate': 2.262149212867899e-05, 'epoch': 1.64}


 58%|█████▊    | 850/1461 [12:47<08:20,  1.22it/s]

{'loss': 0.583, 'grad_norm': 4.00252628326416, 'learning_rate': 2.0910335386721424e-05, 'epoch': 1.75}


 62%|██████▏   | 900/1461 [13:27<07:35,  1.23it/s]

{'loss': 0.5332, 'grad_norm': 7.345174789428711, 'learning_rate': 1.919917864476386e-05, 'epoch': 1.85}


 65%|██████▌   | 950/1461 [14:08<06:58,  1.22it/s]

{'loss': 0.5603, 'grad_norm': 10.913371086120605, 'learning_rate': 1.74880219028063e-05, 'epoch': 1.95}


 67%|██████▋   | 974/1461 [14:28<06:08,  1.32it/s]
 67%|██████▋   | 974/1461 [14:53<06:08,  1.32it/s]

{'eval_loss': 0.5239081382751465, 'eval_runtime': 25.2794, 'eval_samples_per_second': 38.529, 'eval_steps_per_second': 4.826, 'epoch': 2.0}


 68%|██████▊   | 1000/1461 [15:15<06:23,  1.20it/s] 

{'loss': 0.5177, 'grad_norm': 3.8249824047088623, 'learning_rate': 1.5776865160848733e-05, 'epoch': 2.05}


 72%|███████▏  | 1050/1461 [16:00<05:43,  1.20it/s]

{'loss': 0.5098, 'grad_norm': 7.132266521453857, 'learning_rate': 1.406570841889117e-05, 'epoch': 2.16}


 75%|███████▌  | 1100/1461 [16:42<04:56,  1.22it/s]

{'loss': 0.5395, 'grad_norm': 3.2823424339294434, 'learning_rate': 1.2354551676933608e-05, 'epoch': 2.26}


 79%|███████▊  | 1150/1461 [17:27<04:19,  1.20it/s]

{'loss': 0.5965, 'grad_norm': 2.701702356338501, 'learning_rate': 1.0643394934976045e-05, 'epoch': 2.36}


 82%|████████▏ | 1200/1461 [18:08<03:33,  1.22it/s]

{'loss': 0.5236, 'grad_norm': 2.912034749984741, 'learning_rate': 8.932238193018481e-06, 'epoch': 2.46}


 86%|████████▌ | 1250/1461 [18:50<02:54,  1.21it/s]

{'loss': 0.4975, 'grad_norm': 6.453081130981445, 'learning_rate': 7.2210814510609185e-06, 'epoch': 2.57}


 89%|████████▉ | 1300/1461 [19:31<02:12,  1.22it/s]

{'loss': 0.5093, 'grad_norm': 1.5080221891403198, 'learning_rate': 5.509924709103354e-06, 'epoch': 2.67}


 92%|█████████▏| 1350/1461 [20:12<01:31,  1.21it/s]

{'loss': 0.5527, 'grad_norm': 6.229341506958008, 'learning_rate': 3.7987679671457908e-06, 'epoch': 2.77}


 96%|█████████▌| 1400/1461 [20:53<00:51,  1.19it/s]

{'loss': 0.5587, 'grad_norm': 3.3003695011138916, 'learning_rate': 2.0876112251882273e-06, 'epoch': 2.87}


 99%|█████████▉| 1450/1461 [21:34<00:09,  1.22it/s]

{'loss': 0.5375, 'grad_norm': 6.001023292541504, 'learning_rate': 3.7645448323066393e-07, 'epoch': 2.98}


                                                   
100%|██████████| 1461/1461 [22:09<00:00,  1.10it/s]


{'eval_loss': 0.5243842005729675, 'eval_runtime': 25.3429, 'eval_samples_per_second': 38.433, 'eval_steps_per_second': 4.814, 'epoch': 3.0}
{'train_runtime': 1329.0299, 'train_samples_per_second': 8.788, 'train_steps_per_second': 1.099, 'train_loss': 0.5365577550231391, 'epoch': 3.0}


100%|██████████| 122/122 [00:25<00:00,  4.71it/s]
100%|██████████| 122/122 [00:26<00:00,  4.59it/s]
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Confusion Matrix for validation data for fold 2:
[[762   0]
 [212   0]]

Classification Report for validation data for fold 2:
              precision    recall  f1-score   support

           0       0.78      1.00      0.88       762
           1       0.00      0.00      0.00       212

    accuracy                           0.78       974
   macro avg       0.39      0.50      0.44       974
weighted avg       0.61      0.78      0.69       974




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 50/1461 [00:43<20:23,  1.15it/s]

{'loss': 0.5295, 'grad_norm': 4.460055351257324, 'learning_rate': 4.8288843258042436e-05, 'epoch': 0.1}


  7%|▋         | 100/1461 [01:26<19:14,  1.18it/s]

{'loss': 0.5331, 'grad_norm': 8.26445198059082, 'learning_rate': 4.657768651608487e-05, 'epoch': 0.21}


 10%|█         | 150/1461 [02:09<18:34,  1.18it/s]

{'loss': 0.5336, 'grad_norm': 3.914612054824829, 'learning_rate': 4.486652977412731e-05, 'epoch': 0.31}


 14%|█▎        | 200/1461 [02:51<17:22,  1.21it/s]

{'loss': 0.5252, 'grad_norm': 7.614934921264648, 'learning_rate': 4.315537303216975e-05, 'epoch': 0.41}


 17%|█▋        | 250/1461 [03:33<17:01,  1.19it/s]

{'loss': 0.5534, 'grad_norm': 2.6409201622009277, 'learning_rate': 4.1444216290212186e-05, 'epoch': 0.51}


 21%|██        | 300/1461 [04:17<18:38,  1.04it/s]

{'loss': 0.567, 'grad_norm': 1.8338356018066406, 'learning_rate': 3.973305954825462e-05, 'epoch': 0.62}


 24%|██▍       | 350/1461 [05:28<14:35,  1.27it/s]  

{'loss': 0.5433, 'grad_norm': 5.879697322845459, 'learning_rate': 3.802190280629706e-05, 'epoch': 0.72}


 27%|██▋       | 400/1461 [06:33<14:35,  1.21it/s]  

{'loss': 0.5689, 'grad_norm': 5.207136631011963, 'learning_rate': 3.6310746064339495e-05, 'epoch': 0.82}


 31%|███       | 450/1461 [07:14<13:51,  1.22it/s]

{'loss': 0.5349, 'grad_norm': 3.098618507385254, 'learning_rate': 3.459958932238193e-05, 'epoch': 0.92}


                                                  
 33%|███▎      | 487/1461 [08:09<12:44,  1.27it/s]

{'eval_loss': 0.5077562928199768, 'eval_runtime': 24.6519, 'eval_samples_per_second': 39.47, 'eval_steps_per_second': 4.949, 'epoch': 1.0}


 34%|███▍      | 500/1461 [08:20<14:42,  1.09it/s]  

{'loss': 0.535, 'grad_norm': 4.812981605529785, 'learning_rate': 3.288843258042437e-05, 'epoch': 1.03}


 38%|███▊      | 550/1461 [09:04<12:30,  1.21it/s]

{'loss': 0.5353, 'grad_norm': 7.288626670837402, 'learning_rate': 3.117727583846681e-05, 'epoch': 1.13}


 41%|████      | 600/1461 [09:46<11:46,  1.22it/s]

{'loss': 0.5275, 'grad_norm': 1.846017599105835, 'learning_rate': 2.9466119096509244e-05, 'epoch': 1.23}


 44%|████▍     | 650/1461 [10:27<11:06,  1.22it/s]

{'loss': 0.5261, 'grad_norm': 15.684773445129395, 'learning_rate': 2.775496235455168e-05, 'epoch': 1.33}


 48%|████▊     | 700/1461 [11:10<10:27,  1.21it/s]

{'loss': 0.5784, 'grad_norm': 10.517850875854492, 'learning_rate': 2.6043805612594112e-05, 'epoch': 1.44}


 51%|█████▏    | 750/1461 [11:52<09:55,  1.19it/s]

{'loss': 0.5361, 'grad_norm': 6.903830051422119, 'learning_rate': 2.433264887063655e-05, 'epoch': 1.54}


 55%|█████▍    | 800/1461 [12:38<09:51,  1.12it/s]

{'loss': 0.5231, 'grad_norm': 6.843202114105225, 'learning_rate': 2.262149212867899e-05, 'epoch': 1.64}


 58%|█████▊    | 850/1461 [13:23<08:54,  1.14it/s]

{'loss': 0.516, 'grad_norm': 7.916600704193115, 'learning_rate': 2.0910335386721424e-05, 'epoch': 1.75}


 62%|██████▏   | 900/1461 [14:10<08:06,  1.15it/s]

{'loss': 0.5388, 'grad_norm': 7.294417858123779, 'learning_rate': 1.919917864476386e-05, 'epoch': 1.85}


 65%|██████▌   | 950/1461 [14:55<07:29,  1.14it/s]

{'loss': 0.5131, 'grad_norm': 6.853280544281006, 'learning_rate': 1.74880219028063e-05, 'epoch': 1.95}


                                                  
 67%|██████▋   | 974/1461 [15:44<06:52,  1.18it/s]

{'eval_loss': 0.520832359790802, 'eval_runtime': 27.5352, 'eval_samples_per_second': 35.337, 'eval_steps_per_second': 4.431, 'epoch': 2.0}


 68%|██████▊   | 1000/1461 [16:08<07:12,  1.07it/s] 

{'loss': 0.5694, 'grad_norm': 7.594341278076172, 'learning_rate': 1.5776865160848733e-05, 'epoch': 2.05}


 72%|███████▏  | 1050/1461 [17:26<05:39,  1.21it/s]  

{'loss': 0.5411, 'grad_norm': 11.00889778137207, 'learning_rate': 1.406570841889117e-05, 'epoch': 2.16}


 75%|███████▌  | 1100/1461 [20:47<04:55,  1.22it/s]  

{'loss': 0.4738, 'grad_norm': 7.070241451263428, 'learning_rate': 1.2354551676933608e-05, 'epoch': 2.26}


 79%|███████▊  | 1150/1461 [30:58<04:16,  1.21it/s]    

{'loss': 0.4781, 'grad_norm': 4.2497029304504395, 'learning_rate': 1.0643394934976045e-05, 'epoch': 2.36}


 82%|████████▏ | 1200/1461 [36:32<04:24,  1.01s/it]  

{'loss': 0.4537, 'grad_norm': 3.4498586654663086, 'learning_rate': 8.932238193018481e-06, 'epoch': 2.46}


 86%|████████▌ | 1250/1461 [37:41<04:56,  1.40s/it]

{'loss': 0.4608, 'grad_norm': 13.16086196899414, 'learning_rate': 7.2210814510609185e-06, 'epoch': 2.57}


 89%|████████▉ | 1300/1461 [38:24<02:10,  1.23it/s]

{'loss': 0.4741, 'grad_norm': 9.075699806213379, 'learning_rate': 5.509924709103354e-06, 'epoch': 2.67}


 92%|█████████▏| 1350/1461 [39:05<01:31,  1.21it/s]

{'loss': 0.4951, 'grad_norm': 11.318077087402344, 'learning_rate': 3.7987679671457908e-06, 'epoch': 2.77}


 96%|█████████▌| 1400/1461 [39:46<00:49,  1.22it/s]

{'loss': 0.5176, 'grad_norm': 2.931446075439453, 'learning_rate': 2.0876112251882273e-06, 'epoch': 2.87}


 99%|█████████▉| 1450/1461 [40:27<00:09,  1.18it/s]

{'loss': 0.4459, 'grad_norm': 6.5088090896606445, 'learning_rate': 3.7645448323066393e-07, 'epoch': 2.98}


                                                   
100%|██████████| 1461/1461 [41:00<00:00,  1.68s/it]


{'eval_loss': 0.47629255056381226, 'eval_runtime': 24.152, 'eval_samples_per_second': 40.286, 'eval_steps_per_second': 5.051, 'epoch': 3.0}
{'train_runtime': 2460.9038, 'train_samples_per_second': 4.747, 'train_steps_per_second': 0.594, 'train_loss': 0.5215157891364232, 'epoch': 3.0}


100%|██████████| 122/122 [00:24<00:00,  5.02it/s]
100%|██████████| 122/122 [00:25<00:00,  4.87it/s]


Confusion Matrix for validation data for fold 3:
[[743  30]
 [168  32]]

Classification Report for validation data for fold 3:
              precision    recall  f1-score   support

           0       0.82      0.96      0.88       773
           1       0.52      0.16      0.24       200

    accuracy                           0.80       973
   macro avg       0.67      0.56      0.56       973
weighted avg       0.75      0.80      0.75       973




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 50/1461 [00:42<21:07,  1.11it/s]

{'loss': 0.5359, 'grad_norm': 3.775261878967285, 'learning_rate': 4.8288843258042436e-05, 'epoch': 0.1}


  7%|▋         | 100/1461 [01:25<18:42,  1.21it/s]

{'loss': 0.554, 'grad_norm': 8.301746368408203, 'learning_rate': 4.657768651608487e-05, 'epoch': 0.21}


 10%|█         | 150/1461 [02:06<17:47,  1.23it/s]

{'loss': 0.5431, 'grad_norm': 4.537574768066406, 'learning_rate': 4.486652977412731e-05, 'epoch': 0.31}


 14%|█▎        | 200/1461 [02:48<17:14,  1.22it/s]

{'loss': 0.5356, 'grad_norm': 9.321693420410156, 'learning_rate': 4.315537303216975e-05, 'epoch': 0.41}


 17%|█▋        | 250/1461 [03:29<16:49,  1.20it/s]

{'loss': 0.483, 'grad_norm': 2.774860382080078, 'learning_rate': 4.1444216290212186e-05, 'epoch': 0.51}


 21%|██        | 300/1461 [04:10<15:48,  1.22it/s]

{'loss': 0.5626, 'grad_norm': 6.43848180770874, 'learning_rate': 3.973305954825462e-05, 'epoch': 0.62}


 24%|██▍       | 350/1461 [04:51<15:09,  1.22it/s]

{'loss': 0.4625, 'grad_norm': 3.1104700565338135, 'learning_rate': 3.802190280629706e-05, 'epoch': 0.72}


 27%|██▋       | 400/1461 [05:32<14:26,  1.22it/s]

{'loss': 0.5351, 'grad_norm': 3.1280484199523926, 'learning_rate': 3.6310746064339495e-05, 'epoch': 0.82}


 31%|███       | 450/1461 [06:14<13:52,  1.21it/s]

{'loss': 0.4747, 'grad_norm': 4.925756931304932, 'learning_rate': 3.459958932238193e-05, 'epoch': 0.92}


                                                  
 33%|███▎      | 487/1461 [07:10<13:33,  1.20it/s]

{'eval_loss': 0.5388342142105103, 'eval_runtime': 26.0506, 'eval_samples_per_second': 37.35, 'eval_steps_per_second': 4.683, 'epoch': 1.0}


 34%|███▍      | 500/1461 [07:22<15:11,  1.05it/s]  

{'loss': 0.5129, 'grad_norm': 5.380462646484375, 'learning_rate': 3.288843258042437e-05, 'epoch': 1.03}


 38%|███▊      | 550/1461 [08:08<12:44,  1.19it/s]

{'loss': 0.4182, 'grad_norm': 4.136476993560791, 'learning_rate': 3.117727583846681e-05, 'epoch': 1.13}


 41%|████      | 600/1461 [08:50<12:22,  1.16it/s]

{'loss': 0.4758, 'grad_norm': 5.127407550811768, 'learning_rate': 2.9466119096509244e-05, 'epoch': 1.23}


 44%|████▍     | 650/1461 [09:32<11:04,  1.22it/s]

{'loss': 0.4872, 'grad_norm': 7.52424430847168, 'learning_rate': 2.775496235455168e-05, 'epoch': 1.33}


 48%|████▊     | 700/1461 [10:14<10:26,  1.21it/s]

{'loss': 0.4656, 'grad_norm': 19.981616973876953, 'learning_rate': 2.6043805612594112e-05, 'epoch': 1.44}


 51%|█████▏    | 750/1461 [10:58<10:45,  1.10it/s]

{'loss': 0.4025, 'grad_norm': 13.072657585144043, 'learning_rate': 2.433264887063655e-05, 'epoch': 1.54}


 55%|█████▍    | 800/1461 [11:42<09:01,  1.22it/s]

{'loss': 0.4093, 'grad_norm': 5.078599452972412, 'learning_rate': 2.262149212867899e-05, 'epoch': 1.64}


 58%|█████▊    | 850/1461 [12:38<08:11,  1.24it/s]

{'loss': 0.4355, 'grad_norm': 5.265566349029541, 'learning_rate': 2.0910335386721424e-05, 'epoch': 1.75}


 62%|██████▏   | 900/1461 [29:30<07:24,  1.26it/s]    

{'loss': 0.4477, 'grad_norm': 9.342350959777832, 'learning_rate': 1.919917864476386e-05, 'epoch': 1.85}


 65%|██████▌   | 950/1461 [45:55<18:22,  2.16s/it]    

{'loss': 0.3718, 'grad_norm': 4.339460849761963, 'learning_rate': 1.74880219028063e-05, 'epoch': 1.95}


                                                  
 67%|██████▋   | 974/1461 [46:54<06:07,  1.32it/s]

{'eval_loss': 0.5487121939659119, 'eval_runtime': 40.2314, 'eval_samples_per_second': 24.185, 'eval_steps_per_second': 3.032, 'epoch': 2.0}


 68%|██████▊   | 1000/1461 [47:15<06:04,  1.27it/s] 

{'loss': 0.3008, 'grad_norm': 1.1076717376708984, 'learning_rate': 1.5776865160848733e-05, 'epoch': 2.05}


 72%|███████▏  | 1050/1461 [47:57<05:23,  1.27it/s]

{'loss': 0.259, 'grad_norm': 22.55064582824707, 'learning_rate': 1.406570841889117e-05, 'epoch': 2.16}


 75%|███████▌  | 1100/1461 [48:36<04:43,  1.27it/s]

{'loss': 0.2328, 'grad_norm': 0.46806278824806213, 'learning_rate': 1.2354551676933608e-05, 'epoch': 2.26}


 79%|███████▊  | 1150/1461 [49:16<04:04,  1.27it/s]

{'loss': 0.2862, 'grad_norm': 0.2580460011959076, 'learning_rate': 1.0643394934976045e-05, 'epoch': 2.36}


 82%|████████▏ | 1200/1461 [49:55<03:25,  1.27it/s]

{'loss': 0.2845, 'grad_norm': 0.5050647854804993, 'learning_rate': 8.932238193018481e-06, 'epoch': 2.46}


 86%|████████▌ | 1250/1461 [50:35<02:46,  1.27it/s]

{'loss': 0.2288, 'grad_norm': 0.3914144039154053, 'learning_rate': 7.2210814510609185e-06, 'epoch': 2.57}


 89%|████████▉ | 1300/1461 [1:03:38<02:35,  1.03it/s]    

{'loss': 0.2715, 'grad_norm': 7.116154193878174, 'learning_rate': 5.509924709103354e-06, 'epoch': 2.67}


 92%|█████████▏| 1350/1461 [1:04:18<01:27,  1.27it/s]

{'loss': 0.2094, 'grad_norm': 0.5312200784683228, 'learning_rate': 3.7987679671457908e-06, 'epoch': 2.77}


 96%|█████████▌| 1400/1461 [1:04:57<00:47,  1.28it/s]

{'loss': 0.2142, 'grad_norm': 0.1931004524230957, 'learning_rate': 2.0876112251882273e-06, 'epoch': 2.87}


 99%|█████████▉| 1450/1461 [1:05:37<00:08,  1.27it/s]

{'loss': 0.1946, 'grad_norm': 0.24741491675376892, 'learning_rate': 3.7645448323066393e-07, 'epoch': 2.98}


                                                     
100%|██████████| 1461/1461 [1:06:08<00:00,  2.72s/it]


{'eval_loss': 1.0272753238677979, 'eval_runtime': 23.0591, 'eval_samples_per_second': 42.196, 'eval_steps_per_second': 5.291, 'epoch': 3.0}
{'train_runtime': 3968.6694, 'train_samples_per_second': 2.944, 'train_steps_per_second': 0.368, 'train_loss': 0.39873880217289126, 'epoch': 3.0}


100%|██████████| 122/122 [00:23<00:00,  5.28it/s]
100%|██████████| 122/122 [00:24<00:00,  5.06it/s]


Confusion Matrix for validation data for fold 4:
[[674  71]
 [162  66]]

Classification Report for validation data for fold 4:
              precision    recall  f1-score   support

           0       0.81      0.90      0.85       745
           1       0.48      0.29      0.36       228

    accuracy                           0.76       973
   macro avg       0.64      0.60      0.61       973
weighted avg       0.73      0.76      0.74       973




Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 50/1461 [00:39<18:32,  1.27it/s]

{'loss': 0.543, 'grad_norm': 3.0387356281280518, 'learning_rate': 4.8288843258042436e-05, 'epoch': 0.1}


  7%|▋         | 100/1461 [01:19<17:55,  1.27it/s]

{'loss': 0.5584, 'grad_norm': 6.849310398101807, 'learning_rate': 4.657768651608487e-05, 'epoch': 0.21}


 10%|█         | 150/1461 [02:26<17:51,  1.22it/s]  

{'loss': 0.5842, 'grad_norm': 3.454028367996216, 'learning_rate': 4.486652977412731e-05, 'epoch': 0.31}


 14%|█▎        | 200/1461 [03:08<17:46,  1.18it/s]

{'loss': 0.5289, 'grad_norm': 2.914250612258911, 'learning_rate': 4.315537303216975e-05, 'epoch': 0.41}


 17%|█▋        | 250/1461 [03:53<18:16,  1.10it/s]

{'loss': 0.5256, 'grad_norm': 2.0844948291778564, 'learning_rate': 4.1444216290212186e-05, 'epoch': 0.51}


 21%|██        | 300/1461 [04:40<18:53,  1.02it/s]

{'loss': 0.5434, 'grad_norm': 11.75590705871582, 'learning_rate': 3.973305954825462e-05, 'epoch': 0.62}


 24%|██▍       | 350/1461 [05:30<18:38,  1.01s/it]

{'loss': 0.5422, 'grad_norm': 2.671625852584839, 'learning_rate': 3.802190280629706e-05, 'epoch': 0.72}


 27%|██▋       | 400/1461 [06:21<17:36,  1.00it/s]

{'loss': 0.5613, 'grad_norm': 5.296032905578613, 'learning_rate': 3.6310746064339495e-05, 'epoch': 0.82}


 31%|███       | 450/1461 [07:11<16:46,  1.00it/s]

{'loss': 0.5256, 'grad_norm': 4.866761684417725, 'learning_rate': 3.459958932238193e-05, 'epoch': 0.92}


                                                  
 33%|███▎      | 487/1461 [08:19<14:59,  1.08it/s]

{'eval_loss': 0.5194503664970398, 'eval_runtime': 31.203, 'eval_samples_per_second': 31.183, 'eval_steps_per_second': 3.91, 'epoch': 1.0}


 34%|███▍      | 500/1461 [08:32<19:17,  1.20s/it]  

{'loss': 0.5772, 'grad_norm': 7.775692462921143, 'learning_rate': 3.288843258042437e-05, 'epoch': 1.03}


 38%|███▊      | 550/1461 [09:26<15:50,  1.04s/it]

{'loss': 0.485, 'grad_norm': 6.376954555511475, 'learning_rate': 3.117727583846681e-05, 'epoch': 1.13}


 41%|████      | 600/1461 [10:18<14:33,  1.01s/it]

{'loss': 0.541, 'grad_norm': 5.95436429977417, 'learning_rate': 2.9466119096509244e-05, 'epoch': 1.23}


 44%|████▍     | 650/1461 [11:08<13:27,  1.00it/s]

{'loss': 0.4595, 'grad_norm': 9.585408210754395, 'learning_rate': 2.775496235455168e-05, 'epoch': 1.33}


 48%|████▊     | 700/1461 [11:59<12:41,  1.00s/it]

{'loss': 0.5087, 'grad_norm': 9.318435668945312, 'learning_rate': 2.6043805612594112e-05, 'epoch': 1.44}


 51%|█████▏    | 750/1461 [12:50<11:54,  1.01s/it]

{'loss': 0.5177, 'grad_norm': 6.082591533660889, 'learning_rate': 2.433264887063655e-05, 'epoch': 1.54}


 55%|█████▍    | 800/1461 [13:41<11:07,  1.01s/it]

{'loss': 0.4507, 'grad_norm': 3.935647487640381, 'learning_rate': 2.262149212867899e-05, 'epoch': 1.64}


 58%|█████▊    | 850/1461 [14:32<10:59,  1.08s/it]

{'loss': 0.42, 'grad_norm': 9.771103858947754, 'learning_rate': 2.0910335386721424e-05, 'epoch': 1.75}


 62%|██████▏   | 900/1461 [15:23<09:22,  1.00s/it]

{'loss': 0.456, 'grad_norm': 9.76842975616455, 'learning_rate': 1.919917864476386e-05, 'epoch': 1.85}


 65%|██████▌   | 950/1461 [16:14<08:33,  1.01s/it]

{'loss': 0.4473, 'grad_norm': 3.362560510635376, 'learning_rate': 1.74880219028063e-05, 'epoch': 1.95}


                                                  
 67%|██████▋   | 974/1461 [17:11<08:41,  1.07s/it]

{'eval_loss': 0.4651079773902893, 'eval_runtime': 31.8721, 'eval_samples_per_second': 30.528, 'eval_steps_per_second': 3.828, 'epoch': 2.0}


 68%|██████▊   | 1000/1461 [17:38<08:00,  1.04s/it] 

{'loss': 0.3662, 'grad_norm': 8.074893951416016, 'learning_rate': 1.5776865160848733e-05, 'epoch': 2.05}


 72%|███████▏  | 1050/1461 [18:35<07:54,  1.15s/it]

{'loss': 0.3748, 'grad_norm': 14.686750411987305, 'learning_rate': 1.406570841889117e-05, 'epoch': 2.16}


 75%|███████▌  | 1100/1461 [19:29<06:25,  1.07s/it]

{'loss': 0.3258, 'grad_norm': 7.164148330688477, 'learning_rate': 1.2354551676933608e-05, 'epoch': 2.26}


 79%|███████▊  | 1150/1461 [20:22<05:29,  1.06s/it]

{'loss': 0.3672, 'grad_norm': 0.4963669776916504, 'learning_rate': 1.0643394934976045e-05, 'epoch': 2.36}


 82%|████████▏ | 1200/1461 [21:14<04:32,  1.04s/it]

{'loss': 0.2642, 'grad_norm': 5.738635540008545, 'learning_rate': 8.932238193018481e-06, 'epoch': 2.46}


 86%|████████▌ | 1250/1461 [22:07<03:44,  1.06s/it]

{'loss': 0.3145, 'grad_norm': 21.990352630615234, 'learning_rate': 7.2210814510609185e-06, 'epoch': 2.57}


 89%|████████▉ | 1300/1461 [23:09<03:13,  1.20s/it]

{'loss': 0.3318, 'grad_norm': 8.617081642150879, 'learning_rate': 5.509924709103354e-06, 'epoch': 2.67}


 92%|█████████▏| 1350/1461 [24:03<01:56,  1.05s/it]

{'loss': 0.3491, 'grad_norm': 3.3939435482025146, 'learning_rate': 3.7987679671457908e-06, 'epoch': 2.77}


 96%|█████████▌| 1400/1461 [24:55<01:02,  1.02s/it]

{'loss': 0.2808, 'grad_norm': 11.861671447753906, 'learning_rate': 2.0876112251882273e-06, 'epoch': 2.87}


 99%|█████████▉| 1450/1461 [25:49<00:10,  1.07it/s]

{'loss': 0.2934, 'grad_norm': 14.501498222351074, 'learning_rate': 3.7645448323066393e-07, 'epoch': 2.98}


                                                   
100%|██████████| 1461/1461 [26:24<00:00,  1.08s/it]


{'eval_loss': 0.6996310949325562, 'eval_runtime': 24.7558, 'eval_samples_per_second': 39.304, 'eval_steps_per_second': 4.928, 'epoch': 3.0}
{'train_runtime': 1584.0737, 'train_samples_per_second': 7.375, 'train_steps_per_second': 0.922, 'train_loss': 0.44801764305448954, 'epoch': 3.0}


100%|██████████| 122/122 [00:25<00:00,  4.83it/s]
100%|██████████| 122/122 [00:25<00:00,  4.78it/s]


Confusion Matrix for validation data for fold 5:
[[700  79]
 [115  79]]

Classification Report for validation data for fold 5:
              precision    recall  f1-score   support

           0       0.86      0.90      0.88       779
           1       0.50      0.41      0.45       194

    accuracy                           0.80       973
   macro avg       0.68      0.65      0.66       973
weighted avg       0.79      0.80      0.79       973


Fold 1 results: {'eval_loss': 0.554473876953125, 'eval_runtime': 24.4562, 'eval_samples_per_second': 39.826, 'eval_steps_per_second': 4.989, 'epoch': 3.0}
Fold 2 results: {'eval_loss': 0.5243842005729675, 'eval_runtime': 25.9365, 'eval_samples_per_second': 37.553, 'eval_steps_per_second': 4.704, 'epoch': 3.0}
Fold 3 results: {'eval_loss': 0.47629255056381226, 'eval_runtime': 24.2951, 'eval_samples_per_second': 40.049, 'eval_steps_per_second': 5.022, 'epoch': 3.0}
Fold 4 results: {'eval_loss': 1.0272753238677979, 'eval_runtime': 23.1038, 'e

## **Step 6: Tuning**

In [None]:
# Split the original training data into a smaller training subset (50% of original) and a validation subset:
train_subset, _ = train_test_split(train_dataset, test_size=0.5, random_state=42)
train_subset, valid_subset = train_test_split(train_subset, test_size=0.2, random_state=42)  # Further split into 80/20

**Optuna Hyperparameter Optimization:**

In [None]:
# Defining the objective function for Optuna hyperparameter optimization:
def objective(trial):
    # Suggesting hyperparameters for the trial:
    num_train_epochs = trial.suggest_int('num_train_epochs', 1, 2) # Suggesting 1, 2 training epochs.
    per_device_train_batch_size = trial.suggest_categorical('per_device_train_batch_size', [8, 16]) # Batch size suggestion.
    learning_rate = trial.suggest_float('learning_rate', 2e-5, 5e-5) # Learning rate suggestion.
    weight_decay = trial.suggest_float('weight_decay', 1e-5, 1e-2) # Weight Decay suggestion.
    hidden_dropout_prob = trial.suggest_float('hidden_dropout_prob', 0.1, 0.3) # Suggesting hidden layer dropout probability.
    attention_probs_dropout_prob = trial.suggest_float('attention_probs_dropout_prob', 0.1, 0.3) # Suggesting attention dropout probability.
    warmup_steps = trial.suggest_int('warmup_steps', 100, 300) # Suggesting number of warmup steps.

    # Configuring the model with the suggested dropout rates:
    config = BertConfig.from_pretrained('bert-base-uncased',
                                        hidden_dropout_prob=hidden_dropout_prob,
                                        attention_probs_dropout_prob=attention_probs_dropout_prob)
    # Initializing the BERT for sequence classification:
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', config=config)

    # Defining training arguments including output directory, number of epochs, batch size, learning rate, etc.:
    training_args = TrainingArguments(
        output_dir='./results', # Directory for saving.
        num_train_epochs=num_train_epochs, # Training epoch
        per_device_train_batch_size=per_device_train_batch_size, # Batch size per device.
        learning_rate=learning_rate, # Learning Rate.
        weight_decay=weight_decay, # Weigth decay for regularization.
        warmup_steps=warmup_steps, # Number of warmup steps.
        logging_dir='./logs', # Logging directory.
        logging_steps=10, # Log every 10 steps.
        evaluation_strategy='steps', # Model evaluation every few steps.
        eval_steps=50, # Evaluation every 50 step.
        save_total_limit=1, # Total number of checkpoints limit.
        load_best_model_at_end=True, # Loading the best model at the end of the training.
    )

    # Initializing the Trainer with the model, training arguments, and datasets:
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_subset, # The training subset.
        eval_dataset=valid_subset, # The validation subset.
    )

    # Train the model
    trainer.train()

    # Predicting on the validation set to get 'raw' predictions:
    predictions_valid = trainer.predict(valid_subset)
    # Get the predicted labels by taking the argmax of the prediction scores:
    pred_labels_valid = np.argmax(predictions_valid.predictions, axis=-1)

    # Calculate the evaluation metric, here using the error rate on the validation set:
    eval_loss = np.mean(predictions_valid.label_ids != pred_labels_valid)

    # Print the trial number and parameters
    print(f'Trial {trial.number} finished with value: {eval_loss} and parameters: {trial.params}')@

    # Return the evaluation loss as the objective to minimize:
    return eval_loss

In [None]:
# Creating an Optuna study object with the objective to minimize the evaluation loss:
study = optuna.create_study(direction='minimize')

# Optimizing the objective function using the study object, specifying the number of trials to run:
study.optimize(objective, n_trials=5) # 5 trials.

# Print the best hyperparameters found during this process:
print("Best hyperparameters: ", study.best_params)

[I 2024-05-25 23:00:16,800] A new study created in memory with name: no-name-ea1652e0-9a9c-49b7-9978-fd36968625f7
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
 11%|█▏        | 10/87 [03:02<22:46, 17.74s/it]

{'loss': 0.7212, 'grad_norm': 4.859196662902832, 'learning_rate': 3.5103003737191408e-06, 'epoch': 0.11}


 23%|██▎       | 20/87 [06:28<20:35, 18.43s/it]

{'loss': 0.6736, 'grad_norm': 6.755258083343506, 'learning_rate': 7.0206007474382816e-06, 'epoch': 0.23}


 34%|███▍      | 30/87 [09:22<15:29, 16.31s/it]

{'loss': 0.6282, 'grad_norm': 2.947716236114502, 'learning_rate': 1.0530901121157421e-05, 'epoch': 0.34}


 46%|████▌     | 40/87 [12:26<12:59, 16.58s/it]

{'loss': 0.5678, 'grad_norm': 3.7947611808776855, 'learning_rate': 1.4041201494876563e-05, 'epoch': 0.46}


 57%|█████▋    | 50/87 [15:21<12:10, 19.74s/it]

{'loss': 0.5578, 'grad_norm': 4.417141914367676, 'learning_rate': 1.7551501868595705e-05, 'epoch': 0.57}


                                               
 57%|█████▋    | 50/87 [15:33<12:10, 19.74s/it]

{'eval_loss': 0.5717975497245789, 'eval_runtime': 11.5833, 'eval_samples_per_second': 29.957, 'eval_steps_per_second': 3.799, 'epoch': 0.57}


 69%|██████▉   | 60/87 [18:41<08:00, 17.79s/it]

{'loss': 0.5491, 'grad_norm': 3.9263155460357666, 'learning_rate': 2.1061802242314843e-05, 'epoch': 0.69}


 80%|████████  | 70/87 [22:08<06:48, 24.05s/it]

{'loss': 0.5662, 'grad_norm': 2.979161262512207, 'learning_rate': 2.4572102616033985e-05, 'epoch': 0.8}


 92%|█████████▏| 80/87 [25:49<02:03, 17.62s/it]

{'loss': 0.5483, 'grad_norm': 2.287055492401123, 'learning_rate': 2.8082402989753126e-05, 'epoch': 0.92}


100%|██████████| 87/87 [28:36<00:00, 19.73s/it]


{'train_runtime': 1716.9042, 'train_samples_per_second': 0.807, 'train_steps_per_second': 0.051, 'train_loss': 0.5980319483526821, 'epoch': 1.0}


100%|██████████| 44/44 [00:09<00:00,  4.40it/s]
[I 2024-05-25 23:29:07,109] Trial 0 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 16, 'learning_rate': 4.387875467148926e-05, 'weight_decay': 0.0048523174781734185, 'hidden_dropout_prob': 0.26705371584687876, 'attention_probs_dropout_prob': 0.15203313011306313, 'warmup_steps': 125}. Best is trial 0 with value: 0.2478386167146974.


Trial 0 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 16, 'learning_rate': 4.387875467148926e-05, 'weight_decay': 0.0048523174781734185, 'hidden_dropout_prob': 0.26705371584687876, 'attention_probs_dropout_prob': 0.15203313011306313, 'warmup_steps': 125}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 10/348 [00:13<10:16,  1.82s/it]

{'loss': 0.9139, 'grad_norm': 10.911090850830078, 'learning_rate': 1.7281775572377228e-06, 'epoch': 0.06}


  6%|▌         | 20/348 [00:22<05:01,  1.09it/s]

{'loss': 0.8738, 'grad_norm': 9.33482837677002, 'learning_rate': 3.4563551144754456e-06, 'epoch': 0.11}


  9%|▊         | 30/348 [00:31<04:50,  1.09it/s]

{'loss': 0.695, 'grad_norm': 6.375826358795166, 'learning_rate': 5.184532671713168e-06, 'epoch': 0.17}


 11%|█▏        | 40/348 [00:41<04:37,  1.11it/s]

{'loss': 0.5903, 'grad_norm': 4.449949741363525, 'learning_rate': 6.912710228950891e-06, 'epoch': 0.23}


 14%|█▍        | 50/348 [00:49<04:08,  1.20it/s]

{'loss': 0.5009, 'grad_norm': 5.567315101623535, 'learning_rate': 8.640887786188614e-06, 'epoch': 0.29}



 14%|█▍        | 50/348 [00:58<04:08,  1.20it/s]

{'eval_loss': 0.5804856419563293, 'eval_runtime': 8.7388, 'eval_samples_per_second': 39.708, 'eval_steps_per_second': 5.035, 'epoch': 0.29}


 17%|█▋        | 60/348 [01:07<04:36,  1.04it/s]

{'loss': 0.6657, 'grad_norm': 4.950977325439453, 'learning_rate': 1.0369065343426337e-05, 'epoch': 0.34}


 20%|██        | 70/348 [01:15<03:57,  1.17it/s]

{'loss': 0.5081, 'grad_norm': 7.466126441955566, 'learning_rate': 1.209724290066406e-05, 'epoch': 0.4}


 23%|██▎       | 80/348 [01:24<03:45,  1.19it/s]

{'loss': 0.5758, 'grad_norm': 5.962028980255127, 'learning_rate': 1.3825420457901782e-05, 'epoch': 0.46}


 26%|██▌       | 90/348 [01:32<03:37,  1.19it/s]

{'loss': 0.5245, 'grad_norm': 6.878669261932373, 'learning_rate': 1.5553598015139505e-05, 'epoch': 0.52}


 29%|██▊       | 100/348 [01:41<03:30,  1.18it/s]

{'loss': 0.6332, 'grad_norm': 10.597390174865723, 'learning_rate': 1.7281775572377228e-05, 'epoch': 0.57}


                                                 
 29%|██▊       | 100/348 [01:53<03:30,  1.18it/s]

{'eval_loss': 0.5559613704681396, 'eval_runtime': 11.8783, 'eval_samples_per_second': 29.213, 'eval_steps_per_second': 3.704, 'epoch': 0.57}


 32%|███▏      | 110/348 [02:01<03:56,  1.01it/s]

{'loss': 0.5392, 'grad_norm': 11.342509269714355, 'learning_rate': 1.900995312961495e-05, 'epoch': 0.63}


 34%|███▍      | 120/348 [02:10<03:11,  1.19it/s]

{'loss': 0.6209, 'grad_norm': 4.948254108428955, 'learning_rate': 2.0738130686852674e-05, 'epoch': 0.69}


 37%|███▋      | 130/348 [02:18<03:07,  1.16it/s]

{'loss': 0.5828, 'grad_norm': 5.218197345733643, 'learning_rate': 2.24663082440904e-05, 'epoch': 0.75}


 40%|████      | 140/348 [02:28<03:03,  1.13it/s]

{'loss': 0.5735, 'grad_norm': 10.766552925109863, 'learning_rate': 2.419448580132812e-05, 'epoch': 0.8}


 43%|████▎     | 150/348 [02:36<02:49,  1.17it/s]

{'loss': 0.5682, 'grad_norm': 6.21796989440918, 'learning_rate': 2.5922663358565842e-05, 'epoch': 0.86}


                                                 
 43%|████▎     | 150/348 [02:47<02:49,  1.17it/s]

{'eval_loss': 0.5560653805732727, 'eval_runtime': 10.3617, 'eval_samples_per_second': 33.489, 'eval_steps_per_second': 4.246, 'epoch': 0.86}


 46%|████▌     | 160/348 [02:57<03:05,  1.01it/s]

{'loss': 0.5363, 'grad_norm': 6.082193851470947, 'learning_rate': 2.7650840915803565e-05, 'epoch': 0.92}


 49%|████▉     | 170/348 [03:06<02:33,  1.16it/s]

{'loss': 0.5631, 'grad_norm': 8.137551307678223, 'learning_rate': 2.9379018473041288e-05, 'epoch': 0.98}


 52%|█████▏    | 180/348 [03:17<02:32,  1.10it/s]

{'loss': 0.6233, 'grad_norm': 4.461733818054199, 'learning_rate': 3.110719603027901e-05, 'epoch': 1.03}


 55%|█████▍    | 190/348 [03:26<02:13,  1.19it/s]

{'loss': 0.6184, 'grad_norm': 14.608083724975586, 'learning_rate': 3.208361635011833e-05, 'epoch': 1.09}


 57%|█████▋    | 200/348 [03:34<02:01,  1.22it/s]

{'loss': 0.4355, 'grad_norm': 3.3196914196014404, 'learning_rate': 3.0053007720364002e-05, 'epoch': 1.15}


                                                 
 57%|█████▋    | 200/348 [03:44<02:01,  1.22it/s]

{'eval_loss': 0.6006419658660889, 'eval_runtime': 9.9774, 'eval_samples_per_second': 34.779, 'eval_steps_per_second': 4.41, 'epoch': 1.15}


 60%|██████    | 210/348 [03:53<02:14,  1.03it/s]

{'loss': 0.6482, 'grad_norm': 7.7127156257629395, 'learning_rate': 2.8022399090609678e-05, 'epoch': 1.21}


 63%|██████▎   | 220/348 [04:01<01:46,  1.20it/s]

{'loss': 0.5022, 'grad_norm': 5.399535179138184, 'learning_rate': 2.5991790460855353e-05, 'epoch': 1.26}


 66%|██████▌   | 230/348 [04:10<01:40,  1.17it/s]

{'loss': 0.545, 'grad_norm': 5.5490241050720215, 'learning_rate': 2.3961181831101028e-05, 'epoch': 1.32}


 69%|██████▉   | 240/348 [04:18<01:31,  1.19it/s]

{'loss': 0.5696, 'grad_norm': 6.727486610412598, 'learning_rate': 2.1930573201346704e-05, 'epoch': 1.38}


 72%|███████▏  | 250/348 [04:27<01:21,  1.21it/s]

{'loss': 0.4321, 'grad_norm': 4.033036231994629, 'learning_rate': 1.989996457159238e-05, 'epoch': 1.44}


                                                 
 72%|███████▏  | 250/348 [04:36<01:21,  1.21it/s]

{'eval_loss': 0.5808852314949036, 'eval_runtime': 9.4827, 'eval_samples_per_second': 36.593, 'eval_steps_per_second': 4.64, 'epoch': 1.44}


 75%|███████▍  | 260/348 [04:46<01:23,  1.05it/s]

{'loss': 0.4535, 'grad_norm': 4.7086100578308105, 'learning_rate': 1.7869355941838054e-05, 'epoch': 1.49}


 78%|███████▊  | 270/348 [04:54<01:04,  1.21it/s]

{'loss': 0.4886, 'grad_norm': 11.76833438873291, 'learning_rate': 1.583874731208373e-05, 'epoch': 1.55}


 80%|████████  | 280/348 [05:07<01:15,  1.12s/it]

{'loss': 0.5071, 'grad_norm': 5.702723026275635, 'learning_rate': 1.3808138682329405e-05, 'epoch': 1.61}


 83%|████████▎ | 290/348 [05:15<00:48,  1.19it/s]

{'loss': 0.4321, 'grad_norm': 5.783255100250244, 'learning_rate': 1.177753005257508e-05, 'epoch': 1.67}


 86%|████████▌ | 300/348 [05:24<00:39,  1.20it/s]

{'loss': 0.5119, 'grad_norm': 4.761653423309326, 'learning_rate': 9.746921422820757e-06, 'epoch': 1.72}


                                                 
 86%|████████▌ | 300/348 [05:33<00:39,  1.20it/s]

{'eval_loss': 0.5841008424758911, 'eval_runtime': 9.2914, 'eval_samples_per_second': 37.346, 'eval_steps_per_second': 4.736, 'epoch': 1.72}


 89%|████████▉ | 310/348 [05:42<00:35,  1.07it/s]

{'loss': 0.545, 'grad_norm': 8.270478248596191, 'learning_rate': 7.716312793066432e-06, 'epoch': 1.78}


 92%|█████████▏| 320/348 [05:50<00:23,  1.18it/s]

{'loss': 0.5572, 'grad_norm': 5.751743316650391, 'learning_rate': 5.6857041633121075e-06, 'epoch': 1.84}


 95%|█████████▍| 330/348 [05:58<00:14,  1.21it/s]

{'loss': 0.4412, 'grad_norm': 5.5620245933532715, 'learning_rate': 3.6550955335577836e-06, 'epoch': 1.9}


 98%|█████████▊| 340/348 [06:07<00:06,  1.19it/s]

{'loss': 0.6562, 'grad_norm': 10.674330711364746, 'learning_rate': 1.6244869038034596e-06, 'epoch': 1.95}


100%|██████████| 348/348 [06:15<00:00,  1.08s/it]


{'train_runtime': 375.2652, 'train_samples_per_second': 7.387, 'train_steps_per_second': 0.927, 'train_loss': 0.5712054844560295, 'epoch': 2.0}


100%|██████████| 44/44 [00:10<00:00,  4.27it/s]
[I 2024-05-25 23:35:36,190] Trial 1 finished with value: 0.2420749279538905 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'learning_rate': 3.248973807606919e-05, 'weight_decay': 0.006033468033489174, 'hidden_dropout_prob': 0.15698566483474938, 'attention_probs_dropout_prob': 0.19330502088918516, 'warmup_steps': 188}. Best is trial 1 with value: 0.2420749279538905.


Trial 1 finished with value: 0.2420749279538905 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'learning_rate': 3.248973807606919e-05, 'weight_decay': 0.006033468033489174, 'hidden_dropout_prob': 0.15698566483474938, 'attention_probs_dropout_prob': 0.19330502088918516, 'warmup_steps': 188}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  6%|▌         | 10/174 [02:16<44:24, 16.24s/it]

{'loss': 0.7238, 'grad_norm': 4.473864555358887, 'learning_rate': 4.288694858920577e-06, 'epoch': 0.11}


 11%|█▏        | 20/174 [05:30<47:27, 18.49s/it]

{'loss': 0.6194, 'grad_norm': 3.394953727722168, 'learning_rate': 8.577389717841155e-06, 'epoch': 0.23}


 17%|█▋        | 30/174 [10:07<59:40, 24.87s/it]  

{'loss': 0.583, 'grad_norm': 2.482041835784912, 'learning_rate': 1.2866084576761732e-05, 'epoch': 0.34}


 23%|██▎       | 40/174 [12:44<32:27, 14.54s/it]

{'loss': 0.5585, 'grad_norm': 3.33992075920105, 'learning_rate': 1.715477943568231e-05, 'epoch': 0.46}


 29%|██▊       | 50/174 [15:10<27:52, 13.49s/it]

{'loss': 0.5652, 'grad_norm': 3.758619785308838, 'learning_rate': 2.1443474294602888e-05, 'epoch': 0.57}


                                                
 29%|██▊       | 50/174 [15:19<27:52, 13.49s/it]

{'eval_loss': 0.5685209631919861, 'eval_runtime': 9.5467, 'eval_samples_per_second': 36.348, 'eval_steps_per_second': 4.609, 'epoch': 0.57}


 34%|███▍      | 60/174 [18:15<34:10, 17.98s/it]

{'loss': 0.5667, 'grad_norm': 3.521420955657959, 'learning_rate': 2.5732169153523464e-05, 'epoch': 0.69}


 40%|████      | 70/174 [40:25<9:02:49, 313.17s/it]

{'loss': 0.5597, 'grad_norm': 3.0629777908325195, 'learning_rate': 3.0020864012444043e-05, 'epoch': 0.8}


 46%|████▌     | 80/174 [43:08<35:25, 22.61s/it]   

{'loss': 0.54, 'grad_norm': 1.7464849948883057, 'learning_rate': 3.430955887136462e-05, 'epoch': 0.92}


 52%|█████▏    | 90/174 [45:22<17:48, 12.72s/it]

{'loss': 0.604, 'grad_norm': 6.394249439239502, 'learning_rate': 3.8598253730285194e-05, 'epoch': 1.03}


 57%|█████▋    | 100/174 [47:10<12:35, 10.21s/it]

{'loss': 0.5396, 'grad_norm': 6.753331661224365, 'learning_rate': 4.2886948589205777e-05, 'epoch': 1.15}


                                                 
 57%|█████▋    | 100/174 [47:19<12:35, 10.21s/it]

{'eval_loss': 0.5443894267082214, 'eval_runtime': 8.3035, 'eval_samples_per_second': 41.789, 'eval_steps_per_second': 5.299, 'epoch': 1.15}


 63%|██████▎   | 110/174 [48:53<09:57,  9.34s/it]

{'loss': 0.569, 'grad_norm': 6.34322452545166, 'learning_rate': 4.6027592824353766e-05, 'epoch': 1.26}


 69%|██████▉   | 120/174 [50:34<08:39,  9.62s/it]

{'loss': 0.5679, 'grad_norm': 3.6089069843292236, 'learning_rate': 3.883578144554849e-05, 'epoch': 1.38}


 75%|███████▍  | 130/174 [52:14<06:54,  9.43s/it]

{'loss': 0.4715, 'grad_norm': 3.170910120010376, 'learning_rate': 3.1643970066743217e-05, 'epoch': 1.49}


 80%|████████  | 140/174 [54:01<05:35,  9.87s/it]

{'loss': 0.5361, 'grad_norm': 2.632513999938965, 'learning_rate': 2.4452158687937938e-05, 'epoch': 1.61}


 86%|████████▌ | 150/174 [55:58<04:54, 12.28s/it]

{'loss': 0.5081, 'grad_norm': 3.3666584491729736, 'learning_rate': 1.7260347309132663e-05, 'epoch': 1.72}


                                                 
 86%|████████▌ | 150/174 [56:07<04:54, 12.28s/it]

{'eval_loss': 0.5405374765396118, 'eval_runtime': 8.678, 'eval_samples_per_second': 39.986, 'eval_steps_per_second': 5.07, 'epoch': 1.72}


 92%|█████████▏| 160/174 [58:00<02:56, 12.64s/it]

{'loss': 0.5717, 'grad_norm': 2.1259605884552, 'learning_rate': 1.0068535930327387e-05, 'epoch': 1.84}


 98%|█████████▊| 170/174 [59:43<00:37,  9.26s/it]

{'loss': 0.5675, 'grad_norm': 3.916945695877075, 'learning_rate': 2.8767245515221104e-06, 'epoch': 1.95}


100%|██████████| 174/174 [1:00:25<00:00, 20.84s/it]


{'train_runtime': 3625.444, 'train_samples_per_second': 0.765, 'train_steps_per_second': 0.048, 'train_loss': 0.56887180092691, 'epoch': 2.0}


100%|██████████| 44/44 [00:08<00:00,  4.98it/s]
[I 2024-05-26 00:36:14,352] Trial 2 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'learning_rate': 4.674677396223429e-05, 'weight_decay': 0.009751308196446934, 'hidden_dropout_prob': 0.16386995713264338, 'attention_probs_dropout_prob': 0.10117117215984879, 'warmup_steps': 109}. Best is trial 1 with value: 0.2420749279538905.


Trial 2 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'learning_rate': 4.674677396223429e-05, 'weight_decay': 0.009751308196446934, 'hidden_dropout_prob': 0.16386995713264338, 'attention_probs_dropout_prob': 0.10117117215984879, 'warmup_steps': 109}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  3%|▎         | 10/348 [00:08<04:23,  1.29it/s]

{'loss': 0.6552, 'grad_norm': 4.1551127433776855, 'learning_rate': 1.5970132269784866e-06, 'epoch': 0.06}


  6%|▌         | 20/348 [00:17<05:41,  1.04s/it]

{'loss': 0.5993, 'grad_norm': 5.853025913238525, 'learning_rate': 3.194026453956973e-06, 'epoch': 0.11}


  9%|▊         | 30/348 [00:29<06:09,  1.16s/it]

{'loss': 0.584, 'grad_norm': 5.107651710510254, 'learning_rate': 4.79103968093546e-06, 'epoch': 0.17}


 11%|█▏        | 40/348 [00:43<06:46,  1.32s/it]

{'loss': 0.5906, 'grad_norm': 5.40311336517334, 'learning_rate': 6.388052907913946e-06, 'epoch': 0.23}


 14%|█▍        | 50/348 [00:57<07:20,  1.48s/it]

{'loss': 0.5055, 'grad_norm': 2.9208827018737793, 'learning_rate': 7.985066134892432e-06, 'epoch': 0.29}



 14%|█▍        | 50/348 [01:32<07:20,  1.48s/it]

{'eval_loss': 0.5667887330055237, 'eval_runtime': 34.9078, 'eval_samples_per_second': 9.94, 'eval_steps_per_second': 1.26, 'epoch': 0.29}


 17%|█▋        | 60/348 [01:43<06:03,  1.26s/it]  

{'loss': 0.6095, 'grad_norm': 3.5784153938293457, 'learning_rate': 9.58207936187092e-06, 'epoch': 0.34}


 20%|██        | 70/348 [01:59<08:00,  1.73s/it]

{'loss': 0.5256, 'grad_norm': 2.7750864028930664, 'learning_rate': 1.1179092588849405e-05, 'epoch': 0.4}


 23%|██▎       | 80/348 [02:16<07:32,  1.69s/it]

{'loss': 0.5619, 'grad_norm': 3.7890634536743164, 'learning_rate': 1.2776105815827893e-05, 'epoch': 0.46}


 26%|██▌       | 90/348 [02:31<06:23,  1.48s/it]

{'loss': 0.5293, 'grad_norm': 5.58758544921875, 'learning_rate': 1.4373119042806379e-05, 'epoch': 0.52}


 29%|██▊       | 100/348 [02:49<06:53,  1.67s/it]

{'loss': 0.6322, 'grad_norm': 7.831322193145752, 'learning_rate': 1.5970132269784865e-05, 'epoch': 0.57}



 29%|██▊       | 100/348 [03:04<06:53,  1.67s/it]

{'eval_loss': 0.5667465925216675, 'eval_runtime': 15.8474, 'eval_samples_per_second': 21.896, 'eval_steps_per_second': 2.776, 'epoch': 0.57}


 32%|███▏      | 110/348 [03:20<06:32,  1.65s/it]

{'loss': 0.5367, 'grad_norm': 8.959602355957031, 'learning_rate': 1.7567145496763353e-05, 'epoch': 0.63}


 34%|███▍      | 120/348 [03:32<04:51,  1.28s/it]

{'loss': 0.5955, 'grad_norm': 4.010586261749268, 'learning_rate': 1.916415872374184e-05, 'epoch': 0.69}


 37%|███▋      | 130/348 [03:45<04:40,  1.29s/it]

{'loss': 0.5476, 'grad_norm': 4.243637561798096, 'learning_rate': 2.0761171950720325e-05, 'epoch': 0.75}


 40%|████      | 140/348 [03:58<04:06,  1.18s/it]

{'loss': 0.5646, 'grad_norm': 9.865771293640137, 'learning_rate': 2.235818517769881e-05, 'epoch': 0.8}


 43%|████▎     | 150/348 [04:10<03:59,  1.21s/it]

{'loss': 0.5681, 'grad_norm': 4.980096340179443, 'learning_rate': 2.3955198404677297e-05, 'epoch': 0.86}



 43%|████▎     | 150/348 [04:25<03:59,  1.21s/it]

{'eval_loss': 0.559355616569519, 'eval_runtime': 14.3633, 'eval_samples_per_second': 24.159, 'eval_steps_per_second': 3.063, 'epoch': 0.86}


 46%|████▌     | 160/348 [04:38<04:20,  1.39s/it]

{'loss': 0.5323, 'grad_norm': 5.19037389755249, 'learning_rate': 2.5552211631655785e-05, 'epoch': 0.92}


 49%|████▉     | 170/348 [04:50<03:47,  1.28s/it]

{'loss': 0.5854, 'grad_norm': 5.540190696716309, 'learning_rate': 2.7149224858634273e-05, 'epoch': 0.98}


 52%|█████▏    | 180/348 [05:04<03:35,  1.28s/it]

{'loss': 0.6465, 'grad_norm': 2.544477939605713, 'learning_rate': 2.8746238085612757e-05, 'epoch': 1.03}


 55%|█████▍    | 190/348 [05:17<03:36,  1.37s/it]

{'loss': 0.5993, 'grad_norm': 8.155173301696777, 'learning_rate': 3.0343251312591245e-05, 'epoch': 1.09}


 57%|█████▋    | 200/348 [05:31<03:26,  1.39s/it]

{'loss': 0.4859, 'grad_norm': 3.148007869720459, 'learning_rate': 3.194026453956973e-05, 'epoch': 1.15}


                                                 
 57%|█████▋    | 200/348 [05:45<03:26,  1.39s/it]

{'eval_loss': 0.5547922849655151, 'eval_runtime': 14.2453, 'eval_samples_per_second': 24.359, 'eval_steps_per_second': 3.089, 'epoch': 1.15}


 60%|██████    | 210/348 [05:58<03:23,  1.47s/it]

{'loss': 0.6091, 'grad_norm': 5.3057403564453125, 'learning_rate': 3.353727776654822e-05, 'epoch': 1.21}


 63%|██████▎   | 220/348 [06:12<02:57,  1.38s/it]

{'loss': 0.5439, 'grad_norm': 3.190263271331787, 'learning_rate': 3.38615567882095e-05, 'epoch': 1.26}


 66%|██████▌   | 230/348 [06:27<02:37,  1.33s/it]

{'loss': 0.5303, 'grad_norm': 2.8922770023345947, 'learning_rate': 3.121612266413063e-05, 'epoch': 1.32}


 69%|██████▉   | 240/348 [06:39<02:16,  1.26s/it]

{'loss': 0.5885, 'grad_norm': 4.540046215057373, 'learning_rate': 2.8570688540051766e-05, 'epoch': 1.38}


 72%|███████▏  | 250/348 [06:55<02:13,  1.36s/it]

{'loss': 0.4457, 'grad_norm': 3.576287269592285, 'learning_rate': 2.5925254415972898e-05, 'epoch': 1.44}


                                                 
 72%|███████▏  | 250/348 [07:10<02:13,  1.36s/it]

{'eval_loss': 0.5916538834571838, 'eval_runtime': 15.5015, 'eval_samples_per_second': 22.385, 'eval_steps_per_second': 2.838, 'epoch': 1.44}


 75%|███████▍  | 260/348 [07:25<02:16,  1.55s/it]

{'loss': 0.4933, 'grad_norm': 3.221712350845337, 'learning_rate': 2.327982029189403e-05, 'epoch': 1.49}


 78%|███████▊  | 270/348 [07:39<02:01,  1.56s/it]

{'loss': 0.5291, 'grad_norm': 8.614871978759766, 'learning_rate': 2.0634386167815162e-05, 'epoch': 1.55}


 80%|████████  | 280/348 [07:52<01:26,  1.28s/it]

{'loss': 0.5464, 'grad_norm': 2.8462939262390137, 'learning_rate': 1.7988952043736294e-05, 'epoch': 1.61}


 83%|████████▎ | 290/348 [08:05<01:15,  1.30s/it]

{'loss': 0.4728, 'grad_norm': 5.538880825042725, 'learning_rate': 1.534351791965743e-05, 'epoch': 1.67}


 86%|████████▌ | 300/348 [08:19<01:07,  1.41s/it]

{'loss': 0.5395, 'grad_norm': 6.611605167388916, 'learning_rate': 1.2698083795578562e-05, 'epoch': 1.72}


                                                 
 86%|████████▌ | 300/348 [08:33<01:07,  1.41s/it]

{'eval_loss': 0.5653212666511536, 'eval_runtime': 13.996, 'eval_samples_per_second': 24.793, 'eval_steps_per_second': 3.144, 'epoch': 1.72}


 89%|████████▉ | 310/348 [08:46<01:00,  1.58s/it]

{'loss': 0.584, 'grad_norm': 5.597208499908447, 'learning_rate': 1.0052649671499694e-05, 'epoch': 1.78}


 92%|█████████▏| 320/348 [08:59<00:34,  1.22s/it]

{'loss': 0.6232, 'grad_norm': 4.030673027038574, 'learning_rate': 7.407215547420828e-06, 'epoch': 1.84}


 95%|█████████▍| 330/348 [09:11<00:23,  1.30s/it]

{'loss': 0.473, 'grad_norm': 3.4883766174316406, 'learning_rate': 4.761781423341961e-06, 'epoch': 1.9}


 98%|█████████▊| 340/348 [09:24<00:10,  1.31s/it]

{'loss': 0.6425, 'grad_norm': 3.8579299449920654, 'learning_rate': 2.1163472992630936e-06, 'epoch': 1.95}


100%|██████████| 348/348 [09:35<00:00,  1.65s/it]


{'train_runtime': 575.1048, 'train_samples_per_second': 4.82, 'train_steps_per_second': 0.605, 'train_loss': 0.5636824054279547, 'epoch': 2.0}


100%|██████████| 44/44 [00:14<00:00,  3.04it/s]
[I 2024-05-26 00:46:07,544] Trial 3 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'learning_rate': 3.465518702543316e-05, 'weight_decay': 0.007682524015421179, 'hidden_dropout_prob': 0.20260952628330847, 'attention_probs_dropout_prob': 0.21561738362324898, 'warmup_steps': 217}. Best is trial 1 with value: 0.2420749279538905.


Trial 3 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'learning_rate': 3.465518702543316e-05, 'weight_decay': 0.007682524015421179, 'hidden_dropout_prob': 0.20260952628330847, 'attention_probs_dropout_prob': 0.21561738362324898, 'warmup_steps': 217}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  6%|▌         | 10/174 [00:15<04:07,  1.51s/it]

{'loss': 0.7661, 'grad_norm': 5.3499908447265625, 'learning_rate': 1.5271576626994448e-06, 'epoch': 0.06}


 11%|█▏        | 20/174 [00:31<04:01,  1.57s/it]

{'loss': 0.6623, 'grad_norm': 5.8095269203186035, 'learning_rate': 3.0543153253988896e-06, 'epoch': 0.11}


 17%|█▋        | 30/174 [00:44<03:13,  1.34s/it]

{'loss': 0.6353, 'grad_norm': 4.23299503326416, 'learning_rate': 4.5814729880983345e-06, 'epoch': 0.17}


 23%|██▎       | 40/174 [01:00<03:22,  1.51s/it]

{'loss': 0.6049, 'grad_norm': 4.642436504364014, 'learning_rate': 6.108630650797779e-06, 'epoch': 0.23}


 29%|██▊       | 50/174 [01:13<02:47,  1.35s/it]

{'loss': 0.5296, 'grad_norm': 3.880038261413574, 'learning_rate': 7.635788313497224e-06, 'epoch': 0.29}


                                                
 29%|██▊       | 50/174 [01:28<02:47,  1.35s/it]

{'eval_loss': 0.5751470327377319, 'eval_runtime': 15.1299, 'eval_samples_per_second': 22.935, 'eval_steps_per_second': 2.908, 'epoch': 0.29}


 34%|███▍      | 60/174 [01:42<02:45,  1.45s/it]

{'loss': 0.6194, 'grad_norm': 3.8372609615325928, 'learning_rate': 9.162945976196669e-06, 'epoch': 0.34}


 40%|████      | 70/174 [01:55<02:21,  1.36s/it]

{'loss': 0.5299, 'grad_norm': 2.982344150543213, 'learning_rate': 1.0690103638896114e-05, 'epoch': 0.4}


 46%|████▌     | 80/174 [02:08<01:59,  1.27s/it]

{'loss': 0.5883, 'grad_norm': 3.7484378814697266, 'learning_rate': 1.2217261301595558e-05, 'epoch': 0.46}


 52%|█████▏    | 90/174 [02:22<01:59,  1.42s/it]

{'loss': 0.5114, 'grad_norm': 4.7551116943359375, 'learning_rate': 1.3744418964295004e-05, 'epoch': 0.52}


 57%|█████▋    | 100/174 [02:35<01:28,  1.20s/it]

{'loss': 0.6276, 'grad_norm': 7.781628131866455, 'learning_rate': 1.5271576626994447e-05, 'epoch': 0.57}


                                                 
 57%|█████▋    | 100/174 [02:50<01:28,  1.20s/it]

{'eval_loss': 0.555624783039093, 'eval_runtime': 14.496, 'eval_samples_per_second': 23.938, 'eval_steps_per_second': 3.035, 'epoch': 0.57}


 63%|██████▎   | 110/174 [03:02<01:30,  1.42s/it]

{'loss': 0.513, 'grad_norm': 8.594584465026855, 'learning_rate': 1.6798734289693893e-05, 'epoch': 0.63}


 69%|██████▉   | 120/174 [03:14<01:05,  1.22s/it]

{'loss': 0.6115, 'grad_norm': 2.651008367538452, 'learning_rate': 1.8325891952393338e-05, 'epoch': 0.69}


 75%|███████▍  | 130/174 [03:26<00:51,  1.18s/it]

{'loss': 0.5733, 'grad_norm': 3.296694278717041, 'learning_rate': 1.985304961509278e-05, 'epoch': 0.75}


 80%|████████  | 140/174 [03:39<00:39,  1.16s/it]

{'loss': 0.5622, 'grad_norm': 8.211738586425781, 'learning_rate': 2.138020727779223e-05, 'epoch': 0.8}


 86%|████████▌ | 150/174 [03:51<00:28,  1.20s/it]

{'loss': 0.5484, 'grad_norm': 5.1276655197143555, 'learning_rate': 2.290736494049167e-05, 'epoch': 0.86}


                                                 
 86%|████████▌ | 150/174 [04:04<00:28,  1.20s/it]

{'eval_loss': 0.5538168549537659, 'eval_runtime': 13.2691, 'eval_samples_per_second': 26.151, 'eval_steps_per_second': 3.316, 'epoch': 0.86}


 92%|█████████▏| 160/174 [04:16<00:19,  1.36s/it]

{'loss': 0.5231, 'grad_norm': 5.040319442749023, 'learning_rate': 2.4434522603191116e-05, 'epoch': 0.92}


 98%|█████████▊| 170/174 [04:28<00:04,  1.23s/it]

{'loss': 0.5834, 'grad_norm': 5.961061477661133, 'learning_rate': 2.596168026589056e-05, 'epoch': 0.98}


100%|██████████| 174/174 [04:33<00:00,  1.57s/it]


{'train_runtime': 273.1753, 'train_samples_per_second': 5.074, 'train_steps_per_second': 0.637, 'train_loss': 0.5873013912946329, 'epoch': 1.0}


100%|██████████| 44/44 [00:13<00:00,  3.16it/s]
[I 2024-05-26 00:50:57,824] Trial 4 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 8, 'learning_rate': 3.894252039883584e-05, 'weight_decay': 0.0063986528296316484, 'hidden_dropout_prob': 0.10500577565336684, 'attention_probs_dropout_prob': 0.19239252680070668, 'warmup_steps': 255}. Best is trial 1 with value: 0.2420749279538905.


Trial 4 finished with value: 0.2478386167146974 and parameters: {'num_train_epochs': 1, 'per_device_train_batch_size': 8, 'learning_rate': 3.894252039883584e-05, 'weight_decay': 0.0063986528296316484, 'hidden_dropout_prob': 0.10500577565336684, 'attention_probs_dropout_prob': 0.19239252680070668, 'warmup_steps': 255}
Best hyperparameters:  {'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'learning_rate': 3.248973807606919e-05, 'weight_decay': 0.006033468033489174, 'hidden_dropout_prob': 0.15698566483474938, 'attention_probs_dropout_prob': 0.19330502088918516, 'warmup_steps': 188}


In [None]:
# After obtaining the best hyperparameters, train the model with the best parameters
best_params = study.best_params
config = BertConfig.from_pretrained('bert-base-uncased',
                                    hidden_dropout_prob=best_params['hidden_dropout_prob'],
                                    attention_probs_dropout_prob=best_params['attention_probs_dropout_prob'])
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', config=config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Proceeding with the best hyperparameters found by Optuna:**

In [None]:
# Define training arguments using the best hyperparameters found by Optuna
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=best_params['num_train_epochs'],  # Best number of training epochs from Optuna
    per_device_train_batch_size=best_params['per_device_train_batch_size'],  # Best batch size per device from Optuna
    learning_rate=best_params['learning_rate'],  # Best learning rate from Optuna
    weight_decay=best_params['weight_decay'],  # Best weight decay from Optuna
    logging_dir='./logs',  # Directory for logging the training information.
    logging_steps=50,  # Log every 50 steps.
    evaluation_strategy='no'  # No evaluation during training.
)


In [None]:
# Initialize the Trainer again, this time with the best hyperparameters:
trainer = Trainer(
    model=model,
    args=training_args,  # Training arguments including best hyperparameters found by Optuna
    train_dataset=train_dataset
)


In [None]:
# Training the final model with the best hyperparameters:
trainer.train()

  6%|▌         | 50/868 [00:42<10:53,  1.25it/s]

{'loss': 0.6148, 'grad_norm': 4.780924320220947, 'learning_rate': 3.0618209385051375e-05, 'epoch': 0.12}


 12%|█▏        | 100/868 [01:23<10:30,  1.22it/s]

{'loss': 0.5349, 'grad_norm': 2.5848705768585205, 'learning_rate': 2.8746680694033568e-05, 'epoch': 0.23}


 17%|█▋        | 150/868 [02:03<09:28,  1.26it/s]

{'loss': 0.5466, 'grad_norm': 5.00223970413208, 'learning_rate': 2.6875152003015758e-05, 'epoch': 0.35}


 23%|██▎       | 200/868 [02:43<08:53,  1.25it/s]

{'loss': 0.6029, 'grad_norm': 5.299203872680664, 'learning_rate': 2.5003623311997947e-05, 'epoch': 0.46}


 29%|██▉       | 250/868 [03:23<08:10,  1.26it/s]

{'loss': 0.6055, 'grad_norm': 4.381015777587891, 'learning_rate': 2.3132094620980137e-05, 'epoch': 0.58}


 35%|███▍      | 300/868 [04:04<07:43,  1.22it/s]

{'loss': 0.5549, 'grad_norm': 3.582388401031494, 'learning_rate': 2.1260565929962327e-05, 'epoch': 0.69}


 40%|████      | 350/868 [04:43<06:55,  1.25it/s]

{'loss': 0.5557, 'grad_norm': 5.753030776977539, 'learning_rate': 1.9389037238944516e-05, 'epoch': 0.81}


 46%|████▌     | 400/868 [05:24<06:19,  1.23it/s]

{'loss': 0.5701, 'grad_norm': 10.986193656921387, 'learning_rate': 1.7517508547926706e-05, 'epoch': 0.92}


 52%|█████▏    | 450/868 [06:05<05:43,  1.22it/s]

{'loss': 0.5138, 'grad_norm': 4.162977695465088, 'learning_rate': 1.5645979856908896e-05, 'epoch': 1.04}


 58%|█████▊    | 500/868 [06:46<05:00,  1.22it/s]

{'loss': 0.5583, 'grad_norm': 12.449533462524414, 'learning_rate': 1.3774451165891085e-05, 'epoch': 1.15}


 63%|██████▎   | 550/868 [07:30<04:24,  1.20it/s]

{'loss': 0.5209, 'grad_norm': 7.037698745727539, 'learning_rate': 1.1902922474873275e-05, 'epoch': 1.27}


 69%|██████▉   | 600/868 [08:12<03:40,  1.22it/s]

{'loss': 0.4809, 'grad_norm': 7.652212142944336, 'learning_rate': 1.0031393783855463e-05, 'epoch': 1.38}


 75%|███████▍  | 650/868 [08:53<02:58,  1.22it/s]

{'loss': 0.5877, 'grad_norm': 14.10260009765625, 'learning_rate': 8.159865092837654e-06, 'epoch': 1.5}


 81%|████████  | 700/868 [09:34<02:16,  1.23it/s]

{'loss': 0.4787, 'grad_norm': 8.906774520874023, 'learning_rate': 6.288336401819843e-06, 'epoch': 1.61}


 86%|████████▋ | 750/868 [10:17<01:36,  1.22it/s]

{'loss': 0.4906, 'grad_norm': 8.288435935974121, 'learning_rate': 4.416807710802033e-06, 'epoch': 1.73}


 92%|█████████▏| 800/868 [11:01<00:57,  1.18it/s]

{'loss': 0.4599, 'grad_norm': 13.823904991149902, 'learning_rate': 2.545279019784222e-06, 'epoch': 1.84}


 98%|█████████▊| 850/868 [11:47<00:14,  1.28it/s]

{'loss': 0.4922, 'grad_norm': 6.726139068603516, 'learning_rate': 6.737503287664117e-07, 'epoch': 1.96}


100%|██████████| 868/868 [27:32<00:00,  1.90s/it] 

{'train_runtime': 1652.8534, 'train_samples_per_second': 4.195, 'train_steps_per_second': 0.525, 'train_loss': 0.5370206420872069, 'epoch': 2.0}





TrainOutput(global_step=868, training_loss=0.5370206420872069, metrics={'train_runtime': 1652.8534, 'train_samples_per_second': 4.195, 'train_steps_per_second': 0.525, 'total_flos': 534495720078000.0, 'train_loss': 0.5370206420872069, 'epoch': 2.0})

In [None]:
# Predicting on the train set and extracting the predicted labels:
predictions_train = trainer.predict(train_dataset)
pred_labels_train = np.argmax(predictions_train.predictions, axis=-1)

# Predicting on the test set and extracting the predicted labels:
predictions_test = trainer.predict(test_dataset)
pred_labels_test = np.argmax(predictions_test.predictions, axis=-1)

100%|██████████| 434/434 [01:35<00:00,  4.56it/s]
100%|██████████| 175/175 [00:34<00:00,  5.06it/s]


**Print Classification Reports For Evaluation:**

In [None]:
print("Classification Report Train set: \n", classification_report(y_train, pred_labels_train))
print("Confusion Matrix Train set: \n", confusion_matrix(y_train, pred_labels_train))

Classification Report Train set: 
               precision    recall  f1-score   support

           0       0.83      0.97      0.89      2600
           1       0.81      0.40      0.54       867

    accuracy                           0.83      3467
   macro avg       0.82      0.69      0.72      3467
weighted avg       0.83      0.83      0.80      3467

Confusion Matrix Train set: 
 [[2521   79]
 [ 519  348]]


In [None]:
print("Classification Report Test set: \n", classification_report(y_test, pred_labels_test))
print("Confusion Matrix Test set: \n", confusion_matrix(y_test, pred_labels_test))

Classification Report Test set: 
               precision    recall  f1-score   support

           0       0.89      0.91      0.90      1200
           1       0.40      0.35      0.37       200

    accuracy                           0.83      1400
   macro avg       0.65      0.63      0.64      1400
weighted avg       0.82      0.83      0.83      1400

Confusion Matrix Test set: 
 [[1096  104]
 [ 130   70]]


### **Testing With Try 3 Training Epochs:**

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3, # Experimenting with 3 epochs to see effect on results.
    per_device_train_batch_size=best_params['per_device_train_batch_size'],
    learning_rate=best_params['learning_rate'],
    weight_decay=best_params['weight_decay'],
    logging_dir='./logs',
    logging_steps=50,
    evaluation_strategy='no'
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset
)

In [None]:
# Training with the best hyperparameters:
trainer.train()

  4%|▍         | 50/1302 [00:49<18:09,  1.15it/s]

{'loss': 0.1604, 'grad_norm': 0.04119155928492546, 'learning_rate': 3.1242052282057315e-05, 'epoch': 0.12}


  8%|▊         | 100/1302 [01:36<23:58,  1.20s/it]

{'loss': 0.1193, 'grad_norm': 0.06762097775936127, 'learning_rate': 2.999436648804544e-05, 'epoch': 0.23}


 12%|█▏        | 150/1302 [02:41<18:39,  1.03it/s]

{'loss': 0.0996, 'grad_norm': 88.84729766845703, 'learning_rate': 2.8746680694033568e-05, 'epoch': 0.35}


 15%|█▌        | 200/1302 [03:26<15:40,  1.17it/s]

{'loss': 0.1488, 'grad_norm': 9.145594596862793, 'learning_rate': 2.7498994900021694e-05, 'epoch': 0.46}


 19%|█▉        | 250/1302 [04:11<15:22,  1.14it/s]

{'loss': 0.088, 'grad_norm': 0.009762235917150974, 'learning_rate': 2.625130910600982e-05, 'epoch': 0.58}


 23%|██▎       | 300/1302 [04:56<15:50,  1.05it/s]

{'loss': 0.1669, 'grad_norm': 0.02049676887691021, 'learning_rate': 2.5003623311997947e-05, 'epoch': 0.69}


 27%|██▋       | 350/1302 [05:51<13:44,  1.15it/s]

{'loss': 0.1043, 'grad_norm': 0.006245830096304417, 'learning_rate': 2.3755937517986074e-05, 'epoch': 0.81}


 31%|███       | 400/1302 [06:36<15:37,  1.04s/it]

{'loss': 0.1431, 'grad_norm': 3.097804307937622, 'learning_rate': 2.25082517239742e-05, 'epoch': 0.92}


 35%|███▍      | 450/1302 [07:20<12:03,  1.18it/s]

{'loss': 0.1654, 'grad_norm': 2.125974178314209, 'learning_rate': 2.1260565929962327e-05, 'epoch': 1.04}


 38%|███▊      | 500/1302 [08:03<11:07,  1.20it/s]

{'loss': 0.0603, 'grad_norm': 0.13426044583320618, 'learning_rate': 2.001288013595045e-05, 'epoch': 1.15}


 42%|████▏     | 550/1302 [08:47<10:27,  1.20it/s]

{'loss': 0.0403, 'grad_norm': 249.80010986328125, 'learning_rate': 1.876519434193858e-05, 'epoch': 1.27}


 46%|████▌     | 600/1302 [09:28<09:34,  1.22it/s]

{'loss': 0.0702, 'grad_norm': 0.019618425518274307, 'learning_rate': 1.7517508547926706e-05, 'epoch': 1.38}


 50%|████▉     | 650/1302 [10:09<09:08,  1.19it/s]

{'loss': 0.1002, 'grad_norm': 0.007629311643540859, 'learning_rate': 1.6269822753914832e-05, 'epoch': 1.5}


 54%|█████▍    | 700/1302 [10:51<08:19,  1.21it/s]

{'loss': 0.0926, 'grad_norm': 0.013807023875415325, 'learning_rate': 1.5022136959902957e-05, 'epoch': 1.61}


 58%|█████▊    | 750/1302 [11:33<07:35,  1.21it/s]

{'loss': 0.1471, 'grad_norm': 1.2716071605682373, 'learning_rate': 1.3774451165891085e-05, 'epoch': 1.73}


 61%|██████▏   | 800/1302 [12:15<06:51,  1.22it/s]

{'loss': 0.2437, 'grad_norm': 29.47140884399414, 'learning_rate': 1.2526765371879212e-05, 'epoch': 1.84}


 65%|██████▌   | 850/1302 [12:56<06:20,  1.19it/s]

{'loss': 0.4263, 'grad_norm': 162.63131713867188, 'learning_rate': 1.1279079577867336e-05, 'epoch': 1.96}


 69%|██████▉   | 900/1302 [13:37<05:24,  1.24it/s]

{'loss': 0.2526, 'grad_norm': 55.78738784790039, 'learning_rate': 1.0031393783855463e-05, 'epoch': 2.07}


 73%|███████▎  | 950/1302 [14:19<04:47,  1.22it/s]

{'loss': 0.2234, 'grad_norm': 0.09114867448806763, 'learning_rate': 8.783707989843591e-06, 'epoch': 2.19}


 77%|███████▋  | 1000/1302 [14:59<04:10,  1.21it/s]

{'loss': 0.1681, 'grad_norm': 25.60295867919922, 'learning_rate': 7.5360221958317166e-06, 'epoch': 2.3}


 81%|████████  | 1050/1302 [15:44<03:24,  1.23it/s]

{'loss': 0.1831, 'grad_norm': 0.04794296994805336, 'learning_rate': 6.288336401819843e-06, 'epoch': 2.42}


 84%|████████▍ | 1100/1302 [16:28<02:46,  1.21it/s]

{'loss': 0.3032, 'grad_norm': 0.576240599155426, 'learning_rate': 5.0406506078079694e-06, 'epoch': 2.53}


 88%|████████▊ | 1150/1302 [21:48<02:27,  1.03it/s]  

{'loss': 0.2352, 'grad_norm': 0.3192039132118225, 'learning_rate': 3.792964813796096e-06, 'epoch': 2.65}


 92%|█████████▏| 1200/1302 [23:27<04:57,  2.91s/it]

{'loss': 0.2481, 'grad_norm': 0.5398576855659485, 'learning_rate': 2.545279019784222e-06, 'epoch': 2.76}


 96%|█████████▌| 1250/1302 [24:10<00:42,  1.23it/s]

{'loss': 0.171, 'grad_norm': 92.58430480957031, 'learning_rate': 1.2975932257723485e-06, 'epoch': 2.88}


100%|█████████▉| 1300/1302 [24:51<00:01,  1.21it/s]

{'loss': 0.2763, 'grad_norm': 0.08760837465524673, 'learning_rate': 4.990743176047494e-08, 'epoch': 3.0}


100%|██████████| 1302/1302 [24:53<00:00,  1.15s/it]

{'train_runtime': 1493.132, 'train_samples_per_second': 6.966, 'train_steps_per_second': 0.872, 'train_loss': 0.1704139141573323, 'epoch': 3.0}





TrainOutput(global_step=1302, training_loss=0.1704139141573323, metrics={'train_runtime': 1493.132, 'train_samples_per_second': 6.966, 'train_steps_per_second': 0.872, 'total_flos': 801743580117000.0, 'train_loss': 0.1704139141573323, 'epoch': 3.0})

In [None]:
# Predicting and extracting:
predictions_train = trainer.predict(train_dataset)
pred_labels_train = np.argmax(predictions_train.predictions, axis=-1)
predictions_test = trainer.predict(test_dataset)
pred_labels_test = np.argmax(predictions_test.predictions, axis=-1)

100%|██████████| 434/434 [01:27<00:00,  4.97it/s]
100%|██████████| 175/175 [00:35<00:00,  4.90it/s]


**Printing Evaluation Metrics For Comparing Performance With 3 Epochs:**

In [None]:
print("Classification Report Train set: \n", classification_report(y_train, pred_labels_train))
print("Confusion Matrix Train set: \n", confusion_matrix(y_train, pred_labels_train))

Classification Report Train set: 
               precision    recall  f1-score   support

           0       0.98      1.00      0.99      2600
           1       0.99      0.95      0.97       867

    accuracy                           0.99      3467
   macro avg       0.99      0.97      0.98      3467
weighted avg       0.99      0.99      0.99      3467

Confusion Matrix Train set: 
 [[2593    7]
 [  43  824]]


In [None]:
print("Classification Report Test set: \n", classification_report(y_test, pred_labels_test))
print("Confusion Matrix Test set: \n", confusion_matrix(y_test, pred_labels_test))

Classification Report Test set: 
               precision    recall  f1-score   support

           0       0.88      0.83      0.86      1200
           1       0.26      0.35      0.30       200

    accuracy                           0.76      1400
   macro avg       0.57      0.59      0.58      1400
weighted avg       0.79      0.76      0.78      1400

Confusion Matrix Test set: 
 [[996 204]
 [130  70]]


## **Step 7: Predict on some selected tweets**

In [None]:
tweets = [
    ("Biden is a great President like none other we have had", 1),
    ("Max Verstappen is such a clean driver, he never makes dirty moves when racing", 1),
    ("I can't wait to spend time with my family over Christmas! I just love being the only single one and the many many questions asking when will I get a boyfriend 🙄", 1),
    ("Well this is awesome news to wake up to!", 0),
    ("So school have sent out a really helpful newsletter with all key dates and activities. So many fun things arranged for the children.", 0),
    ("I just love the smell of one million 😍", 0)
]

# Tokenize the tweets and create a dataset. Both ensuring that float32 is used
tweet_texts = [tweet for tweet, _ in tweets]
tweet_encodings = tokenizer(tweet_texts, padding=True, truncation=True, max_length=150, return_tensors='pt')
tweet_dataset = SarcasmDataset(tweet_encodings, labels=torch.zeros(len(tweets), dtype=torch.float32))

# Determine the device and move the model to the device
device = torch.device('mps' if torch.has_mps else 'cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Predict and print
predictions = trainer.predict(tweet_dataset)
pred_labels = np.argmax(predictions.predictions, axis=-1)

for tweet, label in zip(tweets, pred_labels):
    if label == 1:
        print(f"Tweet: {tweet}\nPrediction: Sarcastic\n")
    else:
        print(f"Tweet: {tweet}\nPrediction: Non-sarcastic\n")

Tweet: Biden is a great President like none other we have had
Prediction: Non-sarcastic

Tweet: Max Verstappen is such a clean driver, he never makes dirty moves when racing
Prediction: Non-sarcastic

Tweet: I can't wait to spend time with my family over Christmas! I just love being the only single one and the many many questions asking when will I get a boyfriend 🙄
Prediction: Non-sarcastic

Tweet: Well this is awesome news to wake up to!
Prediction: Non-sarcastic

Tweet: So school have sent out a really helpful newsletter with all key dates and activities. So many fun things arranged for the children.
Prediction: Non-sarcastic

Tweet: I just love the smell of one million 😍
Prediction: Non-sarcastic

