<a href="https://colab.research.google.com/github/ozlemkrblt/StatisticalLanguageProcessingAssignments/blob/main/Assignment4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
"""
Course:        Statistical Language Processing - Summer 2024
Assignment:    A4
Author(s):     Özlem Karabulut

Honor Code:    We pledge that this program represents our own work,
               and that we have not given or received unauthorized help
               with this assignment.
"""


'\nCourse:        Statistical Language Processing - Summer 2024\nAssignment:    A4\nAuthor(s):     Özlem Karabulut\n\nHonor Code:    We pledge that this program represents our own work,\n               and that we have not given or received unauthorized help\n               with this assignment.\n'

# **1-Setup The Notebook**

## *1.1.Installing and Importing Necessary Libraries and Packages*

In [2]:
!pip install transformers
!pip install accelerate -U #to be able to use TrainingArguments
!pip install datasets
!pip install evaluate



In [3]:
import transformers
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding,TrainingArguments,Trainer

## *1.2. Mounting Google Drive and Setting Variable `device`*

In [4]:
from google.colab import drive
import os
import torch

drive.mount('/content/drive')

root_dir = '/content/drive/My Drive/Assignment4'

# Create a directory for training checkpoints
checkpoints_dir = os.path.join(root_dir, 'trainingCheckpoints')
if not os.path.exists(checkpoints_dir):
    os.makedirs(checkpoints_dir)

# Create a directory for saving models
models_dir = os.path.join(root_dir, 'savedModels')
if not os.path.exists(models_dir):
    os.makedirs(models_dir)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **2.Prepare the Training Data**

## *2.1.Load the emotion subset of the tweet dataset*

In [5]:
from datasets import load_dataset

ds = load_dataset("cardiffnlp/tweet_eval", "emotion")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## *2.2. & 2.3. Split and Tokenize The Dataset*

In [6]:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [7]:
if device.type == "cpu": # If the device is CPU, use only the first 50/10/10 samples for train/dev/test splits
    ds['train'] = ds['train'].select(range(50))
    ds['validation'] = ds['validation'].select(range(10))
    ds['test'] = ds['test'].select(range(10))

In [8]:
def tokenize_function(example): #method to split the dataset
    return tokenizer(example["text"], padding=True, truncation=True)

In [9]:
tokenized_train = ds['train'].map(tokenize_function, batched=True)
tokenized_validation = ds['validation'].map(tokenize_function, batched=True)
tokenized_test = ds['test'].map(tokenize_function, batched=True)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

### *2.4. Debug Code*

In [10]:
#debug to see if the dataset splitted correctly:
print("Length of train set:", len(ds['train']))
print("Length of validation set:", len(ds['validation']))
print("Length of test set:", len(ds['test']))

def print_examples(dataset, subset_name):
    print(f"\nExamples from {subset_name} set:")
    for i in range (5):
        string = dataset[i]
        print(f"text: {string['text']}")
        print(f"label: {string['label']}")
        print("-------")

print_examples(ds['train'], "train")
print_examples(ds['validation'], "validation")
print_examples(ds['test'], "test")

Length of train set: 50
Length of validation set: 10
Length of test set: 10

Examples from train set:
text: “Worry is a down payment on a problem you may never have'.  Joyce Meyer.  #motivation #leadership #worry
label: 2
-------
text: My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs
label: 0
-------
text: No but that's so cute. Atsu was probably shy about photos before but cherry helped her out uwu
label: 1
-------
text: Rooneys fucking untouchable isn't he? Been fucking dreadful again, depay has looked decent(ish)tonight
label: 0
-------
text: it's pretty depressing when u hit pan on ur favourite highlighter
label: 3
-------

Examples from validation set:
text: @user @user Oh, hidden revenge and anger...I rememberthe time,she rebutted you.
label: 0
-------
text: if not then #teamchristine bc all tana has done is provoke her by tweeting shady shit and trying to be a hard bitch begging for a fight
label: 0
-------
text: Hey @user #Field

# **3. Pre-training Tasks**

## *3.1.Load the pretrained `distilbert-base-uncased` model*

In [11]:
from transformers import AutoModelForSequenceClassification

#get num_labels value and create mappings
num_labels = len(ds['train'].features['label'].names)

label_names = ds["train"].features["label"].names
label2id = {label: idx for idx, label in enumerate(label_names)}
id2label = {id: label for label, id in label2id.items()}

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_labels, label2id=label2id , id2label= id2label)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### *3.1.1. Debug Code*

In [12]:
#check the mapping
print(model.config.label2id)
print(model.config.id2label)

# Print some examples to verify tokenization and labels
def print_examples(dataset, subset_name):
    print(f"\nExamples from {subset_name} set:")
    for i in range(5):
        string = dataset[i]
        print(f"text: {tokenizer.decode(string['input_ids'])}")
        print(f"label: {id2label[string['label']]}")
        print("-------")

print_examples(tokenized_train, "train")
print_examples(tokenized_validation, "validation")
print_examples(tokenized_test, "test")

{'anger': 0, 'joy': 1, 'optimism': 2, 'sadness': 3}
{0: 'anger', 1: 'joy', 2: 'optimism', 3: 'sadness'}

Examples from train set:
text: [CLS] “ worry is a down payment on a problem you may never have '. joyce meyer. # motivation # leadership # worry [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: optimism
-------
text: [CLS] my roommate : it's okay that we can't spell because we have autocorrect. # terrible # firstworldprobs [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: anger
-------
text: [CLS] no but that's so cute. atsu was probably shy about photos before but cherry helped her out uwu [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: joy
-------
text: [CLS] rooneys fucking untouchable isn't he? been fucking dreadful again, depay has looked decent ( ish ) tonight [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
label: anger
-------
text: [CLS] it's pretty depressing when u hit pan

## *3.2.Training Arguments*

In [13]:
training_args = TrainingArguments(
    output_dir=checkpoints_dir,
    evaluation_strategy="steps",
    save_strategy="steps", #during training, the model is saved at intervals, according to save_strategy.
    logging_strategy="steps",
    save_total_limit=2,
    load_best_model_at_end=True, #when training is finished, load the best model.
    eval_steps=500,
    save_steps=500,
    logging_steps=500, #according to the instructions
    per_device_train_batch_size=8,  # controls how many samples are processed before the model's internal parameters are updated.
    per_device_eval_batch_size=8,
    #report_to="none",
)




## *3.3.Evaluation Setup*

In [14]:
import evaluate

metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits,axis=-1)
    f1 = metric.compute(predictions=preds, references=labels)

    return {
        'f1': f1
    }

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

# **4.Initialize Trainer and Train**

In [15]:
model.to(device)

trainer = Trainer(
    model= model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_validation,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics, #evaluation metrics is given here
)

trainer.train()

Step,Training Loss,Validation Loss


TrainOutput(global_step=21, training_loss=1.156018575032552, metrics={'train_runtime': 114.4279, 'train_samples_per_second': 1.311, 'train_steps_per_second': 0.184, 'total_flos': 2929218645600.0, 'train_loss': 1.156018575032552, 'epoch': 3.0})

# **5.Save the Best Model**

In [17]:
trainer.save_model(models_dir)

# **6.Load the Saved Model**

In [20]:
model = AutoModelForSequenceClassification.from_pretrained(models_dir)

# **7.Create the Pipeline**

In [21]:
from transformers import TextClassificationPipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

pipeline = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    task="sentiment-analysis"
)



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

# **8.Use the Model for Inference**

In [26]:
import random

dataset =  ds['validation']['text'] + ds['test']['text']

test_tweets= random.sample(dataset, k=20)

for tweet in test_tweets:
    result = pipeline(tweet)

    prediction = result[0]['label']
    score = result[0]['score']

    print(f"Prediction: {prediction}, Score: {score:.4f}")
    print(f"Tweet: {tweet}")
    print()

Prediction: anger, Score: 0.5717
Tweet: Why have #Emmerdale had to rob #robron of having their first child together for that vile woman/cheating sl smh #bitter

Prediction: anger, Score: 0.6783
Tweet: @user @user Oh, hidden revenge and anger...I rememberthe time,she rebutted you.

Prediction: anger, Score: 0.6606
Tweet: #RIPBiwott I think Robert oukos soul can now rejoice and rest in peace. Call an evil man evil and a good man a good man. He was an evil man

Prediction: anger, Score: 0.6064
Tweet: O, the melancholy Catacombs quickly wandered about the Rue Morgue, Madman!

Prediction: anger, Score: 0.5930
Tweet: Rin might ever appeared gloomy but to be a melodramatic person was not her thing.\n\nBut honestly, she missed her old friend. The special one.

Prediction: anger, Score: 0.6756
Tweet: Hey @user #Fields in #skibbereen give your online delivery service a horrible name. 1.5 hours late on the 1 hour delivery window.

Prediction: anger, Score: 0.6185
Tweet: @user Interesting choice o