<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/course_project_2023_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student Name: Arvin Jalali
- Date: 22 August 2024
- Chosen Corpus: imdb
- Contributions (if group project): no
- Please note that I declare that I have used ChatGPT in various ways during this project. However, my usage was critical and not unreflective.

### Corpus information

- Description of the chosen corpus:
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

- Paper(s) and other published materials related to the corpus: There are several papers that are quite good for understanding binary sentiment analysis on the IMDb corpus using different machine learning methods. Below, I list a few of them that are considered to be good reads for beginners. The second paper is particularly good for understanding the key concepts and details. However, I must confess that, in addition to the papers, ChatGPT is also very informative.

- "Machine Learning-Based Classification for Sentiment Analysis of IMDb Reviews" by Chun-Liang Wu and Song-Ling Shin, Stanford University.

- "Sentiment Analysis on IMDb Using Lexicon and Neural Networks" by Zeeshan Shaukat, Abdul Ahad Zulfiqar, Chuangbai Xiao, Muhammad Azeem, and Tariq Mahmood.

- State-of-the-art performance (best published results) on this corpus:
The state-of-the-art performance for IMDb binary sentiment analysis is achieved by the RoBERTa-large with LlamBERT model, reaching an accuracy of 96.68%, leveraging large-scale low-cost data annotation in NLP.
https://paperswithcode.com/sota/sentiment-analysis-on-imdb

---

## 1. Setup

In [11]:
# Install necessary libraries quietly
!pip install --quiet torch transformers datasets evaluate accelerate

# Import essential libraries for NLP and data handling
import numpy as np  # For numerical operations
import datasets  # For handling datasets
from datasets import load_dataset, load_metric  # For loading datasets and metrics
import transformers  # For pre-trained transformer models
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BertConfig, Trainer, TrainingArguments, EarlyStoppingCallback
import evaluate  # For evaluation utilities
from sklearn.metrics import accuracy_score, f1_score

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [7]:
# Your code to download the corpus here

# Define the name of the dataset to load
DATASET = 'imdb'

# Load the dataset builder for the specified dataset
# The builder is responsible for preparing and configuring the dataset
builder = datasets.load_dataset_builder(DATASET)

# Load the dataset itself into memory
# This will download and prepare the dataset for use
dataset = datasets.load_dataset(DATASET)


### 2.2. Preprocessing

In [8]:


# Load the full dataset
dataset = load_dataset('imdb')

# Split the dataset into train, validation, and test sets
# Here we split the original train set into train and validation sets
train_full = dataset['train']
test = dataset['test']
train, val = train_full.train_test_split(test_size=0.1, seed=42).values()

# Define mappings for label IDs and labels
id2label = {0: "neg", 1: "pos"}
label2id = {"neg": 0, "pos": 1}

# Specify the name of the pre-trained BERT model
MODEL_NAME = "bert-base-uncased"

# Load the tokenizer for the specified BERT model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Load the pre-trained BERT model with the specified configuration
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    config=BertConfig(num_labels=2, id2label=id2label, label2id=label2id)
)

# Define a function to tokenize text examples
def tokenize(example, tokenizer):
    return tokenizer(example['text'], truncation=True, padding='max_length')

# Apply tokenization to the training dataset
train = train.map(lambda example: tokenize(example, tokenizer), batched=True)

# Apply tokenization to the validation dataset
val = val.map(lambda example: tokenize(example, tokenizer), batched=True)

# Apply tokenization to the test dataset
test = test.map(lambda example: tokenize(example, tokenizer), batched=True)



Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

---

## 3. Machine learning model

### 3.1. Model training

In [None]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

# Define the metric to evaluate model performance
accuracy_metric = load_metric("accuracy")

# Function to compute metrics from model predictions
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Convert logits to predicted class labels
    predictions = np.argmax(predictions, axis=1)
    # Compute accuracy using the loaded metric
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Prepare a data collator for dynamic padding of batches
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

# Set up early stopping to halt training if no improvement is seen for a certain number of evaluations
early_stopping_patience = 5
early_stopping = transformers.EarlyStoppingCallback(early_stopping_patience)

# Define training arguments and configuration
trainer_args = transformers.TrainingArguments(
    output_dir='checkpoints',                 # Directory to save model checkpoints
    evaluation_strategy='steps',              # Evaluate model every few steps
    logging_strategy='steps',                 # Log training progress every few steps
    load_best_model_at_end=True,              # Load the best model based on validation performance at the end of training
    eval_steps=100,                           # Number of steps between evaluations
    logging_steps=100,                        # Number of steps between logging
    learning_rate=0.00005,                    # Learning rate for the optimizer
    per_device_train_batch_size=8,            # Batch size for training
    per_device_eval_batch_size=32,            # Batch size for evaluation
    max_steps=500,                            # Maximum number of training steps
)

# Initialize the model for sequence classification with pre-trained weights
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    config=BertConfig(num_labels=2, id2label=id2label, label2id=label2id)
)

# Set up the Trainer with the model, arguments, datasets, and evaluation function
trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=train,
    eval_dataset=val,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=[early_stopping],
)

# Train the model
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
100,0.554,0.424969,0.8376
200,0.372,0.277872,0.89
300,0.3465,0.344154,0.8936
400,0.2849,0.335111,0.9024
500,0.3111,0.28825,0.9068


TrainOutput(global_step=500, training_loss=0.37368193054199217, metrics={'train_runtime': 791.4928, 'train_samples_per_second': 5.054, 'train_steps_per_second': 0.632, 'total_flos': 1052444221440000.0, 'train_loss': 0.37368193054199217, 'epoch': 0.17774617845716317})

### 3.2 Hyperparameter optimization

In [None]:
# Your code for hyperparameter optimization here

from transformers import AutoModelForSequenceClassification, BertConfig, Trainer, TrainingArguments, EarlyStoppingCallback
from datasets import load_metric
import numpy as np

# Load the accuracy metric to evaluate model performance
accuracy_metric = load_metric("accuracy")

# Function to compute accuracy from model predictions
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Convert model outputs (logits) to predicted class labels
    predictions = np.argmax(predictions, axis=1)
    # Compute accuracy using the accuracy_metric loaded earlier
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Prepare a data collator for dynamic padding of batches
# This is necessary to handle sequences of different lengths in a batch
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

# Define the hyperparameter search space
# You should manually enter the learning rates you want to explore
learning_rates = [1e-5, 3e-5]  # Example learning rates to try
# Set a fixed number of epochs for all experiments
# You can adjust this value if needed
num_train_epochs = 1  # Number of epochs

# Function to train and evaluate the model with a given learning rate
def objective(learning_rate):
    # Define training arguments using the current learning rate
    trainer_args = TrainingArguments(
        output_dir='checkpoints',             # Directory to save model checkpoints
        evaluation_strategy='steps',          # Evaluate model every few steps
        logging_strategy='steps',             # Log training progress every few steps
        load_best_model_at_end=True,          # Load the best model at the end of training
        eval_steps=100,                       # Evaluate every 100 steps
        logging_steps=100,                    # Log every 100 steps
        learning_rate=learning_rate,          # Current learning rate (hyperparameter being optimized)
        per_device_train_batch_size=8,        # Training batch size per GPU/CPU
        per_device_eval_batch_size=32,        # Evaluation batch size per GPU/CPU
        num_train_epochs=num_train_epochs,    # Fixed number of epochs
        metric_for_best_model="accuracy",     # Metric to select the best model
        disable_tqdm=True,                    # Disable tqdm progress bar to prevent clutter
    )

    # Initialize the model with pre-trained weights for sequence classification
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        config=BertConfig(num_labels=2, id2label=id2label, label2id=label2id)
    )

    # Set up the Trainer with the model, training arguments, datasets, and evaluation function
    trainer = Trainer(
        model=model,
        args=trainer_args,
        train_dataset=train,                  # Training dataset
        eval_dataset=val,                     # Validation dataset
        compute_metrics=compute_metrics,      # Function to compute accuracy
        tokenizer=tokenizer,                  # Tokenizer used for data preprocessing
        data_collator=data_collator,          # Data collator for dynamic padding
        callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],  # Early stopping callback
    )

    # Train the model and store the result
    trainer.train()

    # Evaluate the model on the validation set
    eval_result = trainer.evaluate()

    # Return the evaluation metric (accuracy) as the result of this training run
    return eval_result['eval_accuracy']

# Perform a grid search over learning rate values
best_accuracy = 0  # Initialize the best accuracy as 0
best_learning_rate = None  # Initialize the best learning rate as None

# Iterate over each learning rate in the search space
for lr in learning_rates:
    print(f"Training with learning_rate={lr}")
    # Train and evaluate the model with the current learning rate
    accuracy = objective(lr)
    print(f"Validation accuracy: {accuracy}")

    # Update the best learning rate if the current one is better
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_learning_rate = lr

# Print out the best learning rate and corresponding validation accuracy
print(f"Best Learning Rate: {best_learning_rate}")
print(f"Best Validation Accuracy: {best_accuracy}")


  accuracy_metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The repository for accuracy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/accuracy.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y
Training with learning_rate=1e-05


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.6071, 'grad_norm': 22.04717445373535, 'learning_rate': 9.644507643085674e-06, 'epoch': 0.03554923569143263}
{'eval_loss': 0.39751434326171875, 'eval_accuracy': 0.8724, 'eval_runtime': 79.6334, 'eval_samples_per_second': 31.394, 'eval_steps_per_second': 0.992, 'epoch': 0.03554923569143263}
{'loss': 0.3551, 'grad_norm': 1.8306313753128052, 'learning_rate': 9.289015286171348e-06, 'epoch': 0.07109847138286526}
{'eval_loss': 0.28923749923706055, 'eval_accuracy': 0.8884, 'eval_runtime': 78.7593, 'eval_samples_per_second': 31.742, 'eval_steps_per_second': 1.003, 'epoch': 0.07109847138286526}
{'loss': 0.3233, 'grad_norm': 15.086372375488281, 'learning_rate': 8.933522929257021e-06, 'epoch': 0.10664770707429791}
{'eval_loss': 0.35335102677345276, 'eval_accuracy': 0.89, 'eval_runtime': 78.4014, 'eval_samples_per_second': 31.887, 'eval_steps_per_second': 1.008, 'epoch': 0.10664770707429791}
{'loss': 0.3124, 'grad_norm': 0.6667317152023315, 'learning_rate': 8.578030572342695e-06, 'epoch'

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5603, 'grad_norm': 9.016888618469238, 'learning_rate': 2.893352292925702e-05, 'epoch': 0.03554923569143263}
{'eval_loss': 0.3524421751499176, 'eval_accuracy': 0.8636, 'eval_runtime': 78.5288, 'eval_samples_per_second': 31.835, 'eval_steps_per_second': 1.006, 'epoch': 0.03554923569143263}
{'loss': 0.3583, 'grad_norm': 1.3554198741912842, 'learning_rate': 2.786704585851404e-05, 'epoch': 0.07109847138286526}
{'eval_loss': 0.30092519521713257, 'eval_accuracy': 0.9016, 'eval_runtime': 78.2346, 'eval_samples_per_second': 31.955, 'eval_steps_per_second': 1.01, 'epoch': 0.07109847138286526}
{'loss': 0.3652, 'grad_norm': 8.327018737792969, 'learning_rate': 2.680056878777106e-05, 'epoch': 0.10664770707429791}
{'eval_loss': 0.3142629861831665, 'eval_accuracy': 0.8884, 'eval_runtime': 78.7468, 'eval_samples_per_second': 31.747, 'eval_steps_per_second': 1.003, 'epoch': 0.10664770707429791}
{'loss': 0.3163, 'grad_norm': 0.6078668832778931, 'learning_rate': 2.5734091717028084e-05, 'epoch':

### 3.3. Evaluation on test set

In [14]:
# Your code to evaluate the final model on the test set here

# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

# Define the metric to evaluate model performance
accuracy_metric = load_metric("accuracy")

# Function to compute metrics from model predictions
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Convert logits to predicted class labels
    predictions = np.argmax(predictions, axis=1)
    # Compute accuracy using the loaded metric
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Prepare a data collator for dynamic padding of batches
data_collator = transformers.DataCollatorWithPadding(tokenizer=tokenizer)

# Set up early stopping to halt training if no improvement is seen for a certain number of evaluations
early_stopping_patience = 5
early_stopping = transformers.EarlyStoppingCallback(early_stopping_patience)

# Define training arguments and configuration
trainer_args = transformers.TrainingArguments(
    output_dir='checkpoints',                 # Directory to save model checkpoints
    evaluation_strategy='steps',              # Evaluate model every few steps
    logging_strategy='steps',                 # Log training progress every few steps
    load_best_model_at_end=True,              # Load the best model based on validation performance at the end of training
    eval_steps=100,                           # Number of steps between evaluations
    logging_steps=100,                        # Number of steps between logging
    learning_rate=3e-05,                    # Learning rate for the optimizer
    per_device_train_batch_size=8,            # Batch size for training
    per_device_eval_batch_size=32,            # Batch size for evaluation
    max_steps=500,                            # Maximum number of training steps
)

# Initialize the model for sequence classification with pre-trained weights
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    config=BertConfig(num_labels=2, id2label=id2label, label2id=label2id)
)

# Set up the Trainer with the model, arguments, datasets, and evaluation function
trainer = transformers.Trainer(
    model=model,
    args=trainer_args,
    train_dataset=train,
    eval_dataset=test,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    callbacks=[early_stopping],
)

# Train the model
trainer.train()



  accuracy_metric = load_metric("accuracy")


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

The repository for accuracy contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/accuracy.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
100,0.5542,0.350992,0.8634
200,0.3437,0.323091,0.89052
300,0.3473,0.2777,0.90088
400,0.2617,0.303614,0.90424
500,0.2903,0.24746,0.91432


TrainOutput(global_step=500, training_loss=0.3594414825439453, metrics={'train_runtime': 4135.9896, 'train_samples_per_second': 0.967, 'train_steps_per_second': 0.121, 'total_flos': 1052444221440000.0, 'train_loss': 0.3594414825439453, 'epoch': 0.17774617845716317})

---

## 4. Results and summary

### 4.1 Corpus insights

In my project, I used the bert-base-uncased model to perform binary sentiment analysis on the IMDb dataset. Through this process, I got insights into both the dataset and the chosen model.

The IMDb corpus used for sentiment analysis consists of movie reviews that have been annotated for sentiment, typically as either positive or negative. Each review in the dataset is labeled based on its overall sentiment, providing a binary classification task. The annotations are derived from user ratings and review content, with positive reviews generally corresponding to ratings of 7 or above out of 10, and negative reviews corresponding to ratings of 4 or below. This labeled data is crucial for training and evaluating sentiment analysis models, allowing them to learn to distinguish between positive and negative sentiments based on textual content.

The IMDb dataset provides a diverse range of text data, making it ideal for training and evaluating models on sentiment classification tasks. The large volume of reviews in the dataset ensures that the model encounters a variety of language patterns and expressions, which is necessary for robust model performance.

The bert-base-uncased model, a variant of BERT, proved to be a suitable choice for this task. Its "uncased" nature, which ignores case sensitivity, is particularly useful for sentiment analysis, where the sentiment of a word or phrase is often independent of capitalization. The model's bidirectional mechanism, which considers context from both directions, further contributed to its strong performance in accurately classifying sentiments.

Overall, working with the IMDb dataset and the bert-base-uncased model highlighted the importance of leveraging pre-trained models for complex NLP tasks. The model's versatility and strong contextual understanding made it a powerful tool for sentiment analysis, achieving relatively high accuracy in classifying movie reviews.

### 4.2 Results

After conducting hyperparameter tuning, the model's performance metrics were closely monitored over 500 training steps. The results demonstrate a clear improvement in both training and validation metrics as the tuning progressed:

At Step 100, the model showed a training loss of 0.5542 and a validation loss of 0.3510, with an accuracy of 86.34%.
By Step 200, the training loss significantly decreased to 0.3437, while the validation loss dropped slightly to 0.3231. The accuracy improved to 89.05%.
At Step 300, the training loss marginally increased to 0.3473, but the validation loss saw a notable decrease to 0.2777, boosting accuracy to 90.09%.
By Step 400, the training loss further decreased to 0.2617, although the validation loss slightly increased to 0.3036. Despite this, accuracy continued to improve, reaching 90.42%.
Finally, at Step 500, the training loss was 0.2903, and the validation loss dropped to its lowest point at 0.2475, resulting in the highest recorded accuracy of 91.43%.

These results indicate that the hyperparameter tuning was effective, as the model consistently improved in accuracy while maintaining lower training and validation losses. The final metrics suggest a well-balanced model with strong generalization capabilities for sentiment analysis.

### 4.3 Relation to state of the art

In my project, I achieved an accuracy of 91.43% using the bert-base-uncased model for binary sentiment analysis on the IMDb dataset. While this is a strong performance, it's important to note that the current state-of-the-art result for this task is 96.68%, achieved by the RoBERTa-large with LlamBERT model. LlamBERT leverages large-scale, low-cost data annotation techniques, which likely contribute to its superior performance. Although bert-base-uncased is a highly effective and widely used model, especially given its accessibility and efficiency, the results from RoBERTa-large with LlamBERT highlight the advancements that can be achieved through the use of more complex models and innovative data annotation methods.

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [None]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [None]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [None]:
# Include your annotated out-of-domain data here