<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/course_project_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project

- Student Name: Aaron Leino
- Date: 6.5.2025
- Chosen Corpus: Stanford Sentiment Treebank (SST-2)

### Corpus information

- Description of the chosen corpus: The Stanford Sentiment Treebank contains movie review sentences from Rotten Tomatoes. The reviews are annotated with binary sentiment labels: positive and negative.
- Paper(s) and other published materials related to the corpus: Socher et al., 2013: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
- State-of-the-art performance (best published results) on this corpus: T5-11B and MT-DNN-SMART, both with 97.5% accuracy. (Papers With Code)

---

## 1. Setup

In [15]:
!pip install -q datasets transformers evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np
import torch
from sklearn.model_selection import ParameterGrid
import optuna

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [10]:
dataset = load_dataset("glue", "sst2")
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})


### 2.2. Preprocessing

In [17]:
# Use DistilBERT as tokenizer

distilbert = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(distilbert)

# Tokenize sentences
def tokenize_func(inputdata):
    return tokenizer(inputdata["sentence"], truncation = True, padding = "max_length", max_length=128)
tokenized_dataset = dataset.map(tokenize_func, batched=True)

# The test set does not have labels so I split the train data.
split1 = tokenized_dataset["train"].train_test_split(test_size=200, seed=2025)
test_dataset = split1["test"]  
remaining_train = split1["train"] 
train_dataset = remaining_train.select(range(1000))

validation_dataset = tokenized_dataset["validation"].select(range(200))


---

## 3. Machine learning model

### 3.1. Model training

In [18]:

# Define accuracy metrics
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

### 3.2 Hyperparameter optimization

In [19]:
def model_init(): #for initializing a fresh model each time
    return AutoModelForSequenceClassification.from_pretrained(distilbert, num_labels=2)

# use optuna for parameter optimization
# define parameter space for optuna 

def optuna_hp_space(ex):
    return {
      "learning_rate": ex.suggest_categorical("learning_rate", [2e-5, 3e-5, 5e-5]),
      "num_train_epochs": ex.suggest_int("num_train_epochs", 2, 3),
      "per_device_train_batch_size": ex.suggest_categorical("per_device_train_batch_size", [8, 16]),
      "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
    }

#define the base arguments for optuna
base_args = TrainingArguments(
    output_dir="./optuna_results",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="no",               
    load_best_model_at_end=False,     
    disable_tqdm=True
)

# hyperparameter search
trainer = Trainer(
    model_init=model_init,
    args=base_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    compute_metrics=compute_metrics,
)
best_run = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    n_trials=12,
    hp_space=optuna_hp_space,
)


print("Best hyperparameters:", best_run.hyperparameters)



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2025-05-07 11:14:41,290] A new study created in memory with name: no-name-7eef530c-c6c9-4846-b96d-9eaf835bb3bc
  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5075, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}
{'eval_loss': 0.4407897889614105, 'eval_accuracy': 0.8, 'eval_runtime': 5.5042, 'eval_samples_per_second': 36.336, 'eval_steps_per_second': 4.542, 'epoch': 1.0}




{'loss': 0.1864, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}
{'eval_loss': 0.6523927450180054, 'eval_accuracy': 0.79, 'eval_runtime': 4.7747, 'eval_samples_per_second': 41.887, 'eval_steps_per_second': 5.236, 'epoch': 2.0}




{'loss': 0.045, 'learning_rate': 0.0, 'epoch': 3.0}


[I 2025-05-07 11:20:31,150] Trial 0 finished with value: 0.83 and parameters: {'learning_rate': 5e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.002314276187863276}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.8185315132141113, 'eval_accuracy': 0.83, 'eval_runtime': 5.0142, 'eval_samples_per_second': 39.887, 'eval_steps_per_second': 4.986, 'epoch': 3.0}
{'train_runtime': 349.4683, 'train_samples_per_second': 8.584, 'train_steps_per_second': 1.073, 'train_loss': 0.24630388641357423, 'epoch': 3.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.6155, 'learning_rate': 1e-05, 'epoch': 1.0}
{'eval_loss': 0.48285767436027527, 'eval_accuracy': 0.8, 'eval_runtime': 5.6789, 'eval_samples_per_second': 35.218, 'eval_steps_per_second': 4.402, 'epoch': 1.0}




{'loss': 0.3388, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:24:10,657] Trial 1 finished with value: 0.785 and parameters: {'learning_rate': 2e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'weight_decay': 1.6968814162145983e-05}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.42116352915763855, 'eval_accuracy': 0.785, 'eval_runtime': 5.2289, 'eval_samples_per_second': 38.249, 'eval_steps_per_second': 4.781, 'epoch': 2.0}
{'train_runtime': 219.0378, 'train_samples_per_second': 9.131, 'train_steps_per_second': 0.575, 'train_loss': 0.47712254902673146, 'epoch': 2.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.559, 'learning_rate': 1.5e-05, 'epoch': 1.0}
{'eval_loss': 0.4359535276889801, 'eval_accuracy': 0.815, 'eval_runtime': 5.0999, 'eval_samples_per_second': 39.216, 'eval_steps_per_second': 4.902, 'epoch': 1.0}




{'loss': 0.267, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:27:53,845] Trial 2 finished with value: 0.8 and parameters: {'learning_rate': 3e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'weight_decay': 6.146881799785113e-06}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.4291819632053375, 'eval_accuracy': 0.8, 'eval_runtime': 5.2897, 'eval_samples_per_second': 37.81, 'eval_steps_per_second': 4.726, 'epoch': 2.0}
{'train_runtime': 222.7167, 'train_samples_per_second': 8.98, 'train_steps_per_second': 0.566, 'train_loss': 0.4129905246552967, 'epoch': 2.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.559, 'learning_rate': 1.5e-05, 'epoch': 1.0}
{'eval_loss': 0.4359535276889801, 'eval_accuracy': 0.815, 'eval_runtime': 5.1646, 'eval_samples_per_second': 38.725, 'eval_steps_per_second': 4.841, 'epoch': 1.0}




{'loss': 0.267, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:31:34,524] Trial 3 finished with value: 0.8 and parameters: {'learning_rate': 3e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'weight_decay': 0.00010167643831554165}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.4291819632053375, 'eval_accuracy': 0.8, 'eval_runtime': 5.3636, 'eval_samples_per_second': 37.288, 'eval_steps_per_second': 4.661, 'epoch': 2.0}
{'train_runtime': 220.2741, 'train_samples_per_second': 9.08, 'train_steps_per_second': 0.572, 'train_loss': 0.4129905246552967, 'epoch': 2.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5323, 'learning_rate': 2.5e-05, 'epoch': 1.0}
{'eval_loss': 0.39774540066719055, 'eval_accuracy': 0.8, 'eval_runtime': 5.3571, 'eval_samples_per_second': 37.333, 'eval_steps_per_second': 4.667, 'epoch': 1.0}




{'loss': 0.2026, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:35:13,439] Trial 4 finished with value: 0.805 and parameters: {'learning_rate': 5e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'weight_decay': 0.002134381267111705}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.49984779953956604, 'eval_accuracy': 0.805, 'eval_runtime': 5.1832, 'eval_samples_per_second': 38.586, 'eval_steps_per_second': 4.823, 'epoch': 2.0}
{'train_runtime': 218.093, 'train_samples_per_second': 9.17, 'train_steps_per_second': 0.578, 'train_loss': 0.36744473472474115, 'epoch': 2.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5113, 'learning_rate': 1.5e-05, 'epoch': 1.0}
{'eval_loss': 0.4693756401538849, 'eval_accuracy': 0.805, 'eval_runtime': 14.1161, 'eval_samples_per_second': 14.168, 'eval_steps_per_second': 1.771, 'epoch': 1.0}




{'loss': 0.2271, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:41:30,916] Trial 5 finished with value: 0.8 and parameters: {'learning_rate': 3e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'weight_decay': 6.319186005153657e-05}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.5429187417030334, 'eval_accuracy': 0.8, 'eval_runtime': 13.3434, 'eval_samples_per_second': 14.989, 'eval_steps_per_second': 1.874, 'epoch': 2.0}
{'train_runtime': 376.7123, 'train_samples_per_second': 5.309, 'train_steps_per_second': 0.664, 'train_loss': 0.36920738220214844, 'epoch': 2.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5019, 'learning_rate': 2.5e-05, 'epoch': 1.0}
{'eval_loss': 0.4312130808830261, 'eval_accuracy': 0.81, 'eval_runtime': 13.2333, 'eval_samples_per_second': 15.113, 'eval_steps_per_second': 1.889, 'epoch': 1.0}




{'loss': 0.1818, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:49:15,222] Trial 6 finished with value: 0.82 and parameters: {'learning_rate': 5e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'weight_decay': 0.00011899112437839468}. Best is trial 0 with value: 0.83.


{'eval_loss': 0.6180638074874878, 'eval_accuracy': 0.82, 'eval_runtime': 4.3194, 'eval_samples_per_second': 46.303, 'eval_steps_per_second': 5.788, 'epoch': 2.0}
{'train_runtime': 463.724, 'train_samples_per_second': 4.313, 'train_steps_per_second': 0.539, 'train_loss': 0.3418203659057617, 'epoch': 2.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.511, 'learning_rate': 1.5e-05, 'epoch': 1.0}
{'eval_loss': 0.4685359299182892, 'eval_accuracy': 0.805, 'eval_runtime': 4.7016, 'eval_samples_per_second': 42.539, 'eval_steps_per_second': 5.317, 'epoch': 1.0}




{'loss': 0.2266, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:52:56,608] Trial 7 finished with value: 0.8 and parameters: {'learning_rate': 3e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 8, 'weight_decay': 0.00478185475363006}. Best is trial 0 with value: 0.83.
  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'eval_loss': 0.5440424680709839, 'eval_accuracy': 0.8, 'eval_runtime': 4.4799, 'eval_samples_per_second': 44.644, 'eval_steps_per_second': 5.58, 'epoch': 2.0}
{'train_runtime': 220.9718, 'train_samples_per_second': 9.051, 'train_steps_per_second': 1.131, 'train_loss': 0.36881552124023437, 'epoch': 2.0}




{'loss': 0.559, 'learning_rate': 1.5e-05, 'epoch': 1.0}
{'eval_loss': 0.4359535276889801, 'eval_accuracy': 0.815, 'eval_runtime': 4.4489, 'eval_samples_per_second': 44.955, 'eval_steps_per_second': 5.619, 'epoch': 1.0}




{'loss': 0.267, 'learning_rate': 0.0, 'epoch': 2.0}


[I 2025-05-07 11:56:14,302] Trial 8 finished with value: 0.8 and parameters: {'learning_rate': 3e-05, 'num_train_epochs': 2, 'per_device_train_batch_size': 16, 'weight_decay': 0.0009414577897716729}. Best is trial 0 with value: 0.83.
  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'eval_loss': 0.4291819632053375, 'eval_accuracy': 0.8, 'eval_runtime': 4.4205, 'eval_samples_per_second': 45.244, 'eval_steps_per_second': 5.655, 'epoch': 2.0}
{'train_runtime': 197.3346, 'train_samples_per_second': 10.135, 'train_steps_per_second': 0.639, 'train_loss': 0.4129905246552967, 'epoch': 2.0}




{'loss': 0.5556, 'learning_rate': 1.9999999999999998e-05, 'epoch': 1.0}


[I 2025-05-07 11:57:53,261] Trial 9 pruned. 


{'eval_loss': 0.47725075483322144, 'eval_accuracy': 0.775, 'eval_runtime': 4.4775, 'eval_samples_per_second': 44.668, 'eval_steps_per_second': 5.584, 'epoch': 1.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5076, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


[I 2025-05-07 12:00:30,516] Trial 10 pruned. 


{'eval_loss': 0.4480636715888977, 'eval_accuracy': 0.8, 'eval_runtime': 13.4333, 'eval_samples_per_second': 14.888, 'eval_steps_per_second': 1.861, 'epoch': 1.0}


  "weight_decay": ex.suggest_loguniform("weight_decay", 1e-6, 1e-1),
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.5078, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}


[I 2025-05-07 12:05:05,203] Trial 11 pruned. 


{'eval_loss': 0.4426151216030121, 'eval_accuracy': 0.8, 'eval_runtime': 13.4545, 'eval_samples_per_second': 14.865, 'eval_steps_per_second': 1.858, 'epoch': 1.0}
Best hyperparameters: {'learning_rate': 5e-05, 'num_train_epochs': 3, 'per_device_train_batch_size': 8, 'weight_decay': 0.002314276187863276}


### 3.3. Evaluation on test set

In [20]:

# define the best model by the previous optimization
# Extract the best parameters
hp = best_run.hyperparameters
best_args = TrainingArguments(
    output_dir="./best_model",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=hp["learning_rate"],
    per_device_train_batch_size=hp["per_device_train_batch_size"],
    num_train_epochs=hp["num_train_epochs"],
    weight_decay=hp.get("weight_decay", 0.0),
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    disable_tqdm=True,
)

# Combine validation and train data for larger training data
from datasets import concatenate_datasets
full_train = concatenate_datasets([train_dataset, validation_dataset])

best_trainer = Trainer(
    model_init(),
    args= best_args,
    train_dataset= full_train,
    eval_dataset=full_train,
    compute_metrics=compute_metrics,
)

#train the model
best_trainer.train()

# calculate results on test data
test_results = best_trainer.evaluate(test_dataset)
print("Test accuracy:", test_results["eval_accuracy"])

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'loss': 0.4702, 'learning_rate': 3.3333333333333335e-05, 'epoch': 1.0}
{'eval_loss': 0.15192928910255432, 'eval_accuracy': 0.9525, 'eval_runtime': 30.2194, 'eval_samples_per_second': 39.71, 'eval_steps_per_second': 4.964, 'epoch': 1.0}




{'loss': 0.1774, 'learning_rate': 1.6666666666666667e-05, 'epoch': 2.0}
{'eval_loss': 0.035744521766901016, 'eval_accuracy': 0.9925, 'eval_runtime': 26.4285, 'eval_samples_per_second': 45.406, 'eval_steps_per_second': 5.676, 'epoch': 2.0}




{'loss': 0.0567, 'learning_rate': 0.0, 'epoch': 3.0}
{'eval_loss': 0.018343310803174973, 'eval_accuracy': 0.9958333333333333, 'eval_runtime': 26.626, 'eval_samples_per_second': 45.069, 'eval_steps_per_second': 5.634, 'epoch': 3.0}
{'train_runtime': 464.9755, 'train_samples_per_second': 7.742, 'train_steps_per_second': 0.968, 'train_loss': 0.2347829967074924, 'epoch': 3.0}




{'eval_loss': 0.7507790923118591, 'eval_accuracy': 0.835, 'eval_runtime': 4.3702, 'eval_samples_per_second': 45.765, 'eval_steps_per_second': 5.721, 'epoch': 3.0}
Test accuracy: 0.835


---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

(Briefly describe the process of annotation)

### 5.2 Conversion into dataset

In [7]:
# Your code to convert the annotations into a dataset here

### 5.3. Model evaluation on out-of-domain test set

In [8]:
# Your code to evaluate the model on the out-of-domain test set here

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [9]:
# Include your annotated out-of-domain data here