<a href="https://colab.research.google.com/github/heinohen/tko_7095_i2hlt/blob/main/Blomqvist_Heinonen_course_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s): Mika Blomqvist & Henrik Heinonen
- Date: 2024-05-02
- Chosen Corpus: imdb
- Contributions (if group project):

### Corpus information

- Description of the chosen corpus: Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
- Paper(s) and other published materials related to the corpus: https://aclanthology.org/P11-1015.pdf
- State-of-the-art performance (best published results) on this corpus:

---

## 1. Setup

In [76]:
# Your code to install and import libraries etc. here
!pip3 install -q transformers[torch] datasets evaluate plotly optuna
!pip3 install -q datasets

import datasets
from datasets import load_dataset_builder
from datasets import load_dataset, DatasetDict
datasets.disable_progress_bar()

from pprint import pprint # Pretty print
import sklearn.feature_extraction


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.1/380.1 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.4/233.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.8/78.8 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h

---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [45]:
# Your code to download the corpus here

def dataset_features ( data : str ) -> DatasetDict:

  dataset = datasets.load_dataset(data)
  builder = datasets.load_dataset_builder(data)

  print(builder.info.description)

  osoittaja = 0
  nimittaja = 0
  tulos = 0

  for rivi in dataset.keys():
    nimittaja += dataset[rivi].num_rows

  print(f"Total number of rows : {nimittaja} \n")
  print("Relative sizes of subsets in the dataset: \n")

  for rivi in dataset.keys():
    osoittaja = dataset[rivi].num_rows
    tulos = osoittaja/nimittaja

    print(f"{rivi}: {tulos:.0%}")


  print("\n---\n")
  train_dataset = dataset['train']
  label_names = train_dataset.features['label'].names
  train_dict = {}

  for indeksi in range(len(train_dataset)) :
    label_name = label_names[train_dataset[indeksi]['label']]
    if label_name not in train_dict :
      train_dict[label_name] = 1
    else:
      train_dict[label_name] += 1

  print("Distribution of labels in the 'train' subset of the dataset: \n")

  for avain, arvo in train_dict.items():
    tulos = arvo/len(train_dataset)
    print(f"{avain}:{tulos:.0%}")

  return dataset

data = "imdb"

dataset = dataset_features(data)
del dataset['unsupervised']

print(dataset)



Total number of rows : 100000 

Relative sizes of subsets in the dataset: 

train: 25%
test: 25%
unsupervised: 50%

---

Distribution of labels in the 'train' subset of the dataset: 

neg:50%
pos:50%
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


In [46]:
test_valid_split = dataset['test'].train_test_split(test_size=0.5)
dataset = DatasetDict({
    'train': dataset['train'],
    'test': test_valid_split['train'],
    'validate': test_valid_split['test']})

print(dataset)



DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
    validate: Dataset({
        features: ['text', 'label'],
        num_rows: 12500
    })
})


### 2.2. Preprocessing

In [47]:
# Your code for any necessary preprocessing here

In [48]:
dataset = dataset.shuffle()

In [50]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary = True, max_features = 20000)

texts = [ex['text'] for ex in dataset['train']]
vectorizer.fit(texts)

In [51]:
# Example from course

def vectorize_example(ex) -> dict:
  vectorized = vectorizer.transform([ex['text']]) # Transform documents to document-term matrix.
  non_zero_features = vectorized.nonzero()[1] # This is from torch 'nonzero' returns a 2-D tensor where each row is the index for a nonzero value.
  non_zero_features += 1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
  return {"input_ids":non_zero_features}

vectorized = vectorize_example(dataset['train'][0])

In [52]:
# Apply the tokenizer to the whole dataset using .map()

# Multiprocessing significantly speeds up processing by parallelizing processes on the CPU.
# Set the num_proc parameter in map() to set the number of processes to use:

# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dataset.map(vectorize_example,num_proc=4)
pprint(dset_tokenized["train"][0])


  self.pid = os.fork()


{'input_ids': [295,
               309,
               351,
               774,
               807,
               860,
               887,
               959,
               1157,
               1744,
               1756,
               1793,
               2103,
               2257,
               2295,
               2604,
               2625,
               2924,
               3053,
               3317,
               3603,
               3669,
               3671,
               3672,
               3797,
               3950,
               4257,
               4906,
               4924,
               4963,
               5540,
               5559,
               5736,
               5896,
               6306,
               6308,
               6584,
               6904,
               6907,
               7082,
               7127,
               7327,
               7330,
               7394,
               7590,
               7757,
               7967,
               8128,


In [53]:

import torch

def collator(list_of_examples):
  batch = {'labels':torch.tensor(list(ex['label'] for ex in list_of_examples))} # Labels in to a single tensor
  tensors = []
  max_len = max(len(example['input_ids']) for example in list_of_examples) # Get the length of longest input
  # To build a tensor
  for e in list_of_examples:
    ids = torch.tensor(e['input_ids']) # Pick the input ids
    # https://pytorch.org/docs/stable/generated/torch.nn.functional.pad.html
    # pad(input, (left, right))
    padded = torch.nn.functional.pad(ids, (0, max_len - ids.shape[0]))
    tensors.append(padded)
  # https://pytorch.org/docs/stable/generated/torch.vstack.html
  batch['input_ids'] = torch.vstack(tensors) # Stack tensors in sequence vertically (row wise).
  return batch

---

## 3. Machine learning model

### 3.1. Model training

In [54]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

In [55]:
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model


    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)

        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)

        #### MODIFIED HERE FOR EXERCISE 5 -> commented out
        ####projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)

        #### MODIFIED HERE FOR EXERCISE 5 -> base it off embedded_summed
        ##### OLD: logits=self.output(projected)
        logits=self.output(embedded_summed)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)

        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()


mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=2)


In [56]:
# And we can make a model
mlp = MLP(mlp_config)
fake_batch = collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call input_ids and labels as parameters to the call

(tensor(0.8185, grad_fn=<NllLossBackward0>),
 tensor([[ 0.2279, -0.0007],
         [ 0.1343, -0.1105]], grad_fn=<AddmmBackward0>))

In [57]:
# https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments

trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps.
    logging_strategy="steps", #  Logging is done every logging_steps.
    eval_steps=500, # Number of update steps between two evaluations if evaluation_strategy="steps".
    # Will default to the same value as logging_steps if not set.
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps=500, #  Number of update steps between two logs if logging_strategy="steps".
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    learning_rate=1e-4, #learning rate of the gradient descent
    # float, optional, defaults to 5e-5) — The initial learning rate.
    max_steps=20000, #  (int, optional, defaults to -1)
    # If set to a positive number, the total number of training steps to perform.
    # Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted)

    # until max_steps is reached.
    #num_train_epochs=5.0,
    load_best_model_at_end=True, # Whether or not to load the best model found during training at the end of training.
    # When this option is enabled, the best checkpoint will always be saved.
    per_device_train_batch_size = 64
)

pprint(trainer_args) #print if needed

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

### 3.2 Hyperparameter optimization

In [58]:
# Your code for hyperparameter optimization here

In [59]:
# TODO: Build more hyperparameter tests

In [60]:
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps.
    logging_strategy="steps", #  Logging is done every logging_steps.
    eval_steps=500, # Number of update steps between two evaluations if evaluation_strategy="steps".
    # Will default to the same value as logging_steps if not set.
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps=500, #  Number of update steps between two logs if logging_strategy="steps".
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    learning_rate=1e-4, #learning rate of the gradient descent
    # float, optional, defaults to 5e-5) — The initial learning rate.
    max_steps=20000, #  (int, optional, defaults to -1)
    # If set to a positive number, the total number of training steps to perform.
    # Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted)

    # until max_steps is reached.
    #num_train_epochs=5.0,
    load_best_model_at_end=True, # Whether or not to load the best model found during training at the end of training.
    # When this option is enabled, the best checkpoint will always be saved.
    per_device_train_batch_size = 64
)

pprint(trainer_args) #print if needed





TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

In [61]:
import numpy as np
import evaluate

# Evaluate is a library that makes evaluating and comparing models
# and reporting their performance easier and more standardized.
# https://pypi.org/project/evaluate/

accuracy = evaluate.load('accuracy')

def compute_accuracy(outputs_and_labels):
  outputs, labels = outputs_and_labels
  preds = np.argmax(outputs, axis = -1) # Returns the indices of the maximum values along an axis.
  # https://numpy.org/doc/stable/reference/generated/numpy.argmax.html
  return accuracy.compute(predictions = preds, references = labels)

In [62]:
"""
# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["validate"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()
"""

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5273,0.418552,0.863
1000,0.3215,0.321208,0.879
1500,0.2432,0.291784,0.882
2000,0.1976,0.281546,0.884
2500,0.1682,0.281514,0.886
3000,0.1437,0.282806,0.893
3500,0.1287,0.288915,0.89
4000,0.109,0.295761,0.889
4500,0.0981,0.307312,0.879
5000,0.0842,0.317462,0.878


TrainOutput(global_step=5000, training_loss=0.2021428825378418, metrics={'train_runtime': 104.4017, 'train_samples_per_second': 12260.34, 'train_steps_per_second': 191.568, 'total_flos': 31773544992.0, 'train_loss': 0.2021428825378418, 'epoch': 12.787723785166241})

### 3.3. Evaluation on test set

In [93]:
import optuna

def objective(trial):
    # Define the search space for hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical("batch_size", [16, 64, 128, 256])


    trainer_args = transformers.TrainingArguments(
        "mlp_checkpoints", #save checkpoints here
        evaluation_strategy="steps",
        logging_strategy="steps",
        eval_steps=500,
        logging_steps=500,
        learning_rate=learning_rate, #learning rate of the gradient descent
        max_steps=10000,
        load_best_model_at_end=True,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size
    )

    mlp = MLP(mlp_config)

    trainer = transformers.Trainer(
        model=mlp,
        args=trainer_args,
        train_dataset=dset_tokenized["train"],
        eval_dataset=dset_tokenized["validate"].select(range(1000)), #make a smaller subset to evaluate on
        compute_metrics=compute_accuracy,
        data_collator=collator,
        callbacks=[early_stopping]
    )

    # Train the model and get the best validation loss
    trainer.train()
    eval_results = trainer.evaluate()
    return eval_results["eval_accuracy"] #let's try to maximize accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=5)

[I 2024-05-07 07:26:32,880] A new study created in memory with name: no-name-339b72d7-42d4-4ce6-b950-9269a9b5e4fe
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.694,0.691334,0.504
1000,0.6882,0.687066,0.504
1500,0.6841,0.683344,0.548
2000,0.68,0.680054,0.613
2500,0.6765,0.677056,0.672
3000,0.6731,0.674293,0.702
3500,0.6704,0.671785,0.728
4000,0.6674,0.66949,0.742
4500,0.6649,0.667382,0.744
5000,0.6625,0.665471,0.75


[I 2024-05-07 07:29:48,999] Trial 0 finished with value: 0.795 and parameters: {'learning_rate': 1.066739746678008e-06, 'batch_size': 64}. Best is trial 0 with value: 0.795.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5824,0.510022,0.853
1000,0.4217,0.40756,0.862
1500,0.333,0.35418,0.873
2000,0.2823,0.323986,0.878
2500,0.2466,0.306454,0.88
3000,0.2219,0.295592,0.885
3500,0.2037,0.288871,0.885
4000,0.1874,0.284863,0.883
4500,0.1759,0.282705,0.881
5000,0.1637,0.281589,0.882


[I 2024-05-07 07:34:00,185] Trial 1 finished with value: 0.884 and parameters: {'learning_rate': 4.302032799816157e-05, 'batch_size': 128}. Best is trial 1 with value: 0.884.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6805,0.670819,0.8
1000,0.6555,0.649926,0.819
1500,0.6315,0.630131,0.828
2000,0.6098,0.611733,0.832
2500,0.589,0.594849,0.839
3000,0.5705,0.579553,0.839
3500,0.5542,0.565878,0.838
4000,0.5392,0.553609,0.841
4500,0.5267,0.542758,0.844
5000,0.5145,0.533108,0.844


[I 2024-05-07 07:39:19,263] Trial 2 finished with value: 0.851 and parameters: {'learning_rate': 5.434730358469891e-06, 'batch_size': 128}. Best is trial 1 with value: 0.884.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.5133,0.411181,0.863
1000,0.3119,0.32322,0.875
1500,0.2337,0.295275,0.882
2000,0.1924,0.28411,0.883
2500,0.1629,0.281784,0.885
3000,0.1414,0.284918,0.887
3500,0.1252,0.288333,0.886
4000,0.1102,0.294786,0.883
4500,0.0997,0.303686,0.879
5000,0.0887,0.311257,0.878


[I 2024-05-07 07:41:58,344] Trial 3 finished with value: 0.885 and parameters: {'learning_rate': 8.379951734284147e-05, 'batch_size': 128}. Best is trial 3 with value: 0.885.
max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.7,0.697147,0.496
1000,0.6913,0.690751,0.497
1500,0.6848,0.685565,0.519
2000,0.6795,0.681044,0.539
2500,0.6744,0.676966,0.558
3000,0.6698,0.673243,0.577
3500,0.6659,0.669846,0.601
4000,0.6619,0.666705,0.615
4500,0.6587,0.663847,0.632
5000,0.6555,0.661231,0.646


[I 2024-05-07 07:47:17,366] Trial 4 finished with value: 0.687 and parameters: {'learning_rate': 1.172646301517167e-06, 'batch_size': 128}. Best is trial 3 with value: 0.885.


In [94]:
learn = study.best_params['learning_rate']

batchsize = study.best_params['batch_size']

In [95]:
print(learn, batchsize)

8.379951734284147e-05 128


In [96]:
# And we can make a model
vali = MLP(mlp_config)
fake_batch = collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call input_ids and labels as parameters to the call

(tensor(0.0448, grad_fn=<NllLossBackward0>),
 tensor([[-0.8853,  1.4817],
         [-5.7466,  7.7066]], grad_fn=<AddmmBackward0>))

In [99]:
# https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments
early_stopping = transformers.EarlyStoppingCallback(5)
vali_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps.
    logging_strategy="steps", #  Logging is done every logging_steps.
    eval_steps=500, # Number of update steps between two evaluations if evaluation_strategy="steps".
    # Will default to the same value as logging_steps if not set.
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps=500, #  Number of update steps between two logs if logging_strategy="steps".
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    learning_rate=learn, #learning rate of the gradient descent
    # float, optional, defaults to 5e-5) — The initial learning rate.
    max_steps=20000, #  (int, optional, defaults to -1)
    # If set to a positive number, the total number of training steps to perform.
    # Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted)

    # until max_steps is reached.
    #num_train_epochs=5.0,
    load_best_model_at_end=True, # Whether or not to load the best model found during training at the end of training.
    # When this option is enabled, the best checkpoint will always be saved.
    per_device_train_batch_size = batchsize
)

pprint(vali_args) #print if needed

vali_trainer = transformers.Trainer(
    model = vali,
    args = vali_args,
    train_dataset = dset_tokenized['train'],
    eval_dataset = dset_tokenized['test'].select(range(1000)),
    compute_metrics = compute_accuracy,
    data_collator = collator,
    callbacks = [early_stopping]
)

vali_trainer.train()

max_steps is given, it will override any value given in num_train_epochs


TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

Step,Training Loss,Validation Loss,Accuracy
500,0.1413,0.236634,0.9
1000,0.1164,0.241385,0.899
1500,0.0965,0.250631,0.894
2000,0.0811,0.266354,0.893
2500,0.0681,0.28469,0.886
3000,0.0573,0.294407,0.891


TrainOutput(global_step=3000, training_loss=0.09346104049682617, metrics={'train_runtime': 138.095, 'train_samples_per_second': 18537.965, 'train_steps_per_second': 144.828, 'total_flos': 41158147968.0, 'train_loss': 0.09346104049682617, 'epoch': 15.306122448979592})

In [100]:
vali_trainer.predict(dset_tokenized['test'])

PredictionOutput(predictions=array([[-0.06904455,  0.2582905 ],
       [-3.7400405 ,  4.6512074 ],
       [-0.77783036,  1.1135826 ],
       ...,
       [ 1.1818006 , -1.2326403 ],
       [ 2.9604218 , -3.367769  ],
       [-1.2348528 ,  1.6630334 ]], dtype=float32), label_ids=array([0, 1, 1, ..., 0, 0, 1]), metrics={'test_loss': 0.28368690609931946, 'test_accuracy': 0.88816, 'test_runtime': 4.1213, 'test_samples_per_second': 3033.007, 'test_steps_per_second': 379.247})

In [88]:
# And we can make a model
bonus = MLP(mlp_config)
fake_batch = collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call input_ids and labels as parameters to the call

(tensor(0.0448, grad_fn=<NllLossBackward0>),
 tensor([[-0.8853,  1.4817],
         [-5.7466,  7.7066]], grad_fn=<AddmmBackward0>))

In [101]:
# https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments
early_stopping = transformers.EarlyStoppingCallback(5)
bonus_trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps.
    logging_strategy="steps", #  Logging is done every logging_steps.
    eval_steps=500, # Number of update steps between two evaluations if evaluation_strategy="steps".
    # Will default to the same value as logging_steps if not set.
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps=500, #  Number of update steps between two logs if logging_strategy="steps".
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    learning_rate=learn, #learning rate of the gradient descent
    # float, optional, defaults to 5e-5) — The initial learning rate.
    max_steps=20000, #  (int, optional, defaults to -1)
    # If set to a positive number, the total number of training steps to perform.
    # Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted)

    # until max_steps is reached.
    #num_train_epochs=5.0,
    load_best_model_at_end=True, # Whether or not to load the best model found during training at the end of training.
    # When this option is enabled, the best checkpoint will always be saved.
    per_device_train_batch_size = batchsize
)

pprint(bonus_trainer_args) #print if needed

bonus_trainer = transformers.Trainer(
    model = bonus,
    args = bonus_trainer_args,
    train_dataset = dset_tokenized['train'],
    eval_dataset = dset_tokenized['test'].select(range(1000)),
    compute_metrics = compute_accuracy,
    data_collator = collator,
    callbacks = [early_stopping]
)

bonus_trainer.train()

max_steps is given, it will override any value given in num_train_epochs


TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

Step,Training Loss,Validation Loss,Accuracy
500,0.1199,0.243638,0.896
1000,0.1043,0.251385,0.894
1500,0.0838,0.262608,0.894
2000,0.0702,0.280113,0.889
2500,0.0587,0.299319,0.888
3000,0.0493,0.310243,0.89


TrainOutput(global_step=3000, training_loss=0.08103721110026042, metrics={'train_runtime': 102.342, 'train_samples_per_second': 25014.174, 'train_steps_per_second': 195.423, 'total_flos': 41158147968.0, 'train_loss': 0.08103721110026042, 'epoch': 15.306122448979592})

---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

*

    TWEETS FROM https://github.com/MarkHershey/CompleteTrumpTweetsArchive
    ANNOTATED BY HAND TEXT CLASSIFICATION 50 POS 50 NEG



(Briefly describe the process of annotation)
The dataset we used for annotation task consists of tweets by Donald Trump before and after inauguration. Because the needed size of annotated texts was small, only 100 tweets as individual documents, the process was pretty straight forward. The split of the data was 25 negatives and 25 positives from both before and after inauguration, totaling the 100 needed. For individual tweets we started looking for highly positive or negative words and after that tried to decide was it satire or not. Borderline cases included tweets that depend on which side of the political spectrum the reader resides in. Those were discarded in this small task. If the amount of data was larger, then we would have to reconsider. As for the test purposes we tried to select as positive or negative tweets as possible for our dataset. The annotation speed of the task was quick, because the size was small, and the tweets are very short documents. We found the contents of the tweets interesting, displaying polarity between the two timeframes. Also, the ethical side of the annotation process included reading a lot of hate speech which in large amounts can be harmful to the individual annotators’ mental well-being. We can only imagine what it feels like to do this for a living for a small monetary compensation.

### 5.2 Conversion into dataset

In [102]:
# Your code to convert the annotations into a dataset here

In [103]:
!wget https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_pos.txt
!wget https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_neg.txt


--2024-05-07 07:56:08--  https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7300 (7.1K) [text/plain]
Saving to: ‘trump_pos.txt.2’


2024-05-07 07:56:09 (37.5 MB/s) - ‘trump_pos.txt.2’ saved [7300/7300]

--2024-05-07 07:56:09--  https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_neg.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7076 (6.9K) [text/plain]
Saving to: ‘trump_neg.txt.2’


2024-05-07 07:56:09 (62.3 MB/s) - ‘trump_

In [104]:
from datasets import Dataset, DatasetDict, ClassLabel, Features, Value, DatasetInfo

In [105]:
texts = []
labels = []

with open('trump_pos.txt', 'r', encoding = 'utf-8') as f:
  for row in f:
    texts.append(row.strip())
    labels.append(1)

with open('trump_neg.txt', 'r', encoding = 'utf-8') as f:
  for row in f:
    texts.append(row.strip())
    labels.append(0)

# A special dictionary that defines the internal structure of a dataset
features = Features({
    'text': Value('string'),
    'label': ClassLabel(num_classes=2, names=['negative', 'positive'])
})

# The base class Dataset implements a Dataset backed by an Apache Arrow table.
bonus_ds = Dataset.from_dict({'text': texts, 'label': labels}, features=features)

# description (str) — A description of the dataset.
bonus_ds.info.description = "This dataset contains tweets from Donald Trump labeled as positive and negative. Annotation was done manually. Dataset has tweets from Donald Trump right before he was in office and after he was elected president. Ratio is 50/50."

# Lets check that everything is ok
for i in range(10):
    print(bonus_ds[i]['text'], bonus_ds[i]['label'])

# lets store the label names for further checking
label_names = bonus_ds.features['label'].names




# Lets shuffle the database
bonus_ds = bonus_ds.shuffle(seed=42)

print("---------------------")

# And lets check the labeling once more
for i in range(10):
    example = bonus_ds[i]
    text = example['text']
    label_num = example['label']
    label_name = bonus_ds.features['label'].int2str(label_num)
    print(f"Label: {label_num} ({label_name})")
    print('---')


print(len(texts))
print(len(labels))
print(bonus_ds)

2016 was AMAZING, but we never had this kind of ENTHUSIASM! 1
Will soon be heading to Wilmington, North Carolina, and then will be going to Battleship North Carolina. Look forward to seeing all of my friends! 1
Mike has my complete &amp; total endorsement. We need him badly in Washington. A great fighter pilot &amp; hero, &amp; a brilliant Annapolis grad, Mike will never let you down. Mail in ballots, &amp; check that they are counted! 1
I’m with the TRUCKERS all the way. Thanks for the meeting at the White House with my representatives from the Administration. It is all going to work out well! 1
Congressman Bill Johnson (@JohnsonLeads) is an incredible fighter for the Great State of Ohio! He’s a proud Veteran and a hard worker who Cares for our Veterans, Supports Small Business, and is Strong on the Border and Second Amendment.... 1
We are having very productive calls with the leaders of every sector of the economy who are all-in on getting America back to work, and soon. More to come

In [107]:

import sklearn.feature_extraction

# max_features means the size of the vocabulary
# which means max_features most-common words
vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in bonus_ds] #get a list of all texts from the training data ['train']
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary




In [108]:
bonus_vectorized=vectorize_example(bonus_ds[1]) #['train']

print(bonus_vectorized)

{'input_ids': array([ 45,  77, 123, 126, 141, 182, 184, 275, 294, 384, 394, 426, 430,
       448, 468, 486, 521, 566, 572, 601, 611, 726, 764, 793, 804, 856,
       921, 929, 944], dtype=int32)}


In [109]:
# Apply the tokenizer to the whole dataset using .map()
bonus_dset_tokenized = bonus_ds.map(vectorize_example,num_proc=4)

#lets check one vector from the data
example = bonus_dset_tokenized[8]

pprint(example)

#Just checking that the labeling is still ok
num_label = example['label']

text_label = bonus_dset_tokenized.features['label'].int2str(num_label)

print("Numerical label:", num_label)
print("Corresponding text label:", text_label)


  self.pid = os.fork()


{'input_ids': [46,
               92,
               104,
               140,
               196,
               210,
               211,
               319,
               326,
               332,
               341,
               552,
               601,
               760,
               822,
               897,
               922],
 'label': 0,
 'text': 'Fiscal mismanagement of cash costing US Taxpayer billions---cut '
         'fraud and waste before cutting funding for Seniors.'}
Numerical label: 0
Corresponding text label: negative


In [110]:
bonus_eval_results = bonus_trainer.predict(bonus_dset_tokenized)

In [111]:
pprint(bonus_eval_results)

PredictionOutput(predictions=array([[ 4.57644403e-01, -3.58707786e-01],
       [-7.80374348e-01,  1.11082733e+00],
       [ 6.15025684e-02,  1.09371118e-01],
       [-5.64188063e-02,  2.46391654e-01],
       [-3.46976221e-01,  5.94066620e-01],
       [-3.04294229e-01,  5.40958285e-01],
       [-2.67233491e-01,  5.01465619e-01],
       [ 9.66164470e-03,  1.70603305e-01],
       [ 4.85751033e-01, -3.95431906e-01],
       [-3.24791729e-01,  5.67284703e-01],
       [ 2.08925903e-01, -6.37654215e-02],
       [-4.08392310e-01,  6.66656375e-01],
       [-3.23123276e-01,  5.65759838e-01],
       [ 4.43992972e-01, -3.45990181e-01],
       [-1.71758071e-01,  3.86813879e-01],
       [ 4.77082551e-01, -3.85391712e-01],
       [ 2.32562274e-01, -9.43184122e-02],
       [ 1.56319961e-01, -6.96476176e-03],
       [-1.33506849e-01,  3.39993924e-01],
       [ 2.33054042e-01, -9.90208387e-02],
       [-4.17769372e-01,  6.79193795e-01],
       [-2.53610790e-01,  4.81323481e-01],
       [ 5.74450910e-01, 

### 5.3. Model evaluation on out-of-domain test set

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [112]:
# Include your annotated out-of-domain data here

for e in bonus_ds:
    text = e['text']
    label_num = e['label']
    label_name = bonus_ds.features['label'].int2str(label_num)
    print(text)
    print(f"Label: {label_num} ({label_name})")
    print('---')

Russian Collusion with the Trump Campaign, one of the most successful in history, is a TOTAL HOAX. The Democrats paid for the phony and discredited Dossier which was, along with Comey, McCabe, Strzok and his lover, the lovely Lisa Page, used to begin the Witch Hunt. Disgraceful!
Label: 0 (negative)
---
Congresswoman @cathymcmorris of Washington State is an incredible leader who is respected by everyone in Congress. We need her badly in D.C. to keep building on #MAGA. She has my Strong Endorsement!
Label: 1 (positive)
---
To all of those who have asked, I will not be going to the Inauguration on January 20th.
Label: 0 (negative)
---
I applaud and congratulate the U.S. Senate for confirming our GREAT NOMINEE, Judge Brett Kavanaugh, to the United States Supreme Court. Later today, I will sign his Commission of Appointment, and he will be officially sworn in. Very exciting!
Label: 1 (positive)
---
Here's to a safe and happy Independence Day for one and all - Enjoy it! --Donald J. Trump
Lab