<a href="https://colab.research.google.com/github/heinohen/tko_7095_i2hlt/blob/main/Blomqvist_Heinonen_course_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to HLT Project (Template)

- Student(s) Name(s): Mika Blomqvist & Henrik Heinonen
- Date: 2024-05-02
- Chosen Corpus: Rotten Tomatoes
- Contributions (if group project):

### Corpus information

- Description of the chosen corpus: Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
- Paper(s) and other published materials related to the corpus:
- State-of-the-art performance (best published results) on this corpus:

---

## 1. Setup

In [72]:
# Your code to install and import libraries etc. here
!pip3 install -q transformers[torch] datasets evaluate optuna plotly
!pip3 install -q datasets

import datasets
from datasets import load_dataset_builder
from datasets import load_dataset, DatasetDict
datasets.disable_progress_bar()

from pprint import pprint # Pretty print
import sklearn.feature_extraction


---

## 2. Data download and preprocessing

### 2.1. Download the corpus

In [73]:
# Your code to download the corpus here

def dataset_features ( data : str ) -> DatasetDict:

  dataset = datasets.load_dataset(data)
  builder = datasets.load_dataset_builder(data)

  print(builder.info.description)

  osoittaja = 0
  nimittaja = 0
  tulos = 0

  for rivi in dataset.keys():
    nimittaja += dataset[rivi].num_rows

  print(f"Total number of rows : {nimittaja} \n")
  print("Relative sizes of subsets in the dataset: \n")

  for rivi in dataset.keys():
    osoittaja = dataset[rivi].num_rows
    tulos = osoittaja/nimittaja

    print(f"{rivi}: {tulos:.0%}")


  print("\n---\n")
  train_dataset = dataset['train']
  label_names = train_dataset.features['label'].names
  train_dict = {}

  for indeksi in range(len(train_dataset)) :
    label_name = label_names[train_dataset[indeksi]['label']]
    if label_name not in train_dict :
      train_dict[label_name] = 1
    else:
      train_dict[label_name] += 1

  print("Distribution of labels in the 'train' subset of the dataset: \n")

  for avain, arvo in train_dict.items():
    tulos = arvo/len(train_dataset)
    print(f"{avain}:{tulos:.0%}")

  return dataset

data = "rotten_tomatoes"

dataset = dataset_features(data)





Total number of rows : 10662 

Relative sizes of subsets in the dataset: 

train: 80%
validation: 10%
test: 10%

---

Distribution of labels in the 'train' subset of the dataset: 

pos:50%
neg:50%


### 2.2. Preprocessing

In [74]:
# Your code for any necessary preprocessing here

In [75]:
dataset = dataset.shuffle()

In [76]:
print(dataset) # from assign 2

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})


In [77]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(binary = True, max_features = 20000)

texts = [ex['text'] for ex in dataset['train']]
vectorizer.fit(texts)

In [78]:
# Example from course

def vectorize_example(ex) -> dict:
  vectorized = vectorizer.transform([ex['text']]) # Transform documents to document-term matrix.
  non_zero_features = vectorized.nonzero()[1] # This is from torch 'nonzero' returns a 2-D tensor where each row is the index for a nonzero value.
  non_zero_features += 1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
  return {"input_ids":non_zero_features}

vectorized = vectorize_example(dataset['train'][0])

In [79]:
# Apply the tokenizer to the whole dataset using .map()

# Multiprocessing significantly speeds up processing by parallelizing processes on the CPU.
# Set the num_proc parameter in map() to set the number of processes to use:

# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dataset.map(vectorize_example,num_proc=4)
pprint(dset_tokenized["train"][0])


{'input_ids': [350,
               556,
               644,
               662,
               4264,
               4600,
               5022,
               6288,
               6385,
               6712,
               6942,
               7815,
               8819,
               9801,
               9993,
               13061,
               13158,
               14637,
               14692,
               15800,
               15997,
               16227,
               16405,
               16417],
 'label': 0,
 'text': 'an admitted egomaniac , evans is no hollywood villain , and yet this '
         "grating showcase almost makes you wish he'd gone the way of don "
         'simpson .'}


In [80]:

import torch

def collator(list_of_examples):
  batch = {'labels':torch.tensor(list(ex['label'] for ex in list_of_examples))} # Labels in to a single tensor
  tensors = []
  max_len = max(len(example['input_ids']) for example in list_of_examples) # Get the length of longest input
  # To build a tensor
  for e in list_of_examples:
    ids = torch.tensor(e['input_ids']) # Pick the input ids
    # https://pytorch.org/docs/stable/generated/torch.nn.functional.pad.html
    # pad(input, (left, right))
    padded = torch.nn.functional.pad(ids, (0, max_len - ids.shape[0]))
    tensors.append(padded)
  # https://pytorch.org/docs/stable/generated/torch.vstack.html
  batch['input_ids'] = torch.vstack(tensors) # Stack tensors in sequence vertically (row wise).
  return batch

---

## 3. Machine learning model

### 3.1. Model training

In [81]:
# Your code to train the machine learning model on the training set and evaluate the performance on the validation set here

In [82]:
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This is quite clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model


    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None): #nevermind the attention_mask, its time will come, data collator insists on adding it
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)

        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)

        #### MODIFIED HERE FOR EXERCISE 5 -> commented out
        ####projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)

        #### MODIFIED HERE FOR EXERCISE 5 -> base it off embedded_summed
        ##### OLD: logits=self.output(projected)
        logits=self.output(embedded_summed)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)

        # We have labels, so we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)

# Configure the model:
#   these parameters are used in the model's __init__()


mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=20,nlabels=2)


In [83]:
# And we can make a model
mlp = MLP(mlp_config)
fake_batch = collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call input_ids and labels as parameters to the call

(tensor(0.6662, grad_fn=<NllLossBackward0>),
 tensor([[-0.1366, -0.1987],
         [-0.1443, -0.1915]], grad_fn=<AddmmBackward0>))

In [84]:
# https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments

trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps.
    logging_strategy="steps", #  Logging is done every logging_steps.
    eval_steps=500, # Number of update steps between two evaluations if evaluation_strategy="steps".
    # Will default to the same value as logging_steps if not set.
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps=500, #  Number of update steps between two logs if logging_strategy="steps".
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    learning_rate=1e-4, #learning rate of the gradient descent
    # float, optional, defaults to 5e-5) — The initial learning rate.
    max_steps=20000, #  (int, optional, defaults to -1)
    # If set to a positive number, the total number of training steps to perform.
    # Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted)

    # until max_steps is reached.
    #num_train_epochs=5.0,
    load_best_model_at_end=True, # Whether or not to load the best model found during training at the end of training.
    # When this option is enabled, the best checkpoint will always be saved.
    per_device_train_batch_size = 64
)

pprint(trainer_args) #print if needed

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

### 3.2 Hyperparameter optimization

In [85]:
# Your code for hyperparameter optimization here

In [86]:
# TODO: Build more hyperparameter tests

In [87]:
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps", # Evaluation is done (and logged) every eval_steps.
    logging_strategy="steps", #  Logging is done every logging_steps.
    eval_steps=500, # Number of update steps between two evaluations if evaluation_strategy="steps".
    # Will default to the same value as logging_steps if not set.
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    logging_steps=500, #  Number of update steps between two logs if logging_strategy="steps".
    # Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps.
    learning_rate=1e-4, #learning rate of the gradient descent
    # float, optional, defaults to 5e-5) — The initial learning rate.
    max_steps=20000, #  (int, optional, defaults to -1)
    # If set to a positive number, the total number of training steps to perform.
    # Overrides num_train_epochs. For a finite dataset, training is reiterated through the dataset (if all data is exhausted)

    # until max_steps is reached.
    #num_train_epochs=5.0,
    load_best_model_at_end=True, # Whether or not to load the best model found during training at the end of training.
    # When this option is enabled, the best checkpoint will always be saved.
    per_device_train_batch_size = 64
)

pprint(trainer_args) #print if needed





TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'gradient_accumulation_kwargs': None},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla

In [88]:
import numpy as np
import evaluate

# Evaluate is a library that makes evaluating and comparing models
# and reporting their performance easier and more standardized.
# https://pypi.org/project/evaluate/

accuracy = evaluate.load('accuracy')

def compute_accuracy(outputs_and_labels):
  outputs, labels = outputs_and_labels
  preds = np.argmax(outputs, axis = -1) # Returns the indices of the maximum values along an axis.
  # https://numpy.org/doc/stable/reference/generated/numpy.argmax.html
  return accuracy.compute(predictions = preds, references = labels)

In [89]:

# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()

max_steps is given, it will override any value given in num_train_epochs


Step,Training Loss,Validation Loss,Accuracy
500,0.6446,0.613596,0.754
1000,0.5125,0.536539,0.779
1500,0.4,0.488547,0.791
2000,0.3191,0.461678,0.799
2500,0.2589,0.447899,0.796
3000,0.2149,0.441159,0.793
3500,0.1801,0.441746,0.79
4000,0.1528,0.444968,0.791
4500,0.1302,0.452422,0.791
5000,0.1121,0.460901,0.791


TrainOutput(global_step=5500, training_loss=0.27473673664439807, metrics={'train_runtime': 66.715, 'train_samples_per_second': 19186.103, 'train_steps_per_second': 299.783, 'total_flos': 3046942584.0, 'train_loss': 0.27473673664439807, 'epoch': 41.04477611940298})

### 3.3. Evaluation on test set

In [90]:
# Your code to evaluate the final model on the test set here

eval_results = trainer.evaluate(dset_tokenized["test"])
print(eval_results)

{'eval_loss': 0.4417257308959961, 'eval_accuracy': 0.7926829268292683, 'eval_runtime': 0.2604, 'eval_samples_per_second': 4094.477, 'eval_steps_per_second': 514.69, 'epoch': 41.04477611940298}


---

## 4. Results and summary

### 4.1 Corpus insights

(Briefly discuss what you learned about the corpus and its annotation)

### 4.2 Results

(Briefly summarize your results)

### 4.3 Relation to state of the art

(Compare your results to the state-of-the-art performance)

---

## 5. Bonus Task (optional)

### 5.1. Annotating out-of-domain documents

(Briefly describe the chosen out-of-domain documents)

*

    TWEETS FROM https://github.com/MarkHershey/CompleteTrumpTweetsArchive
    ANNOTATED BY HAND TEXT CLASSIFICATION 50 POS 50 NEG



(Briefly describe the process of annotation)
The dataset we used for annotation task consists of tweets by Donald Trump before and after inauguration. Because the needed size of annotated texts was small, only 100 tweets as individual documents, the process was pretty straight forward. The split of the data was 25 negatives and 25 positives from both before and after inauguration, totaling the 100 needed. For individual tweets we started looking for highly positive or negative words and after that tried to decide was it satire or not. Borderline cases included tweets that depend on which side of the political spectrum the reader resides in. Those were discarded in this small task. If the amount of data was larger, then we would have to reconsider. As for the test purposes we tried to select as positive or negative tweets as possible for our dataset. The annotation speed of the task was quick, because the size was small, and the tweets are very short documents. We found the contents of the tweets interesting, displaying polarity between the two timeframes. Also, the ethical side of the annotation process included reading a lot of hate speech which in large amounts can be harmful to the individual annotators’ mental well-being. We can only imagine what it feels like to do this for a living for a small monetary compensation.

### 5.2 Conversion into dataset

In [91]:
# Your code to convert the annotations into a dataset here

In [92]:
!wget https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_pos.txt
!wget https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_neg.txt


--2024-05-02 21:12:26--  https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_pos.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7300 (7.1K) [text/plain]
Saving to: ‘trump_pos.txt.1’


2024-05-02 21:12:26 (78.8 MB/s) - ‘trump_pos.txt.1’ saved [7300/7300]

--2024-05-02 21:12:26--  https://raw.githubusercontent.com/heinohen/tko_7095_i2hlt/main/prjct/trump_neg.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7076 (6.9K) [text/plain]
Saving to: ‘trump_neg.txt.1’


2024-05-02 21:12:26 (84.3 MB/s) - ‘trump_

In [93]:
from datasets import Dataset, DatasetDict, ClassLabel, Features, Value, DatasetInfo

In [94]:
texts = []
labels = []

with open('trump_pos.txt', 'r', encoding = 'utf-8') as f:
  for row in f:
    texts.append(row.strip())
    labels.append(1)

with open('trump_neg.txt', 'r', encoding = 'utf-8') as f:
  for row in f:
    texts.append(row.strip())
    labels.append(0)

# A special dictionary that defines the internal structure of a dataset
features = Features({
    'text': Value('string'),
    'label': ClassLabel(num_classes=2, names=['negative', 'positive'])
})

# The base class Dataset implements a Dataset backed by an Apache Arrow table.
bonus_ds = Dataset.from_dict({'text': texts, 'label': labels}, features=features)

# description (str) — A description of the dataset.
bonus_ds.info.description = "This dataset contains tweets from Donald Trump labeled as positive and negative. Annotation was done manually. Dataset has tweets from Donald Trump right before he was in office and after he was elected president. Ratio is 50/50."

# Lets check that everything is ok
for i in range(10):
    print(ds[i]['text'], ds[i]['label'])

# lets store the label names for further checking
label_names = bonus_ds.features['label'].names




# Lets shuffle the database
bonus_ds = bonus_ds.shuffle(seed=42)

print("---------------------")

# And lets check the labeling once more
for i in range(10):
    example = bonus_ds[i]
    text = example['text']
    label_num = example['label']
    label_name = bonus_ds.features['label'].int2str(label_num)
    print(f"Label: {label_num} ({label_name})")
    print('---')


print(len(texts))
print(len(labels))
print(bonus_ds)

2016 was AMAZING, but we never had this kind of ENTHUSIASM! 1
Will soon be heading to Wilmington, North Carolina, and then will be going to Battleship North Carolina. Look forward to seeing all of my friends! 1
Mike has my complete &amp; total endorsement. We need him badly in Washington. A great fighter pilot &amp; hero, &amp; a brilliant Annapolis grad, Mike will never let you down. Mail in ballots, &amp; check that they are counted! 1
I’m with the TRUCKERS all the way. Thanks for the meeting at the White House with my representatives from the Administration. It is all going to work out well! 1
Congressman Bill Johnson (@JohnsonLeads) is an incredible fighter for the Great State of Ohio! He’s a proud Veteran and a hard worker who Cares for our Veterans, Supports Small Business, and is Strong on the Border and Second Amendment.... 1
We are having very productive calls with the leaders of every sector of the economy who are all-in on getting America back to work, and soon. More to come

In [95]:
initial_split = bonus_ds.train_test_split(test_size=0.9)
test_valid_split = initial_split['test'].train_test_split(test_size=0.1)
complete_dataset = DatasetDict({
    'train': initial_split['train'],
    'test': test_valid_split['train'],
    'validate': test_valid_split['test']})

print(complete_dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 10
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 81
    })
    validate: Dataset({
        features: ['text', 'label'],
        num_rows: 9
    })
})


In [96]:

import sklearn.feature_extraction

# max_features means the size of the vocabulary
# which means max_features most-common words
vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in complete_dataset["train"]] #get a list of all texts from the training data
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary




In [97]:
bonus_vectorized=vectorize_example(complete_dataset["train"][1])

print(bonus_vectorized)

{'input_ids': array([ 10,  17,  20,  24,  27,  41,  64,  67,  68,  73,  76,  77,  84,
        90,  94, 107, 113, 126, 128, 129, 133, 139, 144, 152, 158],
      dtype=int32)}


In [98]:
# Apply the tokenizer to the whole dataset using .map()
bonus_dset_tokenized = complete_dataset.map(vectorize_example,num_proc=4)

#lets check one vector from the data
example = bonus_dset_tokenized['train'][8]

pprint(example)

#Just checking that the labeling is still ok
num_label = example['label']

text_label = bonus_dset_tokenized['train'].features['label'].int2str(num_label)

print("Numerical label:", num_label)
print("Corresponding text label:", text_label)


  self.pid = os.fork()
  self.pid = os.fork()


{'input_ids': [8, 14, 25, 37, 38, 44, 69, 104, 124, 133, 135, 146, 152, 156],
 'label': 0,
 'text': "We could use the Balanced Budget Amendment--Politicians don't have "
         'the will to cut spending'}
Numerical label: 0
Corresponding text label: negative


In [99]:
eval_results = trainer.evaluate(bonus_dset_tokenized["test"])

print(eval_results)



{'eval_loss': 0.7451614737510681, 'eval_accuracy': 0.5432098765432098, 'eval_runtime': 0.0623, 'eval_samples_per_second': 1300.599, 'eval_steps_per_second': 176.625, 'epoch': 41.04477611940298}


### 5.3. Model evaluation on out-of-domain test set

### 5.4 Bonus task results

(Present the results of the evaluation on the out-of-domain test set)

### 5.5. Annotated data

In [100]:
# Include your annotated out-of-domain data here