<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/mlp_imdb_hf_dset_and_trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup

Before we start running our own Python code, install the required Python packages using [pip](https://en.wikipedia.org/wiki/Pip):

* [`transformers`](https://huggingface.co/docs/transformers/index) is a popular deep learning package primarily on top of torch, we need to reinstall it with the [torch] configuration (might take a substantial amount of time)
* [`datasets`](https://huggingface.co/docs/datasets/) provides support for loading, creating, and manipulating datasets
* evaluate is a library of performance metrics (like accuracy etc)

**You will likely need to do a Runtime/Restart session for everything to work after the installation.**

In [13]:
# !pip3 install -q datasets evaluate
# !pip install transformers[torch]

(Above, the `!` at the start of the line tells the notebook to run the line as an operating system command rather than Python code, and the `-q` argument to `pip` runs the command in "quiet" mode, with less output.)

---

# Get and prepare data

*   Let us work with the IMDB dataset of movie review sentiment
*   25,000 positive reviews
*   25,000 negative reviews
*   50,000 unlabeled reviews (which we discard for the time being)


In [14]:
from pprint import pprint #pprint => pretty-print, I use it occassionally throughout the notebook
import datasets
import torch
dset=datasets.load_dataset("imdb")
pprint(dset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [15]:
dset=dset.shuffle() #This is never a bad idea, datasets may have ordering to them, which is not what we want
del dset["unsupervised"] #Delete the unlabeled part of the dataset, we don't need it for anything

In [16]:
pprint(dset['train'][0]['text'])
print(dset['train'][0]['label'])

('I happened on "Shower" in the foreign film section of my local video store '
 'and passed it over several times since from its cover it looked like a farce '
 'or comedy. I then lucked into a copy to purchase at economical price and am '
 'happy for my luck. "Shower" is the story of three(3) men, a father and '
 'two(2) adult sons, each coming to terms with life changes as the world '
 'around them also continues to change in modern China. As with many "foreign" '
 'films, the Chinese culture itself is one of the most interesting facets of '
 'this movie.<br /><br />Beyond the fascinating characteristics of the local, '
 'Chinese color giving the setting to this story, is the difficult yet '
 'touching relationships between the men and a sole woman involved in the '
 'story, all set against the backdrop of a village bathhouse.<br /><br />The '
 "family's story moves from estrangement to understanding and made me glad I "
 'came to know these people. Added to the main story are the nu

## Tokenize and map vocabulary
         
*   We need to achieve two complementary tasks
*   **Tokenize** split the text into units which can be interpreted as features (words in this case)
*   **Map vocabulary** build the feature vector for each example
*   Since this is NLP, here it means listing the non-zero elements of the feature vector, or in other words the indices of the vocabulary items
* Since we work with the bag of words (BoW) representation, these do not need to be (and are not) in the order in which they appear in the text
* These indices then refer to the rows in the embedding matrix
*   A traditional and well-tested way it to use sklearn's feature extraction package
*   CountVectorizer is most likely what we want in here, because we only want the ids, nothing else
* But for other NLP work the TfidfVectorizer is also very handy



In [17]:
import sklearn.feature_extraction

# max_features means the size of the vocabulary
# which means max_features most-common words
vectorizer=sklearn.feature_extraction.text.CountVectorizer(binary=True,max_features=20000)

texts=[ex["text"] for ex in dset["train"]] #get a list of all texts from the training data
vectorizer.fit(texts) #"Trains" the vectorizer, i.e. builds its vocabulary


# Building the feature vectors

* This is super-easy with the vectorizer
* It produces a sparse matrix of the non-zero elements

In [18]:
def vectorize_example(ex):
    vectorized=vectorizer.transform([ex["text"]]) # [...] because the vectorizer expects a list/iterable over inputs, not one input
    non_zero_features=vectorized.nonzero()[1] #.nonzero gives a pair of (rows,columns), we want the columns
    non_zero_features+=1 #feature index 0 will have a special meaning
                         # so let us not produce it by adding +1 to everything
    return {"input_ids":non_zero_features}

vectorized=vectorize_example(dset["train"][0])

In [19]:
print(vectorized)

{'input_ids': array([  451,   521,   606,   727,   774,   794,   887,  1115,  1157,
        1207,  1300,  1456,  1497,  1901,  1911,  2295,  2625,  2687,
        3037,  3040,  3058,  3062,  3187,  3189,  3560,  3599,  3616,
        3842,  3979,  4071,  4172,  4424,  4474,  5108,  5736,  5791,
        6578,  6640,  6664,  6683,  6706,  6878,  6887,  7127,  7147,
        7331,  7346,  7695,  7697,  8225,  8233,  8811,  9085,  9180,
        9428,  9485,  9557,  9602,  9630,  9638,  9639, 10090, 10449,
       10475, 10592, 10640, 10734, 10829, 10871, 10970, 11176, 11267,
       11576, 11721, 11761, 11762, 11857, 12260, 12363, 12437, 12439,
       12504, 12627, 12887, 13028, 13727, 14034, 14574, 15667, 15812,
       15816, 15830, 16041, 16161, 16381, 16404, 16522, 16565, 17068,
       17075, 17837, 17897, 17907, 17910, 17917, 17938, 17968, 18000,
       18081, 18115, 18217, 18219, 18553, 18685, 19144, 19173, 19325,
       19377, 19712, 19741, 19785, 19914])}


In [20]:
# We can map back to vocabulary and check that everything works
# vectorizer.vocabulary_ is a dictionary {key:word, value:idx}

idx2word=dict((i,w) for (w,i) in vectorizer.vocabulary_.items()) #inverse the vocab dictionary
words=[]
for idx in vectorized["input_ids"]:
    words.append(idx2word[idx-1]) ## It is easy to forgot we moved all by +1
pprint(", ".join(words)) #This is now the bag of words representation of the document

('added, adult, against, all, also, am, and, are, around, as, at, away, '
 'backdrop, between, beyond, br, by, came, change, changes, characteristics, '
 'characters, china, chinese, color, comedy, coming, conflicts, continues, '
 'copy, cover, culture, customers, difficult, each, economical, facets, '
 'family, farce, fascinating, father, film, films, for, foreign, friendships, '
 'from, giving, glad, happened, happy, humanity, in, individual, interesting, '
 'into, involved, is, it, its, itself, know, life, like, local, looked, luck, '
 'made, main, many, me, men, modern, most, moves, movie, my, numerous, of, on, '
 'one, or, over, passed, people, price, purchase, relationships, section, set, '
 'setting, several, shower, since, small, smiling, sole, sons, store, story, '
 'terms, the, their, them, then, these, this, three, times, to, touched, '
 'touching, two, understanding, video, village, walks, warmth, with, woman, '
 'world, yet')


# Tokenizing / vectorizing the whole dataset

* The datasets library allows us to efficiently map() a function across the whole dataset
* Can run in parallel

**Note**: confusingly, and unlike the Python`map` function, [`Dataset.map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) function _updates_ its argument dataset, keeping existing values. Here, the call adds the values returned by the function call (here `input_ids`) to each example while also keeping the original `text` and `label` values.


In [21]:
# Apply the tokenizer to the whole dataset using .map()
dset_tokenized = dset.map(vectorize_example)
pprint(dset_tokenized["train"][0])

Map: 100%|██████████| 25000/25000 [00:11<00:00, 2197.40 examples/s]
Map: 100%|██████████| 25000/25000 [00:10<00:00, 2287.36 examples/s]

{'input_ids': [451,
               521,
               606,
               727,
               774,
               794,
               887,
               1115,
               1157,
               1207,
               1300,
               1456,
               1497,
               1901,
               1911,
               2295,
               2625,
               2687,
               3037,
               3040,
               3058,
               3062,
               3187,
               3189,
               3560,
               3599,
               3616,
               3842,
               3979,
               4071,
               4172,
               4424,
               4474,
               5108,
               5736,
               5791,
               6578,
               6640,
               6664,
               6683,
               6706,
               6878,
               6887,
               7127,
               7147,
               7331,
               7346,
               7695,




## Input encoding for MLP

* Our `input_ids` are an array containing the indices of the tokens found in the text
* This corresponds to the indices into the row of the embedding matrix in the model
* That seems to be exactly what we need!


# Batching and padding

* When working with neural networks, one rarely trains one example at a time
* Instead, processing always happens a batch at a time
* This has two important reasons:
  1. No batching is too slow (GPU parallelization cannot kick in across examples)
  2. The gradients are averaged across the whole batch and applied only once, i.e. batching acts as a regularizer and improves the stability of the training


# Padding and Collation (forming a batch)

## Padding:

* In order to build a batch as a 2D array of (example, seq), we need to fit together examples of different length
* Solution: pad the shorter examples with zeroes to the length of the longest example in the batch
* Make sure that zero is understood as padding value rather than a (hypothetical) feature with index 0
* This is best shown by example, it is in the end easier than it may sound

## Collation:

* Much like examples are dictionaries with the data, also batches are dictionaries with the data
* The only difference is that in a batch, all data tensors have one extra dimension, that's all there is to it

## Collator function:

* Padding and collation is taken care of by a single function in the HF libraries
* It receives a list of examples, and returns a ready batch
* The surrounding library code takes care of forming these lists
* Let's try to implement one below

In [22]:
# 1) I need to define it here, will explain below
# 2) I show here a very straightforward implementation of padding and collation
# 3) Normally, one would use transformers.DataCollatorWithPadding but that assumes
#    a particular tokenizer, to which it outsources much of the work, and we do not
#    have it
def collator(list_of_examples):
    #this is easy, labels are made into a single tensor
    batch={"labels":torch.tensor(list(ex["label"] for ex in list_of_examples))}
    #the worse bit is now to pad the examples, as they are of different length
    tensors=[]
    max_len=max(len(example["input_ids"]) for example in list_of_examples) #this is the longest example in the batch
    #everything needs to be padded to fit in length the longest example
    #(so we can build a single tensor out of it)
    for example in list_of_examples:
        ids=torch.tensor(example["input_ids"]) #pick the input ids
        # pad(what,(from_left, from_right)) <- this is how we call the stock pad function
        padded=torch.nn.functional.pad(ids,(0,max_len-ids.shape[0])) #pad by max - current length, pads with zero by default
        tensors.append(padded) #accumulated the padded ids
    batch["input_ids"]=torch.vstack(tensors) #now that we have all of them the same length, a simple vstack() stacks them up
    return batch #...and that's all there is to it



#Build a batch from 2 examples, with padding
batch=collator([dset_tokenized["train"][2],dset_tokenized["train"][7]])
print("Shape of labels:",batch["labels"].shape)
print("Shape of input_ids:",batch["input_ids"].shape)
pprint(batch["labels"])
pprint(batch["input_ids"])

Shape of labels: torch.Size([2])
Shape of input_ids: torch.Size([2, 99])
tensor([0, 0])
tensor([[   10,   309,   887,   954,  1115,  1237,  1517,  1573,  1679,  1702,
          2454,  2604,  2625,  2687,  2711,  2717,  2958,  3637,  4130,  4225,
          4723,  5091,  5429,  6302,  6319,  6642,  6767,  6768,  7627,  7692,
          7932,  8047,  8242,  8303,  8322,  8350,  8379,  8780,  9302,  9377,
          9602,  9630,  9890, 10008, 10475, 10638, 10798, 10829, 11176, 11681,
         11857, 11897, 11936, 12134, 12142, 12256, 12363, 12437, 12439, 12577,
         12839, 13323, 13778, 14323, 14349, 15339, 15461, 15464, 16020, 16033,
         16478, 16568, 17302, 17309, 17636, 17649, 17885, 17893, 17897, 17907,
         17910, 17916, 17929, 17944, 17968, 18069, 18115, 18166, 18922, 19332,
         19355, 19521, 19540, 19639, 19712, 19805, 19813, 19832, 19843],
        [  309,   540,   720,   727,   764,   887,  1007,  1115,  1157,  1517,
          1523,  1744,  2516,  2784,  2924,  3199

# Build the MLP model

* Now that all of our data is in shape, we can build the model
* That is luckily quite easy in this case

The model class in its simplest form has `__init__()` which instantiates the layers and `forward()` which implements the actual computation. For more information on these, please see the [PyTorch turorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).

In [23]:
import torch
import transformers

# A model wants a config, I can simply inherit from the base
# class for pretrained configs
class MLPConfig(transformers.PretrainedConfig):
    pass

# This is the model
class MLP(transformers.PreTrainedModel):

    config_class=MLPConfig

    # In the initialization method, one instantiates the layers
    # these will be, for the most part the trained parameters of the model
    def __init__(self,config):
        super().__init__(config)
        self.vocab_size=config.vocab_size #embedding matrix row count
        # Build and initialize embedding of vocab size +1 x hidden size (+1 because of the padding index 0!)
        self.embedding=torch.nn.Embedding(num_embeddings=self.vocab_size+1,embedding_dim=config.hidden_size,padding_idx=0)
        # Normally you would not initialize these yourself, but I have my reasons here ;)
        torch.nn.init.uniform_(self.embedding.weight.data,-0.001,0.001) #initialize the embeddings with small random values
        # Note! This function is relatively clever and keeps the embedding for 0, the padding, pure zeros
        # This takes care of the lower half of the network, now the upper half
        # Output layer: hidden size x output size
        self.output=torch.nn.Linear(in_features=config.hidden_size,out_features=config.nlabels)
        # Now we have the parameters of the model


    # The computation of the model is put into the forward() function
    # it receives a batch of data and optionally the correct `labels`
    #
    # If given `labels` it returns (loss,output)
    # if not, then it returns (output,)
    def forward(self,input_ids,labels=None):
        #1) sum up the embeddings of the items
        embedded=self.embedding(input_ids) #(batch,ids)->(batch,ids,embedding_dim)
        # Since the Embedding keeps the first row of the matrix pure zeros, we don't need to worry about the padding
        # so next we sum the embeddings across the word dimension
        # (batch,ids,embedding_dim) -> (batch,embedding_dim)
        embedded_summed=torch.sum(embedded,dim=1)

        #2) apply non-linearity
        # (batch,embedding_dim) -> (batch,embedding_dim)
        projected=torch.tanh(embedded_summed) #Note how non-linearity is applied here and not when configuring the layer in __init__()

        #3) and now apply the upper, output layer of the network
        # (batch,embedding_dim) -> (batch, num_of_classes i.e. 2 in our case)
        logits=self.output(projected)

        # ...and that's all there is to it!

        #print("input_ids.shape",input_ids.shape)
        #print("embedded.shape",embedded.shape)
        #print("embedded_summed.shape",embedded_summed.shape)
        #print("projected.shape",projected.shape)
        #print("logits.shape",logits.shape)

        # If we have labels, we ought to calculate the loss
        if labels is not None:
            loss=torch.nn.CrossEntropyLoss() #This loss is meant for classification, so let's use it
            # You run it as loss(model_output,correct_labels)
            return (loss(logits,labels),logits)
        else:
            # No labels, so just return the logits
            return (logits,)



In [24]:
# Configure the model:
#   these parameters are used in the model's __init__()
mlp_config=MLPConfig(vocab_size=len(vectorizer.vocabulary_),hidden_size=1,nlabels=2)

# And now we can instantiate it
mlp=MLP(mlp_config)

#we can make a little test with a fake batch formed by the two first example
fake_batch=collator([dset_tokenized["train"][0],dset_tokenized["train"][1]])
mlp(**fake_batch) #** expands input_ids and labels as parameters of the call

(tensor(0.7724, grad_fn=<NllLossBackward0>),
 tensor([[-0.4547,  0.3354],
         [-0.4613,  0.3380]], grad_fn=<AddmmBackward0>))

# Train the model

We will use the Hugging Face [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class for training

* Loads of arguments that control the training
* Configurable metrics to evaluate performance
* Data collator builds the batches
* Early stopping callback stops when eval loss no longer improves
* Model load/save
* Excellent foundation for later deep learning course
  

First, let's create a [`TrainingArguments`](https://huggingface.co/docs/transformers/v4.17.0/en/main_classes/trainer#transformers.TrainingArguments) object to specify hyperparameters and various other settings for training.

Printing this simple dataclass object will show not only the values we set, but also the defaults for all other arguments. Don't worry if you don't understand what all of these do! Many are not relevant to us here, and you can find the details in [`Trainer` documentation](https://huggingface.co/docs/transformers/main_classes/trainer) if you are interested.

In [25]:
# Set training arguments
# their names are mostly self-explanatory
trainer_args = transformers.TrainingArguments(
    "mlp_checkpoints", #save checkpoints here
    evaluation_strategy="steps",
    logging_strategy="steps",
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-5, #learning rate of the gradient descent
    max_steps=20000,
    load_best_model_at_end=True,
    per_device_train_batch_size=128
)

pprint(trainer_args)

TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=500,
evaluation_strategy=steps,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_l

Next, let's create a metric for evaluating performance during and after training. We can use the convenience function [`load_metric`](https://huggingface.co/docs/datasets/about_metrics) to load one of many pre-made metrics and wrap this for use by the trainer.

As the task is simple binary classification and our data is even 50:50 balanced, we can comfortably use the basic `accuracy` metric, defined as the proportion of correctly predicted labels out of all labels.

In [26]:
import numpy as np
import evaluate

accuracy = evaluate.load("accuracy")

def compute_accuracy(outputs_and_labels):
    outputs, labels = outputs_and_labels
    predictions = np.argmax(outputs, axis=-1) #pick the index of the "winning" label
    return accuracy.compute(predictions=predictions, references=labels)

We can then create the `Trainer` and train the model by invoking the [`Trainer.train`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.train) function.

In addition to the model, the settings passed in through the `TrainingArguments` object created above (`trainer_args`), the data, and the metric defined above, we create and pass the following to the `Trainer`:

* [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator): groups input into batches
* [`EarlyStoppingCallback`](https://huggingface.co/docs/transformers/main_classes/callback#transformers.EarlyStoppingCallback): stops training when performance stops improving

In [27]:
# Make a new model
mlp = MLP(mlp_config)


# Argument gives the number of steps of patience before early stopping
# i.e. training is stopped when the evaluation loss fails to improve
# certain number of times
early_stopping = transformers.EarlyStoppingCallback(5)

trainer = transformers.Trainer(
    model=mlp,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"].select(range(1000)), #make a smaller subset to evaluate on
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

# FINALLY!
trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  2%|▎         | 500/20000 [00:11<06:41, 48.60it/s]

{'loss': 0.7903, 'grad_norm': 0.7438209056854248, 'learning_rate': 9.75e-06, 'epoch': 2.55}


                                                   
  3%|▎         | 507/20000 [00:11<10:36, 30.64it/s]

{'eval_loss': 0.7590334415435791, 'eval_accuracy': 0.51, 'eval_runtime': 0.2868, 'eval_samples_per_second': 3487.041, 'eval_steps_per_second': 435.88, 'epoch': 2.55}


  5%|▌         | 1000/20000 [00:22<06:26, 49.13it/s]

{'loss': 0.7324, 'grad_norm': 0.4123412072658539, 'learning_rate': 9.5e-06, 'epoch': 5.1}


                                                    
  5%|▌         | 1005/20000 [00:22<11:01, 28.73it/s]

{'eval_loss': 0.7225984930992126, 'eval_accuracy': 0.51, 'eval_runtime': 0.2448, 'eval_samples_per_second': 4084.827, 'eval_steps_per_second': 510.603, 'epoch': 5.1}


  8%|▊         | 1500/20000 [00:33<06:28, 47.58it/s]

{'loss': 0.7113, 'grad_norm': 0.39016756415367126, 'learning_rate': 9.250000000000001e-06, 'epoch': 7.65}


                                                    
  8%|▊         | 1509/20000 [00:33<09:26, 32.62it/s]

{'eval_loss': 0.7062789797782898, 'eval_accuracy': 0.51, 'eval_runtime': 0.2508, 'eval_samples_per_second': 3987.068, 'eval_steps_per_second': 498.384, 'epoch': 7.65}


 10%|█         | 2000/20000 [00:44<07:09, 41.93it/s]

{'loss': 0.7009, 'grad_norm': 0.39031729102134705, 'learning_rate': 9e-06, 'epoch': 10.2}


                                                    
 10%|█         | 2004/20000 [00:45<12:14, 24.52it/s]

{'eval_loss': 0.6963932514190674, 'eval_accuracy': 0.51, 'eval_runtime': 0.2718, 'eval_samples_per_second': 3679.108, 'eval_steps_per_second': 459.888, 'epoch': 10.2}


 12%|█▎        | 2500/20000 [00:57<06:42, 43.47it/s]

{'loss': 0.693, 'grad_norm': 0.2645820379257202, 'learning_rate': 8.750000000000001e-06, 'epoch': 12.76}


                                                    
 13%|█▎        | 2504/20000 [00:57<12:12, 23.88it/s]

{'eval_loss': 0.6892740726470947, 'eval_accuracy': 0.51, 'eval_runtime': 0.2868, 'eval_samples_per_second': 3487.012, 'eval_steps_per_second': 435.877, 'epoch': 12.76}


 15%|█▌        | 3000/20000 [01:08<06:24, 44.18it/s]

{'loss': 0.6875, 'grad_norm': 0.3193091154098511, 'learning_rate': 8.5e-06, 'epoch': 15.31}


                                                    
 15%|█▌        | 3005/20000 [01:09<11:17, 25.10it/s]

{'eval_loss': 0.6831551790237427, 'eval_accuracy': 0.51, 'eval_runtime': 0.2818, 'eval_samples_per_second': 3548.634, 'eval_steps_per_second': 443.579, 'epoch': 15.31}


 18%|█▊        | 3500/20000 [01:20<05:51, 46.94it/s]

{'loss': 0.6814, 'grad_norm': 0.26086321473121643, 'learning_rate': 8.25e-06, 'epoch': 17.86}


                                                    
 18%|█▊        | 3504/20000 [01:21<10:44, 25.58it/s]

{'eval_loss': 0.677168607711792, 'eval_accuracy': 0.51, 'eval_runtime': 0.2838, 'eval_samples_per_second': 3523.888, 'eval_steps_per_second': 440.486, 'epoch': 17.86}


 20%|██        | 4000/20000 [01:32<05:40, 47.04it/s]

{'loss': 0.6754, 'grad_norm': 0.2898545265197754, 'learning_rate': 8.000000000000001e-06, 'epoch': 20.41}


                                                    
 20%|██        | 4006/20000 [01:32<09:32, 27.92it/s]

{'eval_loss': 0.6705374717712402, 'eval_accuracy': 0.51, 'eval_runtime': 0.2788, 'eval_samples_per_second': 3587.018, 'eval_steps_per_second': 448.377, 'epoch': 20.41}


 22%|██▎       | 4500/20000 [01:43<05:41, 45.34it/s]

{'loss': 0.6674, 'grad_norm': 0.40939071774482727, 'learning_rate': 7.75e-06, 'epoch': 22.96}


                                                    
 23%|██▎       | 4509/20000 [01:44<08:19, 31.02it/s]

{'eval_loss': 0.6629312634468079, 'eval_accuracy': 0.51, 'eval_runtime': 0.2648, 'eval_samples_per_second': 3776.271, 'eval_steps_per_second': 472.034, 'epoch': 22.96}


 25%|██▌       | 5000/20000 [01:55<05:22, 46.58it/s]

{'loss': 0.6586, 'grad_norm': 0.3682360053062439, 'learning_rate': 7.500000000000001e-06, 'epoch': 25.51}


                                                    
 25%|██▌       | 5004/20000 [01:55<09:50, 25.41it/s]

{'eval_loss': 0.6547653675079346, 'eval_accuracy': 0.51, 'eval_runtime': 0.2718, 'eval_samples_per_second': 3679.343, 'eval_steps_per_second': 459.918, 'epoch': 25.51}


 28%|██▊       | 5500/20000 [02:07<05:06, 47.29it/s]

{'loss': 0.6486, 'grad_norm': 0.38483232259750366, 'learning_rate': 7.25e-06, 'epoch': 28.06}


                                                    
 28%|██▊       | 5506/20000 [02:07<08:34, 28.15it/s]

{'eval_loss': 0.6467961668968201, 'eval_accuracy': 0.51, 'eval_runtime': 0.2738, 'eval_samples_per_second': 3652.461, 'eval_steps_per_second': 456.558, 'epoch': 28.06}


 30%|███       | 6000/20000 [02:18<05:04, 46.05it/s]

{'loss': 0.6398, 'grad_norm': 0.43703970313072205, 'learning_rate': 7e-06, 'epoch': 30.61}


                                                    
 30%|███       | 6005/20000 [02:19<08:54, 26.18it/s]

{'eval_loss': 0.6392375230789185, 'eval_accuracy': 0.512, 'eval_runtime': 0.2618, 'eval_samples_per_second': 3819.772, 'eval_steps_per_second': 477.471, 'epoch': 30.61}


 32%|███▎      | 6500/20000 [02:30<04:58, 45.16it/s]

{'loss': 0.6309, 'grad_norm': 0.3851061165332794, 'learning_rate': 6.750000000000001e-06, 'epoch': 33.16}


                                                    
 33%|███▎      | 6504/20000 [02:30<09:18, 24.16it/s]

{'eval_loss': 0.6322633028030396, 'eval_accuracy': 0.513, 'eval_runtime': 0.2728, 'eval_samples_per_second': 3665.635, 'eval_steps_per_second': 458.204, 'epoch': 33.16}


 35%|███▌      | 7000/20000 [02:41<04:32, 47.64it/s]

{'loss': 0.6234, 'grad_norm': 0.44433945417404175, 'learning_rate': 6.5000000000000004e-06, 'epoch': 35.71}


                                                    
 35%|███▌      | 7008/20000 [02:42<07:24, 29.24it/s]

{'eval_loss': 0.6258219480514526, 'eval_accuracy': 0.515, 'eval_runtime': 0.2808, 'eval_samples_per_second': 3561.485, 'eval_steps_per_second': 445.186, 'epoch': 35.71}


 38%|███▊      | 7500/20000 [02:53<04:26, 46.88it/s]

{'loss': 0.6177, 'grad_norm': 0.5876233577728271, 'learning_rate': 6.25e-06, 'epoch': 38.27}


                                                    
 38%|███▊      | 7505/20000 [02:53<08:26, 24.69it/s]

{'eval_loss': 0.6197741627693176, 'eval_accuracy': 0.518, 'eval_runtime': 0.3018, 'eval_samples_per_second': 3313.827, 'eval_steps_per_second': 414.228, 'epoch': 38.27}


 40%|████      | 8000/20000 [03:05<04:21, 45.89it/s]

{'loss': 0.6107, 'grad_norm': 0.4848390519618988, 'learning_rate': 6e-06, 'epoch': 40.82}


                                                    
 40%|████      | 8009/20000 [03:05<06:55, 28.86it/s]

{'eval_loss': 0.6142891645431519, 'eval_accuracy': 0.523, 'eval_runtime': 0.2748, 'eval_samples_per_second': 3639.025, 'eval_steps_per_second': 454.878, 'epoch': 40.82}


 42%|████▎     | 8500/20000 [03:16<03:58, 48.16it/s]

{'loss': 0.6042, 'grad_norm': 0.3927479684352875, 'learning_rate': 5.75e-06, 'epoch': 43.37}


                                                    
 43%|████▎     | 8508/20000 [03:17<06:13, 30.79it/s]

{'eval_loss': 0.6091866493225098, 'eval_accuracy': 0.532, 'eval_runtime': 0.2678, 'eval_samples_per_second': 3734.252, 'eval_steps_per_second': 466.781, 'epoch': 43.37}


 45%|████▌     | 9000/20000 [03:28<04:03, 45.18it/s]

{'loss': 0.6004, 'grad_norm': 0.4539059102535248, 'learning_rate': 5.500000000000001e-06, 'epoch': 45.92}


                                                    
 45%|████▌     | 9005/20000 [03:28<07:03, 25.96it/s]

{'eval_loss': 0.6043636202812195, 'eval_accuracy': 0.538, 'eval_runtime': 0.2688, 'eval_samples_per_second': 3720.36, 'eval_steps_per_second': 465.045, 'epoch': 45.92}


 48%|████▊     | 9500/20000 [03:39<03:45, 46.65it/s]

{'loss': 0.5949, 'grad_norm': 0.5902479887008667, 'learning_rate': 5.2500000000000006e-06, 'epoch': 48.47}


                                                    
 48%|████▊     | 9506/20000 [03:40<05:43, 30.53it/s]

{'eval_loss': 0.5999986529350281, 'eval_accuracy': 0.544, 'eval_runtime': 0.2608, 'eval_samples_per_second': 3834.428, 'eval_steps_per_second': 479.303, 'epoch': 48.47}


 50%|█████     | 10000/20000 [03:51<03:21, 49.71it/s]

{'loss': 0.5903, 'grad_norm': 0.5105913877487183, 'learning_rate': 5e-06, 'epoch': 51.02}


                                                     
 50%|█████     | 10007/20000 [03:51<05:46, 28.84it/s]

{'eval_loss': 0.5958550572395325, 'eval_accuracy': 0.548, 'eval_runtime': 0.2728, 'eval_samples_per_second': 3665.869, 'eval_steps_per_second': 458.234, 'epoch': 51.02}


 52%|█████▎    | 10500/20000 [04:02<03:27, 45.72it/s]

{'loss': 0.5859, 'grad_norm': 0.4184073507785797, 'learning_rate': 4.75e-06, 'epoch': 53.57}


                                                     
 53%|█████▎    | 10508/20000 [04:03<05:27, 29.02it/s]

{'eval_loss': 0.592052161693573, 'eval_accuracy': 0.555, 'eval_runtime': 0.2908, 'eval_samples_per_second': 3439.099, 'eval_steps_per_second': 429.887, 'epoch': 53.57}


 55%|█████▌    | 11000/20000 [04:14<03:09, 47.50it/s]

{'loss': 0.5825, 'grad_norm': 0.4040972888469696, 'learning_rate': 4.5e-06, 'epoch': 56.12}


                                                     
 55%|█████▌    | 11005/20000 [04:14<05:37, 26.63it/s]

{'eval_loss': 0.5883980989456177, 'eval_accuracy': 0.566, 'eval_runtime': 0.2658, 'eval_samples_per_second': 3762.305, 'eval_steps_per_second': 470.288, 'epoch': 56.12}


 57%|█████▊    | 11500/20000 [04:25<03:04, 46.00it/s]

{'loss': 0.5778, 'grad_norm': 0.5294850468635559, 'learning_rate': 4.25e-06, 'epoch': 58.67}


                                                     
 58%|█████▊    | 11508/20000 [04:26<04:50, 29.28it/s]

{'eval_loss': 0.5851428508758545, 'eval_accuracy': 0.575, 'eval_runtime': 0.2618, 'eval_samples_per_second': 3819.754, 'eval_steps_per_second': 477.469, 'epoch': 58.67}


 60%|██████    | 12000/20000 [04:37<02:54, 45.90it/s]

{'loss': 0.575, 'grad_norm': 0.41317784786224365, 'learning_rate': 4.000000000000001e-06, 'epoch': 61.22}


                                                     
 60%|██████    | 12007/20000 [04:37<04:31, 29.40it/s]

{'eval_loss': 0.5820456743240356, 'eval_accuracy': 0.586, 'eval_runtime': 0.2798, 'eval_samples_per_second': 3574.208, 'eval_steps_per_second': 446.776, 'epoch': 61.22}


 62%|██████▎   | 12500/20000 [04:48<02:40, 46.70it/s]

{'loss': 0.572, 'grad_norm': 0.4554649889469147, 'learning_rate': 3.7500000000000005e-06, 'epoch': 63.78}


                                                     
 63%|██████▎   | 12508/20000 [04:49<04:17, 29.12it/s]

{'eval_loss': 0.5791789293289185, 'eval_accuracy': 0.591, 'eval_runtime': 0.2848, 'eval_samples_per_second': 3511.288, 'eval_steps_per_second': 438.911, 'epoch': 63.78}


 65%|██████▌   | 13000/20000 [05:00<02:36, 44.60it/s]

{'loss': 0.5679, 'grad_norm': 0.45365068316459656, 'learning_rate': 3.5e-06, 'epoch': 66.33}


                                                     
 65%|██████▌   | 13006/20000 [05:01<04:03, 28.73it/s]

{'eval_loss': 0.5765705704689026, 'eval_accuracy': 0.595, 'eval_runtime': 0.2718, 'eval_samples_per_second': 3679.108, 'eval_steps_per_second': 459.888, 'epoch': 66.33}


 68%|██████▊   | 13500/20000 [05:12<02:20, 46.27it/s]

{'loss': 0.5665, 'grad_norm': 0.43149781227111816, 'learning_rate': 3.2500000000000002e-06, 'epoch': 68.88}


                                                     
 68%|██████▊   | 13504/20000 [05:12<04:23, 24.62it/s]

{'eval_loss': 0.5741108059883118, 'eval_accuracy': 0.604, 'eval_runtime': 0.2848, 'eval_samples_per_second': 3511.512, 'eval_steps_per_second': 438.939, 'epoch': 68.88}


 70%|███████   | 14000/20000 [05:24<02:08, 46.53it/s]

{'loss': 0.5634, 'grad_norm': 0.5133329033851624, 'learning_rate': 3e-06, 'epoch': 71.43}


                                                     
 70%|███████   | 14008/20000 [05:24<03:18, 30.15it/s]

{'eval_loss': 0.5719172954559326, 'eval_accuracy': 0.613, 'eval_runtime': 0.2588, 'eval_samples_per_second': 3863.994, 'eval_steps_per_second': 482.999, 'epoch': 71.43}


 72%|███████▎  | 14500/20000 [05:35<01:51, 49.39it/s]

{'loss': 0.5612, 'grad_norm': 0.45419302582740784, 'learning_rate': 2.7500000000000004e-06, 'epoch': 73.98}


                                                     
 73%|███████▎  | 14505/20000 [05:35<03:24, 26.87it/s]

{'eval_loss': 0.5699086785316467, 'eval_accuracy': 0.618, 'eval_runtime': 0.2783, 'eval_samples_per_second': 3593.167, 'eval_steps_per_second': 449.146, 'epoch': 73.98}


 75%|███████▌  | 15000/20000 [05:47<01:46, 46.83it/s]

{'loss': 0.5592, 'grad_norm': 0.4969077408313751, 'learning_rate': 2.5e-06, 'epoch': 76.53}


                                                     
 75%|███████▌  | 15004/20000 [05:47<03:23, 24.52it/s]

{'eval_loss': 0.5680720806121826, 'eval_accuracy': 0.626, 'eval_runtime': 0.2998, 'eval_samples_per_second': 3335.913, 'eval_steps_per_second': 416.989, 'epoch': 76.53}


 78%|███████▊  | 15500/20000 [05:58<01:37, 46.33it/s]

{'loss': 0.5575, 'grad_norm': 0.4224236309528351, 'learning_rate': 2.25e-06, 'epoch': 79.08}


                                                     
 78%|███████▊  | 15505/20000 [05:59<02:58, 25.25it/s]

{'eval_loss': 0.5664092898368835, 'eval_accuracy': 0.637, 'eval_runtime': 0.2688, 'eval_samples_per_second': 3720.37, 'eval_steps_per_second': 465.046, 'epoch': 79.08}


 80%|████████  | 16000/20000 [06:10<01:28, 44.95it/s]

{'loss': 0.5557, 'grad_norm': 0.4606848955154419, 'learning_rate': 2.0000000000000003e-06, 'epoch': 81.63}


                                                     
 80%|████████  | 16008/20000 [06:10<02:18, 28.76it/s]

{'eval_loss': 0.5649584531784058, 'eval_accuracy': 0.639, 'eval_runtime': 0.2778, 'eval_samples_per_second': 3599.927, 'eval_steps_per_second': 449.991, 'epoch': 81.63}


 82%|████████▎ | 16500/20000 [06:21<01:15, 46.58it/s]

{'loss': 0.5538, 'grad_norm': 0.42630496621131897, 'learning_rate': 1.75e-06, 'epoch': 84.18}


                                                     
 83%|████████▎ | 16505/20000 [06:22<02:24, 24.23it/s]

{'eval_loss': 0.5636917948722839, 'eval_accuracy': 0.644, 'eval_runtime': 0.3018, 'eval_samples_per_second': 3313.73, 'eval_steps_per_second': 414.216, 'epoch': 84.18}


 85%|████████▌ | 17000/20000 [06:33<01:04, 46.55it/s]

{'loss': 0.5529, 'grad_norm': 0.47312891483306885, 'learning_rate': 1.5e-06, 'epoch': 86.73}


                                                     
 85%|████████▌ | 17006/20000 [06:33<01:43, 28.85it/s]

{'eval_loss': 0.5625880360603333, 'eval_accuracy': 0.647, 'eval_runtime': 0.2798, 'eval_samples_per_second': 3574.208, 'eval_steps_per_second': 446.776, 'epoch': 86.73}


 88%|████████▊ | 17500/20000 [06:44<00:54, 45.51it/s]

{'loss': 0.552, 'grad_norm': 0.43302205204963684, 'learning_rate': 1.25e-06, 'epoch': 89.29}


                                                     
 88%|████████▊ | 17507/20000 [06:45<01:28, 28.32it/s]

{'eval_loss': 0.5616555213928223, 'eval_accuracy': 0.651, 'eval_runtime': 0.2568, 'eval_samples_per_second': 3894.071, 'eval_steps_per_second': 486.759, 'epoch': 89.29}


 90%|█████████ | 18000/20000 [06:56<00:47, 42.52it/s]

{'loss': 0.551, 'grad_norm': 0.41618162393569946, 'learning_rate': 1.0000000000000002e-06, 'epoch': 91.84}


                                                     
 90%|█████████ | 18005/20000 [06:57<01:32, 21.68it/s]

{'eval_loss': 0.5609046816825867, 'eval_accuracy': 0.655, 'eval_runtime': 0.2898, 'eval_samples_per_second': 3450.949, 'eval_steps_per_second': 431.369, 'epoch': 91.84}


 92%|█████████▎| 18500/20000 [07:09<00:32, 45.50it/s]

{'loss': 0.5498, 'grad_norm': 0.48752325773239136, 'learning_rate': 7.5e-07, 'epoch': 94.39}


                                                     
 93%|█████████▎| 18503/20000 [07:10<00:59, 25.23it/s]

{'eval_loss': 0.5603178143501282, 'eval_accuracy': 0.658, 'eval_runtime': 0.2608, 'eval_samples_per_second': 3834.382, 'eval_steps_per_second': 479.298, 'epoch': 94.39}


 95%|█████████▌| 19000/20000 [07:21<00:22, 43.78it/s]

{'loss': 0.5497, 'grad_norm': 0.4702327847480774, 'learning_rate': 5.000000000000001e-07, 'epoch': 96.94}


                                                     
 95%|█████████▌| 19006/20000 [07:21<00:35, 27.85it/s]

{'eval_loss': 0.5598985552787781, 'eval_accuracy': 0.658, 'eval_runtime': 0.2728, 'eval_samples_per_second': 3665.766, 'eval_steps_per_second': 458.221, 'epoch': 96.94}


 98%|█████████▊| 19500/20000 [07:33<00:11, 41.75it/s]

{'loss': 0.5493, 'grad_norm': 0.4368070662021637, 'learning_rate': 2.5000000000000004e-07, 'epoch': 99.49}


                                                     
 98%|█████████▊| 19508/20000 [07:33<00:17, 28.02it/s]

{'eval_loss': 0.5596480369567871, 'eval_accuracy': 0.659, 'eval_runtime': 0.2798, 'eval_samples_per_second': 3573.997, 'eval_steps_per_second': 446.75, 'epoch': 99.49}


100%|██████████| 20000/20000 [07:45<00:00, 46.67it/s]

{'loss': 0.5487, 'grad_norm': 0.4650239944458008, 'learning_rate': 0.0, 'epoch': 102.04}


                                                     
100%|██████████| 20000/20000 [07:45<00:00, 42.98it/s]

{'eval_loss': 0.5595652461051941, 'eval_accuracy': 0.659, 'eval_runtime': 0.2758, 'eval_samples_per_second': 3625.8, 'eval_steps_per_second': 453.225, 'epoch': 102.04}
{'train_runtime': 465.3703, 'train_samples_per_second': 5500.995, 'train_steps_per_second': 42.977, 'train_loss': 0.6097665313720703, 'epoch': 102.04}





TrainOutput(global_step=20000, training_loss=0.6097665313720703, metrics={'train_runtime': 465.3703, 'train_samples_per_second': 5500.995, 'train_steps_per_second': 42.977, 'train_loss': 0.6097665313720703, 'epoch': 102.04})

We can then evaluate the trained model on a given dataset (here our test subset) by calling [`Trainer.evaluate`](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer.evaluate):

In [28]:
eval_results = trainer.evaluate(dset_tokenized["test"])

print(eval_results)

100%|██████████| 3125/3125 [00:06<00:00, 510.21it/s]

{'eval_loss': 0.5708923935890198, 'eval_accuracy': 0.6408, 'eval_runtime': 6.1279, 'eval_samples_per_second': 4079.695, 'eval_steps_per_second': 509.962, 'epoch': 102.04}





# Save the model for later use

* You can save it with `trainer.save_model()`
* You can load it with `MLP.from_pretrained()`


In [29]:
trainer.save_model("mlp-imdb")

# Check save/load

In [30]:
mlp2=MLP.from_pretrained("mlp-imdb")

In [31]:
trainer = transformers.Trainer(
    model=mlp2,
    args=trainer_args,
    train_dataset=dset_tokenized["train"],
    eval_dataset=dset_tokenized["test"],
    compute_metrics=compute_accuracy,
    data_collator=collator,
    callbacks=[early_stopping]
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [32]:
eval_results = trainer.evaluate(dset_tokenized["test"])
print(eval_results)
print('Accuracy:', eval_results['eval_accuracy'])

  2%|▏         | 57/3125 [00:00<00:05, 548.49it/s]

100%|██████████| 3125/3125 [00:05<00:00, 526.74it/s]

{'eval_loss': 0.5708923935890198, 'eval_accuracy': 0.6408, 'eval_runtime': 5.9388, 'eval_samples_per_second': 4209.634, 'eval_steps_per_second': 526.204}
Accuracy: 0.6408





# Extra time left?

* Read through the TrainingArguments documentation, try to understand at least some parts of it https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments
* Read through Torch tensor operations, try to understand at least some parts of it: https://pytorch.org/docs/stable/tensors.html
* Run the model with different parameters (hidden layer width, learning rate, etc), how much do the results change?


# What has the model learned?

* The embeddings should have some meaning to them
* Similar features should have similar embeddings

In [33]:
# Grab the embedding matrix out of the trained model
# and drop the first row (padding 0)
# then we can treat the embeddings as vectors
# and maybe compare them to each other
# ha ha this below took some googling
weights=mlp.embedding.weight.detach().cpu().numpy()
weights=weights[1:,:]

In [34]:
qry_idx=vectorizer.vocabulary_["lousy"] #embedding of "great"

#calculate the distance of the "lousy" embedding to all other embeddings
distance_to_qry=sklearn.metrics.pairwise.euclidean_distances(weights[qry_idx:qry_idx+1,:],weights)
nearest_neighbors=np.argsort(distance_to_qry) #indices of words nearest to "lousy"
for nearest in nearest_neighbors[0,:20]:
    print(idx2word[nearest])
# This works great!

lousy
atrocious
excuse
bored
unbelievable
joke
flat
bother
insult
sucks
decent
uninteresting
weak
dreadful
acting
incoherent
amateurish
low
couldn
sorry


In [35]:
print(nearest_neighbors)

[[10693  1315  6398 ...  7931 19747  6366]]


* The embeddings indeed seem to reflect the task
* There is a meaning to them

# Feature weights

*   A typical "old-school" way to approach the classification would be a simple linear model, like LinearSVM
*   Under such model, each feature (word) would have a single one weight
*   And the classification would simply be based on the sum of these weights
*   In this context of this task, "positive" words would get a high weight, "negative" words would get a low weight
*   It is in fact quite easy to reconfigure the MLP model to work more or less like this and this effect can be replicated
*   I will leave that as an exercise for you



In [36]:
# weights now looks like this
print(weights)

[[ 0.00931336]
 [ 0.00958689]
 [-0.0108391 ]
 ...
 [-0.01083061]
 [ 0.01148495]
 [ 0.00241483]]


In [37]:
# we dont want each weight to be an array item so lets reshape
weights = weights.reshape(1, -1)

In [38]:
# print the hundred most positive words
sorted_feature_importances=np.argsort(weights)
for most_positive in sorted_feature_importances[0,:100]:
    print(idx2word[most_positive])

excellent
wonderful
great
perfect
amazing
favorite
superb
loved
fantastic
highly
best
beautiful
brilliant
wonderfully
touching
today
love
beautifully
gem
enjoyed
heart
terrific
perfectly
refreshing
recommended
powerful
incredible
captures
rare
outstanding
unique
always
underrated
performances
delightful
job
subtle
definitely
both
moving
finest
also
simple
fun
strong
flawless
emotions
awesome
true
noir
superbly
favorites
masterpiece
brilliantly
realistic
sweet
enjoyable
well
enjoy
different
favourite
greatest
friendship
perfection
breathtaking
stunning
solid
classic
journey
shows
magnificent
gives
especially
beauty
delight
excellently
performance
each
world
timeless
intense
portrayal
funniest
complex
still
atmosphere
tears
haunting
fascinating
appreciated
extraordinary
family
notch
helps
remarkable
unexpected
deeply
chilling
freedom
years


In [56]:
# print the hundred most negative words
for most_negative in reversed(sorted_feature_importances[0,-100:]):
    print(idx2word[most_negative])

worst
bad
waste
awful
terrible
boring
worse
stupid
horrible
poor
poorly
nothing
dull
crap
lame
avoid
pointless
ridiculous
wasted
supposed
mess
minutes
laughable
badly
pathetic
no
annoying
script
fails
money
plot
unfortunately
redeeming
disappointing
save
unfunny
unless
cheap
instead
oh
disappointment
predictable
garbage
attempt
dumb
wooden
mediocre
sorry
acting
decent
joke
atrocious
lousy
excuse
bored
unbelievable
flat
bother
insult
sucks
uninteresting
weak
dreadful
incoherent
amateurish
low
couldn
reason
lacks
remotely
tedious
why
even
silly
failed
idea
mst3k
guess
none
embarrassing
half
ok
whatsoever
rubbish
effort
wasting
least
unwatchable
costs
better
skip
forgettable
fake
unconvincing
trash
any
bunch
bland
pretentious
don
