<a href="https://colab.research.google.com/github/leukschrauber/LearningPortfolio/blob/main/learn_portfolio_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Learning Portfolio
*by Fabian Leuk (csba6437/12215478)*

## Session 6: Hugging Face Pre-Trained Models

### Key Learnings

**What is Natural Language Processing and what are associated Machine Learning tasks?**

Natural Language Processing is a field of linguistics and machine learning focused on understanding everything related to human language. The following is a non-exhaustive list of tasks accomplished by NLP models:

* Text classification
* Zero-Shot classification
* Text generation
* Text completion
* Token classification
* Question answering
* Summarization
* Translation

**What is huggingface and why is it useful to Machine Learning?**

huggingface provides through its Transformers library architecture standards for Natural Language Processing (NLP) models. The Transformers library can be used to create and use pre-trained NLP models. In the huggingface repository many pre-trained models are available which can be used for the above listed tasks.

As lots of data and resources are used to initially determine weights of a model, using pre-trained models is more environmentally friendly, more effective and efficient in most of the cases.

Beyond these models, hugginfaces also provides datasets and utility functions for NLP ML purposes.

**How can pre-trained models be used for tasks they have not been trained on?**

For most NLP-Tasks context-sensitivity is required and through the AI architectures mathematical representations of the inputs are created. This is a very general process. Only the top layers of these architectures, the so-called model head, then uses these vectors to accomplish the desired task.

Training a pre-trained model on a new task involves thus discarding the model head and replacing it by a suitable randomly initialized head for accomplishing the desired task. These weights are then trained using methods such as SGD. This process is called "transfer learning" or "fine tuning".

**How can transformer models be classified?**

There are three main types of model architectures:

*   Encoders
*   Decoders
*   Encoder-Decoders

Encoders transform an input to a mathematical vector representation, where each word is assigned a vector of n elements. These elements capture the meaning of the word as well as the meaning of the words in the input, specifically its relation to other words in the input. This is accomplished through the use of so-called attention layers. As an encoder is aware of the left and right context of a word, it is considered bi-directional. Encoders are are the best fit for text classification tasks.

Decoders work in a similar way, but only consider one side of the word as the context, usually the right side. They are considered uni-directional. Decoders are the best fit for text generation tasks. Furthermore, they are auto-regressive, meaning that all generated output is used as input again to determine following tokens.

Encoder-Decoders combine the two architectures, using as well bi-directional context sensitivity, as well as auto-regression to accomplish tasks such as translation.

**What are necessary steps when using a Transformers model?**

1. Pre-Processing

Inputs need to be preprocessed the same way the desired model has been trained on. This means that the input strings need to be split the same way, e.g. character-based or word-based. There is also a range of available techniques which revolve around sub-words. The results is a list of tokens.

The split inputs are then converted into integer representation, where each integer is the representation of a unique token. Special tokens such as separating tokens might be inserted.

Other inputs might be needed depending on the model. Some models require a input mask, classifying which input belongs to which category.

To make use of parallelization, multiple inputs are grouped into batches. Truncation and padding make sure the inputs have the correct input size for the model.

2. Model

The inputs are then processed by the model, which produces vector representations of the sentences as an intermediate result.

This intermediate results is then used by the model head to produce the final task result.

3. Post-Processing

The final task result is a number representation and thus has to be converted back again into human-readable format. Different functions and steps might be necessary depending on the task and model. For a classification task, it might be the softmax-function.

**How can the outline steps be undertaken using the huggingface Transformers library?**

From each model, it is possible to build a Tokenizer, which is a convenience object dealing with all the necessary pre-processing and post-processing steps.

The hugging faces Trainer object is made for fine-tuning the model and automatically performs SGD based on hyperparameters given to it.

Also, it is possible to create a pipeline from a model, which deals with all necessary steps from pre-processing to post-processing. However, this works only if the model head is retained.




### Useful Resources

huggingface Models: https://huggingface.co/models

huggingface Datasets: https://huggingface.co/datasets


### A full training example

In our example, we start by loading the MRPC (https://huggingface.co/datasets/glue/viewer/mrpc/test) dataset, which is a dataset from the GLUE standard used to judge NLP models quality.

In the dataset, two sentences are given and the label indicates whether the sentences are paraphrases of each other or not. Thus, the task is sequence classification.

We use the bert-base-uncased model to accomplish this task. (https://huggingface.co/bert-base-cased) The model has been trained on the task of word masking. 15 % of the words in a text have been masked and the model was trained to predict these words. As training task and our task are considerably different, we will discard the model head.

We create a AutoTokenizer from the bert-base-uncased model and use the map function of the dataset, which comes with several advantages over processing each element in a loop. Hereby, the inputs sentences are converted into integer representation.

Then, we begin to set up our Trainer object for the actual training. We define the necessary hyperparameters, the model used, the train and test datasets, the metric we judge our validation set on, as well as the tokenizer used and begin the fine-tuning. For the metrics, we use huggingfaces evaluate library which comes with predefined metrics for the datasets.

There are many more configuration options, but for most of them a reasonable default is chosen for us.

**TO RUN THIS CODE YOU WILL NEED TO ADD A GPU TO THE COLAB ENVIRONMENT. RUNNING ON CPU ONLY WILL TAKE FOREVER.**


In [3]:
!pip install transformers
!pip install datasets
!pip install evaluate

import evaluate
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, AutoModelForSequenceClassification, Trainer


raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/




  0%|          | 0/3 [00:00<?, ?it/s]



Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.412519,0.833333,0.883562
2,0.517100,0.592316,0.840686,0.893617
3,0.292400,0.641461,0.862745,0.904437


TrainOutput(global_step=1377, training_loss=0.33005627844761315, metrics={'train_runtime': 190.8329, 'train_samples_per_second': 57.663, 'train_steps_per_second': 7.216, 'total_flos': 406183858377360.0, 'train_loss': 0.33005627844761315, 'epoch': 3.0})