# Quick tour (Group A)

This quick tour will help you get started with ```transformers``` library.  It will show you how to load preprocessors,i.e. tokenizers, and language models with an [transformers.AutoClass](https://huggingface.co/docs/transformers/main/en/./model_doc/auto), and quickly train and evaluate a model with [transformers.Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer). 
* You only need to read what is in this notebook as some hyper links are only for reference use. 
* You will need to run each cell of code in order.
* You can ask the instructor to clarify any concept that you feel unclear in this tutorial.
* **This tutorial can be referred back to during the later assignment, so please read this notebook carefully.** 

#### You have 15 minutes on this tutorial. Let the instructor start timing when you read this sentence.

# 1. Library Import (run the code, no need to read through it)

In [1]:
# no need to read the import block
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
os.environ['HF_HOME'] = '/workspace/HF_cache/'
os.environ['HF_DATASETS_CACHE'] = '/workspace/HF_cache/datasets'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/HF_cache/transformers_cache/'
os.environ['TF_ENABLE_ONEDNN_OPTS']='0'
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 

import transformers

# 2. Natural Language Processing via the Transformers Library 

An NLP model takes tokenized text as input and outputs numerical values to solve common NLP tasks, with some examples of each:

</Tip>

| **Task**                     | **Description**                                                                                              | **Application** |
|------------------------------|--------------------------------------------------------------------------------------------------------------|------------------------------|
| Masked language modeling  | predicts a masked token in a sequence                                                                                 | pre-training |          
| Sequence classification          | assign a label to a given sequence of text                                                                   | sentiment analysis |  

The ```transformers``` library provides the functionality and API to create and use such NLP models. And these models can be stored locally in the file system. The ```transformers``` library provides a unified API with few user-facing abstractions for using NLP models so that it has low barrier to entry for educators and practitioners.

* The code below shows you an example that there are three models stored locally under ```models```. 
* Among the models,  ```models/bert-base-uncased_v2``` and ```models/bert-base-uncased-sentiment``` are derived from ```models/bert-base-uncased```.
* Specifically, ```models/bert-base-uncased_v2``` is the next version of ```models/bert-base-uncased``` via fine-tuning and they perform the same pre-training task, i.e.  masked language modeling.
* ```models/bert-base-uncased-sentiment``` is adapted from ```models/bert-base-uncased``` and is trained to perform a downstream task, i.e. sequence classification.

In [2]:
!ls models/bert-base-uncased # this model can perform task ```Masked language modeling```

config.json		special_tokens_map.json  training_args.bin
generation_config.json	tokenizer.json		 vocab.txt
pytorch_model.bin	tokenizer_config.json


In [3]:
!ls models/bert-base-uncased_v2 # this model can perform task ```Masked language modeling```

config.json		special_tokens_map.json  training_args.bin
generation_config.json	tokenizer.json		 vocab.txt
pytorch_model.bin	tokenizer_config.json


In [4]:
!ls models/bert-base-uncased-sentiment # this model can perform task ```Sequence classification```

config.json	   special_tokens_map.json  tokenizer_config.json  vocab.txt
pytorch_model.bin  tokenizer.json	    training_args.bin


# 3. Loading Tokenizer

A tokenizer is responsible for preprocessing text into an array of numbers as inputs to a model. The most important thing to remember is you need to instantiate a tokenizer with the same model name to ensure you're using the same tokenization rules a model was pretrained with.

In [5]:
from transformers import AutoTokenizer

model_name = "models/bert-base-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Pass your text to the tokenizer:

In [6]:
encoding = tokenizer("We are very happy to show you the 🤗 Transformers library.")
print(encoding)

{'input_ids': [101, 11312, 10320, 12495, 19308, 10114, 11391, 10855, 10103, 100, 58263, 13299, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The tokenizer returns a dictionary containing:

* [input_ids](https://huggingface.co/docs/transformers/main/en/./glossary#input-ids): numerical representations of your tokens.
* [attention_mask](https://huggingface.co/docs/transformers/main/en/.glossary#attention-mask): indicates which tokens should be attended to.

A tokenizer can also accept a list of inputs, and pad and truncate the text to return a batch with uniform length:

In [7]:
pt_batch = tokenizer(
    ["We are very happy to show you the Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt",
)

# 4. Loading Model

```transformers``` provides a simple and unified way to load different instances. This means you can load an [AutoModel](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModel) like you would load an [AutoTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoTokenizer). The only difference is selecting the correct [AutoModel](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModel) for the task which can be achieved by either calling the specific class, i.e. AutoModelForSequenceClassification in this exampe, or calling AutoConfig.

In [8]:
from transformers import AutoModelForSequenceClassification

model_name = "models/bert-base-uncased-sentiment"
model1 = AutoModelForSequenceClassification.from_pretrained(model_name)

In [9]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained(model_name)
architecture = config.architectures[0]
model2 = getattr(transformers, architecture).from_pretrained(model_name)

</Tip>

Now pass your preprocessed batch of inputs directly to the model. You just have to unpack the dictionary by adding `**`:

In [10]:
import torch
torch.equal(model1(**pt_batch).logits, model2(**pt_batch).logits)

True

# 5. Model Training

All models are a standard [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) so you can use them in any typical training loop. While you can write your own training loop, ```transformers``` provides a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) class for PyTorch, which contains the basic training loop and adds additional functionality for features like distributed training, mixed precision, and more.

Depending on your task, you'll typically pass the following parameters to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):

1. A [PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel) or a [`torch.nn.Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module):

   ```py
   >>> from transformers import AutoModelForSequenceClassification

   >>> model = AutoModelForSequenceClassification.from_pretrained("models/bert-base-uncased-sentiment")
   ```

2. [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) contains the model hyperparameters you can change like learning rate, batch size, and the number of epochs to train for. The default values are used if you don't specify any training arguments:

   ```py
   >>> from transformers import TrainingArguments

   >>> training_args = TrainingArguments(
   ...     output_dir="tmp_trainer",
   ...     per_device_train_batch_size=256,
   ...     per_device_eval_batch_size=256,
   ... )
   ```

3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:

   ```py
   >>> from transformers import AutoTokenizer

   >>> tokenizer = AutoTokenizer.from_pretrained("models/bert-base-uncased-sentiment")
   ```

4. Load a dataset:

   ```py
   >>> from datasets import load_dataset

   >>> dataset = load_dataset("datasets/rotten_tomatoes")
   ```

5. Create a function to tokenize and preprocess the dataset:

   ```py
   >>> def preprocess_dataset(dataset):
   ...     return tokenizer(dataset["text"])
   ```

   Then apply it over the entire dataset with [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map):

   ```py
   >>> dataset = dataset.map(preprocess_dataset, batched=True)
   ```

6. Now gather all these classes in [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):

   ```py
   >>> from transformers import Trainer
   >>> trainer = Trainer(
   ...      model=model,
   ...      args=training_args,
   ...      train_dataset=dataset["train"],
   ...      eval_dataset=dataset["test"],
   ...      tokenizer=tokenizer,
   ...)
   ```
7. Call [trainer.train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to start training

## * Useful Note:
```transformers``` lets the user to derive a model that can perform a downstream task, e.g. sequence classification, from a pre-training model, e.g. to produce a sequence classification model like ```models/bert-base-uncased-sentiment``` from a pretrained masked language model ```models/bert-base-uncased_v2```, the user can simply do: 

```py
   >>> model = AutoModelForSequenceClassification.from_pretrained("models/bert-base-uncased_v2") # models/bert-base-uncased_v2's weight will be automatically loaded into the sequence classification model with some additional weights that are randomly initiated for the downstream task
   >>> trainer = Trainer(
   ...    model=model,
   ...    args=training_args,
   ...    train_dataset=dataset["train"],
   ...    eval_dataset=dataset["test"],
   ...    tokenizer=tokenizer,
   ...)
```

# 6. Model Evaluation

1. [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) does not automatically evaluate model performance during training. You'll need to pass [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) a function to compute and report metrics. The [Evaluate](https://huggingface.co/docs/evaluate/index) library provides a simple [`accuracy`](https://huggingface.co/spaces/evaluate-metric/accuracy) function you can load with the [evaluate.load](https://huggingface.co/docs/evaluate/main/en/package_reference/loading_methods#evaluate.load) function:

```py
    >>> import numpy as np
    >>> import evaluate

    >>> metric = evaluate.load("accuracy")
```

2. Call `compute` on `metric` to calculate the accuracy of your predictions. Before passing your predictions to `compute`, you need to convert the predictions to logits (remember all ```Transformers``` models return logits):

```py
    >>> def compute_metrics(eval_pred):
    ...    logits, labels = eval_pred
    ...    predictions = np.argmax(logits, axis=-1)
    ...    return metric.compute(predictions=predictions, references=labels)
```

3. Now gather all these classes in [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer):

```py
    >>> from transformers import TrainingArguments, Trainer

    >>> training_args = transformers.TrainingArguments(output_dir=os.path.join("tmp_trainer"),)
    >>> trainer = transformers.Trainer(
    ...      model=model,
    ...      args=training_args,
    ...      tokenizer=tokenizer,
    ...      eval_dataset=dataset,
    ...      compute_metrics=compute_metrics,
    ...)
```
4. Call [trainer.evaluate()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.evaluate) to start evaluation

# Don't close this tab when you are done reading!