<a href="https://colab.research.google.com/github/jeraldflowers/Models-HuggingFace/blob/main/Template_NLP_with_Hugging_Face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP with Hugging Face

## Processing the data for NLP

### Downloading the dataset

In [None]:
%%capture
!pip install datasets transformers evaluate

We will use the MRPC dataset. This is one of the 10 datasets that make up the [benchmark (reference point) GLUE](https://huggingface.co/datasets/glue). It is used to measure the performance of ML models on 10 different text classification tasks.

In other words, we select the `mrpc` subset of the `glue` dataset:

In [None]:
from datasets import load_dataset

ds = load_dataset("glue", "mrpc")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.8k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

This is what an example looks like. We notice that `mrpc` is made up of two sentences and a tag indicating whether the two sentences are equivalent.

In [None]:
example = ds["train"][300]
example

{'sentence1': 'Clearly Roman creams of any type do not normally survive in the archaeological record .',
 'sentence2': 'Clearly Roman creams of any type , paint or cosmetic , do not normally survive ... it \'s pretty exceptional . "',
 'label': 1,
 'idx': 329}

In [None]:
labels = ds["train"].features["label"]
labels

ClassLabel(names=['not_equivalent', 'equivalent'], id=None)

In [None]:
labels.int2str(1)

'equivalent'

### Tokenizing

Do you remember that with vision we downloaded the feature extractor directly from the repository of the pre-trained model that we are going to use as a base?

We can think of the tokenizing function as the equivalent in NLP.

We download the tokenizer directly from the repo of the model we will use.

In [None]:
from transformers import AutoTokenizer

repo_id = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

To preprocess the dataset we need to convert the text into numbers that the model can understand. This is done with a tokenizer.

Going from text to numbers is known as encoding. Encoding is done in a two-step process: tokenization, followed by conversion to input ids. For the moment it is enough for us to know that we are translating text to numbers called as input ids. These will be in the proper format to feed our model.

We can feed the tokenizer a sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

In [None]:
tokenized_sentence_1 = tokenizer(ds["train"]["sentence1"][2])
tokenized_sentence_1

{'input_ids': [101, 2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We need to treat the two statements as a pair and not separately. The tokenizer can take a couple of sequences and prepare them the way our model expects:

In [None]:
inputs = tokenizer("This is the first", "This is the second")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 102, 2023, 2003, 1996, 2117, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

What does each of the values returned by the tokenizer mean?
- `input_ids` is the translation of words to numbers.
- `attention_mask` is a tensor with the same shape as `input_ids`, but filled with 0s and 1s: 1s indicate that the corresponding tokens should be attended to, and 0s indicate that they should not be attended to. That is, they should be ignored by the model.
- `token_type_ids` tells the model which part of the input is the first sentence and which is the second sentence.

The model expects inputs to be of the form [CLS] sentence 1 [SEP] sentence 2 [SEP] when there are two sentences.

In [None]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 '[SEP]']

Selecting another model in the Hub will not necessarily have `token_type_ids` in the tokenized inputs (for example, they are not returned if you use a `DistilBERT` model). They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

In general, we don't need to worry about whether or not there are `token_type_ids` in our tokenized inputs, as long as we use the tokenizer corresponding to the model, everything will be fine since the tokenizer knows what to provide to the model.

For example, during this class we will use a [`distilroberta-base`](https://huggingface.co/distilroberta-base) model for its size and effectiveness. But it doesn't have `token_type_ids` and it still returns excellent results.

In the Platzi organization in the Hub you can find a [BERT model](https://huggingface.co/platzi/platzi-distilroberta-base-mrpc-glue-omar-espejel) fine-tuned following the same process we use in this class .

In [None]:
repo_id = "distilroberta-base"

tokenizer = AutoTokenizer.from_pretrained(repo_id)

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

We create a tokenizing function. It receives an example and tokenizes it.

In [None]:
def tokenize_function(example):
  return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [None]:
prepared_ds = ds.map(tokenize_function, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

### Defining the data collator: Dynamic padding

We need our tensors to have a rectangular shape. That is to say that each of the examples have the same size. However, the texts do not necessarily have the same size.

For this we use the filling or padding. Padding makes sure that all of our sentences are the same length by adding a special word called a padding token to sentences with fewer values. For example, if we have 10 sentences with 10 words and 1 sentence with 20 words, the padding will ensure that all sentences have 20 words.

We leave the tokenizer's `padding` argument empty in our tokenization function for now. This is because padding all the samples to the maximum length of the dataset is not efficient, it is better to pad the samples when we are building a batch, since then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when inputs have highly variable lengths!

We will use a DataCollator for this.

Let's fill (padding) all the examples with the length of the longest item in the batch. This technique is known as dynamic padding.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Training y Evaluation

Let's define the rest of the arguments needed for `Trainer`.

### Defining the metric 

In [None]:
import evaluate
import numpy as np

def compute_metrics(eval_pred):
  metric = evaluate.load("glue", "mrpc")
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

### Configuring `Trainer`


In [None]:
from transformers import AutoModelForSequenceClassification

labels = ds["train"].features["label"].names

model = AutoModelForSequenceClassification.from_pretrained(
    repo_id,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)}
)

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight'

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers",
    evaluation_strategy="steps",
    num_train_epochs=3,
    push_to_hub=True,
    load_best_model_at_end=True
)

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
    
Token: 
Add token as git credential? (Y/n) y
Token is valid.
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credenti

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=prepared_ds["train"],
    eval_dataset=prepared_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Cloning https://huggingface.co/jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers into local empty directory.


### Training

In [None]:
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx. If sentence2, sentence1, idx are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3668
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1377
  Number of trainable parameters = 82119938
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy,F1
500,0.5289,0.566782,0.821078,0.868941
1000,0.3675,0.499023,0.843137,0.881481


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx. If sentence2, sentence1, idx are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8


Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

Saving model checkpoint to jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/checkpoint-500
Configuration saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/checkpoint-500/config.json
Model weights saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/checkpoint-500/pytorch_model.bin
tokenizer config file saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/checkpoint-500/tokenizer_config.json
Special tokens file saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/checkpoint-500/special_tokens_map.json
tokenizer config file saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/tokenizer_config.json
Special tokens file saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx. If sentence2, sentence1, idx are no

Upload file pytorch_model.bin:   0%|          | 3.34k/313M [00:00<?, ?B/s]

Upload file runs/Nov29_01-40-08_d14fb559468f/events.out.tfevents.1669686048.d14fb559468f.77.0:  64%|######3   …

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers
   2c9a995..c06bab8  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers
   2c9a995..c06bab8  main -> main

To https://huggingface.co/jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers
   c06bab8..1618fec  main -> main

   c06bab8..1618fec  main -> main



***** train metrics *****
  epoch                    =        3.0
  total_flos               =   191920GF
  train_loss               =     0.3901
  train_runtime            = 0:02:09.63
  train_samples_per_second =     84.886
  train_steps_per_second   =     10.622


### Evaluation

In [None]:
metrics = trainer.evaluate(prepared_ds["validation"])
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence2, sentence1, idx. If sentence2, sentence1, idx are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 408
  Batch size = 8


***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.8431
  eval_f1                 =     0.8815
  eval_loss               =      0.499
  eval_runtime            = 0:00:03.02
  eval_samples_per_second =     134.77
  eval_steps_per_second   =     16.846


### Let's share in the Hub

In [None]:
kwargs = {
    "finetuned_from": model.config._name_or_path,
    "tasks": "text-classification",
    "dataset": ["glue", "mrpc"],
    "tags": ["text-classification"]
}

trainer.push_to_hub(commit_message="Task accomplished", **kwargs)

Saving model checkpoint to jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers
Configuration saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/config.json
Model weights saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/pytorch_model.bin
tokenizer config file saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/tokenizer_config.json
Special tokens file saved in jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers/special_tokens_map.json
To https://huggingface.co/jeraldflowers/distilroberts-base-mrpc-glue-jeraldflowers
   4c886e5..c979ce4  main -> main

   4c886e5..c979ce4  main -> main

