# NLP with HuggingFace

## Setup

In [38]:
import numpy as np

import evaluate
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)

## Downloading the data

In [2]:
repo_id = "glue"
ds = load_dataset(repo_id, "mrpc")

Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 26.7MB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 3.70MB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 5.57MB/s]


Downloading and preparing dataset glue/mrpc to /Users/mmenendezg/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data: 6.22kB [00:00, 10.7MB/s]/3 [00:00<?, ?it/s]
Downloading data: 1.05MB [00:00, 7.81MB/s]/3 [00:00<00:00,  2.14it/s]
Downloading data: 441kB [00:00, 5.12MB/s]2/3 [00:01<00:00,  1.97it/s]
Downloading data files: 100%|██████████| 3/3 [00:01<00:00,  2.02it/s]
                                                                                     

Dataset glue downloaded and prepared to /Users/mmenendezg/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 1644.83it/s]


In [4]:
ex = ds["train"][400]
ex

{'sentence1': 'U.S. Agriculture Secretary Ann Veneman , who announced Tuesdays ban , also said Washington would send a technical team to Canada to help .',
 'sentence2': "U.S. Agriculture Secretary Ann Veneman , who announced yesterday 's ban , also said Washington would send a technical team to Canada to assist in the Canadian situation .",
 'label': 1,
 'idx': 446}

In [9]:
labels = ds["train"].features["label"]

## Tokenizer

In [11]:
repo_tokenizer = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(repo_tokenizer)

Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 225kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 3.33MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 2.24MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 6.30MB/s]


In [13]:
tokenize_sentence_1 = tokenizer(ds["train"]["sentence1"][2])
tokenize_sentence_1

{'input_ids': [101, 2027, 2018, 2405, 2019, 15147, 2006, 1996, 4274, 2006, 2238, 2184, 1010, 5378, 1996, 6636, 2005, 5096, 1010, 2002, 2794, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [18]:
inputs = tokenizer("This is the first", "This is the second sentence that I am using")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 102, 2023, 2003, 1996, 2117, 6251, 2008, 1045, 2572, 2478, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [19]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'sentence',
 'that',
 'i',
 'am',
 'using',
 '[SEP]']

In [20]:
repo_tokenizer = "distilroberta-base"
tokenizer = AutoTokenizer.from_pretrained(repo_tokenizer)

Downloading (…)lve/main/config.json: 100%|██████████| 480/480 [00:00<00:00, 1.94MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 4.40MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 25.6MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 21.6MB/s]


In [22]:
def tokenize_fn(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

In [24]:
processed_ds = ds.map(tokenize_fn, batched=True)

Loading cached processed dataset at /Users/mmenendezg/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-fc6e4428e47f03aa.arrow
Loading cached processed dataset at /Users/mmenendezg/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bc6e0ef6073bdf33.arrow


## Data Collator

In [40]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Training and Evaluation

### Setting the metric

In [27]:
def compute_metrics(eval_pred):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [31]:
labels = ds["train"].features["label"].names

In [33]:
model = AutoModelForSequenceClassification.from_pretrained(
    repo_tokenizer,
    num_labels=len(labels),
    id2label={str(i): c for i, c in enumerate(labels)},
    label2id={c: str(i) for i, c in enumerate(labels)},
)

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias',

In [35]:
training_args = TrainingArguments(
    output_dir="mmenendezg-distilroberta-base-mrpc",
    evaluation_strategy="steps",
    num_train_epochs=5,
    push_to_hub=True,
    load_best_model_at_end=True,
)

In [37]:
!huggingface-cli login --token

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid.
Your token has been saved to /Users/mmenendezg/.cache/huggingface/token
Login successful


In [None]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=processed_ds["train"],
    eval_dataset=processed_ds["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [42]:
train_results = trainer.train()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_model()

  0%|          | 0/2295 [00:00<?, ?it/s]You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
 22%|██▏       | 500/2295 [03:18<11:55,  2.51it/s]

{'loss': 0.5289, 'learning_rate': 3.910675381263617e-05, 'epoch': 1.09}



Downloading builder script: 100%|██████████| 5.75k/5.75k [00:00<00:00, 8.13MB/s]
                                                  
 22%|██▏       | 500/2295 [03:24<11:55,  2.51it/s]

{'eval_loss': 0.6889318823814392, 'eval_accuracy': 0.8235294117647058, 'eval_f1': 0.8775510204081631, 'eval_runtime': 5.9509, 'eval_samples_per_second': 68.561, 'eval_steps_per_second': 8.57, 'epoch': 1.09}
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the env

 44%|████▎     | 1000/2295 [06:49<08:40,  2.49it/s] 

{'loss': 0.3653, 'learning_rate': 2.8213507625272335e-05, 'epoch': 2.18}


                                                   
 44%|████▎     | 1000/2295 [06:56<08:40,  2.49it/s]

{'eval_loss': 0.7133319973945618, 'eval_accuracy': 0.8235294117647058, 'eval_f1': 0.8705035971223022, 'eval_runtime': 6.4487, 'eval_samples_per_second': 63.269, 'eval_steps_per_second': 7.909, 'epoch': 2.18}


 65%|██████▌   | 1500/2295 [10:15<05:29,  2.41it/s]

{'loss': 0.2021, 'learning_rate': 1.7320261437908496e-05, 'epoch': 3.27}


                                                   
 65%|██████▌   | 1500/2295 [10:22<05:29,  2.41it/s]

{'eval_loss': 1.0655598640441895, 'eval_accuracy': 0.8357843137254902, 'eval_f1': 0.8846815834767642, 'eval_runtime': 6.6603, 'eval_samples_per_second': 61.258, 'eval_steps_per_second': 7.657, 'epoch': 3.27}


 87%|████████▋ | 2000/2295 [13:46<02:00,  2.44it/s]

{'loss': 0.1044, 'learning_rate': 6.427015250544663e-06, 'epoch': 4.36}


                                                   
 87%|████████▋ | 2000/2295 [13:52<02:00,  2.44it/s]

{'eval_loss': 0.9791407585144043, 'eval_accuracy': 0.8406862745098039, 'eval_f1': 0.8873483535528597, 'eval_runtime': 5.9696, 'eval_samples_per_second': 68.346, 'eval_steps_per_second': 8.543, 'epoch': 4.36}
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the en

100%|██████████| 2295/2295 [16:00<00:00,  2.39it/s]


{'train_runtime': 960.2466, 'train_samples_per_second': 19.099, 'train_steps_per_second': 2.39, 'train_loss': 0.2687155272706142, 'epoch': 5.0}
***** train metrics *****
  epoch                    =        5.0
  train_loss               =     0.2687
  train_runtime            = 0:16:00.24
  train_samples_per_second =     19.099
  train_steps_per_second   =       2.39
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, aft

Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file pytorch_model.bin: 320MB [13:50, 674kB/s]                            To https://huggingface.co/mmenendezg/mmenendezg-distilroberta-base-mrpc
   b47fd09..ffaa47e  main -> main

Upload file pytorch_model.bin: 100%|██████████| 313M/313M [13:51<00:00, 395kB/s]
Upload file runs/May16_17-54-13_Marlons-MacBook-Pro.local/events.out.tfevents.1684281800.Marlons-MacBook-Pro.local.19109.0: 100%|██████████| 6.52k/6.52k [13:51<00:00, 8.04B/s] 


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/mmenendezg/mmenendezg-distilroberta-base-mrpc
   ffaa47e..3b9b5f0  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Evaluation

In [43]:
metrics = trainer.evaluate(processed_ds["validation"])
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

100%|██████████| 51/51 [00:05<00:00,  8.54it/s]

***** eval metrics *****
  epoch                   =        5.0
  eval_accuracy           =     0.8235
  eval_f1                 =     0.8776
  eval_loss               =     0.6889
  eval_runtime            = 0:00:06.09
  eval_samples_per_second =     66.915
  eval_steps_per_second   =      8.364





## Upload to the Hub

In [44]:
kwargs = {
    "finetuned_from": model.config._name_or_path,
    "tasks": "text-classification",
    "dataset": ["glue", "mrpc"],
    "tags": ["text-classification"],
}

trainer.push_to_hub(commit_message="NLP model, v1.0", **kwargs)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Upload file runs/May16_17-54-13_Marlons-MacBook-Pro.local/events.out.tfevents.1684283709.Marlons-MacBook-Pro.local.19109.2: 100%|██████████| 457/457 [00:00<?, ?B/s]To https://huggingface.co/mmenendezg/mmenendezg-distilroberta-base-mrpc
   3b9b5f0..31053f4  main -> main

Upload file runs/May16_17-54-13_Marlons-MacBook-Pro.local/events.out.tfevents.1684283709.Marlons-MacBook-Pro.local.19109.2: 100%|██████████| 457/457 [00:01<?, ?B/s]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/mmenendezg/mmenendezg-distilroberta-base-mrpc
   31053f4..ef8e68d  main -> main



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'https://huggingface.co/mmenendezg/mmenendezg-distilroberta-base-mrpc/commit/31053f4e617e73f3b9857d6a91268148583c146d'