<a href="https://colab.research.google.com/github/mariagrandury/hands-on-nlp-hugging-face/blob/main/hands-on-nlp-hugging-face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
Hands-on NLP with Hugging Face 
</h1>

<p align="center">
NLP Workshop at the <a href="https://www.womentech.net/women-tech-conference"> WomenTech Global Conference 2021</a>.
</p>

<p align="center">
<br>
<img src="https://pbs.twimg.com/media/E24NHqOWEA0AxRE?format=jpg&name=medium" alt="logo" width="400"/>
</p>


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow

# Install `datasets`
!pip install datasets

# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

# 🤗 Datasets

We are going to use the data set [Spanish Billion Words](https://huggingface.co/datasets/spanish_billion_words) (10.22 GiB).

In [3]:
from datasets import load_dataset

dataset = load_dataset("spanish_billion_words", split='train[:5%]')

Reusing dataset spanish_billion_words (/root/.cache/huggingface/datasets/spanish_billion_words/corpus/1.1.0/8ba50a854d61199f7d36b4c3f598589a2f8b493a2644b88ce80adb2cebcbc107)


In [4]:
print(len(dataset))
print(dataset[23])

2346265
{'text': 'El señor John Dashwood no tenía la profundidad de sentimientos del resto de la familia pero sí le afectó una recomendación de tal índole en un momento como ése y prometió hacer todo lo que le fuera posible por el bienestar de sus parientes'}


# 🤗 Tokenizers

Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. The three main types of tokenizers used in 🤗 Transformers:
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece

Since we are going to train a RoBERTa-like model, we will use a Byte-level BPE tokenizer.

[More info](https://huggingface.co/transformers/tokenizer_summary.html#byte-pair-encoding)

In [5]:
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

In [6]:
%%time 
from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train_from_iterator(
    iterator=batch_iterator(), 
    vocab_size=52_000,
    min_frequency=2,
    special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>",
    ]
)

CPU times: user 6min 49s, sys: 8.19 s, total: 6min 57s
Wall time: 3min 35s


In [7]:
# Save the files
!mkdir EsBERTa
tokenizer.save_model("EsBERTa")

['EsBERTa/vocab.json', 'EsBERTa/merges.txt']

In [10]:
tokenizer.save_model("/content/drive/MyDrive/EsBERTa")

['/content/drive/MyDrive/EsBERTa/vocab.json',
 '/content/drive/MyDrive/EsBERTa/merges.txt']

Now we have two files:
- `vocab.json`: a list of the most frequent tokens ranked by frequency
- `merges.txt`: a list of merges

Let's see how we can use the trained tokenizer!

In [11]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EsBERTa/vocab.json",
    "./EsBERTa/merges.txt",
)

In [12]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [13]:
tokenizer.encode("Hola me llamo Maria.")

Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [15]:
tokenizer.encode("Buenos dias, me llamo Maria.").tokens

['<s>', 'Buenos', 'Ġdias', ',', 'Ġme', 'Ġllamo', 'ĠMaria', '.', '</s>']

In [16]:
tokenizer.encode("Encantada de estar hoy aquí.").tokens

['<s>', 'Enc', 'ant', 'ada', 'Ġde', 'Ġestar', 'Ġhoy', 'ĠaquÃŃ', '.', '</s>']

In [17]:
tokenizer.encode("Me gusta mucho la divulgación.").tokens

['<s>', 'Me', 'Ġgusta', 'Ġmucho', 'Ġla', 'ĠdivulgaciÃ³n', '.', '</s>']

In [18]:
tokenizer.encode("Espero que os guste el taller.").tokens

['<s>', 'Espero', 'Ġque', 'Ġos', 'Ġguste', 'Ġel', 'Ġtaller', '.', '</s>']

In [24]:
tokenizer.encode("estrambolico, despampanante, genialidad").tokens

['<s>',
 'estr',
 'amb',
 'ol',
 'ico',
 ',',
 'Ġdesp',
 'am',
 'pan',
 'ante',
 ',',
 'Ġgen',
 'ialidad',
 '</s>']

# 🤗 Transformers

We are going to train a [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)-like model. 

The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

In [25]:
# Check that we have a GPU
!nvidia-smi

Tue Jun  8 22:01:06 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P0    23W / 300W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [26]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

True

Now let's instantiate a RoBERTa model according to the specified arguments, defining the model architecture.

In [27]:
from transformers import RobertaConfig

# Configure a RoBERTa model
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [28]:
from transformers import RobertaTokenizerFast

# Create a tokenizer
# tokenizer = RobertaTokenizerFast.from_pretrained("./EsBERTa", max_len=512)
tokenizer = RobertaTokenizerFast.from_pretrained("/content/drive/MyDrive/EsBERTa", max_len=512)

In [29]:
from transformers import RobertaForMaskedLM

# Initialize the model
model = RobertaForMaskedLM(config=config)

In [30]:
model.num_parameters()  # 84 million parameters!

83504416

Now let's initialize our [transformers.Trainer](https://huggingface.co/transformers/main_classes/trainer.html#id1). We need a dataset, a data collator and some training arguments.

In [31]:
%%time
from transformers import LineByLineTextDataset


def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')

dataset = dataset.map(encode, batched=True)

HBox(children=(FloatProgress(value=0.0, max=2347.0), HTML(value='')))


CPU times: user 8min 28s, sys: 17.4 s, total: 8min 45s
Wall time: 6min 46s


In [32]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [33]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./EsBERTa",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

In [34]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Finally, let's train the model!

In [None]:
%%time
trainer.train()

In [None]:
trainer.save_model("./EsBERTa")

trainer.save_model("/content/drive/MyDrive/EsBERTa"

Don't forget to share your model!

[How to upload a model to the 🤗 Model Hub.](https://huggingface.co/transformers/model_sharing.html)