<a href="https://colab.research.google.com/github/mariagrandury/hands-on-nlp-hugging-face/blob/main/hands_on_nlp_with_hugging_face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style="text:center"> Hands-on NLP with Hugging Face </h1>

In [None]:
# We won't need TensorFlow here
!pip uninstall -y tensorflow

# Install `datasets`
!pip install datasets

# Install `transformers` from master
!pip install git+https://github.com/huggingface/transformers
!pip list | grep -E 'transformers|tokenizers'

# 🤗 Datasets

We are going to use the data set [Spanish Billion Words](https://huggingface.co/datasets/spanish_billion_words) (10.22 GiB).

In [None]:
from datasets import load_dataset

dataset = load_dataset("spanish_billion_words")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1474.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=833.0, style=ProgressStyle(description_…


Downloading and preparing dataset spanish_billion_words/corpus (download: 1.89 GiB, generated: 8.34 GiB, post-processed: Unknown size, total: 10.22 GiB) to /root/.cache/huggingface/datasets/spanish_billion_words/corpus/1.1.0/8ba50a854d61199f7d36b4c3f598589a2f8b493a2644b88ce80adb2cebcbc107...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2024166993.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset spanish_billion_words downloaded and prepared to /root/.cache/huggingface/datasets/spanish_billion_words/corpus/1.1.0/8ba50a854d61199f7d36b4c3f598589a2f8b493a2644b88ce80adb2cebcbc107. Subsequent calls will reuse this data.


In [None]:
dataset = dataset["train"]
print(len(dataset))
print(dataset["train"][37])

{'text': 'El señor John Dashwood no tenía la profundidad de sentimientos del resto de la familia pero sí le afectó una recomendación de tal índole en un momento como ése y prometió hacer todo lo que le fuera posible por el bienestar de sus parientes'}


# 🤗 Tokenizers

Tokenizing a text is splitting it into words or subwords, which then are converted to ids through a look-up table. The three main types of tokenizers used in 🤗 Transformers:
- Byte-Pair Encoding (BPE)
- WordPiece
- SentencePiece

Since we are going to train a RoBERTa-like model, we will use a Byte-level BPE tokenizer.

[More info](https://huggingface.co/transformers/tokenizer_summary.html#byte-pair-encoding)

In [None]:
def batch_iterator(batch_size=1000):
    for i in range(0, len(dataset_train), batch_size):
        yield dataset_train[i : i + batch_size]["text"]

In [None]:
%%time 
from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train_from_iterator(
    iterator=batch_iterator(), 
    vocab_size=52_000,
    min_frequency=2
    special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>",
    ]
)

In [None]:
# Save the files
!mkdir EspaBERTa
tokenizer.save_model("EspaBERTa")

Now we have two files:
- `vocab.json`: a list of the most frequent tokens ranked by frequency
- `merges.txt`: a list of merges

Let's see how we can use the trained tokenizer!

In [None]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing


tokenizer = ByteLevelBPETokenizer(
    "./EspaBERTa/vocab.json",
    "./EspaBERTa/merges.txt",
)

In [None]:
tokenizer._tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),
    ("<s>", tokenizer.token_to_id("<s>")),
)
tokenizer.enable_truncation(max_length=512)

In [None]:
tokenizer.encode("Hola me llamo Maria.")

In [None]:
tokenizer.encode("Hola me llamo Maria.").tokens

# 🤗 Transformers

We are going to train a [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html)-like model. 

The RoBERTa model was proposed in [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

In [None]:
# Check that we have a GPU
!nvidia-smi

In [None]:
# Check that PyTorch sees it
import torch
torch.cuda.is_available()

Now let's instantiate a RoBERTa model according to the specified arguments, defining the model architecture.

In [None]:
from transformers import RobertaConfig

# Configure a RoBERTa model
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

In [None]:
from transformers import RobertaTokenizerFast

# Create a tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained("./EspaBERTa", max_len=512)

In [None]:
from transformers import RobertaForMaskedLM

# Initialize the model
model = RobertaForMaskedLM(config=config)

In [None]:
model.num_parameters()  # 84 million parameters!

Now let's initialize our [transformers.Trainer](https://huggingface.co/transformers/main_classes/trainer.html#id1). We need a dataset, a data collator and some training arguments.

In [None]:
%%time
from transformers import LineByLineTextDataset

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./oscar.eo.txt",
    block_size=128,
)

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./EspaBERTa",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

And let's train the model!

In [None]:
%%time
trainer.train()

In [None]:
trainer.save_model("./EspaBERTa")

Don't forget to share your model!

[How to upload a model to the 🤗 Model Hub.](https://huggingface.co/transformers/model_sharing.html)