# QuickStart to HuggingFace Transformers.

## What is transformers and how it is structured.

https://huggingface.co/transformers/v2.5.0/quickstart.html

Transformers is an opinionated library built for NLP researchers seeking to use/study/extend large-scale transformers models.

The library is build around three type of classes for each models:

1. model classes which are PyTorch models (torch.nn.Modules) of the 8 models architectures currently provided in the library, e.g. BertModel

2. configuration classes which store all the parameters required to build a model, e.g. BertConfig. You don’t always need to instantiate these your-self, in particular if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model)

3. tokenizer classes which store the vocabulary for each model and provide methods for encoding/decoding strings in list of token embeddings indices to be fed to a model, e.g. BertTokenizer


All these classes can be instantiated from pretrained instances and saved locally using two methods:

1. from_pretrained() let you instantiate a model/configuration/tokenizer from a pretrained version either provided by the library itself (currently 27 models are provided as listed here) or stored locally (or on a server) by the user,

2. save_pretrained() let you save a model/configuration/tokenizer locally so that it can be reloaded using from_pretrained().


## Installation

Use Pytorch 1.x and tensorflow 2.x


In [0]:
! pip install -q transformers

# Quick Usage

## BERT

Let’s start by preparing a tokenized input (a list of token embeddings indices to be fed to Bert) from a text string using BertTokenizer

In [0]:
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening under the hood, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

In [0]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
# assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']


In [27]:
print(tokenized_text)

['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']


In [0]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

In [31]:
print(tokens_tensor.shape)
print(segments_tensors.shape)

torch.Size([1, 14])
torch.Size([1, 14])


In [0]:
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')

# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

In [0]:
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

In [0]:
# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the hidden state of the last layer of the Bert model
    encoded_layers = outputs[0]
# We have encoded our input sequence in a FloatTensor of shape (batch size, sequence length, model hidden dimension)
# assert tuple(encoded_layers.shape) == (1, len(indexed_tokens), model.config.hidden_size)

In [33]:
print(encoded_layers.shape)
print(model.config.hidden_size)

torch.Size([1, 14, 768])
768


And how to use BertForMaskedLM to predict a masked token:

In [0]:
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')

# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor, token_type_ids=segments_tensors)
    predictions = outputs[0]


In [0]:
print(predictions)

In [0]:
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
# assert predicted_token == 'henson'

In [26]:
print(predicted_token)

henson


## Using OpenAI GPT-2

Here is a quick-start example using GPT2Tokenizer and GPT2LMHeadModel class with OpenAI’s pre-trained model to predict the next token from a text prompt.

First let’s prepare a tokenized input from our text string using GPT2Tokenizer

In [0]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

In [0]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [0]:
# Encode a text inputs
text = "Who was Jim Henson ? Jim Henson was a"
indexed_tokens = tokenizer.encode(text)

In [38]:
print(indexed_tokens)

[8241, 373, 5395, 367, 19069, 5633, 5395, 367, 19069, 373, 257]


In [39]:
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
print(tokens_tensor.shape)

torch.Size([1, 11])


Let’s see how to use GPT2LMHeadModel to generate the next token following our text:

In [0]:
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [0]:
# Set the model in evaluation mode to deactivate the DropOut modules
# This is IMPORTANT to have reproducible results during evaluation!
model.eval()

In [0]:
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

In [0]:
# Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [44]:
print(predictions)

tensor([[[ -38.7640,  -38.4434,  -42.0484,  ...,  -45.7126,  -43.8866,
           -38.7747],
         [-104.8652, -103.7108, -108.8487,  ..., -112.9846, -110.5398,
          -107.1295],
         [ -71.2187,  -70.2136,  -76.3859,  ...,  -84.0147,  -78.1190,
           -73.4547],
         ...,
         [ -96.3705,  -98.9886, -102.8611,  ..., -110.6566, -103.5795,
           -99.4158],
         [-101.4872, -102.2246, -106.9355,  ..., -111.9820, -107.7468,
          -105.1568],
         [-111.4282, -111.0716, -115.7848,  ..., -121.6386, -117.4221,
          -113.0788]]], device='cuda:0')


In [0]:
# get the predicted next sub-word (in our case, the word 'man')
predicted_index = torch.argmax(predictions[0, -1, :]).item()

In [0]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

In [47]:
print(predicted_text)

Who was Jim Henson? Jim Henson was a man


## Using the Past

GPT-2 as well as some other models (GPT, XLNet, Transfo-XL, CTRL) make use of a past or mems attribute which can be used to prevent re-computing the key/value pairs when using sequential decoding. It is useful when generating sequences as a big part of the attention mechanism benefits from previous computations.

Here is a fully-working example using the past with GPT2LMHeadModel and argmax decoding (which should only be used as an example, as argmax decoding introduces a lot of repetition):

In [0]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [0]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')

In [0]:
generated = tokenizer.encode("The Manhattan bridge")

In [51]:
print(generated)

[464, 13458, 7696]


In [0]:
context = torch.tensor([generated])

In [0]:
past = None

In [0]:
for i in range(100):
    print(i)
    output, past = model(context, past=past)
    token = torch.argmax(output[..., -1, :])

    generated += [token.tolist()]
    context = token.unsqueeze(0)

sequence = tokenizer.decode(generated)
print(sequence)

The model only requires a single token as input as all the previous tokens’ key/value pairs are contained in the past.

## Model2Model example

Encoder-decoder architectures require two tokenized inputs: one for the encoder and the other one for the decoder.

Let’s assume that we want to use Model2Model for generative question answering, and start by tokenizing the question and answer that will be fed to the model.

In [0]:
! pip install -q transformers

In [0]:
import torch
from transformers import BertTokenizer, Model2Model

In [0]:
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Encode the input to the encoder (the question)
question = "Who was Jim Henson?"
encoded_question = tokenizer.encode(question)

# Encode the input to the decoder (the answer)
answer = "Jim Henson was a puppeteer"
encoded_answer = tokenizer.encode(answer)

# Convert inputs to PyTorch tensors
question_tensor = torch.tensor([encoded_question])
answer_tensor = torch.tensor([encoded_answer])

In [4]:
print(encoded_question)
print(encoded_answer)

[101, 2040, 2001, 3958, 27227, 1029, 102]
[101, 3958, 27227, 2001, 1037, 13997, 11510, 102]


In [5]:
print(question_tensor.shape)
print(answer_tensor.shape)

torch.Size([1, 7])
torch.Size([1, 8])


Let’s see how we can use Model2Model to get the value of the loss associated with this (question, answer) pair:

In [0]:
# In order to compute the loss we need to provide language model
# labels (the token ids that the model should have produced) to
# the decoder.
lm_labels =  encoded_answer
labels_tensor = torch.tensor([lm_labels])

In [7]:
print(labels_tensor.shape)

torch.Size([1, 8])


In [0]:
# Load pre-trained model (weights)
model = Model2Model.from_pretrained('bert-base-uncased')

In [0]:
# If you have a GPU, put everything on cuda
question_tensor = question_tensor.to('cuda')
answer_tensor = answer_tensor.to('cuda')
labels_tensor = labels_tensor.to('cuda')

In [0]:
# Predict hidden states features for each layer
with torch.no_grad():
    # See the models docstrings for the detail of the inputs
    outputs = model(question_tensor, answer_tensor, decoder_lm_labels=labels_tensor)
    # Transformers models always output tuples.
    # See the models docstrings for the detail of all the outputs
    # In our case, the first element is the value of the LM loss 
    lm_loss = outputs[0]

This loss can be used to fine-tune Model2Model on the question answering task. Assuming that we fine-tuned the model, let us now see how to generate an answer:

In [0]:
# Let's re-use the previous question
question = "Who was Jim Henson?"
encoded_question = tokenizer.encode(question)
question_tensor = torch.tensor([encoded_question])

# This time we try to generate the answer, so we start with an empty sequence
answer = "[CLS]"
encoded_answer = tokenizer.encode(answer, add_special_tokens=False)
answer_tensor = torch.tensor([encoded_answer])

In [0]:
# Load pre-trained model (weights)
model = Model2Model.from_pretrained('fine-tuned-weights')
model.eval()

In [0]:
# If you have a GPU, put everything on cuda
question_tensor = question_tensor.to('cuda')
answer_tensor = answer_tensor.to('cuda')
model.to('cuda')

In [0]:
# Predict all tokens
with torch.no_grad():
    outputs = model(question_tensor, answer_tensor)
    predictions = outputs[0]

# confirm we were able to predict 'jim'
predicted_index = torch.argmax(predictions[0, -1]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]

In [0]:
print(predicted_token)