My goals:

- Learn how to use pre-trained model out of the box
- How to fine-tune a model.
- How to upload my own model once it is fine-tuned.
- What products do they offer?

In [3]:
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


### Reformer

From https://huggingface.co/google/reformer-enwik8

Also from their page: `Note: Language generation using ReformerModelWithLMHead is not optimized yet and is rather slow.`, and indeed, it is _very_ slow.

In [4]:
import torch

# Encoding
def encode(list_of_strings, pad_token_id=0):
    max_length = max([len(string) for string in list_of_strings])

    # create emtpy tensors
    attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
    input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)

    for idx, string in enumerate(list_of_strings):
        # make sure string is in byte format
        if not isinstance(string, bytes):
            string = str.encode(string)

        input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
        attention_masks[idx, :len(string)] = 1

    return input_ids, attention_masks

# Decoding
def decode(outputs_ids):
    decoded_outputs = []
    for output_ids in outputs_ids.tolist():
        # transform id back to char IDs < 2 are simply transformed to ""
        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
    return decoded_outputs

In [None]:
from transformers import ReformerModelWithLMHead

model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
decode(model.generate(encoded, do_sample=True, max_length=150))


## Tutorial

### Quick Tour

https://huggingface.co/transformers/quicktour.html

#### Using Pipelines

In [None]:
from transformers import pipeline
import sys
classifier = pipeline('sentiment-analysis')

classifier('We are very happy to show you the 🤗 Transformers library.')

In [4]:
classifier(['I am happy', 'I am sad', 'I am neither happy nor sad but more happy than sad'])

[{'label': 'POSITIVE', 'score': 0.9998801946640015},
 {'label': 'NEGATIVE', 'score': 0.9991856217384338},
 {'label': 'POSITIVE', 'score': 0.9831374287605286}]

Use a specific model from the [model hub](https://huggingface.co/models). This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish.

In [5]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")


In [6]:
classifier(['I am very happy!!!', 'I hate this.', 
            'I have a love-hate relationship with this.', 
            'I guess its ok. Meh.',
           'I only like it a little bit.',
           'Its pretty bad.'])

[{'label': '5 stars', 'score': 0.8226729035377502},
 {'label': '1 star', 'score': 0.8582792282104492},
 {'label': '5 stars', 'score': 0.5077174305915833},
 {'label': '3 stars', 'score': 0.758967399597168},
 {'label': '3 stars', 'score': 0.6192184686660767},
 {'label': '2 stars', 'score': 0.454030841588974}]

#### Using specific models and tokenizers

In [42]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

Grab the tokenizer and model

In [43]:
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

`pt_model` Is a a pytorch model!

In [44]:
from torch.nn import Module

In [45]:
isinstance(pt_model, Module)

True

In [46]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
inputs

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

#### This is how you can create a batch with padding:

In [47]:
pt_batch = tokenizer(
     ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
     padding=True,
     truncation=True,
     max_length=512,
     return_tensors="pt"
 )

In [48]:
pt_batch

{'input_ids': tensor([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996,   100,
         19081,  3075,  1012,   102],
        [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,  1012,
           102,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}

Use the model with the inputs you pre-processed

In [49]:
pt_outputs = pt_model(**pt_batch, )
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [50]:
pt_outputs.logits

tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>)

In [51]:
print(pt_outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)


In [52]:
type(pt_outputs)

transformers.modeling_outputs.SequenceClassifierOutput

In [53]:
??pt_outputs.__repr__

> All 🤗 Transformers models (PyTorch or TensorFlow) return the activations of the model before the final activation function (like SoftMax) since this final activation function is often fused with the loss.

In [54]:
import torch.nn.functional as F
pt_predictions = F.softmax(pt_outputs.logits, dim=-1)

print(pt_predictions)

tensor([[2.2043e-04, 9.9978e-01],
        [5.3086e-01, 4.6914e-01]], grad_fn=<SoftmaxBackward>)


#### supplying labels

In [55]:
import torch
pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0]))

In [56]:
pt_outputs

SequenceClassifierOutput(loss=tensor(0.3167, grad_fn=<NllLossBackward>), logits=tensor([[-4.0833,  4.3364],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

#### Training

this is skipped and defferred to another tutorial

#### Saving

In [32]:
save_directory = 'model_save'
tokenizer.save_pretrained(save_directory)
pt_model.save_pretrained(save_directory)

In [34]:
!ls model_save/

config.json	   special_tokens_map.json  vocab.txt
pytorch_model.bin  tokenizer_config.json


In [41]:
from transformers import AutoModel
tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModel.from_pretrained(save_directory, from_tf=False)

### Output all hidden states and attention weights

In [87]:
pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
hs = pt_outputs.hidden_states

In [88]:
pt_outputs.keys()

odict_keys(['logits', 'hidden_states', 'attentions'])

Lets see the attention outputs, the shape is (bs, num_heads, n_steps, n_steps),  the last two dimensions are a square matrix that has seq_len x seq_len  because for each time step there you attend over all time steps (maybe with a mask), or `(Q * K.T) * V`

In [98]:
for a in pt_outputs.attentions:
    print(a.shape)

torch.Size([2, 12, 14, 14])
torch.Size([2, 12, 14, 14])
torch.Size([2, 12, 14, 14])
torch.Size([2, 12, 14, 14])
torch.Size([2, 12, 14, 14])
torch.Size([2, 12, 14, 14])


In [99]:
for h in pt_outputs.hidden_states:
    print(h.shape)

torch.Size([2, 14, 768])
torch.Size([2, 14, 768])
torch.Size([2, 14, 768])
torch.Size([2, 14, 768])
torch.Size([2, 14, 768])
torch.Size([2, 14, 768])
torch.Size([2, 14, 768])


### Customizing Model Head

If you only need to change the head of a model, you can accomplish this with `num_labels`

In [71]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [74]:
model(**pt_batch).logits.shape

torch.Size([2, 10])

### Customizing Other aspects of the model

If you need to customize things other than the number of heads, you can use a `Config` object:

>  If you do core modifications, like changing the hidden size, you won’t be able to use a pretrained model anymore and will need to train from scratch.

In [94]:
from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification(config)

In [96]:
out = model(**pt_batch, output_attentions=True)

In [97]:
for a in out.attentions:
    print(a.shape)

torch.Size([2, 8, 14, 14])
torch.Size([2, 8, 14, 14])
torch.Size([2, 8, 14, 14])
torch.Size([2, 8, 14, 14])
torch.Size([2, 8, 14, 14])
torch.Size([2, 8, 14, 14])


## TF Scratch

In [60]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

tf_batch = tokenizer(
    ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="tf"
)

tf_outputs = tf_model(tf_batch)

import tensorflow as tf
tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0]), 
                     output_hidden_states=True, output_attentions=True)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_78']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [61]:
print(tf_outputs)

TFSequenceClassifierOutput(loss=<tf.Tensor: shape=(2,), dtype=float32, numpy=array([2.2051287e-04, 6.3326043e-01], dtype=float32)>, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-4.0832963 ,  4.3364143 ],
       [ 0.081807  , -0.04178282]], dtype=float32)>, hidden_states=(<tf.Tensor: shape=(2, 14, 768), dtype=float32, numpy=
array([[[ 0.35494003, -0.1386167 , -0.22525916, ...,  0.15358105,
          0.07478691,  0.13097724],
        [-0.5773195 ,  0.67906535, -0.9738145 , ...,  0.8804597 ,
          1.1043587 , -0.7628254 ],
        [-0.34514365, -0.20938337,  0.57090867, ...,  0.32083964,
          0.08531715,  0.45752436],
        ...,
        [ 0.44307742,  0.09305832, -0.10337374, ..., -0.77375007,
          0.08131394,  0.07283318],
        [-0.5604855 ,  0.1081146 ,  0.12285595, ...,  0.45192814,
          0.21039748,  0.29700357],
        [-0.61160445,  0.01562406, -0.05545302, ..., -0.17362992,
          0.19333245, -0.00208834]],

       [[ 0.35494003, -0.1386

In [63]:
tf_outputs.attentions

(<tf.Tensor: shape=(2, 12, 14, 14), dtype=float32, numpy=
 array([[[[6.90215603e-02, 3.89681682e-02, 2.48741675e-02, ...,
           6.21191151e-02, 9.88976732e-02, 2.06353441e-01],
          [6.62244186e-02, 9.92845446e-02, 1.45226652e-02, ...,
           1.18213117e-01, 2.10416373e-02, 2.15363875e-02],
          [2.41780296e-01, 1.38992026e-01, 2.06516031e-02, ...,
           2.83739194e-02, 6.07644357e-02, 1.63915738e-01],
          ...,
          [1.11827508e-01, 8.38079900e-02, 3.28807272e-02, ...,
           3.59347872e-02, 3.51291858e-02, 5.04803620e-02],
          [2.19152153e-01, 5.36153167e-02, 3.43204476e-02, ...,
           2.99999453e-02, 1.11683317e-01, 9.44220796e-02],
          [2.23596662e-01, 3.22642587e-02, 4.10014652e-02, ...,
           2.21555736e-02, 6.74072579e-02, 1.47974193e-01]],
 
         [[9.80098963e-01, 7.91435014e-04, 8.09619029e-04, ...,
           5.09542122e-04, 1.95403863e-03, 2.48786109e-03],
          [1.78714131e-03, 1.18398890e-02, 1.69390962e-0