My goals:

- Learn how to use pre-trained model out of the box
- How to fine-tune a model.
- How to upload my own model once it is fine-tuned.
- What products do they offer?

In [52]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


### Reformer

From https://huggingface.co/google/reformer-enwik8

Also from their page: `Note: Language generation using ReformerModelWithLMHead is not optimized yet and is rather slow.`, and indeed, it is _very_ slow.

In [1]:
import torch

# Encoding
def encode(list_of_strings, pad_token_id=0):
    max_length = max([len(string) for string in list_of_strings])

    # create emtpy tensors
    attention_masks = torch.zeros((len(list_of_strings), max_length), dtype=torch.long)
    input_ids = torch.full((len(list_of_strings), max_length), pad_token_id, dtype=torch.long)

    for idx, string in enumerate(list_of_strings):
        # make sure string is in byte format
        if not isinstance(string, bytes):
            string = str.encode(string)

        input_ids[idx, :len(string)] = torch.tensor([x + 2 for x in string])
        attention_masks[idx, :len(string)] = 1

    return input_ids, attention_masks

# Decoding
def decode(outputs_ids):
    decoded_outputs = []
    for output_ids in outputs_ids.tolist():
        # transform id back to char IDs < 2 are simply transformed to ""
        decoded_outputs.append("".join([chr(x - 2) if x > 1 else "" for x in output_ids]))
    return decoded_outputs

In [2]:
from transformers import ReformerModelWithLMHead

model = ReformerModelWithLMHead.from_pretrained("google/reformer-enwik8")
encoded, attention_masks = encode(["In 1965, Brooks left IBM to found the Department of"])
decode(model.generate(encoded, do_sample=True, max_length=150))


['In 1965, Brooks left IBM to found the Department of Defense on [[August 21]] [[1968]]. He also played [[Pioneer Studies]] at the [[Input Station Stati']

## Tutorial

### Quick Tour

https://huggingface.co/transformers/quicktour.html

#### Using Pipelines

In [3]:
from transformers import pipeline
import sys
classifier = pipeline('sentiment-analysis')

classifier('We are very happy to show you the 🤗 Transformers library.')

[{'label': 'POSITIVE', 'score': 0.9997795224189758}]

In [4]:
classifier(['I am happy', 'I am sad', 'I am neither happy nor sad but more happy than sad'])

[{'label': 'POSITIVE', 'score': 0.9998801946640015},
 {'label': 'NEGATIVE', 'score': 0.9991856217384338},
 {'label': 'POSITIVE', 'score': 0.9831374287605286}]

Use a specific model from the [model hub](https://huggingface.co/models). This classifier can now deal with texts in English, French, but also Dutch, German, Italian and Spanish.

In [5]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")


In [6]:
classifier(['I am very happy!!!', 'I hate this.', 
            'I have a love-hate relationship with this.', 
            'I guess its ok. Meh.',
           'I only like it a little bit.',
           'Its pretty bad.'])

[{'label': '5 stars', 'score': 0.8226729035377502},
 {'label': '1 star', 'score': 0.8582792282104492},
 {'label': '5 stars', 'score': 0.5077174305915833},
 {'label': '3 stars', 'score': 0.758967399597168},
 {'label': '3 stars', 'score': 0.6192184686660767},
 {'label': '2 stars', 'score': 0.454030841588974}]

#### Using specific models and tokenizers

In [42]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

Grab the tokenizer and model

In [43]:
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

`pt_model` Is a a pytorch model!

In [44]:
from torch.nn import Module

In [45]:
isinstance(pt_model, Module)

True

In [46]:
inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
inputs

{'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

#### This is how you can create a batch with padding:

In [47]:
pt_batch = tokenizer(
  text = ["We re very happy to show you the HF libraray", "We hope you don't hate it."],
  padding=True,
  truncation=True,
  max_length=512,
  return_tensors="pt"
)

In [48]:
pt_batch

{'input_ids': tensor([[  101,  2057,  2128,  2200,  3407,  2000,  2265,  2017,  1996,  1044,
          2546,  5622, 10024,  9447,   102],
        [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,  1012,
           102,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}

Use the model with the inputs you pre-processed

In [49]:
pt_outputs = pt_model(**pt_batch)
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-4.1998,  4.5413],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [50]:
pt_outputs.logits

tensor([[-4.1998,  4.5413],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>)

In [51]:
print(pt_outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.1998,  4.5413],
        [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)
