# Looking Into Pipelines

1. Preprocessing with a tokeniser
2. Passing inputs through the model
3. Postprocessing

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing

1. Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
2. Mapping each token to an integer
3. Adding additional inputs that may be useful to the model

### Loading a Tokeniser

- Spelling is `tokenizer`
- **Needs to be done exactly the same ws when the model was pretrained**
  - Information can be downloaded through [Model Hub](https://huggingface.co/models)
- Sentences can be directly passed to the tokeniser
- It returns a dictionary that will be fed into the model

In [5]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # Tokeniser and from_pretrained()

- Todo: convert list of inputs -> tensors
  - Using Transformers library need not worry about the ML framework (PyTorch, Tensorflow, Flax)
  - Transformers models only accept `tensors` as input
    - Tensors: like NumPy arrays, can be a scalar (0D), a vector (1D), a matrix (2D) or have more dimensions
- Output: a dictionary with two keys
  - `input_ids`: two rows of integers (one for each sentence), unique identifiers of the tokens (words) in each sentence
  - `attention_mask`

In [6]:
# Human readable data input
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # return_tensors() specify the type of sensors returned, default return is a list of lists
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

## Passing Inputs Through the Model

- Download pretrained model the same way as tokeniser
- This architecture contains only the base Transformer module
  - Given some inputs, it outputs hidden states also known as features
  - For each model input, a high-dimensional vector representing the contextual understanding of that input by the Transformer model is retrieved
  - Hidden states are usually inputs to another part of the model (also useful on their own)

In [129]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # Should be cached because it has been downloaded in the previous blocks
model = AutoModel.from_pretrained(checkpoint) # Instantiate model

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [130]:
model.save_pretrained("models/" + checkpoint)

- Vector output has three dimensions (large)
  - Batch size: number of sequences processed at a time
  - Sequence length: length of numerical representation of the sequence
  - Hidden size: vector dimension of each model input (high dimension), small models usually have 768, large models >= 3072

In [11]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape) # Access elements by attributes
print(outputs["last_hidden_state"]) # Access elements by key
print(outputs[0]) # Access elements by index

torch.Size([2, 16, 768])
tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],
         [ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],
         [ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],
         ...,
         [ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],
         [ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],
         [ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],

        [[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],
         [-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],
         [-0.1536,  0.8987, -0.0728,  ..., -0.2189, -0.8528,  0.0710],
         ...,
         [-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],
         [-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],
         [-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],
       grad_fn=<NativeLayerNormBackward>)
tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.301

Model head is an additional component, usually made up of one or a few layers, to convert the transformer predictions to a task-specific output. Adaptation heads, also known simply as heads, come up in different forms: language modeling heads, question answering heads, sequence classification heads...

- The high-dimensional vectors are inputs to the model heads
- Project the vectors onto a different dimension
- Output of the Transformer model is sent to the model head to be processed
- Embedding layers convert each input ID in the tokenised input into a representing vector
- Subsequent layers manipulate those vectors using the attention mechanism to produce final sentence representations

[Model Input] -> ""[Embeddings + Layers]" (Transformer network) -> [Hidden States] -> [Head]" (Full model) -> [Model Output]]

- Different architectures are needed for different tasks
- For example, to use a model with a sequence classification head we need to use another class

In [15]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # These are just a set of trained weights
model = AutoModelForSequenceClassification.from_pretrained(checkpoint) # This is the architecture
outputs = model(**inputs) # Set inputs
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

- The result of the model head is 2 * 2 (one value per label)
- Reduce dimensionality

In [16]:
outputs.logits.shape

torch.Size([2, 2])

## Postprocessing

- Postprocessing is needed to understand the outputs
- Logits are the raw, unnormalised scores outputted by the last layer of the model
- Logits need to be converted to probabilities through a `SoftMax` layer

What is the point of applying a SoftMax function to the logits output by a sequence classification model? It applies a lower and upper bound so that they're understandable. The total sum of the output is then 1, resulting in a possible probabilistic interpretation.

In [17]:
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward>)

In [18]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

- Now we need to get the corresponding labels

In [19]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Models

The `AutoModel` class is used. It is handy because you can instantiate any model from a checkpoint. This class and its relatives are simple wrappers over the wide variety of models in the library.

## Loading a Model

Creating a transformer. Model with default config can be used but outputs gibberish. Requires training from scratch.

In [22]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

# Default config initialises the model with random values

The `BertConfig` contains many attributes used to build the model. `hidden_size` defines the size of the `hidden_states` vector. `num_hidden_layers` defines the number of layers of the transformer.

In [23]:
config

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

Loading a pre-trained transformer.

In [119]:
from transformers import BertModel # No need BertConfig()

checkpoint = "bert-base-cased"

model = BertModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Your code should work seamlessly with another checkpoint if it already works with one, even if the architecture is different; the checkpoint only needs to be trained for a similar task. The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers. You can customize your cache folder by setting the `HF_HOME` environment variable.

In [123]:
from transformers import AutoModel # Produces checkpoint-agnostic code

model = BertModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Saving a Model

In [124]:
model.save_pretrained("models/" + checkpoint)

`config.json` file has model config attributes and some metadata (checkpoint origin, last saved model using which Transformers version etc.) `pytorch_model.bin` is the state dictionary that contains all the model weights. Config defines architecture and model weights define model parameters.

In [27]:
!ls models

config.json
pytorch_model.bin


## Using a Model for Infererence

To make some predictions. Transformer models can only process numbers generated by tokenisers.

### Inputs

What kind of inputs are accepted by the models? Something must be done before sending the inputs to the model, where the tokeniser will cast the inputs to the framework tensors.

In [28]:
sequences = ["Hello!", "Cool.", "Nice!"]

Tokeniser converts these vocab indices into `input_id`s. Each sequence is a list of numbers. This is a list of lists.

In [29]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

Tensors only accept rectangular shapes lke matrices. This list/array is rectangular so it is easy to be converted into tensors.

In [30]:
import torch

model_inputs = torch.tensor(encoded_sequences)
model_inputs

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

### Putting Tensors Into Model

`input_id`s are the only mandatory input. Many arguments can be accepted.

In [31]:
output = model(model_inputs)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4033e-02,
           3.9394e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1958e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0996e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1118e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

# Tokenisers

They translate text into numerical data for models to understand because models can only accept numerical data. They find the **most meaningful representation (makes most sense to the model) and, if possible, the smallest representation**. Must use the same rules used when the model was pretrained.

In [32]:
string = "Jim Henson was a puppeteer"

## Word-Based

Easy to set up, few rules, decent results.

- Split on spaces
- Split on punctuation

Vocabulary is defined by the total number of independent tokens in the corpus. Each word has an ID starting from 0 to the max vocab size. IDs are used to identify each word. Enormous amounts of tokens are needed. Also need a custom token `[UNK]` (to be avoided).

In [33]:
tokenized_text = "Jim Henson was a puppeteer".split() # Split on spaces
print(tokenized_text)

['Jim', 'Henson', 'was', 'a', 'puppeteer']


## Character-Based

This type of tokeniser is to reduce the amount of unknown tokens. Much smaller vocab and lesser unknown tokens (every word can be built from characters). But it can be less meaningful because each character does not mean a lot on its own (depending on language). Chinese characters have more information though. Also there are a large amount of tokens to be processed by our model (more characters than words).

## Subword

Best of both word- and character-based tokenisers. Frequently used words should not be split into smaller subwords. Rare words are decomposed into meaningful subwords.

[annoyingly] -> [annoying] + [ly]

Works especially well with agglutinative languages like Turkish.

## Others

- Byte-level BPE, as used in GPT-2
- WordPiece, as used in BERT
- SentencePiece or Unigram, as used in several multilingual models

## Loading and Saving

Load and save the tokeniser algorithm (like model architecture) and its vocabulary (like model weights).

In [35]:
# from transformers import BertTokenizer
from transformers import AutoTokenizer # Automatically grabs tokeniser based on checkpoint name

# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [38]:
tokenizer("The Harry Potter book series are the best in decades.")

{'input_ids': [101, 1109, 3466, 11434, 1520, 1326, 1132, 1103, 1436, 1107, 4397, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [115]:
tokenizer.save_pretrained("tokenizers/" + checkpoint)

('tokenizers/distilbert-base-uncased-finetuned-sst-2-english\\tokenizer_config.json',
 'tokenizers/distilbert-base-uncased-finetuned-sst-2-english\\special_tokens_map.json',
 'tokenizers/distilbert-base-uncased-finetuned-sst-2-english\\vocab.txt',
 'tokenizers/distilbert-base-uncased-finetuned-sst-2-english\\added_tokens.json',
 'tokenizers/distilbert-base-uncased-finetuned-sst-2-english\\tokenizer.json')

## Encoding

Text -> numbers is encoding.

1. Tokenisation
2. Conversion to `input_id`s

Conversion involves tokens -> numbers to build a tensor out of them. The tokeniser has a vocabulary (downloaded during instantiation). The vocab must be the same as when the model was pretrained.

### Tokenisation

In [46]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence) # Subword tokeniser

tokens

['Using', 'a', 'Trans', '##former', 'network', 'is', 'simple']

### Conversion

In [47]:
ids = tokenizer.convert_tokens_to_ids(tokens)
ids

# I hate this so much input ids: 146, 4819, 1142, 1177, 1277
# I love this so much input ids: 146, 1567, 1142, 1177, 1277
# Each words has a fixed ID (using the same tokeniser)

[7993, 170, 13809, 23763, 2443, 1110, 3014]

## Decoding

Vocab indices -> string

In [48]:
decoded_string = tokenizer.decode([7993, 170, 13809, 23763, 2443, 1110, 3014]) # Converts and groups together the tokens part of the same words, Trans + ##former
decoded_string

'Using a Transformer network is simple'

# Handling Multiple Sequences

In [49]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor(ids)
# This line will fail.
model(input_ids)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Transformers models expect multiple sentences by default. A new dimension must be added. Batching is sending multiple sentences to the model all at once.

In [104]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "I've been waiting for a HuggingFace course my whole life."

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)

input_ids = torch.tensor([ids]) # Wrapped with []
print("Input IDs:", input_ids)

output = model(input_ids)
print("Logits:", output.logits)

Input IDs: tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])
Logits: tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward>)


In [105]:
batched_ids = [ids, ids]

input_ids = torch.tensor(batched_ids) # Get twice

output = model(input_ids)
print("Logits:", output.logits)

Logits: tensor([[-2.7276,  2.8789],
        [-2.7276,  2.8789]], grad_fn=<AddmmBackward>)


## Padding

Batching together two or more sentences might cause different length problem, but tensors must be rectangular. You cannot convert directly the list of input IDs into a tensor. To work around this problem, we need to **pad** the inputs.

In [53]:
batched_ids = [
    [200, 200, 200],
    [200, 200]
]

# Cannot be converted to a tensor

Make tensors have a rectangular shape. Padding adds a padding token to shorter sentences to meet with the longer sentence.

In [54]:
padding_id = 100

batched_ids = [
    [200, 200, 200],
    [200, 200, padding_id],
]

There’s something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we’ve got completely different values! This is because attention layers contextualise each token by also analysing the padding tokens. To get the same results, use an attention mask to tell attention layers to ignore the padding tokens.

In [56]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

seq1_ids = [[200, 200, 200]]
seq2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(seq1_ids)).logits)
print(model(torch.tensor(seq2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)
tensor([[ 1.5694, -1.3895],
        [ 1.3373, -1.2163]], grad_fn=<AddmmBackward>)


## Attention Masks

They are tensors with the excat same shape as the input IDs tensor containing 0s and 1s. 1s: tokens must be processed. 0s: tokens must be ignored.

In [57]:
batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward>)


Manual tokenisation, padding and attention masking

In [106]:
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
]

batched_ids = []
attention_mask = []

for seq in sequences:
    tokens = tokenizer.tokenize(seq)
    ids = tokenizer.convert_tokens_to_ids(tokens)
    batched_ids.append(ids)
    attention_mask.append(ids)

padding = int((len(batched_ids[0]) - len(batched_ids[1]))/2)

for i in range(padding):
    batched_ids[1].append(tokenizer.pad_token_id)
    attention_mask[1].append(0)

attention_mask[0] =[1 for x in attention_mask[0]]
attention_mask[1] =[1 if x != 0 else x for x in attention_mask[1]]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
outputs.logits

tensor([[-2.7276,  2.8789],
        [ 3.1931, -2.6685]], grad_fn=<AddmmBackward>)

Most Transformer models have a limit of 512 or 1024 tokens, and will crash when asked to process longer sequences.

Two solutions available:
- Use a model that supports longer sequence length
- Truncate sequences
  - sequence = sequence[:max_sequence_length]

# Putting It All Together

In [128]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
path = "tokenizers/" + checkpoint
tokenizer = AutoTokenizer.from_pretrained(path, local_files_only=True)

sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence) # Includes input IDs and attention mask and additional inputs

In [131]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

In [132]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

model_inputs = tokenizer(sequences)

In [133]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

In [134]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [135]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Returns PyTorch tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="pt")

# Returns TensorFlow tensors
model_inputs = tokenizer(sequences, padding=True, return_tensors="tf")

# Returns NumPy arrays
model_inputs = tokenizer(sequences, padding=True, return_tensors="np")

## Special Tokens

If we take a look at the input IDs returned by the tokenizer API, we will see they are a tiny bit different from what we had earlier. One token ID was added at the beginning, and one at the end.

In [136]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]
[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]


They are the [CLS] and [SEP] tokens. Needs to add these because this particular model was pretrained with them.

In [137]:
print(tokenizer.decode(model_inputs["input_ids"]))
print(tokenizer.decode(ids))

[CLS] i've been waiting for a huggingface course my whole life. [SEP]
i've been waiting for a huggingface course my whole life.


In [140]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained("tokenizers/" + checkpoint, local_files_only=True)
model = AutoModelForSequenceClassification.from_pretrained("models/" + checkpoint, local_files_only=True)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at models/distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
