Ken Perry attribution:
- Derived from [HuggingFace Course, Chapt 2, "Putting it all together"](https://huggingface.co/course/chapter2/6?fw=tf)
  - [Colab](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter2/section6_tf.ipynb)
  

# Putting it all together (TensorFlow)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [16]:
!pip install datasets evaluate transformers[sentencepiece]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Tokenize the input

The Transformer's inputs are sequences of *token identifiers* (of type integer)
- Need to convert text into tokens ("word parts")
- Need to convert the tokens to token identifiers



A *model* is identified by a **checkpoint**
  - string identifying the model architecture and state at which training was ended
    - n.b., if you train for longer, the weights will change (resulting in a different checkpoint)

A pre-trained model is usually paired with the Tokenizer on which it was trained.

We can obtain the Tokenizer from a checkpoint via `AutoTokenizer.from_pretrained(checkpoint)`

In [17]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



Let's understand the Tokenizer

In [18]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

print("Model inputs: ", model_inputs)

print("Model inputs (input_ids): ", model_inputs["input_ids"])

Model inputs:  {'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Model inputs (input_ids):  [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]


The `input_ids` key are the *token identifiers*.

Out of curiousity, we can obtain the token identifiers in 2 sub-steps
- convert text to tokens
- convert tokens to token identifiers

In [19]:
print("Text: ", sequence)

Text:  I've been waiting for a HuggingFace course my whole life.


In [20]:
print("Text: ", sequence)

print("\nFirst step: Manually convert sequence of characters to sequence of tokens")
tokens = tokenizer.tokenize(sequence)

print("Tokens: ", tokens)

print("\nSecond step: Manually convert tokens to ids")
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Token identifiers: ", token_ids)

# Verify that the sequence of token ids created manually is identical to that created by the one-step process
model_inputs = tokenizer(sequence)

assert(token_ids == model_inputs["input_ids"][1:-1])
print('\nVerified ! token_ids == model_inputs["input_ids"][1:-1]')
print('\n\tThat is: model_inputs has bracketed the token_ids with the special start and end tokens')

print("\n")
print("Decoded model inputs (input_ids): ", tokenizer.decode(model_inputs["input_ids"]))
print("Decoded token identifiers: ", tokenizer.decode(token_ids) )

Text:  I've been waiting for a HuggingFace course my whole life.

First step: Manually convert sequence of characters to sequence of tokens
Tokens:  ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']

Second step: Manually convert tokens to ids
Token identifiers:  [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

Verified ! token_ids == model_inputs["input_ids"][1:-1]

	That is: model_inputs has bracketed the token_ids with the special start and end tokens


Decoded model inputs (input_ids):  [CLS] i've been waiting for a huggingface course my whole life. [SEP]
Decoded token identifiers:  i've been waiting for a huggingface course my whole life.


You can see that the 
- `input_ids` has the special token `[CLS]` added at the start and `[SEP]` added at the end of the text
- These special tokens are required by the Transformer model

`token_ids` is identical to `input_ids` except for these special tokens

The Tokenizer's behavior can be modified.

When dealing with more than one example, the example lengths (after tokenization) may have different lengths.

The Tokenizer can adapt it's behavior.

We just list the behavior without going further into it.


In [21]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequence, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequence, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequence, padding="max_length", max_length=8)


In [22]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [23]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

Downloading tf_model.h5:   0%|          | 0.00/256M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFDistilBertForSequenceClassification.

All the layers of TFDistilBertForSequenceClassification were initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [24]:
output

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-1.5606955,  1.6122806],
       [-3.6183183,  3.9137495]], dtype=float32)>, hidden_states=None, attentions=None)

The output is a `Tensor`
- they are the `logits` (scores, **not** probabilities) of the Binary Classification model

Convert them to probabilities

In [25]:
import numpy as np 
probs = tf.nn.softmax(output["logits"]).numpy()

ex_classes = np.argmax(probs, axis=1)

for i, prob in enumerate(probs):
  ex_class = ex_classes[i]
  print(f"Example {i}: Class {ex_class:d} with probability {probs[i, ex_class]:3.2f}")


Example 0: Class 1 with probability 0.96
Example 1: Class 1 with probability 1.00
