In [1]:
try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("We're running Colab")

We're running Colab


In [2]:
if not IN_COLAB:
    %run Latex_macros.ipynb

Ken Perry attribution:
- Derived from [HuggingFace Course, Chapt 2, "Putting it all together"](https://huggingface.co/course/chapter2/6?fw=tf)
  - [Colab](https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter2/section6_tf.ipynb)
      - link to Github repo no longer works
      - repo probably updated

# Putting it all together (TensorFlow)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [3]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


# Tokenize the input

The Transformer's inputs are sequences of *token identifiers* (of type integer)
- Need to convert text into tokens ("word parts")
- Need to convert the tokens to token identifiers



A *model* is identified by a **checkpoint**
  - string identifying the model architecture and state at which training was ended
    - n.b., if you train for longer, the weights will change (resulting in a different checkpoint)

A pre-trained model is usually paired with the Tokenizer on which it was trained.

We can obtain the Tokenizer from a checkpoint via `AutoTokenizer.from_pretrained(checkpoint)`

In [4]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Let's understand the Tokenizer

In [5]:
sequence = "I've been waiting for a HuggingFace course my whole life."

model_inputs = tokenizer(sequence)

print("Model inputs: ", model_inputs)

print("Model inputs (input_ids): ", model_inputs["input_ids"])

Model inputs:  {'input_ids': [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Model inputs (input_ids):  [101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102]


The `input_ids` key are the *token identifiers*.

Out of curiousity, we can obtain the token identifiers in 2 sub-steps
- convert text to tokens
- convert tokens to token identifiers

In [6]:
print("Text: ", sequence)

Text:  I've been waiting for a HuggingFace course my whole life.


In [7]:
print("Text: ", sequence)

print("\nFirst step: Manually convert sequence of characters to sequence of tokens")
tokens = tokenizer.tokenize(sequence)

print("Tokens: ", tokens)

print("\nSecond step: Manually convert tokens to ids")
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print("Token identifiers: ", token_ids)

# Verify that the sequence of token ids created manually is identical to that created by the one-step process
model_inputs = tokenizer(sequence)

assert(token_ids == model_inputs["input_ids"][1:-1])
print('\nVerified ! token_ids == model_inputs["input_ids"][1:-1]')
print('\n\tThat is: model_inputs has bracketed the token_ids with the special start and end tokens')

print("\n")
print("Decoded model inputs (input_ids): ", tokenizer.decode(model_inputs["input_ids"]))
print("Decoded token identifiers: ", tokenizer.decode(token_ids) )

Text:  I've been waiting for a HuggingFace course my whole life.

First step: Manually convert sequence of characters to sequence of tokens
Tokens:  ['i', "'", 've', 'been', 'waiting', 'for', 'a', 'hugging', '##face', 'course', 'my', 'whole', 'life', '.']

Second step: Manually convert tokens to ids
Token identifiers:  [1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]

Verified ! token_ids == model_inputs["input_ids"][1:-1]

	That is: model_inputs has bracketed the token_ids with the special start and end tokens


Decoded model inputs (input_ids):  [CLS] i've been waiting for a huggingface course my whole life. [SEP]
Decoded token identifiers:  i've been waiting for a huggingface course my whole life.


You can see that the
- `input_ids` has the special token `[CLS]` added at the start and `[SEP]` added at the end of the text
- These special tokens are required by the Transformer model

`token_ids` is identical to `input_ids` except for these special tokens

The Tokenizer's behavior can be modified.

When dealing with more than one example, the example lengths (after tokenization) may have different lengths.

The Tokenizer can adapt it's behavior.

We just list the behavior without going further into it.


In [8]:
# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequence, padding="longest")

# Will pad the sequences up to the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequence, padding="max_length")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequence, padding="max_length", max_length=8)


In [9]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

In [12]:
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, use_safetensors=False)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf")
output = model(**tokens)

Some layers from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english and are newly initialized: ['dropout_59']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
output

TFSequenceClassifierOutput(loss=None, logits=<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[-1.5606961,  1.6122813],
       [-3.6183178,  3.9137495]], dtype=float32)>, hidden_states=None, attentions=None)

The output is a `Tensor`
- they are the `logits` (scores, **not** probabilities) of the Binary Classification model

Convert them to probabilities

In [14]:
import numpy as np
probs = tf.nn.softmax(output["logits"]).numpy()

ex_classes = np.argmax(probs, axis=1)

for i, prob in enumerate(probs):
  ex_class = ex_classes[i]
  print(f"Example {i}: Class {ex_class:d} with probability {probs[i, ex_class]:3.2f}")


Example 0: Class 1 with probability 0.96
Example 1: Class 1 with probability 1.00


# Classifier model output type: logits vs probabilities

There is a **subtle but important** way to pass Loss function names into Keras when using HuggingFace.

Recall that some Classifiers, e.g., Logistic Regression, work by
- computing a score/logit
$$
\text{logit} = \Theta \cdot \x
$$
- converting the logit to a probability
    - by applying a `softmax` to the logits
    

Our practice has been to assume that
- the model output
$$\y = \text{model}(\x)$$
- is a *probability* vector
    - Given possible labels/classes
    $$ C = \{ c_1, \ldots, c_\text{#C} \}$$
    - $y_j$ is the probability that input $\x$ is from class $c_j$
    
**However**: the HuggingFace standard is that $\y$ are **logits** rather than probabilities
- values *before* applying a softmax

The import of the difference is that
- the *loss function* must know
- that the model is returning *logits*, rather than *probabilities* (the Keras default)

In Keras, we can pass the loss either
- as a function object
    - e.g., `tf.keras.losses.SparseCategoricalCrossentropy`
- or a *string* denoting the function
    - e.g., `sparse_categorical_crossentropy`
    


To conform to the HuggingFace standard
- we should specify the loss as a function
- passing in an (optional) argument indicating that the model output are logits
    - e.g., `SparseCategoricalCrossentropy(from_logits=True)`

So the typical compile statement should look like

        model.compile(
              ...,
              loss=SparseCategoricalCrossentropy(from_logits=True),
              ...)
              
rather than


        model.compile(
              ...,
              loss='sparse_categorical_crossentropy',
              ...)


See the [warning for common pitfall](https://huggingface.co/learn/nlp-course/chapter3/3?fw=tf)

```
Note a very common pitfall here — you can just pass the name of the loss as a string to Keras, but by default Keras will assume that you have already applied a softmax to your outputs. Many models, however, output the values right before the softmax is applied, which are also known as the logits. We need to tell the loss function that that’s what our model does, and the only way to do that is to call it directly, rather than by name with a string.
```

**Remember**

- the Loss function must be compatible with the type of the model output
    - logits or probabilties

# Examining the model

- inspect the `__init__` and `call` methods

`__init__` will show the model components
- we can recursively inspect the components

`call` will show you how the model transforms input to output

In [15]:
model.__init__??

We will recursively examine the `distilbert` attribute which is a `TFDistilBertMainLayer`

But first, let's examine the `call` method, which will help us understand how the components are connected.

In [16]:
model.call??

If you scroll through all the description, you will find the body of the call
- the `distilbert` component is called, result assigned to `distilbert_output`
- the `distilbert_output` is passed through several layers
- before passing through a Classifier `self.classifier`
  - which produces logits

## Let's work our way down the model, starting with the `distilbert` attribute is what we want

In [17]:
model.distilbert

<transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertMainLayer at 0x7f2dd2307da0>

We can see that it is of type `TFDistilBertMainLayer`

Use the Jupyter notebook ? and ?? tools to inspect
- the signature
- the code

In [18]:
model.distilbert?

In [19]:
model.distilbert.__init__??

In [20]:
model.distilbert.call??

The `__init__` method shows you the components
- let's examine `self.transformer`

In [21]:
model.distilbert.transformer.__init__??

We see that it has a `layer` attribute which is the array of `TFTransformerBlock`

Let's examine one block

In [22]:
l = model.distilbert.transformer.layer[0]

l.__init__??

You see that the Feed Forward Network at the `.ffn` attribute

In [25]:
l.ffn.__init__??

You will see two `Dense` layers at attributes `lin1` and `lin2`

In [24]:
print("Done")

Done
