# Models

* Outlines creating and using models using the `AutoModel` class from the Hugging Face `Transformers` library
* Shows what happens when one instantiates an `AutoModel` object
* All classes and functions are imported from the `Transformers` library

## Setup

In [44]:
model_provider = "bert"
model_name = "bert-base-uncased"
model = f"{model_provider}/{model_name}"

---

## Creating a Transformer

* Similar to instantiating a tokenizer, use the `from_pretrained` method on the `AutoModel` class to:
    * Download and cache the model data
* The checkpoint name corresponds to a specific model architecture and weights:
    * A BERT model is used with a basic architecture:
        * 12 layers
        * 768 hidden size
        * 12 attention heads
        * Cased inputs

**Notes:**

* The `AutoModel` class is simply a wrapper designed to fetch an appropriate model architecture given a checkpoint
* It is considered *auto* because it will guess the appropriate model architecture and instantiate the correct model class
* If the model architecture is known, a model can be instantiated using its direct class that defines its architecture:
    * For example: `BertModel` will directly instantiate a BERT model

In [45]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

---

## Loading and Saving

* Saving a model is similar to saving a tokenizer
* The `save_pretrained` method save's a model's weights and architecture configuration to disk:
    * `config.json`: model architecture configuration
    * `model.safetensors`: model weights
* The `config.json` file contains all the necessary attributes to build the model architecture including:
    * Checkpoint origination
    * Transformers version used since last checkpoint save
* The `model.safetensors` file is known as the state dictionary containing a model's weights
* Both files work together:
    * Configuration file is needed to know the model architecture
    * Model weights are the parameters of the model
* Use the `from_pretrained` method again to reuse a saved model

In [46]:
model.save_pretrained("~/")

In [47]:
model = AutoModel.from_pretrained("~/")

---

## Publishing Models

* The Transformers library can publish models to Model Hub using one's Hugging Face account
* The `push_to_hub` method is used to publish a model to the Model Hub
* The `push_to_hub` method takes two arguments:
    * `repo_id`: the name of the repository to publish the model to
    * `token`: the token used to authenticate with the Model Hub

### Login to Hugging Face

In [48]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Push Model

* The model is pushed to Model Hub into a repository under one's namespace
* Anyone can then load the model using the `from_pretrained` method

In [49]:
# model.push_to_hub("my-model")

In [50]:
# model = AutoModel.from_pretrained("username/my-model")

---

## Encoding Text

* Transformer models handle text by turning inputs into numbers
* Text is split into tokens then transformed into numbers
* The `AutoTokenizer` class is used to encode text into numbers
* It returns a dictionary with the following fields:
    * `input_ids`: the numerical representations of tokens
    * `token_type_ids`: informs a model which part of the input is a given sentence
    * `attention_mask`: indicates which tokens should be attended to or not

In [51]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
encoded_input = tokenizer("Hello, I'm a single sentence!")

In [52]:
print(encoded_input)

{'input_ids': [101, 8667, 117, 146, 112, 182, 170, 1423, 5650, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


* Input IDs can be decoded back into their original text
* The tokenizer added special tokens:
    * \[CLS\]: classifies the input
    * \[SEP\]: separates sentences
* The model requires these special tokens
* Not all models need special tokens
* Only those pretrained with special tokens require them

In [53]:
tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I ' m a single sentence! [SEP]"

* Multiple sentences can be encoded together
    * Batching together
    * Via a list
* When passing multiple sentences, the tokenizer returns a list for each sentence for each dictionary value
* One can also ask the tokenizer to return the tensors

In [54]:
encoded_input = tokenizer("How are you?", "I'm fine, thank you!")

In [55]:
print(encoded_input)

{'input_ids': [101, 1731, 1132, 1128, 136, 102, 146, 112, 182, 2503, 117, 6243, 1128, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [56]:
encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")

In [57]:
print(encoded_input)

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  136,  102,  146,  112,  182, 2503,  117, 6243,
         1128,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


* The lists do not have the same length
* Arrays and tensors need to be rectangular
* One cannot convert lists to tensors

---

### Padding Inputs

* By setting the `padding` parameter, the tokenizer will make all sentences the same length by adding a special padding token to the sentences that are shorter than the longest one
* The result will be rectangular tensors
* Padding tokens are encoded into input IDs with ID $0$
* They have an attention mask value of $0$
* The model will not analyze these tokens

In [58]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)

In [59]:
print(encoded_input)

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  136,  102,    0,    0,    0,    0],
        [ 101,  146,  112,  182, 2503,  117, 6243, 1128,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


---

### Truncating Inputs

* Tensors might get too big to be processed by a model
* For instance, BERT is only pretrained with sequences up to 512 tokens
* For sequences longer than a model can handle, they need to be truncated
* Use the `truncation` parameter to perform sequence truncation

In [60]:
encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)

In [61]:
print(encoded_input["input_ids"])

[101, 1188, 1110, 170, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1304, 1263, 5650, 119, 102]


* Combining padding and truncation parameters, one can ensure the tensors are the size one needs

In [62]:
encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)

In [63]:
print(encoded_input)

{'input_ids': tensor([[ 101, 1731, 1132, 1128,  102],
        [ 101,  146,  112,  182,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]])}


---

### Add Special Tokens

* Special tokens are important to some models, such as BERT
* These tokens are added to better represent sentence boundaries:
    * \[CLS\]: classifies the input
    * \[SEP\]: separates sentences
    * \[PAD\]: padding token
    * \[UNK\]: unknown token

In [64]:
encoded_input = tokenizer("How are you?")

In [65]:
print(encoded_input["input_ids"])

[101, 1731, 1132, 1128, 136, 102]


In [66]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] How are you? [SEP]'

---

## Example

Given the sequences:

In [67]:
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

Once tokenized:

In [70]:
encoded_sequences = [
    [
        101,
        1045,
        1005,
        2310,
        2042,
        3403,
        2005,
        1037,
    ],
    [101, 1045, 5223, 2023, 2061, 2172, 999, 102],
]

* This is a list of encoded sequences (list of lists)
* Tensors only accept rectangular shapes (matrices)
* This 2D array is already a rectangular shape and be converted to a tensor

In [71]:
import torch

model_inputs = torch.tensor(encoded_sequences)

* Using the tensors is simple: call the model with inputs!
* Only the input IDs are required for the model

In [72]:
output = model(model_inputs)

In [76]:
print(output.last_hidden_state.shape)

torch.Size([2, 8, 768])
