# Models (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%%capture
!pip install datasets evaluate transformers[sentencepiece]

Also, log into Hugging face.

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In this section we'll take a closer look at creating and using a model. We'll use the <font color='blue'>AutoModel class</font>, which is handy when you want to <font color='blue'>instantiate any model</font> from a <font color='blue'>checkpoint</font>.

The AutoModel class and all of its relatives are actually simple <font color='blue'>wrappers</font> over the <font color='blue'>wide variety of models</font> available in the <font color='blue'>library</font>. It's a clever wrapper as it can <font color='blue'>automatically guess the appropriate model architecture</font> for your checkpoint, and then <font color='blue'>instantiates a model</font> with this <font color='blue'>architecture</font>.

However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let's take a look at how this works with a <font color='blue'>BERT model</font>.

### Creating a Transformer

The <font color='blue'>first thing</font> we'll need to do to initialize a BERT model is <font color='blue'>load a configuration object</font>:

In [3]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

The configuration contains many attributes that are used to build the model:

In [4]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.42.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



While you haven't seen what all of these attributes do yet, you should recognize some of them: the <font color='blue'>hidden_size attribute</font> defines the <font color='blue'>size</font> of the <font color='blue'>hidden_states vector</font>, and <font color='blue'>num_hidden_layers</font> defines the <font color='blue'>number of layers</font> the Transformer model has.

### Different loading methods

Creating a model from the <font color='blue'>default configuration</font> initializes it with <font color='blue'>random</font> values:

In [5]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

The model can be used in this state, but it will output gibberish; it <font color='blue'>needs to be trained first</font>. We could train the model from scratch on the task at hand, but as you saw in [Chapter 1](https://huggingface.co/course/chapter1), this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To <font color='blue'>avoid</font> unnecessary and duplicated effort, it's imperative to be able to <font color='blue'>share and reuse models that have already been trained</font>.

Loading a Transformer model that is already trained is simple ‚Äî we can do this using the <font color='blue'>from_pretrained() method</font>:

In [13]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

As you saw earlier, we could replace <font color='blue'>BertModel</font> with the equivalent <font color='blue'>AutoModel class</font>. We'll do this from now on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the <font color='blue'>architecture is different</font>, as long as the <font color='blue'>checkpoint was trained for a similar task</font> (for example, a sentiment analysis task).

In the code sample above we <font color='blue'>didn't use BertConfig</font>, and instead loaded a <font color='blue'>pretrained model</font> via the <font color='blue'>bert-base-cased identifier</font>. This is a model checkpoint that was <font color='blue'>trained by the authors of BERT themselves</font>; you can find more details about it in its [model card](https://huggingface.co/bert-base-cased).

This model is now <font color='blue'>initialized</font> with all the <font color='blue'>weights of the checkpoint</font>. It can be used <font color='blue'>directly</font> for <font color='blue'>inference</font> on the <font color='blue'>tasks it was trained on</font>, and it can also be <font color='blue'>fine-tuned on a new task</font>. By training with pretrained weights rather than from scratch, we can quickly achieve good results.

The weights have been downloaded and cached (so future calls to the `from_pretrained()` method won't re-download them) in the cache folder, which defaults to `~/.cache/huggingface/transformers`. You can <font color='blue'>customize</font> your <font color='blue'>cache folder</font> by setting the <font color='blue'>HF_HOME environment variable</font>.

The <font color='blue'>identifier</font> used to <font color='blue'>load the model</font> can be the identifier of <font color='blue'>any model on the Model Hub</font>, as long as it is <font color='blue'>compatible</font> with the <font color='blue'>BERT architecture</font>. The entire list of available BERT checkpoints can be found [here](https://huggingface.co/models?filter=bert).

### Saving methods

Saving a model is as easy as loading one ‚Äî we use the <font color='blue'>save_pretrained() method</font>, which is <font color='blue'>analogous</font> to the <font color='blue'>from_pretrained()</font> method:

In [7]:
model.save_pretrained("directory_on_my_computer")

This saves two files to your disk:

In [14]:
!ls directory_on_my_computer

config.json  model.safetensors


In [15]:
!cat directory_on_my_computer/config.json

{
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.42.4",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}


If you take a look at the <font color='blue'>config.json file</font>, you'll recognize the <font color='blue'>attributes necessary</font> to <font color='blue'>build the model architecture</font>. This file also contains some metadata, such as where the checkpoint originated and what ü§ó Transformers version you were using when you last saved the checkpoint.

The <font color='blue'>pytorch_model.bin</font> file is known as the <font color='blue'>state dictionary</font>; it contains all your <font color='blue'>model's weights</font>. The two files go hand in hand; the configuration is necessary to know your model's architecture, while the model weights are your model's parameters.

### Using a Transformer model for inference

Now that you know how to load and save a model, let's try <font color='blue'>using</font> it to <font color='blue'>make some predictions</font>. Transformer models can <font color='blue'>only process numbers</font> ‚Äî numbers that the tokenizer generates. But before we discuss tokenizers, let's explore what inputs the model accepts.

<font color='blue'>Tokenizers</font> can take care of <font color='blue'>casting</font> the <font color='blue'>inputs</font> to the <font color='blue'>appropriate framework's tensors</font>, but to help you understand what's going on, we'll take a quick look at what must be done before sending the inputs to the model.

Let's say we have a couple of sequences:

In [9]:
sequences = ["Hello!", "Cool.", "Nice!"]

The <font color='blue'>tokenizer</font> converts these to <font color='blue'>vocabulary indices</font> which are typically called <font color='blue'>input IDs</font>. Each sequence is now a list of numbers! The resulting output is:

In [10]:
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

This is a <font color='blue'>list</font> of <font color='blue'>encoded sequences</font>: a list of lists. Tensors only accept rectangular shapes (think matrices). This ‚Äúarray‚Äù is already of rectangular shape, so <font color='blue'>converting</font> it to a <font color='blue'>tensor</font> is easy:

In [11]:
import torch

model_inputs = torch.tensor(encoded_sequences)

In [16]:
print(model_inputs)

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])


### Using the tensors as inputs to the model

Making use of the tensors with the model is extremely simple ‚Äî we just call the model with the inputs:

In [12]:
output = model(model_inputs)

In [17]:
print(output)

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9393e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1964e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0995e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4876e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1110e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2

While the model accepts a <font color='blue'>lot</font> of <font color='blue'>different arguments</font>, only the <font color='blue'>input IDs</font> are <font color='blue'>necessary</font>. We'll explain what the other arguments do and when they are required later, but first we need to take a closer look at the tokenizers that build the inputs that a Transformer model can understand.