# Models

Architectures vs. checkpoints

**Architecture**: This is the skeleton of the model — the definition of each layer and each operation that happens within the model.

**Checkpoints**: These are the weights that will be loaded in a given architecture.

**Model**: This is an umbrella term that isn’t as precise as “architecture” or “checkpoint”: it can mean both.

The AutoModel class and all of its relatives are actually simple wrappers over the wide variety of models available in the library.

The weights have been downloaded and cached (so future calls to the from_pretrained() method won’t re-download them) in the cache folder, which defaults to ~/.cache/huggingface/transformers.

In [1]:
!apt-get install tree

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 15 not upgraded.
Need to get 43.0 kB of archives.
After this operation, 115 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/universe amd64 tree amd64 1.8.0-1 [43.0 kB]
Fetched 43.0 kB in 0s (157 kB/s)
Selecting previously unselected package tree.
(Reading database ... 123069 files and directories currently installed.)
Preparing to unpack .../tree_1.8.0-1_amd64.deb ...
Unpacking tree (1.8.0-1) ...
Setting up tree (1.8.0-1) ...
Processing triggers for man-db (2.9.1-1) ...


In [2]:
!pip install -q datasets evaluate transformers[sentencepiece]

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/486.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m481.3/486.2 kB[0m [31m16.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.2/486.2 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m90.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

## Load pretrained model

### Load model from config
- Model is randomly initialized.

In [3]:
from transformers import BertConfig, BertModel

# Building the config
config = BertConfig()

# Building the model from the config
model = BertModel(config)

The configuration contains many attributes that are used to build the model:

In [None]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



In [None]:
model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [4]:
# Model size

model_size = sum(t.numel() for t in model.parameters())
print(f"Modelsize: {model_size/1000**2:.1f}M parameters")

Modelsize: 109.5M parameters


### Load model from pretrained

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Load model from AutoModel

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


You can customize your cache folder by setting the HF_HOME environment variable.

In [None]:
from transformers import AutoModel

# Specify the cache folder path
cache_folder = "/content/model"

# Load the checkpoint with the specified cache folder
model = AutoModel.from_pretrained("bert-base-cased", cache_dir=cache_folder)

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
!tree -h "/content/model"

[01;34m/content/model[00m
└── [4.0K]  [01;34mmodels--bert-base-cased[00m
    ├── [4.0K]  [01;34mblobs[00m
    │   ├── [ 570]  107460496b431545e4f921afb3fd5486fd2ae79d
    │   └── [416M]  1d8bdcee6021e2c25f0325e84889b61c2eb26b843eef5659c247af138d64f050
    ├── [4.0K]  [01;34mrefs[00m
    │   └── [  40]  main
    └── [4.0K]  [01;34msnapshots[00m
        └── [4.0K]  [01;34m5532cc56f74641d4bb33641f5c76a55d11f846e0[00m
            ├── [  52]  [01;36mconfig.json[00m -> ../../blobs/107460496b431545e4f921afb3fd5486fd2ae79d
            └── [  76]  [01;36mmodel.safetensors[00m -> ../../blobs/1d8bdcee6021e2c25f0325e84889b61c2eb26b843eef5659c247af138d64f050

5 directories, 5 files


## Saving models

In [None]:
save_folder = "/content/model2"

model.save_pretrained(save_folder)

In [None]:
!tree -h "/content/model2"

[01;34m/content/model2[00m
├── [ 656]  config.json
└── [413M]  pytorch_model.bin

0 directories, 2 files


- config.json contains some metadata.

- The pytorch_model.bin file is known as the state dictionary; it contains all your model’s weights.



In [None]:
!cat /content/model2/config.json

{
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.30.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}


### Commons functions

- **forward(inputs)**: This function performs the forward pass computations through the model. It takes input data, such as input tokens, attention masks, or segment IDs, and returns the model's output.

- **from_pretrained(pretrained_model_name)**: This class method is used to load a pre-trained model from the Hugging Face model hub. It allows you to specify the pretrained_model_name or identifier to load a specific pre-trained model.

- **eval()**: This function sets the model to evaluation mode, disabling dropout and other regularization techniques. It is typically used during inference or evaluation.

- **train()**: This function sets the model to training mode, enabling dropout and other regularization techniques. It is used during the training phase.

- **generate(inputs)**: If the model is a language generation model, this function generates text based on the provided inputs. The inputs may include input prompts or conditional information.

- **predict(inputs)**: If the model is a classifier or a sequence labeling model, this function can be used to predict labels or tags for a given input sequence.

- **save_pretrained(directory)**: This function saves the model and its associated configuration file to the specified directory. It allows you to save the model for future use or to share it with others.

## Reference:

- Hugging Face NLP Course
  - https://huggingface.co/learn/nlp-course/chapter1/1

- datacamp: An Introduction to Using Transformers and Hugging Face
  - https://www.datacamp.com/tutorial/an-introduction-to-using-transformers-and-hugging-face