# Model

In this botebook, it will:

    I. present transformer models
    II. explain the use of model
        A. model download
        B. model Inference
        C. Model with Head
    III. Train a model

## I. Transformer


#### 1. Transformer Model structure

This is the original transformer structure, which is composed of an encoder and a decoder.

<img src="https://machinelearningmastery.com/wp-content/uploads/2021/08/attention_research_1-727x1024.png" width="300"/>

https://huggingface.co/docs/transformers/en/model_summary

#### 2. Model Types

Based on the principal components of the transformer models, there are 3 major types:

* encoder-only model
* decoder-only model
* encoder-decoder model

#### 3. Pretrained Models

This is a summary of some pretrained models and their applications.

| model type      | pretrained model                    | applications                                 |
|---	          |---	                                |---	                                       |
| encoder-only    | ALBERT, BERT, DistilBERT, RoBERTa   | text classification, NER, Text comprehension |
| decoder-only    | GPT, GPT-2, Bloom, LLaMA            | text generation                              | 
| encoder-decoder | BART, T5, Marian, mBART, GLM        | text summary, translation                    |

#### 4. Model Head

The model head is the last or several last fully connected layers. It is used to project the hidden states to the outputs based on the task.
Transformers provide some predefined heads:
* *Model: no head, return the hidden states
* *ForCausalLM: decoder, return a sequence
* *ForMaskedLM: 
* *ForSeq2SeqLM
* *ForMultipleChoice
* *ForQuestionAnswering
* *ForSequenceClassification
* *ForTokenClassification
* ...

## II. How to Use

In [1]:
# 1) import
###########

from transformers import AutoConfig, AutoModel, AutoTokenizer

### A. Model download

By default, we could use "AutoModel" to download the base models. The base models contain usually the common structure of the network without task-specific output layers head (eg, last layer of feedforward or dense layer).

In [2]:
# 2) Loading
############

## load online using Transformers

# download base model without head

# the model is in: https://huggingface.co/google-bert/bert-base-uncased

model = AutoModel.from_pretrained("google-bert/bert-base-uncased")

In [None]:
## download using git

!git lfs clone https://huggingface.co/google-bert/bert-base-uncased --include="*.bin"

In [None]:
## load offline

model = AutoModel.from_pretrained("bert-base-uncased")

In [3]:
# 3) model config
#################

## show model config

# The config contains common charactistiques of the model such as:
#   * max_position_embeddings: the maximum dim of input tensor
#   * hidden_size: the dim of the hidden layer
#   * hidden_dropout_prob: dropout rate
#   * ...

# The config contains only the model related parameters
# But it don't contains the customizable arguments

model.config

BertConfig {
  "_name_or_path": "google-bert/bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.41.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [4]:
## config

# there is also an associated config object to the model
# it is loaded as:

config = AutoConfig.from_pretrained("google-bert/bert-base-uncased")
config



BertConfig {
  "_name_or_path": "google-bert/bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.41.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [5]:
## change model config

# show attentions
print(model.config.output_attentions)
model.config.output_attentions = True
print(model.config.output_attentions)

# convert id to label
print(model.config.id2label)
model.config.id2label = {0: "Negative", 1: "Positive"}
print(model.config.id2label)

False
True
{0: 'LABEL_0', 1: 'LABEL_1'}
{0: 'Negative', 1: 'Positive'}


In [6]:
model.config.id2label = {0: "Negative", 1: "Positive"}
model.config.id2label

{0: 'Negative', 1: 'Positive'}

### B. Model Inference

In [7]:
# 1) prepare some data
######################

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
input_text = "I came here searching for an answer."
tokenized_data = tokenizer(input_text, return_tensors='pt')
tokenized_data.input_ids

tensor([[ 101, 1045, 2234, 2182, 6575, 2005, 2019, 3437, 1012,  102]])

In [9]:
# 2) Model with no head
#######################

# This is a encoder-only model.

# the last layer is Linear(in_features=768, out_features=768)

model_nh = AutoModel.from_pretrained("google-bert/bert-base-uncased")

model_nh

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [10]:
# 3) Inference
##############

## output of model

# We input some dummy inputs to show the outputs of the model.

# By default, the model gives 2 outpus:
#
#   * last_hidden_state: is the hidden state of the input tokens, it is of dim [batch, seq_len, hidden_size].
#     So, as our input tokens has len of 10, we have the dimension of the last_hidden_state [1, 10, 768], since the
#     hidden size of the model is 768 (see model.config).
#     To check its dim we can use: output.last_hidden_state.size()

#   * pooler_output: is the pooled hidden state of the last_hidden_state, it is of dim [batch, hidden_size].
#     So, with the hidden size of 768 (see model.config), we have the dimension of the pooler_output [1, 768].
#     To check its dim we can use: output.pooler_output.size()

# And all other outputs hidden_states, past_key_values, attentions, cross_attentions are None.

output = model_nh(input_ids)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0414,  0.2263, -0.3680,  ..., -0.0351,  0.2934,  0.5573],
         [ 0.6679,  0.0198, -0.5826,  ..., -0.0465,  0.5988,  0.4897],
         [ 0.0205, -0.1866,  0.0394,  ...,  0.0380,  0.1209, -0.2524],
         ...,
         [ 0.6688, -0.5342,  0.1776,  ..., -0.1937,  0.1931,  0.0144],
         [ 0.5437,  0.2672, -0.4816,  ..., -0.0911, -0.2954, -0.7393],
         [ 0.7015,  0.1112,  0.1859,  ...,  0.3119, -0.4980, -0.4962]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9180, -0.4124, -0.8767,  0.7994,  0.6171, -0.1314,  0.8555,  0.2321,
         -0.8629, -0.9999, -0.5016,  0.9781,  0.9815,  0.5616,  0.9504, -0.7621,
         -0.3130, -0.6262,  0.1598, -0.5764,  0.7138,  0.9999, -0.0433,  0.3060,
          0.3983,  0.9946, -0.6577,  0.9428,  0.9665,  0.7421, -0.6475,  0.0761,
         -0.9916,  0.0264, -0.8849, -0.9890,  0.4022, -0.6485,  0.1285,  0.2126,
         -0.9041,  0.1580,  1.00

In [11]:
## show other outputs

# After setting output_attentions to true, the attentions is no longer None.

# So we can set those arguments when loading the model to control the outputs.

model_nh = AutoModel.from_pretrained("google-bert/bert-base-uncased", output_attentions=True)
output = model_nh(input_ids)
output



BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.0414,  0.2263, -0.3680,  ..., -0.0351,  0.2934,  0.5573],
         [ 0.6679,  0.0198, -0.5826,  ..., -0.0465,  0.5988,  0.4897],
         [ 0.0205, -0.1866,  0.0394,  ...,  0.0380,  0.1209, -0.2524],
         ...,
         [ 0.6688, -0.5342,  0.1776,  ..., -0.1937,  0.1931,  0.0144],
         [ 0.5437,  0.2672, -0.4816,  ..., -0.0911, -0.2954, -0.7393],
         [ 0.7015,  0.1112,  0.1859,  ...,  0.3119, -0.4980, -0.4962]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9180, -0.4124, -0.8767,  0.7994,  0.6171, -0.1314,  0.8555,  0.2321,
         -0.8629, -0.9999, -0.5016,  0.9781,  0.9815,  0.5616,  0.9504, -0.7621,
         -0.3130, -0.6262,  0.1598, -0.5764,  0.7138,  0.9999, -0.0433,  0.3060,
          0.3983,  0.9946, -0.6577,  0.9428,  0.9665,  0.7421, -0.6475,  0.0761,
         -0.9916,  0.0264, -0.8849, -0.9890,  0.4022, -0.6485,  0.1285,  0.2126,
         -0.9041,  0.1580,  1.00

### C. Model with Head

In [12]:
## load model

# import task specific model
from transformers import AutoModelForSequenceClassification

# load model
# By default, the classes are 2
model_cla = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

# model inference
out_cla = model_cla(input_ids)

out_cla

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SequenceClassifierOutput(loss=None, logits=tensor([[ 0.2068, -0.3478]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [13]:
## Model

# Compared to the model without head (called BertModel), the model with head is called BertForSequenceClassification
# which is a wrapped BertModel base model.
# This wraped model has an extra layer at the end : Linear(in_features=768, out_features=2, bias=True).

model_cla

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [14]:
## change output classes

# load model
# We set the output classes to 10
model_cla = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=10)

# model inference
out_cla = model_cla(input_ids)

out_cla

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


SequenceClassifierOutput(loss=None, logits=tensor([[ 0.4765,  0.2972, -0.2578, -0.2620,  0.4579,  0.1838,  0.5107,  0.1503,
         -0.1623,  0.0592]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

## III. Training

To run the below, restart the kernel.

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
# 1) load dataset
#################

# In this class, we don't do the tokenization
# As in the hf_transformers_basics_tokenizer.ipynb, tokenization of batched size is faster
# So we don't do the tokenization here, since it will be less efficient

from torch.utils.data import Dataset
from datasets import load_dataset

class MyDataset(Dataset):

    label2id = {"negative":0, "positive": 1}
    id2label = {0: "negative", 1: "positive"}

    def __init__(self, ckp):

        super().__init__()
        self.ckp = ckp
        data = load_dataset(ckp)
        self.data = {"text":[], "label": []}
        # there are 3 classes, we select only the 2: neg and pos
        for i in range(len(data["train"]["review"])):
            if data["train"][i]["division"] in ["negative", "positive"]:
                self.data["text"].append(data["train"][i]["review"])
                # convert string label to numerical label
                self.data["label"].append(MyDataset.label2id.get(data["train"][i]["division"]))

    def __getitem__(self, index):

        return self.data["text"][index], self.data["label"][index]

    def __len__(self):

        return len(self.data["text"])


ckp_data = "davidberg/sentiment-reviews"   
data = MyDataset(ckp_data)


In [3]:
print(len(data))
for i in range(10):
    print(data[i])


3548
('able play youtube alexa', 1)
('able recognize indian accent really well drop function helpful call device talk person near device smart plug schedule work seamlessly con would sound kindloud but lack clarity mid frequency need tweeked optimum clarity rarely device doesnt respond call alexa', 1)
('absolute smart device amazon connect external sub woofer sound amaze recons voice even close room like almost collection songs english hindi must quite moneys worth', 1)
('absolutely amaze new member family control home voice connect home anywhere world', 1)
('absolutely amaze previously sceptical invest money but arrive worth ityou absolutely buy wont regret cheer', 1)
('absolutely cheat customer if buy amazon product definitely want buy amazon prime members also case if song want play absolutely need amazon prime membership otherwise can not play music app no google apps not work amazon alexa if anybody want amazon alexa go google home everything also free cost app', 1)
('absolutely h

In [4]:
# 2) split data
###############

from torch.utils.data import random_split

trainset, validset = random_split(data, lengths=(0.9, 0.1))

len(trainset), len(validset)

(3194, 354)

In [5]:
# 3) tokenizer
##############

from transformers import AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

def collate_fct(batch):

    texts, labels = [], []

    for item in batch:
        texts.append(item[0])
        labels.append(item[1])
    toks = tokenizer(texts, max_length=128, truncation=True, padding="max_length", return_tensors="pt")
    toks["labels"] = torch.tensor(labels)
    return toks

In [6]:
# 4) dataloader
###############

from torch.utils.data import DataLoader

trainloader = DataLoader(trainset, batch_size=32, shuffle=True, collate_fn=collate_fct)
validloader = DataLoader(validset, batch_size=32, shuffle=False, collate_fn=collate_fct)

In [7]:
next(iter(trainloader))

{'input_ids': tensor([[  101,  2525,  8224,  ...,     0,     0,     0],
        [  101,  4067,  2643,  ...,     0,     0,     0],
        [  101,  3452,  2204,  ...,     0,     0,     0],
        ...,
        [  101,  2131,  2197,  ...,     0,     0,     0],
        [  101, 21688,  2614,  ...,     0,     0,     0],
        [  101,  2204,  4031,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 1, 1, 1, 1, 1, 1, 1])}

In [8]:
# 5) load model
###############

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")

# sent to gpu

if torch.cuda.is_available():
    model = model.cuda()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
# 6) define optimizer
#####################

from torch.optim import Adam

optimizer = Adam(model.parameters(), lr=2e-5)

In [10]:
# 7) evaluation
###############

def eval():

    acc_count = 0

    model.eval()

    for batch in validloader:

        # if there is GPU, send the data to GPU
        if torch.cuda.is_available():
            batch = {k: v.to(model.device) for k, v in batch.items()}

        output = model(**batch)

        pred = torch.argmax(output.logits, dim=-1)

        # count correct labels
        acc_count += (pred.int() == batch["labels"].int()).sum()

    return acc_count / len(validset)


In [11]:
# 8) define Training
####################

def train(epoch=3, log_step=50):

    gStep = 0

    for e in range(epoch):

        model.train()

        for batch in trainloader:
            
            # if there is GPU, send the data to GPU
            if torch.cuda.is_available():
                batch = {k: v.to(model.device) for k, v in batch.items()}

            optimizer.zero_grad()

            output = model(**batch)

            output.loss.backward()

            optimizer.step()

            if gStep % log_step == 0:

                print(f"{e+1} / {epoch} - global step: {gStep}, loss: {output.loss.item()}")

            gStep += 1

        acc = eval()

        print(f"{e+1} / {epoch} - acc: {acc}")

In [12]:
# 9) train
##########

train()

1 / 3 - global step: 0, loss: 0.6653264760971069
1 / 3 - global step: 50, loss: 0.10414525121450424
1 / 3 - acc: 0.9519774317741394
2 / 3 - global step: 100, loss: 0.1194240152835846
2 / 3 - global step: 150, loss: 0.18372759222984314
2 / 3 - acc: 0.9519774317741394
3 / 3 - global step: 200, loss: 0.058997586369514465
3 / 3 - global step: 250, loss: 0.27312278747558594
3 / 3 - acc: 0.9548022747039795


In [14]:
# inference
###########

## manual

text = "I am not sure I like it."

with torch.inference_mode():

    inputs = tokenizer(text, return_tensors="pt")

    inputs = {k:v.to(model.device) for k, v in inputs.items()}

    output = model(**inputs)

    logits = output.logits

    pred = torch.argmax(logits, dim=-1)

    print(f"intput: {text}, prediction: {MyDataset.id2label.get(pred.item())}")



intput: I am not sure I like it., prediction: negative


In [15]:
## pipeline of transformers

from transformers import pipeline

# define model config
# if not, the result will be the default value: LABEL_0/LABEL_1
model.config.id2label = MyDataset.id2label

#  we need a task type name for the pipeline
# In this case, the name is "text-classification".
# If we don't know the name, we can just put whatever, and an error message will show up.
# At the end of the error message, there is a list of all type names that we can choose.

pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

2024-06-19 15:01:48.750048: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-19 15:01:48.750100: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-19 15:01:48.752253: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-19 15:01:48.763197: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [16]:
pipe(text)

[{'label': 'negative', 'score': 0.6834433674812317}]