<a href="https://colab.research.google.com/github/iamhasanhumane/Hugging_Face/blob/main/Chapter_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
!pip install transformers
!pip install huggingface_hub



## Inside the Pipeline Function

Lets have a look at what actually happens when we execute the following code

In [4]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier([
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much"
])


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9516071081161499},
 {'label': 'NEGATIVE', 'score': 0.9995144605636597}]

There are three stages in Pipeline Presentation


**Tokenizer** - - - - - **Model** - - - - **Postprocessing**



---



*Tokenizer*

*   We convert the raw texts to numbers the model can make sense of using a tokenizer.

* This course is amazing! - - - - - [101,2023,2607,2003,6429,999,102]


*Model*

* These numbers go through the model which outputs logits.
* [101,2023,2607,2003,6429,999,102] - - - - - - [-4.3630 , 4.6859]


*Post processing*

* The post processing step converts those logits into labels and scores.
* [-4.3630 , 4.6859] - - - - - - [Positive : 99.89%  ,Negative : 0.11%]



### Stage - 1 : Tokenization



1.   First , the text is split into small chunks called tokens. They can be words , parts of words or punctuation symbols.
2.   Then the tokenizer will add some special tokens like [CLS] and [SEP] (If the model expects them ).
3.   Lastly, the tokenizer matches each token to its unique ID in the vocabulary of the pretrained model.



*The AutoTokenizer class can load the tokenizer for any checkpoint*

In [5]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much"
]

inputs = tokenizer(raw_inputs , padding = True , truncation = True , return_tensors = 'pt')

print(inputs)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   102,     0,     0,     0,
             0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


### Stage 2 : Model



The AutoModel class loads a model without its pretraining head

In [6]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

torch.Size([2, 15, 768])


The AutoModel Api will only instantiate the body of the model , i.e., the part of the model that is left once the pretraining head is removed.

It will output a high-dimensional tensor that is a representation that is a representation of the sentences passed .

In [7]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

print(outputs.logits)

tensor([[-1.4683,  1.5105],
        [ 4.2141, -3.4158]], grad_fn=<AddmmBackward0>)


Each AutoModelForXxxx class loads a model suitable for a specific task

### Stage 3 : Post Processing

In [8]:
import torch

In [9]:
torch.set_printoptions(precision=4, sci_mode=False)

To go from logits to probabilites we apply a softmax layer.

In [10]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits , dim = 1)
print(predictions)

tensor([[    0.0484,     0.9516],
        [    0.9995,     0.0005]], grad_fn=<SoftmaxBackward0>)


In [11]:
predicted_labels = torch.argmax(predictions, dim = 1)
predicted_labels

tensor([1, 0])

In [12]:
sentiment_labels = ["positive" if label == 1 else "negative" for label in predicted_labels]
sentiment_labels

['positive', 'negative']

## Instantiating a Transformer Model

The AutoModel API allows you to instantiate a pretrained model from any checkpoint

In [13]:
from transformers import AutoModel

bert_model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
print(type(bert_model))

gpt_model = AutoModel.from_pretrained("gpt2")
print(type(gpt_model))

bart_model = AutoModel.from_pretrained("facebook/bart-base")
print(type(bart_model))

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

<class 'transformers.models.bert.modeling_bert.BertModel'>


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>


config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

<class 'transformers.models.bart.modeling_bart.BartModel'>


The AutoConfig API allows you to instantiate the configuration of a pretrained model from any checkpoint

In [14]:
from transformers import AutoConfig

bert_config = AutoConfig.from_pretrained("google-bert/bert-base-uncased")
print(type(bert_config))

gpt_config = AutoConfig.from_pretrained("gpt2")
print(type(gpt_config))

bart_config = AutoConfig.from_pretrained("facebook/bart-base")
print(type(bart_config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>
<class 'transformers.models.bart.configuration_bart.BartConfig'>


We can also use the specific configuration class corresponding to the checkpoint.

#### BertConfig

In [15]:
from transformers import BertConfig

bert_config = BertConfig.from_pretrained("google-bert/bert-base-cased")
print(bert_config)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.48.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



#### GPT2Config

In [16]:
from transformers import GPT2Config

gpt2_config = GPT2Config.from_pretrained("gpt2")
print(gpt2_config)

GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.48.3",
  "use_cache": true,
  "vocab_size": 50257
}



#### BartConfig

In [17]:
from transformers import BartConfig

bart_config = BartConfig.from_pretrained("facebook/bart-base")
print(bart_config)

BartConfig {
  "_name_or_path": "bart-base",
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "bar

### The Configuration Class contains all the information needed to load the model.

In [18]:
print(type(bert_config))
print(type(gpt2_config))
print(type(bart_config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>
<class 'transformers.models.gpt2.configuration_gpt2.GPT2Config'>
<class 'transformers.models.bart.configuration_bart.BartConfig'>




---



---



We can instantiate a given model with random weights from this config

In [19]:
from transformers import BertConfig , BertModel

bert_config = BertConfig.from_pretrained("google-bert/bert-base-uncased")
bert_model = BertModel(bert_config)

Using only 10 layers instead of 12

In [20]:
from transformers import BertConfig , BertModel

bert_config = BertConfig.from_pretrained("google-bert/bert-base-uncased",
                                         num_hidden_layers = 10)
bert_model = BertModel(bert_config)

### Saving the Pretrained Model

To save a model , we just have to use the save_pretrained method.

In [21]:
bert_model.save_pretrained("my-bert-model")      # Here the model will be saved inside a folder named my-bert-model inside the current working directory

### Reloading from Local Directory

To reload a saved model , we can use from_pretrained method

In [22]:
from transformers import AutoConfig

loaded_model_config = AutoConfig.from_pretrained("my-bert-model")

print(type(loaded_model_config))


<class 'transformers.models.bert.configuration_bert.BertConfig'>


Loading a GPT Model and saving the pretrained model in our local directory

In [23]:
from transformers import GPT2Model , GPT2Config

gpt_config = GPT2Config.from_pretrained("gpt2")
gpt_model = GPT2Model(gpt_config)

gpt_model.save_pretrained("my-gpt-model")

Loading the saved pretrained model from our local directory


In [24]:
from transformers import AutoModel

loaded_gpt = AutoModel.from_pretrained("my-gpt-model")
print(type(loaded_gpt))

<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>


## Overview of Tokenizers

In Natural Language Processing , most of the data that we handle consists of raw text. However , Machine Learning models cannot read and understand the text in its raw form. They can only work with numbers.**The Tokenizers objective will be to translate the text into numbers**.

We'll take a look at three different Tokenization Algorithms .


1.   Word - based
2.   Character - based
3.   Subword - based



### Word Based Tokenizers

**Word Based Tokenization** is the Idea of splitting the raw text into words , by splitting on spaces or other specific rules like punctuation.


Let's do tokenization  - - - - -  [Let's , do , tokenization!]

In this Algorithm , each word has a specific number ,an "ID" attributed to it.

The Model has representations that are based on entire words.


**Limits**



*   Very similar words have entirely different meaning.
*   The Vocabulary can end up very large.
*   Large Vocabularies result in heavy models.
*   Out of vocabulary words result in a loss of information.



### Character Based Tokenization

**Character Based Tokenization** splits the input text into individual characters, rather than words. Vocabularies are slimmer. With a character-based
vocabulary , we can get by only 256 characters!.

Character Based Vocabularies use fewer tokens than the word based vocabularies.

Let's do tokenization -- ['L', 'e', 't', "'", 's', ' ', 'd', 'o', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', '!']

**Limits**



*   Characters do not hold as much information individually as a word would hold.
*   Sequences are translated into very large amount of tokens to be processedby the model.



### Subword Based Tokenization

**Sub Word Tokenization** lies between character and word-based algorithms. The Algorithm rely on the following principle.

*Frequently used words should not be split into smaller subwords*
*Rare words should be decomposed into meaningful subwords*

DOG - - - - DOG

DOGS - - - - DOG S

TOKENIZATION - - - - [TOKEN , IZATION]

Most models obtaining state-of-the-art results in English today use some kind of subword-tokenization algorithm.



*   WordPiece ( BERT , DistilBERT )
*   Unigram ( XLNet , ALBERT )
*   Byte-Pair Encoding ( GPT-2 , RoBERTa )




## The Tokenization Pipeline

The first step of the pipeline is to split the text into tokens

In [25]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

tokens = tokenizer.tokenize("Let's try to tokenize!")

print(tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['let', "'", 's', 'try', 'to', 'token', '##ize', '!']


In [26]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")

tokens = tokenizer.tokenize("Let's try to tokenize!")

print(tokens)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

['▁let', "'", 's', '▁try', '▁to', '▁to', 'ken', 'ize', '!']


The second step of the tokenization pipeline is to map those tokens to their respective IDs as defined by the vocabulary of the tokenizer.

In [27]:
from transformers import AutoTokenizer

bert_tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")

tokens = bert_tokenizer.tokenize("Let's try to tokenize!")

input_ids = bert_tokenizer.convert_tokens_to_ids(tokens)

print(input_ids)


[2292, 1005, 1055, 3046, 2000, 19204, 4697, 999]


Lastly , the tokenizer adds special tokens the model expects.

In [28]:
final_inputs = bert_tokenizer.prepare_for_model(input_ids)

print(final_inputs['input_ids'])

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[101, 2292, 1005, 1055, 3046, 2000, 19204, 4697, 999, 102]


We can look at the special tokens and more generally how the tokenizer has changed our text by using the decode method on the outputs of the tokenizer object.

In [29]:
bert_tokenizer.decode(final_inputs['input_ids'])

"[CLS] let ' s try to tokenize! [SEP]"

The Decode method allows us to check how the final output of the tokenizer translates back into text.

These special tokens vary depending on the tokenizer that we are using. The BERT tokenizer uses [CLS] and [SEP] but the roberta tokenizer uses html-like anchors

In [30]:
from transformers import AutoTokenizer

roberta_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

inputs = roberta_tokenizer("Let's try to tokenize!")

print(roberta_tokenizer.decode(inputs["input_ids"]))

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

<s>Let's try to tokenize!</s>


In Summary a tokenizer takes texts as inputs and outputs numbers the associated model can make sense of.

## Batching Inputs Together

In [31]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis",
                      model = "distilbert/distilbert-base-uncased-finetuned-sst-2-english")

classifier([
    "I've been waiting for a HuggingFace course my whole life",
    "I hate this so much"
])

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9516071081161499},
 {'label': 'NEGATIVE', 'score': 0.9995144605636597}]

Sentences we want to group inside a batch will often have different lenghts.

In [69]:
checkpoint = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
distillbert_tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this."
]

tokens = [distillbert_tokenizer.tokenize(sentence) for sentence in sentences]
ids = [distillbert_tokenizer.convert_tokens_to_ids(token) for token in tokens]

In [70]:
print(ids[0])
print(ids[1])

[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]
[1045, 5223, 2023, 1012]


In [83]:
ids_2 = ids[1]

tensor([1045, 5223, 2023, 1012,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0])

In [61]:
torch.tensor(ids)

ValueError: expected sequence of length 14 at dim 1 (got 4)

Here we can see that we can't build a tensor with lists of different lengths.

One way to overcome this issue is to pad the smaller sentences to the length of the longest one.

In [72]:
ids

[[1045,
  1005,
  2310,
  2042,
  3403,
  2005,
  1037,
  17662,
  12172,
  2607,
  2026,
  2878,
  2166,
  1012],
 [1045, 5223, 2023, 1012]]

Here we convert the input ids to tensors for sentiment analysis

In [74]:
ids1 = torch.tensor(ids[0])

ids1 = ids1.unsqueeze(0)

ids1

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012]])

In [75]:
ids2 = torch.tensor(ids_2)

ids2 = ids2.unsqueeze(0)

ids2

tensor([[1045, 5223, 2023, 1012]])

In [77]:
for _ in range(len(ids[0]) - len(ids[1])):
  ids[1].append(distillbert_tokenizer.pad_token_id)

In [78]:
all_ids = torch.tensor(ids)
all_ids

tensor([[ 1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012],
        [ 1045,  5223,  2023,  1012,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]])

In [79]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [80]:
print(model(ids1).logits)

tensor([[-2.7276,  2.8789]], grad_fn=<AddmmBackward0>)


In [86]:
print(model(ids2).logits)

tensor([[ 3.9497, -3.1357]], grad_fn=<AddmmBackward0>)


In [48]:
print(model(all_ids).logits)

tensor([[-2.7276,  2.8789],
        [ 1.5444, -1.3998]], grad_fn=<AddmmBackward0>)


Here we see that in the second sentence in all_ids we get different results. This is because the attention layers use the padding tokens in the context they look at for each token in the sentence . To tell the attention layer to ignore the padding tokens , we need to pass them an attention mask.

In [87]:
attention_mask = torch.tensor([[int(bool(id)) for id in sentence] for sentence in ids])

In [88]:
attention_mask

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [89]:
output = model(all_ids , attention_mask = attention_mask)
print(output.logits)

tensor([[-2.7276,  2.8789],
        [ 3.9497, -3.1357]], grad_fn=<AddmmBackward0>)


With the proper attention mask , predictions are the same for a given sentence, with ot without padding.

Using padding=True , the tokenizer can directly prepare the inputs with padding and the proper attention mask.