<a href="https://colab.research.google.com/github/prakhar-luke/HuggingFace-learn/blob/main/NLP_course/02_using_HF_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline

### pipeline recap

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## This pipeline groups together three steps:
1. preprocessing
2. passing the inputs through the model
3. postprocessing:

## Preprocessing with a tokenizer

We use a *tokenizer*, which will be responsible for:

- Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
- Mapping each token to an integer
- Adding additional inputs that may be useful to the model

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

We user `AutoTokenizer` class and `from_pretrained()` method so all th preprocessing can be done exactly the same way as the model was pretrained on.

In [None]:
raw_inputs = [
    "I'm excited to lean this course from Hugging Face",
    "I hate that i came across this so late."
]
inputs = tokenizer(raw_inputs , padding=True, truncation=True, return_tensors='pt')
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  1049,  7568,  2000,  8155,  2023,  2607,  2013,
         17662,  2227,   102],
        [  101,  1045,  5223,  2008,  1045,  2234,  2408,  2023,  2061,  2397,
          1012,   102,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}


## Going through the model

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

`AutoModel` class is used to download model from HF hub.

This architecture contains only the base Transformer module: given some inputs, it outputs what we’ll call *hidden states*, also known as *features*.

these *hidden states* can be useful on their own, they’re usually inputs to another part of the model, known as the *head*.

### High-dimensional vector
The vector output by the Transformer module is usually large. It generally has three dimensions:

- Batch size: The number of sequences processed at a time (2 in our example).
- Sequence length: The length of the numerical representation of the sequence (13 in our example).
- Hidden size: The vector dimension of each model input.

In [None]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 13, 768])


### Model Head : Making sense out of number

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers.

we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the `AutoModel` class, but `AutoModelForSequenceClassification`:

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [None]:
print(outputs.logits.shape)

torch.Size([2, 2])


## Postprocessing the output

In [None]:
print(outputs.logits)

tensor([[-3.8919,  4.1693],
        [ 3.5199, -2.9280]], grad_fn=<AddmmBackward0>)


Those are not probabilities but logits, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a SoftMax layer

In [None]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=1)
print(predictions)

tensor([[3.1542e-04, 9.9968e-01],
        [9.9842e-01, 1.5814e-03]], grad_fn=<SoftmaxBackward0>)


To get the labels corresponding to each position, we can inspect the id2label attribute of the model config (more on this in the next section):

In [None]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Models

## Creating a Transformer

In [None]:
from transformers import BertConfig, BertModel

# build the config
config = BertConfig()

# Build the model from config
model = BertModel(config)

In [None]:
print(config)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.40.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}



## Different Loading methods

In [None]:
from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

In above case the model is randomly initialized

In [None]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In the code sample above we didn’t use BertConfig, and instead loaded a pretrained model via the bert-base-cased identifier. This is a model checkpoint that was trained by the authors of BERT themselves

## Saving Methods

In [None]:
model.save_pretrained("/content/02_hf_t")

In [None]:
%%bash
ls "/content/02_hf_t"

config.json
model.safetensors


## Using a transformer model for inference
Tokenizers can take care of casting the inputs to the appropriate framework’s tensors.

Below is how tokenizer works

In [None]:
sequences = ["Hello!", "Cool.", "Nice!"]
encoded_sequences = [
    [101, 7592, 999, 102],
    [101, 4658, 1012, 102],
    [101, 3835, 999, 102],
]

In [None]:
import torch

model_inputs = torch.tensor(encoded_sequences)

In [None]:
output = model(model_inputs)

In [None]:
output["last_hidden_state"].shape

torch.Size([3, 4, 768])

# Tokenizers

## Loading and saving

Ways to load tokenizer:

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Using `AutoTokenizer`

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [None]:
# using the tokenizer
tokenizer("Soon you would have forgeten everything.")

{'input_ids': [101, 5398, 1128, 1156, 1138, 5042, 1424, 1917, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Saving a tokinizer

In [None]:
tokenizer.save_pretrained("/content/02_hf_t")

('/content/02_hf_t/tokenizer_config.json',
 '/content/02_hf_t/special_tokens_map.json',
 '/content/02_hf_t/vocab.txt',
 '/content/02_hf_t/added_tokens.json',
 '/content/02_hf_t/tokenizer.json')

## Encoding
Translating text to numbers is known as encoding.

Encoding works in 2 steps:
1. text to tokens (tokenization)
2. tokens to numbers

### Tokenization

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

sequence = "All people suffer, but not all people pitty themself."
tokens = tokenizer.tokenize(sequence)

print(tokens)

['All', 'people', 'suffer', ',', 'but', 'not', 'all', 'people', 'pit', '##ty', 'them', '##self', '.']


### Token to numebers

In [None]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1398, 1234, 8813, 117, 1133, 1136, 1155, 1234, 7172, 2340, 1172, 19303, 119]


## Decoding
vocabulary indices -> string

In [None]:
decoded_str = tokenizer.decode(ids)
print(decoded_str)

All people suffer, but not all people pitty themself.


Decoder not only convert indices back to token, but also groups the tokens that were part of the same words to produce a readable sentence.

# Handiling multiple sequences

## Models expects a batch of inputs

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = "Find a way! I entrusted everything to you! My pride, my promise, EVERYTHING! I won't tolerate failure! "

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
input_ids = torch.tensor(ids)
# Error here
# The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default.
model(input_ids)

IndexError: too many indices for tensor of dimension 1

Add a new dimension to the `ids`

In [None]:
input_ids = torch.tensor([ids])

outputs = model(input_ids)

print(f"Input Id : {input_ids}")
print(f"Logits : {outputs.logits}")

Input Id : tensor([[ 2424,  1037,  2126,   999,  1045, 18011,  2673,  2000,  2017,   999,
          2026,  6620,  1010,  2026,  4872,  1010,  2673,   999,  1045,  2180,
          1005,  1056, 19242,  4945,   999]])
Logits : tensor([[-3.4117,  3.6325]], grad_fn=<AddmmBackward0>)


`Batching` = sending multiple sentence to model at once.

In case of single sentence :

In [None]:
batched_ids = [ids, ids]
batch_tensor = torch.tensor(batched_ids)

batch_outputs = model(batch_tensor)

In [None]:
batch_outputs.logits

tensor([[-3.4117,  3.6325],
        [-3.4117,  3.6325]], grad_fn=<AddmmBackward0>)

NOTE : both gave the same results

Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There’s a second issue, though. When you’re trying to batch together two (or more) sentences, they might be of different lengths.

### Padding the inputs

In [2]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

seq1_id = [[2023, 2003, 7367, 4160, 1015]]
seq2_id = [[1998, 2023, 2028, 2003, 5537, 1016]]

batched_ids = [
    [2023, 2003, 7367, 4160, 1015, tokenizer.pad_token_id],
    [1998, 2023, 2028, 2003, 5537, 1016]
]

In [3]:
print(model(torch.tensor(seq1_id)).logits)
print(model(torch.tensor(seq2_id)).logits)
print(model(torch.tensor(batched_ids)).logits)

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


tensor([[ 2.0993, -1.7500]], grad_fn=<AddmmBackward0>)
tensor([[ 2.7036, -2.2732]], grad_fn=<AddmmBackward0>)
tensor([[ 2.3905, -1.9832],
        [ 2.7036, -2.2732]], grad_fn=<AddmmBackward0>)


### Attention Masks

Attention masks are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model).

In [4]:
batched_ids = [
    [2023, 2003, 7367, 4160, 1015, tokenizer.pad_token_id],
    [1998, 2023, 2028, 2003, 5537, 1016]
]

attention_mask = [
    [1, 1, 1, 1, 1, 0],
    [1, 1, 1, 1, 1, 1]
]

outputs = model(torch.tensor(batched_ids), attention_mask = torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 2.0993, -1.7500],
        [ 2.7036, -2.2732]], grad_fn=<AddmmBackward0>)


### Longer Sequences

Two solutions to this problem:

- Use a model with a longer supported sequence length.
- Truncate your sequences.

In [None]:
sequence = sequence[:max_sequence_length]

# Putting it all together

In [6]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [8]:
sequence1 = "Friday it is !"
sequence2 = ["Friday it was.", "Saturday it is."]

model_input = tokenizer(sequence)
model_input2 = tokenizer(sequence2)

Padding

In [10]:
# With padding

# "longest" = will pad the sequences up to the max seq length
model_inputs = tokenizer(sequence, padding = "longest")

# "max_length" = will pad teh sequence up to the model max_length
model_inputs = tokenizer(sequence, padding = "max_length")

# will pad sequence up to specified max length
model_inputs = tokenizer(sequence, padding="max_length", max_length=2)


Truncate

In [12]:
# will truncate the seq that are longer than the model max length
model_inputs = tokenizer(sequence2, truncation=True)

# will truncate the sequence that are longer than specified max length
model_inputs = tokenizer(sequence2, max_length = 8, truncation=True)

The `tokenizer` object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — "pt" returns PyTorch tensors, "tf" returns TensorFlow tensors, and "np" returns NumPy arrays:

In [14]:
# Returns PyTorch tensors
model_inputs = tokenizer(sequence2, padding=True, return_tensors='pt')

# Returns TensorFlow tensors
model_input_tf = tokenizer(sequence2, padding=True, return_tensors='tf')

# Return numpy arrays
model_inputs_np = tokenizer(sequence2, padding=True, return_tensors='np')

## Special tokens

In [16]:
sequence = "Keep making me dance waving my hands."

model_inputs = tokenizer(sequence)
print(model_inputs["input_ids"])

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[101, 2562, 2437, 2033, 3153, 12015, 2026, 2398, 1012, 102]
[2562, 2437, 2033, 3153, 12015, 2026, 2398, 1012]


One token ID was added at the beginning, and one at the end. Let’s decode the two sequences of IDs above to see what this is about:

In [17]:
print(tokenizer.decode(model_inputs['input_ids']))
print(tokenizer.decode(ids))

[CLS] keep making me dance waving my hands. [SEP]
keep making me dance waving my hands.


The tokenizer added the special word `[CLS]` at the beginning and the special word `[SEP]` at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well.

## WRAPUP: Tokenizer-to-model

In [19]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequence = ["Believe you can and you're halfway there.", "Just say One more try."]

tokens = tokenizer(sequence, padding=True, truncation=True, return_tensors='pt')
outputs = model(**tokens)

In [20]:
print(outputs)

SequenceClassifierOutput(loss=None, logits=tensor([[-4.0864,  4.4267],
        [ 2.1826, -1.8462]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
