<a href="https://colab.research.google.com/github/not-sid-29/transformers_huggingface/blob/main/5_Batching_Inputs_for_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook-5: Handling Multiple Inputs at one time - {Batching Inputs together to act as inputs for transformers}

In [1]:
!pip install --q datasets evaluate transformers[sentencepiece]

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m542.7/547.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/116.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.1/316.1 kB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### What is the need for batching inputs for transformers?
→ A transformer generally expects a large raw text sequence as an input, so typically it does not function on a single text piece. Thus we need to pass in a batch of raw text sequences together.

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

seq = "Using HuggingFace is quite easy"

tokens = tokenizer.tokenize(seq)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([input_ids])
print("Input IDs: ", input_ids)


score = model(input_ids)
print("Output Logits: ", score.logits)

Input IDs:  tensor([[ 2478, 17662, 12172,  2003,  3243,  3733]])
Output Logits:  tensor([[-0.4818,  0.6146]], grad_fn=<AddmmBackward0>)


### Batching with Padding (for unequal length of multiple strings):

In [5]:
pad_token_id = 100

seq1_ids = [[150, 500, 350, 450, 100, 200]]
seq2_ids = [[150, 150, 150]]


batched_ids = [
    [150, 500, 350, 450, 100, 200],
    [150, 150, 150, pad_token_id, pad_token_id, pad_token_id]
]

print(model(torch.tensor(seq1_ids)).logits)
print(model(torch.tensor(seq2_ids)).logits)
print(model(torch.tensor(batched_ids)).logits)

tensor([[ 0.9795, -0.8957]], grad_fn=<AddmmBackward0>)
tensor([[ 0.9764, -0.9153]], grad_fn=<AddmmBackward0>)
tensor([[ 0.9795, -0.8957],
        [ 0.9984, -0.8844]], grad_fn=<AddmmBackward0>)


⇒ here, the logits for `seq2_ids` and `second row of batched_ids` should have been same, but they are not, why because the attention mechanism was not set to ignore the `pad_token_id`, hence the differed predictions

In [7]:
attention_mask = [
    [1, 1, 1, 1, 1, 1],
    [1, 1, 1, 0, 0, 0]
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print("New preds for batched_ids: ", outputs.logits)

New preds for batched_ids:  tensor([[ 0.9795, -0.8957],
        [ 0.9765, -0.9153]], grad_fn=<AddmmBackward0>)
