<a href="https://colab.research.google.com/github/harshinharshi/huggingface_NLP/blob/main/transformers-course/chapter1/Behind_the_pipeline_(PyTorch).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Behind the pipeline (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# load token
import os
from dotenv import load_dotenv

load_dotenv()
os.environ['HF_TOKEN']= os.environ.get('HF_TOKEN')

## Pipeline 
The most basic object in the HuggingFace Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer. pipeline() function groups together three steps: preprocessing, passing the inputs through the model, and postprocessing

pipeline("sentiment-analysis") : if we mention the pipeline without the model name a default checkpoint is used, here distilbert/distilbert-base-uncased-finetuned-sst-2-english is used

In [8]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've love pizza",
        "I hate ice cream",
    ]
)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.999572217464447},
 {'label': 'NEGATIVE', 'score': 0.9974052309989929}]

## Behind Pipeline
Hugging Face allows us to bypass the high-level pipeline abstraction and directly utilize the tokenizer and model for more granular control. In this explanation, we’ll explore how to implement this approach, which involves two key steps:  

1. **Preprocessing with a Tokenizer**  
   This step involves converting raw text into tokenized inputs (e.g., token IDs, attention masks) that the model can process.  

2. **Processing with the Model**  
   Once preprocessed, the tokenized data is passed through the model to generate outputs, such as predictions or embeddings.  

Let’s break down each step in detail.

### Preprocessing with a tokenizer
**Core Objective**: Transform raw text into numerical representations through the following steps:  

1. **Tokenization**  
   Split the input text into smaller units (tokens), such as words, subwords, or symbols (e.g., punctuation marks).  

2. **Numerical Mapping**  
   Convert each token into a corresponding integer ID using a predefined vocabulary or lookup table.  

3. **Input Augmentation**  
   Add special tokens (e.g., `[CLS]`, `[SEP]`) or metadata (e.g., attention masks, token type IDs) required by the model to process the input effectively.  

This process ensures the text is structured into a format compatible with machine learning models.

Since the default checkpoint of the sentiment-analysis pipeline is distilbert-base-uncased-finetuned-sst-2-english. We are selecting that model here.

In [None]:
from transformers import AutoTokenizer

# initializing tokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


1. raw_inputs: A string or a list of strings to be tokenized.
2. padding=True: Pads shorter sequences to match the longest sequence in the batch.
3. truncation=True: Truncates longer sequences to fit within the model's token limit.
3. return_tensors="pt": Returns the output as PyTorch tensors.

In [22]:
raw_inputs = [
        "I've love pizza",
        "I hate ice cream",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2293, 10733,   102],
        [  101,  1045,  5223,  3256,  6949,   102,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0]])}


### Going through the model

The vector output by the Transformer module is usually large. It generally has three dimensions:

1. Batch size: The number of sequences processed at a time (2 in our example).
2. Sequence length: The length of the numerical representation of the sequence (16 in our example).
3. Hidden size: The vector dimension of each model input.

In [33]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [34]:
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 7, 768])


The output shape torch.Size([2, 7, 768]) represents a 3D tensor with the following dimensions:

2 → Batch size: The model processed two input sequences.

7 → Sequence length: Each sequence contains 7 tokens (after tokenization, padding, or truncation).

768 → Hidden size: Each token is represented by a 768-dimensional vector (common in models like BERT).

Meaning:

Each sequence has 7 tokens.

Each token is encoded as a 768-dimensional vector (output from a transformer model's hidden layer).

The batch contains 2 sequences.

This output is typically from the last hidden layer of a transformer-based model like BERT.

In [35]:
outputs

BaseModelOutput(last_hidden_state=tensor([[[ 0.2425, -0.1938,  0.2781,  ...,  0.4556,  1.1108, -0.2243],
         [ 0.7231, -0.0086,  0.2829,  ...,  0.3625,  1.2168, -0.0976],
         [ 1.1178, -0.0109,  0.4670,  ...,  0.7832,  0.2728, -0.7872],
         ...,
         [ 0.8897, -0.0912,  0.6215,  ...,  0.2053,  1.0163,  0.0544],
         [ 0.2794, -0.2736,  0.5849,  ...,  0.5751,  0.7293, -0.1187],
         [ 0.7754, -0.1281,  0.4774,  ...,  0.7552,  0.8777, -0.5446]],

        [[-0.0709,  0.7972, -0.2863,  ..., -0.1594, -0.1840,  0.1574],
         [-0.2289,  0.8642, -0.1575,  ..., -0.2345, -0.0336,  0.3861],
         [-0.0514,  0.9576, -0.0853,  ..., -0.2791, -0.0349,  0.3560],
         ...,
         [-0.8821,  0.5989, -0.0820,  ...,  0.0766, -0.0486,  0.1625],
         [ 0.1153,  0.3351, -0.0721,  ...,  0.1415, -0.1309, -0.1116],
         [-0.1579,  0.5985, -0.1570,  ..., -0.0764, -0.0212,  0.0562]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won’t actually use the AutoModel class, but AutoModelForSequenceClassification:

In [26]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label):

In [27]:
print(outputs.logits.shape)

torch.Size([2, 2])


In [28]:
print(outputs.logits)

tensor([[-3.7717,  3.9848],
        [ 3.2460, -2.7057]], grad_fn=<AddmmBackward0>)


To be converted to probabilities, they need to go through a SoftMax layer 

In [29]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[4.2779e-04, 9.9957e-01],
        [9.9741e-01, 2.5947e-03]], grad_fn=<SoftmaxBackward0>)


In [30]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

Now we can conclude that the model predicted the following:

First sentence: NEGATIVE: 0.0004277, POSITIVE: 0.99957

Second sentence: NEGATIVE: 0.99741, POSITIVE: 0.000259