# Behind the pipeline (PyTorch)

The explanation of this notebook is in the Hugging Face course, chapter 2, section 2: [Behind the pipeline?](https://huggingface.co/course/chapter2/2?fw=pt)

The original code of this notebook is in the Hugging Face's SageMaker repository: [section2_pt.ipynb](https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter2/section2_pt.ipynb)

## Run conditions

This notebook has been tested in the following environment:
- Environment: Project created in [Paperspace Gradient](https://gradient.paperspace.com) with Python 3.9.13.
- Machine: P5000 (30GiB RAM 8 CPU 16GiB GPU) (more details on [Paperspace Machines](https://docs.paperspace.com/gradient/machines/)).
- IDE: Visual Studio Code with remote Jupyter server.

## Install dependencies

In [1]:
# Install the libraries datasets v2.7.1, evaluate v0.3.0, and transformers v4.25.1 with quiet and upgrade flags.
%pip install -q datasets==2.7.1 evaluate==0.3.0 transformers==4.25.1 --upgrade

Note: you may need to restart the kernel to use updated packages.


## Sentiment Analysis pipeline

In [2]:
# Import pipeline from Transformers.
from transformers import pipeline

# Create a classifier with a sentiment analysis pipeline.
classifier = pipeline("sentiment-analysis")
# Classify two sentences in a list.
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing with a tokenizer

In [3]:
# Import AutoTokenizer from Transformers.
from transformers import AutoTokenizer

# Create a checkpoint name.
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# Create a tokenizer from the checkpoint name.
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [4]:
# Set 2 sentences as raw inputs.
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
# Tokenize the raw inputs with padding, truncation and return_tensors.
tokenized_inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
# Print the tokenized inputs.
tokenized_inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

## Going through the model

In [5]:
# Import AutoModel from Transformers.
from transformers import AutoModel

# Create a model from the checkpoint name.
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'classifier.bias', 'pre_classifier.weight', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## A high-dimensional vector?

In [6]:
# Create outputs from the model with the tokenized inputs.
outputs = model(**tokenized_inputs)
# Print the shape of the outputs.
outputs.last_hidden_state.shape

torch.Size([2, 16, 768])

## Model heads: Making sense out of numbers

In [7]:
# Import AutoModelForSequenceClassification from Transformers.
from transformers import AutoModelForSequenceClassification

# Create a model from the checkpoint name.
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# Create outputs from the model with the tokenized inputs.
outputs = model(**tokenized_inputs)
# Print the shape of the outputs.
outputs.logits.shape

torch.Size([2, 2])

## Postprocessing the output

In [8]:
# Print the outputs logits.
outputs.logits

tensor([[-1.5607,  1.6123],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)

In [9]:
# Import torch.
import torch

# Create predictions from the outputs logits.
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Print the predictions.
predictions

tensor([[4.0195e-02, 9.5980e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

In [10]:
# Get labels from the model.
labels = model.config.id2label
# Print the labels.
labels

{0: 'NEGATIVE', 1: 'POSITIVE'}