# Behind the Pipeline

* Outlines what the `pipeline` function from the Hugging Face `Transformers` library does behind the scenes
* Uses a sentiment analysis example
* Uses the `distilbert/distilbert-base-uncased-finetuned-sst-2-english` model
* All classes and functions are imported from the `Transformers` library

## Setup

In [27]:
model_provider = "distilbert"
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = f"{model_provider}/{model_name}"
task_name = "sentiment-analysis"

## Sentiment Analysis Pipeline

In [28]:
from transformers import pipeline

classifier = pipeline(task=task_name, model=model)
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

Device set to use mps:0


[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing with Tokenizer

* Raw text needs to be tokenized and converted into integers for input into the model
* This preprocessing must be done the same way as when the model was pretrained
* Use the `AutoTokenizer` class via its `from_pretraining` method with the model's checkpoint name to:
    * Automatically fetch the data associated with the model's tokenizer
    * Cache the data so it can be reused as needed
* Once the `tokenizer` object is set, text sentences can be passed to it
* The `raw_inputs` variable defines an array of the text sentences to pass to the tokenizer
* Transformer models only accept *tensors* as input
* The `return_tensors` argument of the tokenizer specifies the type of tensors to return
    * `pt` returns PyTorch tensors
* The `inputs` variable is the output of the tokenizer to be used as input to the model
    * It contains a dictionary of 2 keys:
        * `input_ids`: Contains 2 rows of integers (one for each sentence) that are unique identifiers of the tokens in each sentence
        * `attention_masks`

In [29]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In [30]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [31]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

In [32]:
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")

In [33]:
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}
