# Looking Into Pipelines

1. Preprocessing with a tokeniser
2. Passing inputs through the model
3. Postprocessing

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9598047137260437},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

## Preprocessing

1. Splitting the input into words, subwords, or symbols (like punctuation) that are called tokens
2. Mapping each token to an integer
3. Adding additional inputs that may be useful to the model

### Loading a Tokeniser

- Spelling is `tokenizer`
- **Needs to be done exactly the same ws when the model was pretrained**
  - Information can be downloaded through [Model Hub](https://huggingface.co/models)
- Sentences can be directly passed to the tokeniser
- It returns a dictionary that will be fed into the model

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint) # Tokeniser and from_pretrained()

- Todo: convert list of inputs -> tensors
  - Using Transformers library need not worry about the ML framework (PyTorch, Tensorflow, Flax)
  - Transformers models only accept `tensors` as input
    - Tensors: like NumPy arrays, can be a scalar (0D), a vector (1D), a matrix (2D) or have more dimensions
- Output: a dictionary with two keys
  - `input_ids`: two rows of integers (one for each sentence), unique identifiers of the tokens (words) in each sentence
  - `attention_mask`

In [None]:
# Human readable data input
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") # return_tensors() specify the type of sensors returned, default return is a list of lists
inputs

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}

### Going Through the Model

- Download pretrained model the same way as tokeniser
- This architecture contains only the base Transformer module
  - Given some inputs, it outputs hidden states also known as features
  - For each model input, a high-dimensional vector representing the contextual understanding of that input by the Transformer model is retrieved
  - Hidden states are usually inputs to another part of the model (also useful on their own)

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # Should be cached because it has been downloaded in the previous blocks
model = AutoModel.from_pretrained(checkpoint) # Instantiate model

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
