# Pipeline

full-nlp-pipeline.svg

* Tokenizer: The text is preprocessed into model format
* Model: The preprocessed inputs are passed to the model
* Post Processing: Model predictions are post-processed

## Tokenizer

* Motivation:
<ul>
* Splitting the input into words, subwords, or symbols
* Mapping each token to an integer
* Adding additional inputs that may be useful to the model
</ul>
* Tokenization in the same way as the model trained
* Tokenization is associated with the model in use
* AutoTokenizer class and its from pretrained() method
* Example:

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

In [None]:
from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

From list of strings to tensors:

In [3]:
raw_inputs = ["I've been waiting for a HuggingFace course my whole life.","I hate this so much!"]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])}


## Model

[AutoModel class includes from pretrained() method](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html):

In [5]:
from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 16, 768])


The output of the model (hidden states) has three dimensions:
* Batch size: The number of sequences processed at a time
* Sequence length: Length of numerical representation
* Hidden size: The vector dimension of each model input