# The Hugging Face Pipeline

First, why are we even using Hugging Face? Is there something better? What does Hugging Face offer us?

Hugging Face strives to the *the* hub for sharing LLMs. Because anyone can share LLMs, and big companies *do* share their LLMs here, we can use models with 10s of billions of parameters with little fuss.

Hugging Face advertises:
- **It's easy**: "Downloading, loading, and using LLMs is done in a few lines of code"
  - We see this is true by the by performing a task in two lines of code using `pipeline`
- **Flexibility**: Everything is of the class Pytorch and TensorFlow

We've used the `pipeline` function.

The `pipeline` function is abstracting away the process of:
1. **Tokenization**: Preprocessing data; tokenization
2. **Model Input**: Inputting the preprocessed data into the model and outputting logits
3. **Postprocessing**: converting logits to probabilities, and getting the results

We will now discuss each of these, individually.

## Tokenization

Neural networks cannot process textual data. It must be "Tokenized".

Plainly, the Tokenization process has three steps:
1. Converting textual data (sentences, phrases, words) into words, subwords, or symbols, all being **tokens**.
2. Adding special characters that the model might need to separate sentences.
3. Converting the tokens into a numeric representations.

Each model has it's own preprocessing (tokenization) method. You might ask "how does this make sense?" 

It makes sense because numerically encoding words/tokens and training a model means that if one does not give words the same encoding as they were given in training, then the words effectively aren't the same. Imagine if you read two dictionaries and the definitions of words were completely different!

Hugging Face allows us to easily use model's original preprocessing techniques through the `AutoTokenizer` class.

*Def'n.* **checkpoint** - The weights used as the parameters for the trained model.

In [1]:
from transformers import AutoTokenizer
import tensorflow as tf
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

data = ['Hugging Face is a nice website!',
        'I love to learn about tech-related topics!']

inputs = tokenizer(data, padding = True, truncation = True, return_tensors = 'tf')
print(inputs)

inputs = tokenizer(data, padding = True,return_tensors = 'tf')
print(inputs)

  from .autonotebook import tqdm as notebook_tqdm



{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101, 17662,  2227,  2003,  1037,  3835,  4037,   999,   102,
            0,     0,     0],
       [  101,  1045,  2293,  2000,  4553,  2055,  6627,  1011,  3141,
         7832,   999,   102]])>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}
{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101, 17662,  2227,  2003,  1037,  3835,  4037,   999,   102,
            0,     0,     0],
       [  101,  1045,  2293,  2000,  4553,  2055,  6627,  1011,  3141,
         7832,   999,   102]])>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}
