# The Hugging Face Pipeline

First, why are we even using Hugging Face? Is there something better? What does Hugging Face offer us?

Hugging Face strives to the *the* hub for sharing LLMs. Because anyone can share LLMs, and big companies *do* share their LLMs here, we can use models with 10s of billions of parameters with little fuss.

Hugging Face advertises:
- **It's easy**: "Downloading, loading, and using LLMs is done in a few lines of code"
  - We see this is true by the by performing a task in two lines of code using `pipeline`
- **Flexibility**: Everything is of the class Pytorch and TensorFlow

We've used the `pipeline` function.

The `pipeline` function is abstracting away the process of:
1. **Tokenization/Encoding**: Preprocessing data; tokenization
2. **Model Input**: Inputting the preprocessed data into the model and outputting logits
3. **Postprocessing**: converting logits to probabilities, and getting the results

We will now discuss each of these, individually.

## Tokenization (Preprocessing; Encoding)

Neural networks cannot process textual data. It must be "Tokenized".

Plainly, the Tokenization process has three steps:
1. Converting textual data (sentences, phrases, words) into words, subwords, or symbols, all being **tokens**.
2. Converting the tokens into a numeric representations.
3. Adding special characters that the model might need to separate sentences.

Note that the vocabulary of a model is the corpus it was trained on.

Also, there are three main types of Tokenization:
1. Word: (worst) Creates huge vocabulary; if vocab is condensed, lots of unknowns; moderate # of tokens to input
   - Dog not found in [Dogs], but Dogs is just the plural of dogs... 
2. Character: (better) (good for ideogram-based languages) less meaningful, but small vocabulary; huge # of tokens to input
3. Sub-word: (best) words broken down to make smaller vocabulary
   - Dog found in [Dog, s], but [Dog, s] is actually Dogs

There are other ways to Tokenize, but this depends on which model you want to use.

Each model has it's own preprocessing (tokenization) method. You might ask "how does this make sense?" 

It makes sense because numerically encoding words/tokens and training a model means that if one does not give words the same encoding as they were given in training, then the words effectively aren't the same. Imagine if you read two dictionaries and the definitions of words were completely different!

Hugging Face allows us to easily use model's original preprocessing techniques through the `AutoTokenizer` class.

*Def'n.* **checkpoint** - The weights used as the parameters for the trained model.

### Tokenizer Code

We can get the tokenizer by explicitly using the BERT tokenizer, or using the AutoTokenizer. Note how `tokenizer` and `tokenizer2` call the same pretrained tokenizer and therefore have the exact same output.

Note, **this is the long way to tokenize** so we can show decoding.

In [18]:
from transformers import BertTokenizer, AutoTokenizer
import tensorflow as tf
checkpoint = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
tokenizer2 = AutoTokenizer.from_pretrained(checkpoint)

sequence = 'I love to learn about LLMs!'

# Tokenization
tokens1 = tokenizer.tokenize(sequence)
tokens2 = tokenizer2.tokenize(sequence)
print(tokens1)
print(tokens2, end = '\n\n')

# Encoding
ids1 = tokenizer.convert_tokens_to_ids(tokens1)
ids2 = tokenizer2.convert_tokens_to_ids(tokens2)
print(ids1)
print(ids2, end = '\n\n')

# The output for both tokenizers is the exact same, and for the code above
# This is the shortcut code to what was done above
print(tokenizer('I love to learn about LLMs!'))
print(tokenizer2('I love to learn about LLMs!'))

['I', 'love', 'to', 'learn', 'about', 'LL', '##Ms', '!']
['I', 'love', 'to', 'learn', 'about', 'LL', '##Ms', '!']

[146, 1567, 1106, 3858, 1164, 12427, 25866, 106]
[146, 1567, 1106, 3858, 1164, 12427, 25866, 106]

{'input_ids': [101, 146, 1567, 1106, 3858, 1164, 12427, 25866, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 146, 1567, 1106, 3858, 1164, 12427, 25866, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The **output for both tokenizers is the exact same; one is just being more explicit.**

In [17]:
# Save if you want the Tokenizer's data locally
some_file_name = 'BertTokenizerFiles' # could be any name
tokenizer.save_pretrained(some_file_name)

('BertTokenizerFiles\\tokenizer_config.json',
 'BertTokenizerFiles\\special_tokens_map.json',
 'BertTokenizerFiles\\vocab.txt',
 'BertTokenizerFiles\\added_tokens.json')

## Models

At first, I was going to call this section "Modeling", then I realized it's not actually modeling because we're using pretrained models; I'll use the Hugging Face section name "Models".

First, I'll initialize a configured model, but we wont use it because it isn't trained; it's just for completeness.

In [20]:
# Initialize an untrained model with BERT architecture
from transformers import BertConfig, TFBertModel
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

config = BertConfig() # build config

model = TFBertModel(config) # Build model

In [23]:
print(config)
print(model)

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

<transformers.models.bert.modeling_tf_bert.TFBertModel object at 0x000001EDD9565B10>


It's good to see what's going on, but we wont use this model because it's untrained. As is taught in the first part of the Hugging Face course: training a huge model can take days and weeks and cost a ton of money.

Again, I'll show that the two classes produce the same output.

In [29]:
from transformers import TFAutoModel, TFBertModel

auto_model = TFAutoModel.from_pretrained(checkpoint)
bert_model = TFBertModel.from_pretrained(checkpoint)

**If you need to save the model, like for after fine-tuning**

In [None]:
# Saving model
auto_model.save_pretrained('some_directory_on_pc')

In [2]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

data = ['Hugging Face is a nice website!',
        'I love to learn about tech-related topics!']

inputs = tokenizer(data, padding = True, truncation = True, return_tensors = 'tf')
print(inputs)

inputs = tokenizer(data, padding = True,return_tensors = 'tf')
print(inputs)

{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101, 17662,  2227,  2003,  1037,  3835,  4037,   999,   102,
            0,     0,     0],
       [  101,  1045,  2293,  2000,  4553,  2055,  6627,  1011,  3141,
         7832,   999,   102]])>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}
{'input_ids': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[  101, 17662,  2227,  2003,  1037,  3835,  4037,   999,   102,
            0,     0,     0],
       [  101,  1045,  2293,  2000,  4553,  2055,  6627,  1011,  3141,
         7832,   999,   102]])>, 'attention_mask': <tf.Tensor: shape=(2, 12), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])>}


In [5]:
model = TFAutoModel.from_pretrained(checkpoint)




Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertModel: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing TFDistilBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFDistilBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertModel for predictions without further training.


In [7]:
outputs = model(inputs)
print(outputs.last_hidden_state.shape)

(2, 12, 768)


In [13]:
outputs[0]

<tf.Tensor: shape=(2, 12, 768), dtype=float32, numpy=
array([[[ 0.5692251 , -0.03803831,  0.32966405, ...,  0.6067085 ,
          0.98890364, -0.384846  ],
        [ 0.7754908 ,  0.09545754,  0.4908302 , ...,  0.76056325,
          1.0096803 , -0.30047473],
        [ 0.86226773,  0.01781805,  0.4306031 , ...,  0.6162475 ,
          0.98571277, -0.25459784],
        ...,
        [ 0.4978409 , -0.12197125,  0.2099949 , ...,  0.7613839 ,
          0.90970665, -0.2837196 ],
        [ 0.57905954, -0.04940146,  0.20574325, ...,  0.72581595,
          0.9981229 , -0.24939202],
        [ 0.4884028 ,  0.07165703,  0.26795864, ...,  0.78447723,
          0.8851079 , -0.22291045]],

       [[ 0.55294013, -0.16169162,  0.25375643, ...,  0.4124907 ,
          0.98647946, -0.506946  ],
        [ 1.0206703 ,  0.04772499,  0.03159535, ...,  0.4072253 ,
          0.95650345, -0.27522287],
        [ 0.8618861 ,  0.18678255,  0.33349323, ...,  0.39239886,
          0.86635435, -0.3706951 ],
        ...,


In [18]:
from transformers import TFAutoModelForSequenceClassification

model2 = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs2 = model(inputs)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [19]:
outputs2

TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(2, 12, 768), dtype=float32, numpy=
array([[[ 0.5692251 , -0.03803831,  0.32966405, ...,  0.6067085 ,
          0.98890364, -0.384846  ],
        [ 0.7754908 ,  0.09545754,  0.4908302 , ...,  0.76056325,
          1.0096803 , -0.30047473],
        [ 0.86226773,  0.01781805,  0.4306031 , ...,  0.6162475 ,
          0.98571277, -0.25459784],
        ...,
        [ 0.4978409 , -0.12197125,  0.2099949 , ...,  0.7613839 ,
          0.90970665, -0.2837196 ],
        [ 0.57905954, -0.04940146,  0.20574325, ...,  0.72581595,
          0.9981229 , -0.24939202],
        [ 0.4884028 ,  0.07165703,  0.26795864, ...,  0.78447723,
          0.8851079 , -0.22291045]],

       [[ 0.55294013, -0.16169162,  0.25375643, ...,  0.4124907 ,
          0.98647946, -0.506946  ],
        [ 1.0206703 ,  0.04772499,  0.03159535, ...,  0.4072253 ,
          0.95650345, -0.27522287],
        [ 0.8618861 ,  0.18678255,  0.33349323, ...,  0.39239886,
          0.8