# The Hugging Face Pipeline

First, why are we even using Hugging Face? Is there something better? What does Hugging Face offer us?

Hugging Face strives to be *the* hub for sharing LLMs. Because anyone can share LLMs, and big companies *do* share their LLMs here, we can use models with 10s of billions of parameters with little fuss.

Hugging Face advertises:
- **It's easy**: "Downloading, loading, and using LLMs is done in a few lines of code"
  - We see this is true by the by performing a task in two lines of code using `pipeline`
- **Flexibility**: Everything is of the class Pytorch and TensorFlow

We've used the `pipeline` function.

The `pipeline` function is abstracting away the process of:
1. **Preprocessing**: Tokenization and Encoding
2. **Model Input**: Inputting the preprocessed data into the model and outputting logits
3. **Postprocessing**: converting logits to probabilities, and getting the results

We will now discuss each of these, individually.

## 1. Preprocessing (Tokenization and Encoding)

Neural networks cannot process textual data. It must be encoded into numbers.

Plainly, the Preprocessing process has three steps:
1. **Tokenization**: Converting textual data (sentences, phrases, words) into words, subwords, or symbols, all being **tokens**.
2. **Encoding**: Converting the tokens into a numeric representations.
3. Adding special characters that the model might need to separate sentences.

Note that the vocabulary of a model is the corpus it was trained on.

Also, there are three main types of Tokenization:
1. Word: (worst) Creates huge vocabulary; if vocab is condensed, lots of unknowns; moderate # of tokens to input
   - "Dog" has no relation to "Dogs", but Dogs is just the plural of dogs... 
2. Character: (better) (good for ideogram-based languages) less meaningful, but small vocabulary; huge # of tokens to input
3. Sub-word: (best) words broken down to make smaller vocabulary
   - Dog found in [Dog, s], but [Dog, s] is actually Dogs

There are other ways to Tokenize, but this depends on which model you want to use.

Each model has it's own preprocessing (tokenization/encoding) method. You might ask "how does this make sense?" 

It makes sense because numerically encoding words/tokens and training a model means that if one does not give words the same encoding as they were given in training, then the words effectively aren't the same. Imagine if you read two dictionaries and the definitions of words were completely different!

Hugging Face allows us to easily use model's original preprocessing techniques through the `AutoTokenizer` class.

*Def'n.* **checkpoint** - The weights used as the parameters for the trained model.

### Tokenizer Code

We can get the tokenizer by explicitly using the BERT tokenizer, or using the AutoTokenizer. Note how `tokenizer` and `tokenizer2` call the same pretrained tokenizer and therefore have the exact same output.

**This is the long way to tokenize** using `transformers` so we can show decoding.

In [1]:
from transformers import BertTokenizer, AutoTokenizer
import tensorflow as tf
checkpoint = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(checkpoint)
tokenizer2 = AutoTokenizer.from_pretrained(checkpoint)

sequence = 'I love to learn about LLMs!'

# Tokenization
tokens1 = tokenizer.tokenize(sequence)
tokens2 = tokenizer2.tokenize(sequence)
print(tokens1)
print(tokens2, end = '\n\n')

# Encoding
ids1 = tokenizer.convert_tokens_to_ids(tokens1)
ids2 = tokenizer2.convert_tokens_to_ids(tokens2)
print(ids1)
print(ids2, end = '\n\n')

# The output for both tokenizers is the exact same, and for the code above
# This is the shortcut code to what was done above
print(tokenizer('I love to learn about LLMs!'))
print(tokenizer2('I love to learn about LLMs!'))

  from .autonotebook import tqdm as notebook_tqdm



['I', 'love', 'to', 'learn', 'about', 'LL', '##Ms', '!']
['I', 'love', 'to', 'learn', 'about', 'LL', '##Ms', '!']

[146, 1567, 1106, 3858, 1164, 12427, 25866, 106]
[146, 1567, 1106, 3858, 1164, 12427, 25866, 106]

{'input_ids': [101, 146, 1567, 1106, 3858, 1164, 12427, 25866, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 146, 1567, 1106, 3858, 1164, 12427, 25866, 106, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The **output for both tokenizers is the exact same; one is just being more explicit.**

In [17]:
# Save if you want the Tokenizer's data locally
some_file_name = 'BertTokenizerFiles' # could be any name
tokenizer.save_pretrained(some_file_name)

('BertTokenizerFiles\\tokenizer_config.json',
 'BertTokenizerFiles\\special_tokens_map.json',
 'BertTokenizerFiles\\vocab.txt',
 'BertTokenizerFiles\\added_tokens.json')

## Models

At first, I was going to call this section "Modeling", then I realized it's not actually modeling because we're using pretrained models; I'll use the Hugging Face section name "Models".

First, I'll initialize a configured model, but we wont use it because it isn't trained; it's just for completeness.

In [20]:
# Initialize an untrained model with BERT architecture
from transformers import BertConfig, TFBertModel
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

config = BertConfig() # build config

model = TFBertModel(config) # Build model

In [23]:
print(config) # Outputs model hyperparameters
print(model) # Outputs the class name information

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

<transformers.models.bert.modeling_tf_bert.TFBertModel object at 0x000001EDD9565B10>


It's good to see what's going on, but we wont use this model because it's untrained. As is taught in the first part of the Hugging Face course: training a huge model can take days and weeks and cost a ton of money.

Again, I'll show that the two classes produce the same output.

In [38]:
from transformers import TFAutoModel, TFBertModel

auto_model = TFAutoModel.from_pretrained(checkpoint)
bert_model = TFBertModel.from_pretrained(checkpoint)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

In [4]:
import numpy as np
sequences = [sequence,
            'actually, I love to learn about anything tech related!',
            'and I love to learn about languages',
            'NLP is perfect!']

# Just use the tokenizer, it saves having to pad manually, which takes maybe 4-5 more lines
inputs = tokenizer(sequences, padding = True, truncation=True, return_tensors = 'tf')

In [None]:
outputs = auto_model(inputs)
outputs2 = bert_model(inputs)

In [66]:
# (outputs2[0] == outputs[0]) # Trust me, this outputs True...

In [67]:
print(outputs.last_hidden_state.shape)

(4, 13, 768)


**This output was shown for completeness, but to perform a task, we need the proper attention head for that task.** In the next output it's also easier to show the models are outputting the exact same.

In [7]:
from transformers import TFAutoModelForSequenceClassification, TFBertForSequenceClassification

auto_clf = TFAutoModelForSequenceClassification.from_pretrained(checkpoint)
bert_clf = TFBertForSequenceClassification.from_pretrained(checkpoint)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
auto_clf.config

BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

In [15]:
bert_clf.config

BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.36.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

In [18]:
out1 = auto_clf(inputs)
out2 = bert_clf(inputs)

In [19]:
print(tf.math.softmax(out1.logits, axis = 1))
print()
print(tf.math.softmax(out2.logits, axis = 1))

tf.Tensor(
[[0.48359364 0.51640636]
 [0.51789117 0.48210883]
 [0.49984252 0.50015754]
 [0.5000088  0.49999118]], shape=(4, 2), dtype=float32)

tf.Tensor(
[[0.6227092  0.37729082]
 [0.6379381  0.3620619 ]
 [0.6230812  0.37691882]
 [0.6356983  0.3643017 ]], shape=(4, 2), dtype=float32)


**If you need to save the model, like for after fine-tuning**

In [None]:
# Saving model
auto_model.save_pretrained('some_directory_on_pc')