# What happens inside the pipeline function?
* raw text in to tokenizer
* input IDs from tokenizer to model
* model produces logits
* postprocessing creates predictions 

## Tokenization
* Text is split into tokens
* Special tokens are added if necessary (beginning and end of sentence, etc)
* Tokenizer matches each token to its unique ID in the vocabulary of the pre-trained model (maps each token to an integer)
* Add anything else that may be helpful here

### AutoTokenizer class can load the tokenizer for any checkpoint

In [3]:
from transformers import AutoTokenizer

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

raw_inputs = [
    'I have been waiting for a HuggingFace course my whole life.',
    'I hate this so much!',
]

inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors='pt')

# padding=True: Since the two sentences aren't the same size, we need to pad the shortest one.
# truncation=True: Any sentence longer than the maximum that the model can handle is truncated.
# return_tensors='pt': Return a PyTorch tensor

# inputs is a dictionary with two keys ('input_ids' and 'attention_mask')
# attention_mask indicates what padding has been applied so the model does not pay attention to it

# once you have the tokenizer, you can directly pass your sentences to it 
# and you will get back a dict that is ready to feed into a model

In [5]:
inputs['input_ids']

tensor([[  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
          2026,  2878,  2166,  1012,   102],
        [  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,
             0,     0,     0,     0,     0]])

In [7]:
inputs['attention_mask']

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])

In [8]:
inputs['input_ids'][0]

tensor([  101,  1045,  2031,  2042,  3403,  2005,  1037, 17662, 12172,  2607,
         2026,  2878,  2166,  1012,   102])

### AutoModel class loads a model without its pretraining head

In [10]:
from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint)
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight', 'classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


torch.Size([2, 15, 768])


In [None]:
# it outputs hidden states, also known as features

# 2 is the batch size
# 15 is the sequence length
# 768 is the hidden size

# for each model input, we get a high-dimensional vector representing the 
# contextual understanding of that input by the Transformer model

### Model Adaptation Heads

The model adaptation heads (also known as model heads) take the high-dimensional vector of hidden states as input and project them onto a different dimension.  
They come in different forms and target a specific task, such as language modeling heads, question answering heads, sequence classification heads.  
The output of the Transformer model is sent directly to the model head to be processed. 


### Each AutoModelForXxx class loads a model for a specific task

In [14]:
from transformers import AutoModelForSequenceClassification

# we need a model with a sequence classification head (to be able to classify the sentences as positive or negative)

checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

print(outputs.logits)


tensor([[-1.3782,  1.4346],
        [ 4.1692, -3.3464]], grad_fn=<AddmmBackward0>)


### To go from logits to probabilities we apply SoftMax

In [15]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[5.6637e-02, 9.4336e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)


In [None]:
# Now they are probabilities that are positive and sum up to 1

In [16]:
# Get the labels
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [None]:
# first sentence: NEGATIVE 5.6%, POSITIVE 94.3%
# second sentence: NEGATIVE 99.9%, POSITIVE 0.54%

# Models

#### Instantiate a Transformer Model
The AutoModel API allows you to instantiate a pretrained model from any checkpoint.

In [19]:
from transformers import AutoModel

bert_model = AutoModel.from_pretrained('bert-base-cased')
print(type(bert_model))

gpt_model = AutoModel.from_pretrained('gpt2')
print(type(gpt_model))

bart_model = AutoModel.from_pretrained('facebook/bart-base')
print(type(bart_model))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<class 'transformers.models.bert.modeling_bert.BertModel'>
<class 'transformers.models.gpt2.modeling_gpt2.GPT2Model'>


Downloading:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/558M [00:00<?, ?B/s]

<class 'transformers.models.bart.modeling_bart.BartModel'>


The AutoModel.from_pretrained() gets the checkpoint or local folder and downloads   
the config file and instantiates the config class and then gets the model config and loads the model.  
The config for a model is a blueprint that contains all the info necessary to load the model.

In [22]:
from transformers import BertConfig, BertModel

# Building the config
bert_config = BertConfig.from_pretrained('bert-base-cased')
print(bert_config)

# Building the model from the config
model = BertModel(bert_config)

BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.22.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}



In [24]:
# Creating a model from the default configuration initializes it with random values
# Don't do this

from transformers import BertConfig, BertModel

config = BertConfig()
model = BertModel(config)

# Model is randomly initialized!

In [25]:
# Instead use the from_pretained method

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# The weights get downloaded and cached to ~/.cache/huggingface/transformers

### Saving a Model

In [None]:
# Able to change any part of the model using keyword arguments
# To save a model, use the save_pretrained method

In [28]:
model.save_pretrained('tmp')

# This downloads two files: the config.json for the model, and the pytorch_model.bin to create the model.
# pytorch_model.bin is the state dictionary - it contains all the model weights (your model's parameters)

# Tokenizers

In [None]:
# Tokenizers translate text into numerical data that can be processed by the model.

## Word-based

In [None]:
# Split on spaces, punctuation oor 
# Each word has a specific ID 

### Disadvantages:
* very similar words have entirely different meanings 
  (cat vs cats have two different embeddings so the model doesn't understand that these words are close)
* the vocabulary size (total number of words) can end up very large
* model can become very large / but we can limit the amount of words we add to the vocabulary (take 10K most frequent words for example)
* however, out of vocab words result in a loss of information
* to avoid these flaws, try character based tokenizer

## Character-based

* Helps to reduce the amount of unknown tokens 
* Vocabs are slimmer since you use a character-based vocabulary instead of a word-based vocab (so 256 vs 170K+ vocab size)
* even words unseen during the tokenization training can still be tokenized
* so out of vocab words will be fewer (can retain misspelled words instead of discarding them)

### Disadvantages:
* characters don't hold as much info as words
* sequences are translated into very large amounts of tokens to be processed by model
* it will reduce the size of input text allowed

To get the best of both worlds, we can use a 3rd technique that combines the two approaches

## Subword-based

Split the raw text into subwords.  
'cats' is split into 'cat' and 's'
* Frequently used words should not be split into smaller subwords
* Rare words SHOULD be split into smaller subwords
* the subwords provide a lot of semantic meaning
* it can also identify the start of word tokens

# Loading and saving tokenizers

In [1]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

In [2]:
# Can also use AutoTokenizer to get the correct tokenizer class based on the checkpoint name

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [3]:
tokenizer("Using a Transformer network is simple")

{'input_ids': [101, 7993, 170, 13809, 23763, 2443, 1110, 3014, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [4]:
tokenizer.save_pretrained('tmp')

('tmp/tokenizer_config.json',
 'tmp/special_tokens_map.json',
 'tmp/vocab.txt',
 'tmp/added_tokens.json',
 'tmp/tokenizer.json')

# Encoding

In [8]:
# The tokenizer takes text as inputs and outputs numbers
# raw text -> tokens -> special tokens -> input IDs

from transformers import AutoTokenizer

# split the text into tokens
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("Let's go jogging in the park for exercise to tokenize!")
print(tokens)
# different models have different symbols for start and end of words or other things

input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

['let', "'", 's', 'go', 'jogging', 'in', 'the', 'park', 'for', 'exercise', 'to', 'token', '##ize', '!']
[2292, 1005, 1055, 2175, 28233, 1999, 1996, 2380, 2005, 6912, 2000, 19204, 4697, 999]


In [15]:
# lastly the tokenizer adds special tokens that the model expects
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs['input_ids'])

# Use the decode method to see the final output 
inputs = tokenizer("Let's go jogging in the park for exercise to tokenize!")
print(tokenizer.decode(inputs['input_ids']))

# CLS and SEP are used by this model, each model has different standards

[101, 2292, 1005, 1055, 2175, 28233, 1999, 1996, 2380, 2005, 6912, 2000, 19204, 4697, 999, 102]
[CLS] let's go jogging in the park for exercise to tokenize! [SEP]


# Handling Multiple Sequences

In [19]:
# How to batch inputs together?

# Usually sentences have different lengths
# BUT you can't build a tensor with different lengths so you need to pad the smaller sentence

# the model has a specific padding ID that you need to use

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizer.pad_token_id

# Padding and attention layers go hand in hand
# The attention layers use the padding tokens in the context they look at for each token in the sentence.

# We have to pass an attention mask! to tell the attention layers to ignore the padding.

# use padding=True to tell the tokenizer to prepare the inputs with padding and the proper attention mask

0

Transformer models expect multiple sentences by default.  
Batching is the act of sending multiple sentences through the model all at once.  
<br>
There is a limit to the lengths of sequences that Transformer models can handle.  
Most can handle up to 512 or 1024 tokens max.  
<br>
Some models specialize in handling very long sequences (like Longformer)

otherwise you should truncate your sequences by doing 
> sequence = sequence[:max_sequence_length]

# Wrap Up

In [22]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [23]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)