This and following series of notebooks dive into the Transformers & Huggingface philosophy and how things are built. 

https://huggingface.co/docs/transformers/philosophy

### Easy & Fast to use

> 3 Classes for any models: configuration, models and preprocessor like tokenizer(NLP), image_procesor(vision), feature_extractor(audio) and processor for multi-modal. All intialized using .from_pretrained() method. The model data is pulled from huggingface_hub. 

> pipeline() to do inference and trainer() to train the models

### Provide SOTA models that are close in performance to the original models:

> One example of each architecture is provided, that reproduces the results of the model authors. 

> The code is **close** to original, meaning some code may not be pytorchic

> Provides API access to **Full Hidden States** and **attention weights** of the model

In [1]:
# Looking at the attention masks in Transformers

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [2]:
seq_a = "This is a sentence of 3 words"
seq_b = "This is a sentence of more than 3 words, providing lot more information"

In [3]:
encode_a = tokenizer(seq_a)['input_ids']
encode_b = tokenizer(seq_b)['input_ids']
len(encode_a), len(encode_b)  # (9, 16)  # Have different lengths

(9, 16)

In [5]:
# How the tokenizer output looks, with a single input
tokenizer(seq_a)

{'input_ids': [101, 1188, 1110, 170, 5650, 1104, 124, 1734, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [6]:
tokenizer([seq_a, seq_b], padding=True) # The tokens and attention masks are padded where required

{'input_ids': [[101, 1188, 1110, 170, 5650, 1104, 124, 1734, 102, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 5650, 1104, 1167, 1190, 124, 1734, 117, 3558, 1974, 1167, 1869, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [7]:
encoded = tokenizer([seq_a, seq_b], padding=True) # The tokens and attention masks are padded where required

In [11]:
tokenizer.decode(encoded['input_ids'][0])

'[CLS] This is a sentence of 3 words [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]'

#### Some interestin / important terms:

backbone: model / network that outputs raw hidden states. This is connected to a **head**. There are different "head" for different tasks. 
    
    > LM Head
    > DoubleHeads
    > Question Answering
    > Sequence Classification
    > Token Classification

CTC / connectionist temporal algorithm: Model learns without exactly knowing how the inputs and outputs are aligned. Its used in **speech recognition**

convolution: NN Layer, where the inputs are multiplied element-wise by a smaller kernel-matrix & summed up into new matrix

decoder models are auto-regressive, as they learn to predict the next words from the dataset of masked sentences. 

encoder models are auto-encoding, which uses Masked Language Modeling and embedding to create numerical representation

labels are optional argument which can be passed in order for the model to compute loss itself. The base models don't accept labels, as they just output featurers

Position_ids are required by Transformers to identify the location of a particular tokens. There are many positional embedding like sinusoidal and relative embeddings. The position must be between [0, config.max_position_embeddings - 1]

self-supervised learning, is the process of creating its own learning objectives and learn from **unlabled data**. Masked language modelling is one such self-supervised learning.

ZeRO : Zero Redundancy Optimizer, which is a kind of tensor sharding for parallel operation. The shards are reconstructed during forward and backward computation.



In [12]:
# Domain and the models segregated into different architectures

computer_vision = {
    "encoder": ['ViT','Swin', 'SegFormer', 'BEiT'],
    "decoder": ['ImageGPT'],
    "encoder-decoder": ['DETR'],
    "convolution": ['ConvNeXT']
}

NLP = {
    "encoder": ["BERT", "RoBERTa", "ALBERT", "DistillBERT", "DeBERTa", "Longformer",],
    "decoder": ["GPT-2", "XLNet", "GPT-J", "OPT", "BLOOM"],
    "encoder-decoder": ["BART", "Pegasus", "T5", ],
}

Audio = {
    "encoder": ["Wav2Vec2", "Hubert"],
    "encoder-decoder": ["Speech2Text", "Whisper"]
}

MultiM = {
    "encoder": ["VisualBERT", "ViLT", "CLiP", "OWL-ViT"],
    "encoder-decoder": ["TrOCR", "Donut"]
}

Reinforcement = {
    "decoder": ["Trajectory transformer", "Decision transformer"]
}

In [13]:
# Tokenizers
# moving from rule based, word level to char level and settling on subword algorithm.
# subword allows for reasonable vocabulary size, and allows to learn the representation
Rule_based = ['spacy', 'moses', 'XLM', 'FlauBERT',]

sub_word = ['Byte-pair-encoding', 'WordPiece', 'Unigram', 'SentencePiece']
# Need to locate the data on models and their respective tokenisation algorithms

space_based = ["GPT-2", "RoBERTa"]

In [None]:
tokenizer_algos = {
    "byte_pair": {
        "base": ['GPT'],
        "byte_level": ['GPT-2'],
        "intro": "https://arxiv.org/abs/1508.07909"
    },
    "WordPiece":{
        "base": ['BERT', 'DistilBERT', 'Electra'],
        "intro": "https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf"
    },
    "Unigram": {
        "base": [],
        "intro": "https://arxiv.org/pdf/1804.10959.pdf"
    },
    "SentencePiece":{
        "base": ["XLM", "ALBERT", "XLNET", "Marian", "T5"],
        "intro": "https://arxiv.org/pdf/1808.06226.pdf"
    }
}

In [14]:
# working of BertTokenizer
tokens = tokenizer.tokenize("I have a great Nvidia 4070 GPU")
tokens  
# '##' signifies the word can be attached with earlier token in the list 

['I', 'have', 'a', 'great', 'N', '##vid', '##ia', '40', '##70', 'GP', '##U']

In [16]:
from transformers import XLNetTokenizer

xlnet_tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
xlnet_tokenizer.tokenize("Do you love your GPU very much? I do.")

Downloading spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

['▁Do',
 '▁you',
 '▁love',
 '▁your',
 '▁G',
 'PU',
 '▁very',
 '▁much',
 '?',
 '▁I',
 '▁do',
 '.']

In [None]:
# GPT has a vocabulary size of 40,478 since they have 478 base characters and chose to stop training after 40,000 merges.

# GPT-2 has a vocabulary size of 50,257, which corresponds to the 256 bytes base tokens, a special end-of-text token and the symbols learned with 50,000 merges.