# 1. Transformer Models

NlP is a field of Linguistic and Machine Learning which deals with understanding everything related to Human Language. Not only just words but also the context of group of words.

---

**Common NLP Tasks**
- Classifying each word in a sentence.
- Classifying whole sentence.
- Generating text content.
- Extracting answer from text.
- Generating a new sentence from text.

The most basic object in HuggingFace is the `pipeline()` function it helps in ppreprocessing and postprocessing inputs.

In [None]:
# Sentiment Analysis
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier(["I am just starting", "i like this course"])


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9993411898612976},
 {'label': 'POSITIVE', 'score': 0.9998490810394287}]

**Behind the scenes**
- The text input is preprocessed into a format the model understands.
- The preprocessed input is passed into the model.
- The predictions of the model are postprossed.

The pipeline supports several tasks including:

---
`feature-extraction, sentiment-analysis, fill-mask, question-answering, summerization, translation, zero-short-classification, text-generation`

In [None]:
# Zero Shot Classification
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This course is about NLP using HuggingFace",
    candidate_labels = ["education", "politics", "oil company"]
)

#This pipeline is called zero-shot because you don’t need to fine-tune the model on your data to use it.
# It can directly return probability scores for any list of labels you want!

# Text Generation
from transformers import pipeline

generator = pipeline("text-generation", model = "distilgpt2")
generator("This course is about the development of.",
         max_length = 50,
         num_return_sequences = 2)

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'sequence': 'This course is about NLP using HuggingFace',
 'labels': ['education', 'politics', 'oil company'],
 'scores': [0.9342062473297119, 0.04426104575395584, 0.021532732993364334]}

In [None]:
# Fill Mask
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This could be the best <mask> i had ", top_k = 2) #The top_k argument controls how many possibilities you want to be displayed

# Question Answering
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

# Translation
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'translation_text': 'This course is produced by Hugging Face.'}]

# 2. How do Transformers work?

*Originally designed for translation.* Broadly Transformers can be classified into 3 categories:
1. Auto-Regressive models like GPT, LlaMa
2. Auto-Encoder models like BERT
3. Sequence-Sequence models like T5, BART etc.

Training Large amount of raw text data in an self-supervised format, i.e letting the model figure out the labels from the data. These massive language models are then trained in a supervised fasion using transfer learning on specific downstream tasks.

Predicting the next word by looking at the previous n-words is called *casual language modeling*. Another task is *Masked Language modeling* where the model learn to predict a word within a sentence.

- The encoder part create a representation of input , hence it is optimized for that task
- The decoder part generate the a target sequence given encoder representation.

**Encoder only models** : Good for tasks like sentence classification, Named Entity Recognition. They are good at extraing meaning form sentences, and are often characerized as autoencoder models.
- They hold bi-directional capabilities (context from left to right and vice aversea)
- Intented for question answering, sequence classification , masked language modeling and NLU. Eg: BERT, RoBERTAa etc.
- The pretraining of these models usually revolves around somehow corrupting a given sentence (by masking random words in it) and tasking the model with finding or reconstructing the initial sentence.

**Decoder only models** : Good for text generation. One can use encoder generally for all tasks of encoders but with a small loss of performance.
- The models are unidirectional i.e at any given point the model has access to only left or right context. The pretraining of decoder models usually revolves around predicting the next word in the sentence.

- They are great at casual language modeling or text generation.
Eg: GPT, GPT2 , LlaMa etc.

**Encoder - Decoder models** : (Seq-Seq) tasks: Good for generative tasks that require an input like summerization or traslation. The encoder passes a numerical representation of the input (done once) then the decoder uses the representation with the input to generate outputs in an autoregressive manner.
- Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering. Eg: T5

## Behind the scenes.
## 2.1 The Models

The weights are usually loaded and stored in `~/.cache/huggingface/transformers`. You can customize your cache folder by setting the `HF_HOME` environment variable.

In [None]:
from transformers import BertConfig, BertModel

# Building config
config = BertConfig()

#Building model
model = BertModel(config)
# Model is randomly initialized!
model.train()


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

- Parameters	= 1.1B
- Attention Variant	= Grouped Query Attention
- Model Size	= Layers: 22, Heads: 32, Query Groups: 4, Embedding Size: 2048

In [9]:
from transformers import AutoModel, AutoConfig

config = AutoConfig.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = AutoModel.from_config(config)
model.train()



config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

LlamaModel(
  (embed_tokens): Embedding(32000, 2048)
  (layers): ModuleList(
    (0-21): 22 x LlamaDecoderLayer(
      (self_attn): LlamaSdpaAttention(
        (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
        (k_proj): Linear(in_features=2048, out_features=256, bias=False)
        (v_proj): Linear(in_features=2048, out_features=256, bias=False)
        (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (mlp): LlamaMLP(
        (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
        (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
        (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
        (act_fn): SiLU()
      )
      (input_layernorm): LlamaRMSNorm()
      (post_attention_layernorm): LlamaRMSNorm()
    )
  )
  (norm): LlamaRMSNorm()
)

In [13]:
config.num_hidden_layers=10
config.num_attention_heads=16

In [18]:
model2 = AutoModel.from_config(config)
print(sum(p.numel() for p in model.parameters()))
print(sum(p.numel() for p in model2.parameters()))

1034512384
516466688


## 2.2 Tokenizers

Models can olny process numbers, tokenizer breaks down sentences,words into chunks and later numbers.

1. Word Based Tokenizer
Each word is assigned an ID. The raw text is split into tokens by simple rule and it only gives decent result. There are some with extra rules for punctuation and we might end up with a large vocabulary. Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the `[UNK]` or `<unk>`.

2. Character Level Tokenizer
Splits raw texts into charcters rather than words, it would be having less vocabulary size and there would be much less `<ukn>` tokens. One could argue that we'll end with with meanigless tokens while also large number of tokens too. This depends on the language.

3. Sub-Word Tokens
This approach rely on the principal that frequently used words should be split into samller subwords. This kind of splitting enable more effifent representation of long words which makes semantic meaning. Like the word "Tokenization" split into "Token" and "ization". This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens.

- Byte-level used in GPT.
- Wordpeice  as used in BERT.
- SentencePeice as used in several multilingual models.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "This library is very cool to use!."
tokens = tokenizer(sequence)
print(tokens)

{'input_ids': [101, 1188, 3340, 1110, 1304, 4348, 1106, 1329, 106, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [8]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print("The ids: " ,ids)

decoded_string = tokenizer.decode([101, 1188, 3340, 1110, 1304, 4348, 1106, 1329, 106, 119, 102])
print("The decoded string: ",decoded_string)

The ids:  [100, 100, 100]
The decoded string:  [CLS] This library is very cool to use!. [SEP]


## 2.3 Handling multiple sequences

So far we've seen single sequence
- What if we have multiple sequences??
- Sequences of variable length??
- Is there such a thing as too long a sequence??

**Models expect a batch of inputs**

In [24]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

sequence = ["I've been waiting for a HuggingFace course my whole life.", "I like to train transformers"]

tokens = tokenizer.tokenize(sequence)
ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = torch.tensor([ids]) # Batching - adding an extra dimension as the trasnformer model by default takes multiple sequences.
model(input_ids)

SequenceClassifierOutput(loss=None, logits=tensor([[-2.6099,  2.7623]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

- Batching is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single
sequence:

`batched_ids = [ids, ids]`

- Padding is to ensure that all inputs in a tokenizer have same length adding a special word called the `padding token`

In [25]:
tokenizer.pad_token_id

0

 The key feature of Transformer models is attention layers that `contextualize` each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence.

 Most transformer models have a sequence length of 512 or 1024, either we can use larger context length llms or truncate sequence.


In [27]:
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

# Will pad the sequences up to the maximum sequence length
model_inputs = tokenizer(sequences, padding="longest")

# Will pad the sequences up to the specified max length
model_inputs = tokenizer(sequences, padding="max_length", max_length=8)

# Will truncate the sequences that are longer than the model max length
# (512 for BERT or DistilBERT)
model_inputs = tokenizer(sequences, truncation=True)

# Will truncate the sequences that are longer than the specified max length
model_inputs = tokenizer(sequences, max_length=8, truncation=True)

Tokenizers add [CLS] and [SEP] tokens because the models are pretrained on those, some models add tokens at end only or some at begening only. Some might add it all the way.

**wrap-up**

In [28]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

In [30]:
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.5607,  1.6123],
        [-3.6183,  3.9137]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)