# Introduction:

* Transformer models are very large with Ms to 10s of Billions of parameters, which make the process of training and fine-tuning and deploying them very hard.
* Here comes the **Hugging Face** library which adress that problem, the goal is to provide a single `API` through which any transformer model can be loaded, trained and saved.
* With **`Transformer`** library we can:
      - Download, load and use models for inference or fine-tuning with just couple lines of code
      - all models in the library are stored like any other model, at their core they are just a simple pytorch `nn.Module` class.
      - All components of the models are stored in one file, so no abstarctions or shared modules across files


# Behind the PipeLine:

* To understand what's happenening behind the scene we must first start with what already know: **Pipeline**

In [4]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier(['My birthday is today!'])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9996117949485779}]

* As we saw in the previous chapter the `pipeline` goups 3 steps in order to perform such a task:  

![pipeline](pic1.png)



## Preprocessing with a Tokenizer:

* In order to convert raw text to its numerical form before we feed it to the model, we use **Tokenizer**.
* Here is how we tokenize any input words:

In [6]:
from transformers import AutoTokenizer
mdl_ckpt = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
inputs = 'My birthday is today!'
outputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
outputs

{'input_ids': tensor([[ 101, 2026, 5798, 2003, 2651,  999,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

* First we pick a model `distilbert-base-uncased-finetuned-sst-2-english` which is basically the same model our pipeline used to classify the sentence.
* We use `AutoTokenizer` to get to tokenization method according to that model, because each model has its own method of tokenizing words.
* Then we feed the text to the tokenizer, and we pick which type of tensors we want to get returned
    - `pt` stands for pytorch
    - other parameters will be covered later

* We get a dictionary with 2 keys: `input_ids` and `attention_mask`
* `attention_mask` will be covered later, `input_ids` contains one list of integers.    

##  Going through the model:

* We can download the pretraind model same we did with tokenizer, by usin `AutoModel` class which also has `from_pretrained` method.
* We just need to download the same model as used in tokenization process.

In [8]:
from transformers import AutoModel
model = AutoModel.from_pretrained(mdl_ckpt)

* This architecture we just downloaded conatins onlly the base transformer module: given some inputs, it outputs what we call **Hidden_state**.
* For each model inputs we will retrieve a high-dimensional vector representing the contextual understanding of that input by the model
* These Hidden_states can be used as it is, but usually it will be feeded as input to another part of the model called the **Head**.
* Each Head is a task_specific head.

##  A high-dimensional vector?

* Usually the model outputs a large vector with 3 dimensions:
  - **Batch-size**: the number of sequence processed (in our case we pass only one sentence)
  - **Sequence-length**: The length of the numerical representation of the sequence (8 in our example)
  - **Hidden size**: The vector dimension of each model input.

* The high-dimentionality of this vector comes from the last dimension, the hidden-size is very large dimension: usually ~700:

In [10]:
outs = model(**outputs)
outs.last_hidden_state.shape

torch.Size([1, 7, 768])

##  Model heads: Making sense out of numbers:

* So to wrap-up the whole process: First get inputs converted input ID then the embedding layer convert them into tokenized vectors.
* The subsequent layers manipulate thes vectors using attention mechanism to produce a contextual understanding of that input in form of **High-dimensional-vector**.

![model](pic2.png)

* There rae many architecture available in the Transformers library, each is designed to tackle specific task.
* For example if we want a model for a sequence classification head, we will use `AutoModelForSequenceClassification` instead of `AutoModel`.



In [14]:
text = ['do you feel any better today?', 'I feel warm and cosy in my house']
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
inps = tokenizer(text, padding=True, truncation=True, return_tensors='pt')

In [16]:
from transformers import AutoModelForSequenceClassification
mdl_ckpt = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(mdl_ckpt)
outs = model(**inps)
outs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2121,  0.4987],
        [-3.9382,  4.1996]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [17]:
outs.logits.shape

torch.Size([2, 2])

* In this case we have 2 sentences and 2 labels `negative` `positive`.
* The model will take the high dimensional vector as input and outputs a vector that match our task.

### Post processing:

* The vector we get doesn't make any sense as it is, so we need to make it meaningful for our task.

In [18]:
outs.logits

tensor([[-0.2121,  0.4987],
        [-3.9382,  4.1996]], grad_fn=<AddmmBackward0>)

* those are prediction for each sentence, and each prediction can be mapped to a label, so we need to know each label which, then convert those logits into some meaningful values.
* To convert the logits into probabilies we will pass them through a softmax layer.

In [19]:
import torch
preds = torch.nn.functional.softmax(outs.logits, dim=-1)
preds

tensor([[3.2942e-01, 6.7058e-01],
        [2.9218e-04, 9.9971e-01]], grad_fn=<SoftmaxBackward0>)

* Now we need to know the label of each colomn:

In [20]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

* So the position [0] is negative where the position [1] positive

# Models

* As we saw before the `AutoModel()` class is handy tool to instantiate a model from a `chekcpoint(weights)`
* It can guess the correspondent architecture for the checkpoint.

### Building the transformer:

* We also could call the class of the model precisely if we know exactly the model we want to use.

In [21]:
from transformers import BertConfig, BertModel
cnfg = BertConfig()
mdl = BertModel(cnfg)

* The configurations contains many attributes related the architecture:

In [22]:
cnfg

BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.34.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

* We can understand many of these attributes like:
   - `hidden_act`: activation function : `gelu`
   - `hidden_size`: vector dimensions of each input word
   - `attention_head`, `num_hidden_layer`, `model_type` ...

* While it is possible to build model like this way and using it, but it will produce very low predictions beacause its weights are set randomly.
* This forces us to train it from scratch, which is a very daunting and time, noney, energy consuming process.
* This is way its very preferably to use to other way of loading the model by starting with a pretrained one:

In [23]:
from transformers import BertModel
mdl = BertModel.from_pretrained('bert-base-cased')

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

* we even could use `AutoModel` instead of `BertModel` since it will produce agnostic code that fits all situations

* At this point the model is initialized with all the weights of the checkpoint, it can be used for inference directly on the tasks it was trained on, and also it can be fine-tuned on new tasks or more data.

## Saving the model:

* To save a model we are satisfied with its prformance:

In [24]:
mdl.save_pretrained('path')

In [25]:
!ls 'path'

config.json  pytorch_model.bin


* This saves 2 files:
    - `config.json`: contains all attributes necessary to build the model architecture, and also it contains some metadata
    - `pytorch_model.bin`: contains the learnable weights.

##  Using a Transformer model for inference:

* Tokenizer convert input words into input ID:


In [28]:
sequences = ["Hello!", "Cool.", "Nice!"]
inps = tokenizer(sequences, return_tensors='pt')
encoded_sequences = inps.input_ids
encoded_sequences

tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

* The output we get here is a list of list, the problem here is that the tensors accept only the rectangular shapes.
* So we nee to cenvert this into the targeted shape:

In [29]:
input = torch.tensor(encoded_sequences)
input

  input = torch.tensor(encoded_sequences)


tensor([[ 101, 7592,  999,  102],
        [ 101, 4658, 1012,  102],
        [ 101, 3835,  999,  102]])

###  Using the tensors as inputs to the model

* Making use of this returned tensor is easy as pass it through the model:

In [31]:
outputs= mdl(input)
outputs

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 4.4496e-01,  4.8276e-01,  2.7797e-01,  ..., -5.4032e-02,
           3.9394e-01, -9.4770e-02],
         [ 2.4943e-01, -4.4093e-01,  8.1772e-01,  ..., -3.1917e-01,
           2.2992e-01, -4.1172e-02],
         [ 1.3668e-01,  2.2518e-01,  1.4502e-01,  ..., -4.6915e-02,
           2.8224e-01,  7.5566e-02],
         [ 1.1789e+00,  1.6738e-01, -1.8187e-01,  ...,  2.4671e-01,
           1.0441e+00, -6.1970e-03]],

        [[ 3.6436e-01,  3.2464e-02,  2.0258e-01,  ...,  6.0111e-02,
           3.2451e-01, -2.0995e-02],
         [ 7.1866e-01, -4.8725e-01,  5.1740e-01,  ..., -4.4012e-01,
           1.4553e-01, -3.7545e-02],
         [ 3.3223e-01, -2.3271e-01,  9.4877e-02,  ..., -2.5268e-01,
           3.2172e-01,  8.1079e-04],
         [ 1.2523e+00,  3.5754e-01, -5.1320e-02,  ..., -3.7840e-01,
           1.0526e+00, -5.6255e-01]],

        [[ 2.4042e-01,  1.4718e-01,  1.2110e-01,  ...,  7.6062e-02,
           3.3564e-01,  2