Has [79 videos!! in this yt playlist](https://www.youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o). I am creating one section per chapter for now.

# Additional videos/resources after this
 - [Jay Alammar's illustrated transformer](https://jalammar.github.io/illustrated-transformer/) _all of his other blog posts are worth a read too they say_

# Installing

```shell
mamba install transformers
```

## Fetching models and tokenizers

There are tons of different models (_different networks and weights_) and tokenizers. I think the tokenizers are a byproduct of the training of the model on it's inputs: they are also neural.

Flesh this section out from https://huggingface.co/docs/transformers/installation as needed.

# Boilerplate

Import transformers as well as some diagramming libs (_ `nb_js_diagrammers` allows mermaid and others while `iplantuml` allows plantuml magics_)

In [None]:
import transformers

# The next two are for my diagramming. Nothing to do with HF
%load_ext nb_js_diagrammers
import iplantuml


# Video 2 - Pipeline function

Uses the `pipeline` object from `transformers`. Has many different types of pipelines (_presumably composed of multiple pieces per the name_)


In [None]:
%%plantuml
@startmindmap
* pipeline
** **sentiment-analysis**
** **zero-shot-classification** classfies along user supplied categories
** Hello
@endmindmap

## Sentiment analysis

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
])

## Zero shot classification

zero-shot would mean, no additional learning needed. This takes a bunch of user supplied labels and classifies input text into those (as weights)

In [None]:
classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

## Text generation

Are these models auto quantized to git GPU memory ?
 - defaulted to `gpt2`
 - Specified `distilgpt2` was 330M in size

In [None]:
# Generated text pipeline
# Defaulted to gpt2
generator = pipeline("text-generation")
generator("In this couse, we will teach you how to")

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course we will teach you how to",
    max_length=30,
    num_return_sequences=2
)

## Fill mask

> Defaulted to `distilroberta-base` which was 331M in size.

Filling some missing piece of a text

The following code will print out two most likely completions for the missing word. For each
 - Lists the `token_str` which replaces the masked word
 - The `score` which is the probability of seeing this completion ?

In [None]:
unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k = 2)

## Named Entity Recognition (NER)

> This defaulted to `dbmdz/bert-large-cased-finetuned-conll03-english` which was 1.3G in size

```python
ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn")
```

results in the following json.

```json
[{'entity_group': 'PER',
  'score': 0.9986171,
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': 0.97779936,
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': 0.9889684,
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]
```

The args
 - `Grouped Entities` means to group multiple words that belong to the same entity together. `Hugging` and `Face` 

- This is important for me so should study this in more detail. 
- Also how to use the Spacy visualizers to visualize this info ?

In [None]:
ner = pipeline("ner", grouped_entities=False)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn")

## Question Answering

> defaults to distilbert-base-cased-distilled-squad which is 261M in size

In [None]:
question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn"
)

## Summarization

> defaults to sshleifer/distilbart-cnn-12-6 which is 1.2G in size!

The summary for the following article is 

_' First time U.S. troops have been killed by enemy fire in the Middle East since Gaza war . US officials say drone attack was launched by Iran-backed militants and appeared to come from Syria . President Joe Biden vows to hold those responsible for the attack on a US base in Jordan .'_

In [None]:
# Picked a random thing from the internet instead of coppying his text
summarizer = pipeline("summarization")
summarizer("""
US President Joe Biden vowed to hold "to account" those responsible for a drone attack on a US outpost in Jordan, which killed three US Army soldiers and injured at least 34, marking the first time US troops have been killed by enemy fire in the Middle East since the beginning of the Gaza war.

US Central Command confirmed the deaths and said eight personnel had to be medically evacuated from Jordan. The number of wounded is expected to rise.

The Islamic Resistance in Iraq, an umbrella group for several Iran-backed militias in the country, said it attacked a number of places along the Jordan-Syria border on Sunday — including a camp near the US base in Jordan where soldiers were killed.

US officials have said the drone that killed the US service members at Tower 22 was launched by Iran-backed militants and appeared to come from Syria. The US government has not yet named a specific militia they hold responsible.

Iran denied it played any role in the attack.
""")

## Translation

Translates from one language to the other.

In [None]:
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

# Video 4 - What is transfer learning

**Leverage the knowledge in a model trained for a specific task on a new task**
 - Initialize weights from model A onto model B
 - Use foundational models (_those trained on lots of data_) and transfer that knowledge to one trained on a smaller dataset.
 - The head (_last layer in the deep network_) of the source model is thrown away (_replaced with random weights_) prior to fine-tuning. 
   - If an image recongnition model needs to be used in a cat/dog classification task say. Then 
     - replace head with a 2 output classification layer (_random weights_)
     - train on smaller dog/cat dataset
 - Pick a pre-trained model as close in objectives to the final model.

👉 Transfer learning based on a foundational model (_with lots of data_) ends up being way more accurate than training from scratch on your specialized task's dataset. ImageNet is trained on 1.2 million images for instance and BERT on a huge word corpus.
 - ImageNet's 1.2 million images are supervized (_labels provided for each image_)
 - GPT-2 was trained on 40GB of internet data that was pre-processed heavily.
 - BERT etc are unsupervized and so work on way more larger images. Their process is to guess the next word (_already known so no need to provide supervized data_) etc where the correct answer is already known.

> Note that the transfer learning inherits the learning and the biases.
> - ImageNet has mostly US and Western EU images
> - Text generation for female pronouns have a bias toward physical characteristics as opposed to make pronouns


![](./img/transfer-learning-concept.png "Transfer Learning concept")

![](./img/transfer-learning-heads.png "Transfer Learning head change")

# Video 5,6,7,8 - The transformer network - encoders, decoders and encoder-decoders

> Note: 👉 Incomplete notes. After a whiel I just watched all of them to get an idea. Eventually Watch all 4 videos to take further notes as needed. For now, I have some superficial understanding.

> Note: You can choose separate encoder and decoder models to make up an encoder-decoder set based on what they each excel at. 
>
> Wonder if masked language generation can be used for co-reference resolution ?

Shows arch from the original Vaswani paper: _Attention is all you need_. Simplifies it to an `encode` & `decoder` block to teach some high-level concepts

![](./img/transformers-details.png "Transformers in detail")

![Encoder decoder blocks](./img/transformers-encoder-decoder-blocks.png "Encode Decoder Blocks")

 - Watch them all.
 - Encoder
   - 👉 Embeds semantic understanding of the training material
   - converts text to numbers. Embeddings (_I understand_) or feautures he says.
   - accepts inputs and converts them to a high-level representation
     - Calls them feature vector (_also embeddings I guess. Feature space is neurally learned_)     
   - `bi-directional` here means that it takes into account words that come to the left and those that come to the right of the word being encoded.
   - `self-attention` seems to mean that the word is not encoded in isolation but uses the context of surrounding words (_both left and word in bi-directional schemes_)
 - Decoder
   - `auto-regressive`
   - `uni-directional`: The attention context of a word is just the words to the left (_usually_) or right.
     - also referred to as `masked self-attention` where a part of the context is masked (_from being the context i.e._)
     - If usding left-context, this would be great at generating words that come to the right: completion tasks.
   - Uses outputs from encoder alongside other inputs to generate a prediction. This prediction will be re-used in future iterations, hence the term: `auto-regressive`
     - In a completion task, is it adding a single token, then back to input and add one more token etc ?
- can be used together (`sequence-to-sequence`) or separately.

## Encoder

![](./img/encoder-schematic.png)

Take the example of encoding a sentence

![](./img/encoder-words-to-vec.png)
 - One sequence of numbers (_a vector of some fixed dimension_) per input word
   - The size of the vector is the _dimensionality_ of the vector space of the features.
   - Naively assigning one word to a dimension will lead to a high-dimensional space (_high compute and storage expense not to mention sparse_)
   - Can pick randomly smaller dimension and see how the neural nets create the embedding as the training progresses. _Embedding is the act of embedding a feature into a fixed dimensional space with low to no loss of knowledge_
 - Each vector is a feature vector or tensor. 
 - **Each word _in the initial sequence_ affects the representation of every other word in the sequence**. _Is the sequence a sentence?_

**Bi-directionality** 

![](./img/encoder-bi-directionlaity.png)
  - The representation contains the value of a word (_more precisely, it's centextual embedding_)
    - **Contextual**: This is not the representation of just the word. It is the value in the context of it's surrounding words. bi-directional to include words to it's left and to it's right. Hence _contextualized value_. This is done via the _self-attention_ mechanism.
    - **Embedding** embedding a word from a high-dimensional space (one dim per word) into a lower dimensional space

### Applicability of encoder only models

 - Good at extracting meaningful information _because of the bi-directional context_ ?
 - Sequence clasification (Sentiment analysis)
 - Question Answering
 - **Masked language modeling** (MLM): Guessing a randomly masked word in a sequence/sentence. _This is pretty much the main training method for bi-directional encoding so it stands to reason that they do well here_
 - NLU (_Natural language modeling_)
 - Examples: BERT, RoBERTa, ALBERT

## Decoder

![](./img/decoder-schematic.png)

![](./img/decoder-words-to-vec.png)

 - Similar to encoder in that this also converts words to vectors. Each word to a fixed-dimension vector/tensor.
 - Similarly also called a feature-vector or tensor (_also an embedding into a usually lower dimensional space_)

![](./img/decoder-vec-per-word.png)

**One big difference with encoders is that these are unidirectional** i.e., the context included (_attention_) is in just one direction: either to the left or the words that is being decoded.
 - This is called _masked self attention_
 - If the words to the right are removed from the context, then they are considered being _masked from attention_.

![](./img/decoder-words-to-vec-unidirectional.png)

**When should these be used**
 - The strength is mainly in that the context is restricted to the **left** (_usually. Can be right too I guess if the language is left-to-right_). Unidirectionlaity is a strength here because it enables specific tasks.
 - Great at causal tasks: Generating sequences given a left context (_for languages that go left-to-right_)
 - **NLG**: Natural language generation
 - Examples: GPT-2, GPT Neo

Example:
 - Start with _My_ as the input to the decoder
 - Model outpus _name_ as the most likely next word (_actually outputs a number-sequence, which is mapped by a language-modeling-head to a word_). Language modeling head is the final layer that converts the low dimentional vector to a single word in a high dimensional space ? De-Embedding is a thing ?
 - **This is where the auto-regressive** part comes in
 - Now add this generated word to the previous input sequence and send in _My name_ to the model
 - Repeat till a stop condition (_end of sentence of some special token_) is received
 - _My_ → _My name_ → _My name is_ → _My name is Sylvain._
 - Starting from a single word, we have a full sentence!

> When they say GPT-2 has a context of 1024 words (_tokens actually but simplify_). This means litreally the _left context_, while generating the 1023r'd word, it still has memory of the first word in the sequence. After that the generation loses the context of the first word.

## Encoder Decoder

The two work together in the following way
 - Encoder _Words → Contexted Vec_
 - Send encoder output to Decoder
 - Give decoder a sequence **alongside** the encoder output
   - Initial seq is empty: some sentinel value
   - Keep looping like usual till end of generation. _Note that all generation is in context of the encoder output_

The following shows a concrete langauge translation (_sequence to sequence transduction_) which makes it clearer
 - The _encoded("Welcome to NYC")_ is the context for the generation. _Likely training used english vector + actual french translation vector_ for training so they have semantic similarity or some such.
 - Generator generates
   - _encoded("Welcome to NYC")_: START → `Bienvenu`
   - _encoded("Welcome to NYC")_: `Bienvenu` → `a`
   - _encoded("Welcome to NYC")_: `Bienvenu a` → `NYC`
   - _encoded("Welcome to NYC")_: `Bienvenu a NYC` → STOP

![](./img/encoder-decoder-translation-task.png)

Encoder and decoder are different networks with different weights and embeddings. In above case
 - Encoder understands english sentences
 - Decoder can generate french in the context of the encoder

### Where do they shine

 - Sequence to sequence tasks
   - Many to many
   - Translation
   - Summarization
 - Output length is independent of input length (_In the case of language translation for sure_)
  - Summarization for instance is always smaller than the input text
  - Elaboration usually more
  - Translation between languages can be any
  - This allows for different context sizes between encoders and decoders (_just operational for GPU memory or is this relevant along other dimensions too ?_)
 - Examples
   - BART
   - ProphetNet
   - mT5
   - M2m100
   - T5
   - Pegasus
   - MarianMT
   - mBART
   - and many others
 - Can mix and match _I am sure there are details here_. _Encoder : Decoder_ pairs. Pick each for their proven worth on specific tasks.
   - BERT    : GPT-2
   - BERT    : BERT
   - RoBERTa : RoBERTa   


# Video 9, 10 - What happens inside a pipeline function (pyTorch or otherwise)

Uses the `sentiment-analysis` pipeline as an example

```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier([
  "I've been waiting for a HuggingFace course my whole life.",
  "I hate this so much!",
])
```

![](./img/pipeline-classifier-flow.png)
 - Tokenizer: Text → Vectors/Numbers
 - Model : Numbers → Logits
 - PostProcessing: Logits → Predictions

## Tokenizer

![](./img/pipeline-tokenizer-details.png)

 - Raw text  → Tokens _(whole pre-processing steps here_)
 - Tokens → Sentencified Tokens _(Add begin/end special tokens_)
 - Sentencified Tokens  → InputID _(map tokens to numbers based on vocabulary. Lookups or embeddings_) 

![](./img/pipeline-tokenizer-details-code.png)

 - AutoTokenizer loads tokenizer for any model checkpoint. `AutoTokenizer.from_pretrained(<checkpoint>)`
 - Truncation asks to truncate any input longer than what the model can handle. _Is there a warning though ?_
 - Makes all inputs the same lengths by 0 padding 
 - `attention_mask` is nice to see. Simply masks out the padding in this case.
 - `return_tensors="pt"` asks to return a pyTorch tensor

## Model

![](./img/pipeline-model-details.png)
 - Convert input_id's (_numbers_) to logits (_what it is?_)

### AutoModel

> AutoModel is meant to be used only for finetuning ? If one of the `AutoModelForXXX` is not suitable, then load the base pretrained model and then transfer train it ?

![](./img/pipeline-model-details-code.png)
 - `AutoModel.from_pretrained(<checkpoint>)` loads a model **without its pretraining head**. _What!! What does it replace it with then ? I thought this was only for transfer-learning in which case a new head has to be trained_
 - This outputs a `high-dimensional tensor` (_ok, since this is a sentiment model, removal of the head means removal of the final layer which has two dimensions (pos, neg). Why remove it ?_). 
 - The second last-layer (_A hidden layer per deep learning. Input and Output layers are named and the rest are hidden_)
 - `print(outputs.last_hidden_state.shape)` shows the dimension as `tensor(2, 16, 768)`
  - 2 sentences (_array of sequences_)
  - of size 16 (_sequence length_)
  - hidden size of 768 (_neurons in last layer essentially_)

### AutoModelForXXX

> Each AutoModelForXXX method loads a model suitable for a specific task at hand. This includes a task-specific head!. Note that sequence classifier uses `distilbert-base-uncased-finetuned-sst-2-english`. So likely:
> 
> Base → `distilbert-base-uncased` which is then finetuned with 
> Head → `sst2-english` (_sequence sentiment 2 outputs ?_)
>
> There is one AutoModel for each NLP task in the transformers library

![](./img/pipeline-modelForClassification-details-code.png)

The outputs from this stage are not probabilities yet. These are logits

## Post Processing

![](./img/pipeline-postprocessing-details.png)

 - `torch.nn.functional.softmax(output.logits, dim=-1)` is used to apply a [softmax layer](https://towardsdatascience.com/softmax-activation-function-how-it-actually-works-d292d335bd78)
   - Usually the last layer in any DNN. 
   - Scales logits to probabilities
- `model.config.is2label` provides the labels for the id/index which map to the index/order of the proability outputs. So in this case, map output `{0.0402, 0.9598}` using `{0: 'NEGATIVE', 1 : 'POSITIVE'}` to `{'NEGATIVE': 4.02%, 'POSITIVE': 95.98%}`


## Impressions

I now see what one of the internet comments was saying about this being very transparent under the hood. I wonder how much Sylvain and his FastAI background, interation with Jeremy contributed here: a lot I would think. This is great architecture aiming for generality and simplicity.


# Video 11 - Instantiate a Transformers model

`AutoModel` allows one to load any model from the hub: `AutoModel.from_pretrained("checkpointName")`
 - Downloads `config` and `model` file
 - `checkpoint | local folder`
 - What each of these mean are likely a p
 - if _folder_ needs a valid config file and a weights file

![](./img/v11-instantiate-via-automodel.png)

![](./img/v11-instantiate-via-automodel-details.png)

## Loading Config

> If you want to modify the config before loading the model ?
>
> Uses `AutoConfig`

![](./img/v11-instantiate-config.png)

## Whats in a config

![](./img/v11-whats-in-a-config.png)

## Train a checkpointed model from scratch

While I am not entirely clear about what the `Config`, `ConfigClass`, `ModelClass`, `Model`, `Architecture` do. Likely there is some redundancy and loose terminology. Nevertheless, if there is a checkpointed (_and hence useful_) trained model. You can use it on your own data with customizations as needed. For instance, to train a blank `BERT` model with a change: `num_hidden_layers: 12 → 10`, do the following.

![](./img/v11-instantiate-bert-train-from-scratch.png)

> Note: A config loaded from a `checkpoint` is a collection of config values that worked in the past.  There could be several other BERT models with different configs on different input corpuses.

### Saving after training

```python
from transformers import BertConfig, BertModel

bert_config = BertConfig.from_pretrained("bert-base-cased")
bert_model  = BertModel(bert_config)

# Training code
# Data, epochs and all that

# Save..
# This will go to the process's CWD
bert_model.save_pretrained("my_bert_model")
```

> This can then be pushed to the Hub as well if intended for publishing.

### Load saved custom model

> From CWD. Can we specify a folder name as well ?

```python
from transformers import BertModel

bert_model = BertModel.from_pretrained("my-bert-model")
```

# Video 13, 14, 15, 16 - Tokenizers

> Also see [NLP Word Embeddings.ipynb](./NLP%20Word%20Embeddings.ipynb) and [NLP_TraditionalTextFeatureEngineering](./NLP_TraditionalTextFeatureEngineering.ipynb)

## History and motivation

A typical neural-net's input layer takes N numerical inputs and outputs M numerical outputs. Using such a network on non-numerical data requires the conversion of such data to numbers in the first place and then converting the final output unmbers to the output domain (usually different from the input domain).

In the case of an image-classifier for instance, the inputs are the pixels (_one input per pixel. Hence the usual limitation on image size to avoid blowing up input parameters_) and the outputs can point to one of N available labels. Even here, to reduce data headaches, the outputs might map the label strings to IDs. A simple nunmberical mapping from `text → int`

In the case of text though, you have to map the input text to numbers as well. Say you are operating the universe of words, we could consider mapping each word to a unique ID. One way of doing this is a 1:1 mapping like [one hot encoding](https://en.wikipedia.org/wiki/One-hot) which each word will map to a distinct value/vector (_where only one bit is on: the hot bit_). This way each bit can map to one of the inputs of the neural-net. The problems start arising when you consider that the english language alone has roughly 170k words. Thats a 170k vector per word and a tremendous amount of matrix math on huge matrices where the vectors are very sparse.

With this bg, we can look at the other videos that try to explain what tokenization (_the process of converting words or whatever the input is to inputIDs or tokens_) 


## Word based

 - Split a sentence into words (_split on spaces, punctuations etc_)
 - Each word gets assigned an ID/Code/Token
 - A word has some semantic information (_used in certain contexts and with certain words etc_) so this type of tokensization is expected to be powerful.
 - However `dog` and `dogs` can get two entirely different tokens. _Not sure why this is being presented as a big issue. Isn't lemmatization that is done in traditional NLP pre-processing done here as well. Or we want to learn about dog and dogs and hence this issue ?_
 - 170,000 words in just english so potentially 170,000 codes. Thats a large vocabulary and hence a large vector. Huge compute and storage needs.
 - Strategy to limit vocabulary size
   - Limit to 10,000 most popular words. Any unknown words gets an `OOV / UNKNOWN` token for _Out of vocabulary_
   - Compromize because all unknowsn words have the same rep and a problem when using new novel text that has a lot of OOV words.

## Character based

This has been developed to handle the flaws of character based tokensization.

 - Split text into chars instead of words
 - Unlike large word vocab, a limited character vocab _256 is usually enough ?_
 - **Even never seen words wil still be composed mostly of the same chars. So we'll have a much smaller chances of OOV chars**
 - However, characters don't hold as much information as words.
 - Large number of input tokens since there are a lot of characters vs words in a given input text.

## Subword based

Developed to overcome the shortcomings in both word-based and character based encodings.

**word based**
  - Very large vocabularies
  - Large quantity of OOV tokens when limiting vocab size
  - Loss of meaning across similar words (dog, dogs for instance)

**character based**
  - very long sequences (_1 token per char_)
  - less meaningul individual tokens


The current (2024) guidelines seem to be:
 - Frequently used words should not be split into smaller subwords
 - Rare words should be decomposed into meaningful subwords

![](./img/v16-subword-decomposition-1.png) 

![](./img/v16-subword-decomposition-rare-words-1.png)

### Detecting meaning across words

Same root or same extension allows the model to understand syntactic or semantic similarity in text.

![](./img/v16-subword-decomposition-rare-words-2.png)


### Marking start-of-word or suffix tokens

To help with whether a token is a common suffix or a common start word, some models signal it via the token itself. For instance, `BERT` starts a prefix token with `##` (_based on word bees? algo_). Other models may do things differently.

![](./img/v16-subword-mark-suffix-token.png)


There are many different sub-word tokenization algorithms that can be used to generated tokens.

![](./img/v16-subword-tokenization-algos.png)

> What about neurally learned tokenization ? I thought fixed algo tokenization is usually superceded by the learned varieties.


# Video 17 - The tokenizer pipeline

The tokenizer takes text as input and outputs numbers the associated model can make use of
 - You can see it use subword tokens since input length and output length differ (_could also be terminals etc_)

![](./img/v17-using-tokenizer-1.png)


![](./img/v17-inside-tokenizer.png)

## Tokenize first: text to subword tokens


In [None]:
from transformers import AutoTokenizer

# bert uses ## preceding suffixes.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Lets try to tokenize!")
print(tokens)

In [None]:
from transformers import AutoTokenizer

# Albert based tokenizers put a _ infront of all words that have a space in front of them.
# Apparently a convention shared by all sentence based tokenizers. What it is
tokenizer = AutoTokenizer.from_pretrained("albert-base-v1")
tokens = tokenizer.tokenize("Lets try to tokenize!")
print(tokens)

## IDs next: string tokens to numbers

Here we perform the final part of the tokenization by mapping the textual tokens to numbers. Note that the same tokensize that generated the string tokens should be used to convert them to numeric IDs.

In [None]:
from transformers import AutoTokenizer

# bert uses ## preceding suffixes.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Let's try to tokenize!")
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(input_ids)

## Model ready IDs last: Add special tokens

Turns out we need to perform a final step of adding special tokens (sentence-start, sentence-end etc. Whats the exact list here?).

In [None]:
final_inputs = tokenizer.prepare_for_model(input_ids)
print(final_inputs["input_ids"])



## Decode input_ids back to text tokens

There is a method to go the reverse way. For example, to check what the text-tokens correspondong to the special tokens are, we can run the model-ready inputs-ids through the decoder.

In [None]:
# These can be decoded into text as well
# ok. So Decode is from input_id to text tag
print(tokenizer.decode(final_inputs["input_ids"]))

In [None]:
# Roberta tokensizer uses HTML style special tokens 
# Note the use of tokensizer("string") instead of tokenizer.tokenize("string")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
inputs = tokenizer("Let's try to tokenize!")
print(tokenizer.decode(inputs["input_ids"]))

## Other usages

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
inputs = tokenizer("Let's try to tokenize!")
print((inputs))

# Video 18,19 - Batching inputs together

The standard HF API, allows multiple inputs (sentences) to be sent in for it's various pipelines (_sentiment analysis etc_). 

```python
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
sentences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this.",
]
```

This seems painless. The `padding=True` seems out of place (_shouldn't it do this automatically ?_) but otherwise, all is simple. However, a few things happen under the hood which is good to understand. I am glad they add these kinds of internal details in the videos.

## How does batching work.

 - In general sentences we pass to the model don't have the same length
 - Each sentence is converted to a vector (1d tensor)
 - Multiple sentences are added as rows to a 2D tensor (matrix) which is sent in as a single batched input
 - Since each sentence has a potentially different size, the smaller ones are all padded to the length of the longest one.
 - Padding is via a known token called `tokenizer.pad_token_id` which the model knows.


### converting each sentence to it's own 1d tensor

![](./img/v18-sentence-to-1d_tensor.png)

Here you see that each tensor has a different length. This is all good if you want to process each separately.

### batching the inputs into one tensor

Converting this directly to a tensor will fail because it is not rectangular.

![](./img/v18-nonrect-ids-to-tensor.png)

The way we fix this is to pad the smaller tensors. Truncating the larger one is an idea but a terrible one. Note that the pads are not random values, they should be `tokenizer.pad_token_id`

![](./img/v18-pad-smaller-tensor.png)

## running the singles and batched inputs through the model

This shows the outputs gotten by executing the model on the single as well as the batched inputs.

![](./img/v18-executing-single-batched-inputs.png)

Note that the batched inputs (_second sentence of the batch_) do differ from the single ones. This is because even though we did pad the inputs, there is an attention layer that is paying attention to the pads as well. _Note that this is incrmentally examining what the internals do. Not that a developer using the external API will ever run into this as it is all taken care of by the HF library_.

![](./img/v18-attention-on-pads.png)

To fix this, we need to mask the attention layer to they ignore the padded values.

![](./img/v18-mask-attention-on-padding.png)

Putting it all together, we get the following low level code which works as expected.

![](./img/v18-padded-and-attention-masked-batch.png)


## Normal higher level batched input usage

![](./img/v18-high-level-batched-input-API.png)

# Video 20, 21 - Huggingface datasets overview

Don't think I'll need this right now but good to know in case I want to validate some model behavior.

 - [HF Datasets Reference](https://huggingface.co/docs/datasets/en/index)
 - [Apache Arrow](https://arrow.apache.org/) to facilitate in-memory datasets (_that are disk mapped so very little is actually loaded in memory_)
 - Text, Audio, Vision, Tabular etc datasets
 - 👉 can be streamed in as opposed to completely downloaded in one-shot.
 - 👉 can directly give pyTorch tensors via `Dataset.with_format("torch")`. _`Dataset` is a wrapper over an Apache arrow table_
 - Simple API to fetch many publicly available datasets
 - Where on the web do they catalog this ?
 - new python package `datasets` ?

 ```python
 from datasets import load_dataset

 raw_datasets = load_dataset("glue", "mrpc")
 raw_datasets

 >> DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668    
    }),
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows : 408
    }),
    test : Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows : 1725
    })
 })
 ```

 for instance the **glue** dataset contains pairs of sentences.
  - think `features` == `columns`

👉 The files in the dataset are saved to disk using [Apache Arrow](https://arrow.apache.org/). This means, we can load just a slice of the dataset into memory as needed without loading the whole thing into RAM.
 - To access a single row, use `raw_datasets["train"][6]`
 - To access a slice of the above day, `raw_datasets["train"][:5]` will load rows [0, 1, 2, 3, 4]

## Exploring the dataset structures

A simple `print(raw_datasets)` dumps out the outline we see above

The `raw_datasets.features` attribute shpws more information abuot the features. _This is one of the confusing aspects of python. There is a `__repr()`, `__print()` likely which allows a structure to print a custom version of itself. So it is not always clear that something is a dictionary or not_.

```python
raw_datasets["train"].features

> { 'sentence1': Value(dtype='string', id=None),
    'sentence2': Value(dtype='string', id=None),
    'label': ClassLabel(num_classes=2, name=['not_equivalent', 'equivalent'], names_file=None, id=None),
    'idx': Value(dtype='int32', id=None)}
```

From the above, we can read that label-index:0 → `not_equivalent` and label:1 → `equivalent`

## Example tokenization of this dataset

```python
from transformers import AutoTokenizer

checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    # max_length ? Needs max_length=xxx and then pads till that ? Why not max_length=xxx and padding=True ?
    # If truncating to max_length, will there be a warning or accumulation of such warnings ?
    #
    # Normally, we tokensize a single string. This sending in two strings needs some investigation. Since it is 
    # being applied directly on the tokensizer, this is the __call__ method. 
    # Look at https://huggingface.co/docs/transformers/v4.37.2/en/main_classes/tokenizer#transformers.PreTrainedTokenizer
    #
    # arg1: text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. 
    # Each sequence can be a string or a list of strings (pretokenized string). If the sequences are 
    # provided as list of strings (pretokenized), you must set is_split_into_words=True 
    # (to lift the ambiguity with a batch of sequences).
    #
    # arg2: text_pair (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. 
    # Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided
    #  as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity
    #  with a batch of sequences).
    #
    # Still don't understand what special things are done when the inputs are text_pairs. How is the pairing 
    # relevant to tokenizing ? Since labels are provided, during training we can mark the sentence-pair as similar
    # which means keep the vector embedding close otherwise, keep vector embedding far apart ? Hence send tokens
    # for the pair in one shot ?
    return tokenizer(
        example["sentence1"], example["sentence2"], padding="max_length", truncation=True, max_length=128    
    )

tokenized_datasets = raw_datasets.map(tokenize_function)
print(tokenized_datasets.column_names)

# This map is not a generic map function. It adds new 'keys' to existing keys in raw_datasets.
# ALso seems to apply to each table (underlying Apache Arrow tables)
# tokenize adds the following columns:
#  'attention_mask', 'idx', 'token_type_ids'
```

## Example batched tokenization of this dataset.

> Batching is an operational thing to optimize the usage of GPU. Fit as much into one batch as the free memory of the GPU allows to speed up training. During inference, same logic applies. Additionally, you can batching multiple inference requests into one batch as well to tradeoff a slightly increased latency with much higher throughput. An impactful topic.

We can send multiple inputs to the tokenizer. If you look at the arguments to the `tokenizer.__call__` method, the first arg can be 
 - `str`
 - `List[str]`
 - `List[List[str]]`
 - can be strings or pre-tokenized strings (_if pre-tokenized set `is_split_into_words=True`_)

So basically send in a list of strings. This is accomplished via the `dataset.map`'s `batched=True` argument. The actual `tokensize_function` does not have to change since it accepts `str`,`List[str]` or `List[List[str]]`.

![](./img/v20-batched-dataset-mapping.png)

## Example preparation for training

 - Remove columns not needed for training. _Remove `idx`, `sentence1`, `sentence2`_
 - rename columns as needed. _HF models expect the column `labels` so rename `label` -> `labels`
 - convert to ML runtime's format (_pyTorch for instance_)

 ![](./img/v20-prepare-tokenized-dataset-for-training.png)

If needed, a smaller sample of the dataset can be sliced out using the `select` method: 

```python
    small_teain_dataset = tokenized_datasets["train"].select(range(100))
```

# Video 22,23 - Pre Processing sentence pairs

Problems like
 - **Identifying duplicate questions** (_Supervized. Labels provided with the pairs_)
    - `What are the best resources for learning Morse code` **not duplicate** of `What is Morse code?`
    - `How does an IQ test work and what is determined from an IQ test?` **duplicate** of `How does IQ test works?`
    - etc
    - To answer: `Are these two questions duplicates?`
 - **Does the first sentence imply the other or not**
   - This uses labels
     - *Contradiction* - False. Opposite of imply
     - *neutral* - Neither true nor false
     - *entailment* - Yes, it implies
   - `Fun for only children` **contradicts** `Fun for adults and children`
   - `Well you're a mechanics student right?` **neutral** toward `Yeah well you're a student right?`
   - `The other men were shuffeld around` **entailement** of `The other men shuffled` _sounds non sensical. "were shuffled around"  is not "shuffled around"_ Anyway, I get the point of how the labeling works for training.

## sentence-pair datasets

The [GLUE](https://gluebenchmark.com/) benchmark is an academic bench mark for text classification. It currently provides 10 datasets out of which 8 are sentence pairs
 - MRPC _Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent_
 - STS-B - _The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5._
 - QQP _Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent._
 - MNLI _Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment annotations_
 - QNLI _Stanford Question Answering Dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator)._ 
 - RTE _Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. The authors of the benchmark combined the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). Examples are constructed based on news and Wikipedia text._ 
 - WNLI - _Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices._ **CoRef ?**

 The single sentence ones are 
  - SST-2 - _Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels._ 
  - COLA - _Corpus of Linguistic Acceptability_ whether it is a grammatical sentence

## BERT training objectives on sentence pairs

[Bert was trained to achieve dual objectives](https://www.linkedin.com/pulse/what-bert-how-trained-high-level-overview-suraj-yadav/) _Sylvain is many times hard to understand and it is hard to provide context for so much in such little time anyway. I looked around to get more info on BERT objectives_. BERT optimizes both of the following objectives simultaneously.
  - **Masked language modeling** (MSM) Predict masked word from other words using a bi-directional context.
  - **Next sentence prediction** (NSP) Predict whether two sentences occur consecutively or not. The input is the concatenation of the two sentences with a `[SEP]` token, additionally a special classification token (_token to indicate classification task on the entire input_):`[CLS]` is inserted at the beginning.
    - During training a binary classification layer is added on top of the BERT model to perform this binary classification. It takes the hidden state rep of the `[CLS]` token ?? and outputs a probability score.

 ![NSP Example](./img/v22-nsp-example.png)

 The transformer library has a nice API to deal with pairs of sentences (_single pair or batched_)

```python
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Arg1 and Arg2 can be both
# str for singles
# List[str] or List[List[str]] for batched inputs.
tokenizer("My name is Sylvain.", "I work at Hugging Face.")
```

This outputs a single token stream (the pair is encoded as one input)
 - token_type_ids: show shich one for the first sentence and which for the second. `0` indicates sentence 1 and `1` the second that is supposed to be classified as occuring next or not.
 - mask includes it all.

```python
{
    'input_ids': [...],
    'token_type_ids' : [0, 0, 0,..., 1,1,1],
    'attention_mask' : [..]
}
```

## sentence pair through the BERT model

```python
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
batch = tokenizer(
    ["My name is Sylvain.", "Going to the cinema"],
    ["I work at Hugging face.", "This movie is great!"],
    padding=True,
    return_tensors="pt)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**batch)
```

# Video 24 - Dynamic padding

 - We need to pad sentences of different lengths so we can send em all in as a batch.
 - When we have a huge dataset
   - can pad em all to the longest string in the dataset (_global but likely very wasteful_)
   - pad then when we compose the batch (_obvious. I was expecting a more aha moment_). This is called *dynamic batching*. Downside is that batch shapes are different and this can slow things down on an accelerator ??


Some details on how to use the dataCollator, dynamic batching on CPUs fixed batchig on GPUs ?
**re-watch with more detailed notes if needed**

# Video 25 - The trainer API

Skipping for now.

# Videos 26..29 - TensorFlow specific 
Skipping for now.

# Videos 30 - Write your training loop in PyTorch
Skipping for now.

# Videos 31 - Supercharge PyTorch training with Accelerate
Skipping for now.

# Video 32, 33, 34, 35 - ModelHub push
Skipping for now.

# Video 36,37,38,39,40,41 - Custom Datasets
Skipping for now.

# Video 42 - Text embeddings and semantic search

! already have multiple docs on embedding and their use in semantic search. Will be very useful to find out how the HF folk expose that functionality.

Start with just the basic high level concepts
 - represent text as an array of numbers of a vector
 - usually use an encoder transformer model _Recall that an eccoder because of it's bi-directional context is especially good at capturing language meaning and syntax. As long as the loss-function is defined as low vector-distance for similar and high for dissimilar, wouldn't any DL model work ok though ?_

![Text embeddings of some statements](./img/v42-embeddings-of-some-statements.png)
 - Simply reading the vector components shows that the `I took my dog for a walk` is very similar to `I took my cat for a walk`

## Vector similarity

When figuring how different two vectors are, one measure to use is [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
 - distance close to 180 degrees can be a measure of opposite-ness
 - details related to normalizing the vectors (_as shown below_) affecting the scores. How does the magnitude of the vector come into play ?
 - [PyTorch - Cosine Similarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html) - _cosine similarity along dim which defaults to 1. PyTorch and others deal with tensors which are n-dimensional constructs. Dimension here corresponds to which dimension of the tensor to be used as the vector. Right ?._
 - [PyTorch - Cosine loss](https://pytorch.org/docs/stable/generated/torch.nn.CosineEmbeddingLoss.html) shows how to use cosine similarity to compute the loss.


![cosine similarity](./img/v42-cosine-similarity.png)

## Hairyness

 - BERT produces one vector per token! A 384 dim vector!
 - Some math is performed to get a vector per sentence (_Do I need to check the math behind this ?_)
 - Exposes to additional tools to actually plot distances

## Detailed study

This is an important topic for me. Split off into it's own notebook at [./NLP_HuggingFace_Embeddings.ipynb](./NLP_HuggingFace_Embeddings.ipynb)

# Video 43 - Training a new tokenizer

Video: https://www.youtube.com/watch?v=DJimQynXZsQ&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=43

**When is an existing tokensizer not applicable**

 - Corpus is in a different language
   - Sometimes, too many tokens are generated
   - Lots of `[UNK]` tokens
   - Too long tokens might clash against model limits
   - Learned NSP or MLM objectives are inapplicable.
 - Uses new characters
   - Lots of `[UNK]` tokens
 - Uses a specific vocab (different domain: medical, law etc)
 - Different style (_from a century back, say_)

## To train a new tokenizer

 - Gather data: the corpus of text
 - Choose a tokensizer architecture or a new arch (_if you have the expertise_)
 - Train
 - Save


All `FastTokenizers` from HF have this method `AutoTokenizer.train_new_from_iterator(text_iterator, vocab_size, new_special_tokens=None, special_tokens_map=None,**kwargs)` to train a tokenizer using a known architecture on a new corpus.


```python
from datasets import load_dataset

# Python corpus from online public sources and github
raw_datasets = load_dataset("code_search_net", "python")

# Is this an iterator yielding 1000 at a time ?
def get_training_corpus_iterator():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx: start_idx + 1000]
        yield samples["whole_func_string"]

training_corpus_iter = get_training_corpus_iterator()

# Start with a GPT2 architecture
old_tokesizer = AutoTokensizers.from_pretrained("gpt2")

# Use a vocabulary of 52000 (Why?) for the new Tokenizer
# Wonder how long this takes on my PC with a 2080Ti. Will it even ?
new_tokenizer = old_tokenizer.train_new_from_iterator(trainintg_corpus_iter, 52000)
```

# Video 44 - Why are fast tokenizers called fast

> Fast ones are backed by rust. Capable of parallel

Sylvain uses the `glue` dataset's `mnli` dataset for testing. This has 

```python
from transformers import AutoTokenizer
from datasets import load_dataset

# Dataset
# This has multiple datasets: 
#    train, 
#    validation_mathced, validation_mismatched, 
#    test_matched and test_mismatched
# Each with features: `premise`, `hypothesis`, `label` and `idx`
raw_datasets = load_dataset("glue", "mnli")

# There is a `use_fast=True` arg here making it fast! Has to be available 
# right ?
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", use_fast=False)

# And use this method to run the tokenizer
def tokensize(tokenizer, examples):
    return tokenizer(
        examples["premise"], examples["hypothesis", truncation=True]
    )

# On jupyter use %time to time the calls
# Using partially applied fucntions here instead of the examples
from functools import partial

# Tokensize with fast vs fast
# He says the fast one is 4 times faster
fast_tokenized_datasets = raw_datasets.map(
    partial(tokenize, fast_tokenenizer)
    )

slow_tokenized_datasets = raw_datasets.map(
    partial(tokenize, slow_tokenenizer)
    )    

# However, since it is paralell capable
# using batch makes exploits its speed best
# This shows a 20x speedup!
fast_tokenized_datasets = raw_datasets.map(
    partial(tokenize, fast_tokenenizer),
    batched=True
    )

slow_tokenized_datasets = raw_datasets.map(
    partial(tokenize, slow_tokenenizer),
    batched=True
    )
```



In [None]:
from transformers import AutoTokenizer
from datasets import load_dataset
from functools import partial

# Gah! Dashed.
# Some wierd download problems with this. 
# Spent a lot of time debugging and found some issues. Must be common
# Turned out I needed to move to latest 2.17.0 datasets from the old one I had.
raw_datasets = load_dataset("glue", "mnli")

fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# And use this method to run the tokenizer
def v44_tokenize(tokenizer, examples):
    return tokenizer(
        examples["premise"], examples["hypothesis"], truncation=True
    )

v44_tokenize_with_fast = partial(v44_tokenize, fast_tokenizer)

In [None]:
%%time

# However, since it is paralell capable
# using batch makes exploits its speed best
# This shows a 20x speedup!
fast_tokenized_datasets = raw_datasets.map(v44_tokenize_with_fast, batched=True)

When I run the above on my Ryzen 16core CPU + 2080Ti, it works great. 11.9 s vs his 12.1s (_his numbers are from 2 years ago but likely mostly GPU dependent. My CPU and GPU are also old_). Awesome that the 2080Ti is upto these kinds of tasks.

# Video 45 - Fast tokenizer superpowers

 - Fast (_Rust based_)
 - Loaded via `AutoTokensize.from_pretrained( checkpointName, fast=True)`. _fast=True is the default so you don't explicitly see this_
 - And have new features


## New features of fast tokenizers
 - Skips empty spaces in tokenization
 - Standardizes things like start of word symbols and words split across multiple tokens. Not easy to figure out which word a token belongs to as each has a different way of doing it.
   - RoBERTa uses `G`
   - T5 uses `_`
   - BERT uses nothing for pre but uses `##` for trailing parts.

## word-ids

Fast tokensizers, along with standard method of `encodings.tokens()` also have a `encodings.word_ids()` which keep track of which word each token belongs to. This prints out a word-id for each token. When multiple tokens have the same word-id, you know they are for the splits of the same word.

## offset_mapping

There is a new `return_offsets_mapping=True` argument when tokenizing which adds a new data-field to the output: `encodings["offset_mapping"]` that returns the span of characters each token comes from.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Deliberately introduce spaces
encoding  = tokenizer(
    "Let's talk about tokenizers     superpowers.",
    return_offsets_mapping=True)

In [None]:
inp_str = "Let's talk about tokenizers     superpowers."

# Explore the output
# Notice no-tokens for empty spaces
# ['[CLS]', 'Let', "'", 's', 'talk', 'about', 'token', '##izer', '##s', 'super', '##power', '##s', '.', '[SEP]']
print(encoding.tokens())

# Shows that word-id:5,6 are split into 3 tokens each.
# [None, 0, 1, 2, 3, 4, 5, 5, 5, 6, 6, 6, 7, None]
print(encoding.word_ids())

# Show for instance: that token:Let has offset (0,3). (char_idx=start, char_idx < end; ++char_idx)
# [(0, 0), (0, 3), (3, 4), (4, 5), (6, 10), (11, 16), (17, 22), (22, 26), (26, 27), (32, 37), (37, 42), (42, 43), (43, 44), (0, 0)]
print(encoding["offset_mapping"])

# Print out sub-strings belonging to tokens this way.
print(inp_str[32:42])

These new features of the tokensizer are super useful for special tasks where you need to map the tokens back to their spans etc.

# Video 46,47 - Inside the token classification pipeline

Resources
  - NLP Course: https://huggingface.co/learn/nlp-course/chapter7/2
  - Video: https://www.youtube.com/watch?v=0E7ltQB7fM8&list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o&index=46

`token-classification` is one of the supported HF pipelines. This is a generic problem which covers assigning a label/classification to token. This can be done along many lines
 - NER _Find entities: names of people, locations, organizations etc_
 - POS _Parts of speech. Mark each word in a sentences are corresponding to a particular part of speech (noun, verb, adjective etc)_
 - Chunking _Find tokens that belong to the same entity. This task (combinable with POS or NER) can be formulated as one label, usually `B-` to any token at the beginning of a chunk and another label, usually `I-` to tokens inside a token and a third label, usually `0`, to tokens that do not belong to any chunk._
 - Several others..

When you use `pipeline("token-classification")`, it performs a NER classification
 - Tokens are classified into one N (`Person`, `Organization`, `Location`) classes with one class for unclassified tokens `Misc`?



In [None]:
from transformers import pipeline

# This is a straightforward method which many times comes up with many small tokens 
# for each word.
token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging face in Brooklyn")

In [None]:
# We can form grouping (by recogning B-, I- tokens, start/end of tokens etc) to combine tokens related to one
# entity in one place. Insteaf of I-PER, I-LOC, we ge a PER, LOC etc.
token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging face in Brooklyn")

## Pipeline In Details

This has the usual steps of `Tokenization → Model → Post Processing`. In the following code you'll see that
 - `My name is Sylvain and I work at Hugging Face in Brooklyn.` gets broken down into 19 tokens
 - These 19 tokens have 9 outputs each which correspond to the probabilities associated with each of the possible 9 token classifier labels

In [None]:
# Phase 1 - Tokenization
from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")

# Phase 2 - Model execution to generate logits
outputs = model(**inputs)

print(inputs)

print(inputs["input_ids"].shape)
print(outputs.logits.shape)

In [None]:
# Phase 3 - Post processing
import torch

# classification label probabilities
# [NumBatches x NumTokens x NumLabels]
with torch.no_grad():
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0]

#print(probabilities)

# argmax returns the indices of the tensor with max values
# along the supplied dimension (-1 or 2 in this case for the last dim)
predictions   = probabilities.argmax(dim=-1).tolist()
#print(predictions)

# The label -> name mapping is stored here
#print(model.config.id2label)

# Print out more information per-token
# - classification
# - text span (Note that this uses the fast-tokenizer's offsets)
results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]

    # unclassified tokens have the "O" label (letter O not number 0)
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {"entity": label, "score": probabilities[idx][pred],
             "word": tokens[idx], "start": start, "end":end}
        )

print(results)


In [None]:
# Group all tokens that belong to a word together into one-word
# Implementation of pipeline("token-classification", aggregation_strategy="simple")
import numpy as np

probabilities_list = probabilities.tolist()
label_map = model.config.id2label
results = []
idx = 0


# Note: Either (B-xxx)|(I-xxx)*
#         -or- (I-xxx)* (I-yyy)  # Where yyy != xxx. i.e, till a label transition occurs
# The label-transition pattern is more universal since not all classifiers use B-/I- patterns
while idx < len(predictions):
    pred   = predictions[idx]
    label  = label_map[pred]
    scores = []

    # unclassified tokens have the "O" label (letter O not number 0)
    if label != "O":        
        # Remove the B- or I- tag
        label    = label[2:]
        start, _ = offsets[idx]        
        
        # grab till transition to a different label
        while idx < len(predictions) and label_map[predictions[idx]] == f"I-{label}":
            _, end = offsets[idx]
            scores.append(probabilities_list[idx][pred])
            idx += 1
        
        word = example[start:end]
        #print(scores)

        results.append(
            {"entity_group": label,
             "score": sum(scores)/len(scores),
             "word" : word,
             "start": start,
             "end": end}
        )

    idx += 1

print(results)

# Extra - Parts of Speech tagging (POS)

While the videos do not show how POS tagging is done, I found some POS examples by searching for `POS` under [`token classification tasks](https://huggingface.co/models?pipeline_tag=token-classification) and found a model with usage doc. Also shows a different way to assemble a Pipeline from tokenizer and model. _There are several such POS models so good_
 - [QCRI/bert-base-multilingual-cased-pos-english](https://huggingface.co/QCRI/bert-base-multilingual-cased-pos-english) is recent:  Jan 2023

```json
[
    {'entity': 'PRP$', 'score': 0.9995228, 'index': 1, 'word': 'My', 'start': 0, 'end': 2}, 
     {'entity': 'NN', 'score': 0.9995235, 'index': 2, 'word': 'name', 'start': 3, 'end': 7}, 
     {'entity': 'VBZ', 'score': 0.9995004, 'index': 3, 'word': 'is', 'start': 8, 'end': 10}, 
     {'entity': 'NNP', 'score': 0.9951285, 'index': 4, 'word': 'Jorge', 'start': 11, 'end': 16}, 
     {'entity': 'CC', 'score': 0.9996544, 'index': 5, 'word': 'and', 'start': 17, 'end': 20}, 
     {'entity': 'PRP', 'score': 0.9996605, 'index': 6, 'word': 'I', 'start': 21, 'end': 22}, 
     {'entity': 'VBP', 'score': 0.99771214, 'index': 7, 'word': 'live', 'start': 23, 'end': 27}, 
     {'entity': 'IN', 'score': 0.99966407, 'index': 8, 'word': 'in', 'start': 28, 'end': 30},
     {'entity': 'NNP', 'score': 0.9996543, 'index': 9, 'word': 'Mountain', 'start': 31, 'end': 39}, 
     {'entity': 'NNP', 'score': 0.9995591, 'index': 10, 'word': 'View', 'start': 40, 'end': 44}, 
     {'entity': ',', 'score': 0.9998921, 'index': 11, 'word': ',', 'start': 44, 'end': 45}, 
     {'entity': 'NNP', 'score': 0.99960035, 'index': 12, 'word': 'California', 'start': 46, 'end': 56}, 
     {'entity': '.', 'score': 0.9999255, 'index': 13, 'word': '.', 'start': 56, 'end': 57}
     ]
```     

Joining `Mountain` and `View` together because they are both NNP ? Need to come up with rules for POS just like for NER.

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

# Large model. pytorch_weights.bin is 712M
model_name = "QCRI/bert-base-multilingual-cased-pos-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer)
outputs = pipeline("My name is Jorge and I live in Mountain View, California.")
print(outputs)

In [1]:
# Load model directly
# https://huggingface.co/tliu/asp-coref-flan-t5-large/tree/main
# While it says the following should work, even after upgrading to latest transformers, nothing works.
#  Does it need TF to be installed somehow
#from transformers import T5Coref, AutoTokenizer
##
## It's config.json says it needs Transformers 4.33.3
##checkpoint = "tliu/asp-coref-flan-t5-large"
#
#tokenizer = AutoTokenizer.from_pretrained(checkpoint)
##model = T5Coref.from_pretrained(checkpoint)

ImportError: cannot import name 'T5Coref' from 'transformers' (/home/vamsi/mambaforge/envs/ml/lib/python3.10/site-packages/transformers/__init__.py)

# Video 48,49 - Inside the question answering pipeline

This pipeline can extract answers from a given context: _some text that contains the answer_.




In [None]:
from transformers import pipeline

question_answer = pipeline("question-answering")
context = """
Transformers is backed by the three most popular deep learning libraries - Jax, PyTorch and TensorFlow - wth a seamless
 integration between them. It's straightforward to train your models with one before loading them for 
 inference with the other
"""

question = "Which deep learning libraries back Transformers?"

# This responds with 'Jax, PyTorch and TensorFlow'
question_answer(question=question, context=context)

The above snippet shows a small context, however this works for very long contexts as well. We'll show how later in this section.

The usual flow is used here: `tokenizer -> Model -> Postprocessing`


# Video 50 - What is normalization

# Video 51 - What is pre-tokenization

# Video 52 - Byte-pair encoding tokenization

# Video 53 - WordPiece tokenization

# Video 54 - Unigram tokenization

# Video 55 - Building a new tokenizer

# Video 56 - Data processing for token classification

# Video 57 - Data processing for masked language modeling

# Video 58 - What is perplexity

# Video 59 - What is domain adaptation

# Video 60 - Data processing for translation

# Video 61 - What is the BLEU metric

# Video 62 - Data processing for summarization

# Video 63 - What is the ROUGE metric

# Video 64 - Data processing for causal Language modeling

# Video 65 - Using a custom loss function

# Video 66 - Data processing for Question answering

# Video 67,68 - Post processing in Question Answering (pyTorch)

# Video 69 - Data collators - A tour

# Video 70 - What to do when you get an error

# Video 71 - Using a debugger in a notebook

# Video 72 - Using a debugger in a terminal

# Video 73 - Asking for help on the forums

# Video 74,75 - Debugging the training pipeline (PyTorch)

# Video 76 - Writting a good issue/bug-report

# FAQ - HFArgumentParser

See 
 - [Automatically Generate Python CLI](https://python.plainenglish.io/how-to-automatically-generate-command-line-interface-for-python-programs-e9fd9b6a99ca)
 - [Python Data classes](https://realpython.com/python-data-classes/)

> The HFArgumentParser takes a bunch of dataclass objects _(new in Python 3.7: supercharged dicts and tuples)_ which contain fields. This then dynamically generates CLI args for an underlying ArgumentParser and packs the parsed args back into the data-classes.

Data classes look like plain classes with a lot of the boiler plate removed. Looks a lot like Rust and Scala case classes.
  - lots of boiler plate removed
  - automatic _repr_ to give good output

In the context of the TANL code-base, here is one data-classes

```python
from dataclasses import dataclass, field

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    """
    Arguments for the Trainer.
    """
    output_dir: str = field(
        default='experiments',
        metadata={"help": "The output directory where the results and model weights will be written."}
    )
    
    zero_shot: bool = field(
        default=False,
        metadata={"help": "Zero-shot setting"}
    )
```

and this is how the args are parsed

```python
second_parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
second_parser.set_defaults(**defaults)
model_args, data_args, training_args = second_parser.parse_args_into_dataclasses(remaining_args)
```

 - It uses additional data-classes (`ModelArguments` and `DataTrainingArguments`)
 - One of the returned item is `training_args`.



