# LLM (Gemma-2B-it)
Documentation:
- [Gemma-2B-it](https://huggingface.co/google/gemma-2b-it)
- [AutoTokenizer](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoTokenizer)
  - [GemmaTokenizer](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/gemma#transformers.GemmaTokenizer)
  - [Tokenizer call](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__)
- [AutoModelForCausalLM](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoModelForCausalLM)
  - [GemmaModel](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/gemma#transformers.GemmaModel)
  - [Model generation](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate)

In [3]:
# Global constants for LLM
llm_model_id = "google/gemma-2b-it"
llm_model_folder = f"./models/{llm_model_id.split('/')[-1]}"

## Download the model

In [4]:
import os
from huggingface_hub import snapshot_download

# First, go to the model's web site and apply to access the model (https://huggingface.co/google/gemma-2b-it). Once you 
# are granted access, use huggingface-cli for authentication (https://huggingface.co/docs/huggingface_hub/guides/cli)

# Speed up file transfers with the Hub
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

# Download the model
print(f"Downloading model '{llm_model_id}' to '{llm_model_folder}'")
snapshot_download(llm_model_id, local_dir=llm_model_folder, local_dir_use_symlinks=False)

Downloading model google/gemma-2b-it at ./models/gemma-2b-it


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

gemma-2b-it.gguf:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

'/Users/isaacperezborrero/Documents/vlm-gemma2b/models/gemma-2b-it'

## Load the model and the tokenizer

In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM

llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_folder, local_files_only=True)
llm_model = AutoModelForCausalLM.from_pretrained(llm_model_folder, local_files_only=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Inspect Tokenizer
The definition of tokenization, as given by Stanford NLP group is:

_Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation_

Tokenizer used in `Gemma-2B-it` is based on [byte-level Byte-Pair-Encoding](https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/models/gemma/tokenization_gemma.py#L39). Each string is split according this algorithm and each part is assigned to a token.

A token is represented as an embedding vector internally in the model. During training, the model learns the values for the embedding.

Each token has an identifier that allows to map the string to the corresponding embedding.

The vocabulary (number of tokens) of `Gemma-2B-it` is `256000` tokens and each token uses a `2048`-dimensional embedding vector.

We can transform strings into tokens (represented as integers) and the other way around. This can be achieved with the
methods:
  - [`encode`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode): `str` -> `list[int, ...]`
  - [`decode`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode): `list[int, ...]` -> `str`

The model uses some special tokens to indicate that it is the beginning of a sentence or that it ends the user prompt among other uses. These are the special tokens used by `Gemma-2B-it`:
  - `<bos>`: Stands for "beginning of sentence." This token is used to indicate the start of a sentence or text sequence. It signals the model to start generating or processing text.
  - `<eos>`: Stands for "end of sentence." This token is used to signify the end of a sentence or text sequence. It helps the model determine when to stop generating text or when a complete thought has been processed.
  - `<pad>`: Stands for "padding." This token is used to fill in blank spaces in text sequences to ensure they all have the same length when processed in batches. Padding is necessary because many machine learning models require input data of a consistent size.
  - `<unk>`: Stands for "unknown." This token is used to represent words or characters that are not found in the model's vocabulary. It acts as a placeholder for any unrecognized or out-of-vocabulary elements.
  - `<start_of_turn>`: This token is used to indicate the beginning of a speaker's turn or a new segment of conversation.
  - `<end_of_turn>`: Similar to `<start_of_turn>`, this token indicates the end of a speaker's turn or conversation segment.

In [35]:
print(f"Using a vocabulary of {llm_tokenizer.vocab_size} tokens")

# Special tokens (some of them already exists on HuggingFace library and others are new from Gemma)
hf_special_tokens = [llm_tokenizer.bos_token, llm_tokenizer.eos_token, llm_tokenizer.unk_token, llm_tokenizer.pad_token]
gemma_special_tokens = llm_tokenizer.additional_special_tokens
special_tokens = hf_special_tokens + gemma_special_tokens
print(f"Special tokens: {special_tokens}")

# Print some tokens and their ids
vocab = llm_tokenizer.get_vocab()
print("Mapping of some tokens:")
for key in special_tokens + ['0', '1', '2', 'a', 'b', 'c', 'the']:
    print(f"Token '{key}' correspond to id '{vocab[key]}'")  # also valid: llm_tokenizer.token_to_id(key)

# Let's encode a string
text = "the0a1b"
text_encoded = llm_tokenizer.encode(text)

# It adds the <bos> token at the begining. To disable it, pass add_special_tokens=False to encode method.
print("Encoding:")
print(f"'{text}' is encoded as {text_encoded}")

# Decode the tokens back to a string
text_decoded = llm_tokenizer.decode(text_encoded)
print(f"{text_encoded} is decoded as '{text_decoded}'")

Using a vocabulary of 256000 tokens
Special tokens: ['<bos>', '<eos>', '<unk>', '<pad>', '<start_of_turn>', '<end_of_turn>']
Mapping of some tokens:
Token '<bos>' correspond to id '2'
Token '<eos>' correspond to id '1'
Token '<unk>' correspond to id '3'
Token '<pad>' correspond to id '0'
Token '<start_of_turn>' correspond to id '106'
Token '<end_of_turn>' correspond to id '107'
Token '0' correspond to id '235276'
Token '1' correspond to id '235274'
Token '2' correspond to id '235284'
Token 'a' correspond to id '235250'
Token 'b' correspond to id '235268'
Token 'c' correspond to id '235260'
Token 'the' correspond to id '1175'
Encoding:
'the0a1b' is encoded as [2, 1175, 235276, 235250, 235274, 235268]
[2, 1175, 235276, 235250, 235274, 235268] is decoded as '<bos>the0a1b'


To train the model, we need the id of the tokens so the model can use its embedding and the __attention mask__. The primary use of the attention mask is to allow the model to differentiate between the actual content and padding within the input sequences. 

In NLP tasks, input sequences can vary in length. However, most neural networks require inputs to be of a fixed size. To address this, sequences are often padded with special tokens to reach a uniform length before being fed into a model. While necessary for processing, these padding tokens should not influence the model's predictions. The attention mask tells the model which parts of the input are actual data and which are padding. The attention_mask is a binary mask (i.e., consisting of zeros and ones) indicating which tokens in the sequence are padding tokens and which are not. For most models, a 1 indicates a real token and a 0 indicates a padding token. During the attention calculation in the model, the mask is used to virtually eliminate the effect of padding tokens by setting their attention scores to a very large negative value (making their resulting softmax scores close to zero). This way, when the softmax function is applied to the attention scores, the padding tokens do not contribute to the final output.

Consider a scenario where you have two sentences:

- Sentence A: `"Hello, how are you?"`
- Sentence B: `"Good morning."`

If you need to pad these sentences to a fixed length of 5 tokens each, your input might look like this after tokenization and padding (assuming `[PAD]` is the padding token):

- Sentence A Tokens: `[Hello, how, are, you, ?]`
- Sentence B Tokens: `[Good, morning, [PAD], [PAD], [PAD]]`

Correspondingly, the attention mask for these inputs would be:

- Sentence A Mask: `[1, 1, 1, 1, 1]` (indicating all tokens should be attended to)
- Sentence B Mask: `[1, 1, 0, 0, 0]` (indicating only the first two tokens are real and the rest are padding)

By using the attention mask, the model knows to focus on the meaningful content and ignore the padding, thus ensuring more accurate processing and analysis of the input data.

We can prepare everything we need to train the model using the Tokenizer call method (`__call__`). The `__call__` method is a more flexible and feature-rich interface to the tokenizer. When you use the tokenizer object like a function (i.e., tokenizer("Your input text here.")), you're actually invoking its `__call__` method. This method can perform tokenization (similar to `encode`), but it also handles additional features like padding the input to a fixed length, truncating inputs to the model's maximum length, returning tensors ready to feed into a model, and more. Essentially, it's designed to prepare the model inputs in one step.

The `__call__` method often returns a dictionary containing various keys such as `input_ids`, `token_type_ids`, and `attention_mask`, depending on the configuration and the needs of the specific model you're working with.

In [54]:
# Since we plan to use the tokenizer to train the model and the model only accepts Pytorch Tensors as input, we need to
# set return_tensors="pt". We also need to add padding because the sentences are not the same length.
sentences = ["the", "the0a1b"]
tokenizer_output = llm_tokenizer(sentences, return_tensors="pt", padding=True)

print(f"Tokenizer output for two sentences: {sentences}")
for k, v in tokenizer_output.items():
    print(f"{k} {tuple(v.shape)}: {v}")

Tokenizer output for two sentences: ['the', 'the0a1b']
input_ids (2, 6): tensor([[     0,      0,      0,      0,      2,   1175],
        [     2,   1175, 235276, 235250, 235274, 235268]])
attention_mask (2, 6): tensor([[0, 0, 0, 0, 1, 1],
        [1, 1, 1, 1, 1, 1]])


## Inspect LLM

So far we have not used embeddings. We have converted the tokens (text) to their identifiers with the `encode` method and then converted them back to text with the `decode` method. When you use __call__ method does the same as `encode`, it just has some extra functionality. 

To obtain the embedding of a token you typically don't use the Tokenizer class directly for embeddings. Instead, the Tokenizer class is used to convert text to tokens or token IDs, which are then fed into a model to obtain embeddings. __The embeddings themselves are retrieved from the model, not the tokenizer__.

In the case of `Gemma-2B-it`, the embedding model is stored at `model.embed_tokens`. It's a Pytorch `nn.Embedding` layer, which is a simple lookup table that stores embeddings of a fixed dictionary and size.

In [41]:
import torch

# Embedding for 'the' token
embedding = llm_model.model.embed_tokens(torch.LongTensor([1175]))
print(f"Embedding for 'the' token (id 1175) with shape {embedding.shape}: {embedding.detach()}")

Embedding for 'the' token (id 1175) with shape torch.Size([1, 2048]): tensor([[ 0.1924, -0.0713, -0.0732,  ...,  0.0156, -0.0017,  0.0288]])


## Inference

# Visual encoder (CLIP model)