# LLM (Gemma-2B-it)
Documentation:
- [Gemma-2B-it](https://huggingface.co/google/gemma-2b-it)
- [AutoTokenizer](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoTokenizer)
  - [GemmaTokenizer](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/gemma#transformers.GemmaTokenizer)
  - [Tokenizer call](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__)
- [AutoModelForCausalLM](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/auto#transformers.AutoModelForCausalLM)
  - [GemmaModel](https://huggingface.co/docs/transformers/v4.38.2/en/model_doc/gemma#transformers.GemmaModel)
  - [Model generation](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate)

In [2]:
# Global constants for LLM
llm_model_id = "google/gemma-2b-it"
llm_model_folder = f"./models/{llm_model_id.split('/')[-1]}"

## Download the model

In [3]:
import os
from huggingface_hub import snapshot_download

# First, go to the model's web site and apply to access the model (https://huggingface.co/google/gemma-2b-it). Once you 
# are granted access, use huggingface-cli for authentication (https://huggingface.co/docs/huggingface_hub/guides/cli)

# Speed up file transfers with the Hub
os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

# Download the model
print(f"Downloading model '{llm_model_id}' to '{llm_model_folder}'")
snapshot_download(llm_model_id, local_dir=llm_model_folder, local_dir_use_symlinks=False)

Downloading model 'google/gemma-2b-it' to './models/gemma-2b-it'


Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

gemma-2b-it.gguf:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

'/Users/isaacperezborrero/Documents/vlm-gemma2b/models/gemma-2b-it'

## Load the model and the tokenizer

In [46]:
from transformers import AutoTokenizer, AutoModelForCausalLM

llm_tokenizer = AutoTokenizer.from_pretrained(llm_model_folder, local_files_only=True)
llm_model = AutoModelForCausalLM.from_pretrained(llm_model_folder, local_files_only=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Inspect Tokenizer
The definition of tokenization, as given by Stanford NLP group is:

_Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens, perhaps at the same time throwing away certain characters, such as punctuation_

Tokenizer used in `Gemma-2B-it` is based on [byte-level Byte-Pair-Encoding](https://github.com/huggingface/transformers/blob/092f1fdaa4224fdd88c616dc9678e6fcb37bfffd/src/transformers/models/gemma/tokenization_gemma.py#L39). Each string is split according this algorithm and each part is assigned to a token.

A token is represented as an embedding vector internally in the model. During training, the model learns the values for the embedding.

Each token has an identifier that allows to map the string to the corresponding embedding.

The vocabulary (number of tokens) of `Gemma-2B-it` is `256000` tokens and each token uses a `2048`-dimensional embedding vector.

We can transform strings into tokens (represented as integers) and the other way around. This can be achieved with the
methods:
  - [`encode`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.encode): `str` -> `list[int, ...]`
  - [`decode`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.decode): `list[int, ...]` -> `str`

The model uses some special tokens to indicate that it is the beginning of a sentence or that it ends the user prompt among other uses. These are the special tokens used by `Gemma-2B-it`:
  - `<bos>`: Stands for "beginning of sentence." This token is used to indicate the start of a sentence or text sequence. It signals the model to start generating or processing text.
  - `<eos>`: Stands for "end of sentence." This token is used to signify the end of a sentence or text sequence. It helps the model determine when to stop generating text or when a complete thought has been processed.
  - `<pad>`: Stands for "padding." This token is used to fill in blank spaces in text sequences to ensure they all have the same length when processed in batches. Padding is necessary because many machine learning models require input data of a consistent size.
  - `<unk>`: Stands for "unknown." This token is used to represent words or characters that are not found in the model's vocabulary. It acts as a placeholder for any unrecognized or out-of-vocabulary elements.
  - `<start_of_turn>`: This token is used to indicate the beginning of a speaker's turn or a new segment of conversation.
  - `<end_of_turn>`: Similar to `<start_of_turn>`, this token indicates the end of a speaker's turn or conversation segment.

In [47]:
print(f"Using a vocabulary of {llm_tokenizer.vocab_size} tokens")

# Special tokens (some of them already exists on HuggingFace library and others are new from Gemma)
hf_special_tokens = [llm_tokenizer.bos_token, llm_tokenizer.eos_token, llm_tokenizer.unk_token, llm_tokenizer.pad_token]
gemma_special_tokens = llm_tokenizer.additional_special_tokens
special_tokens = hf_special_tokens + gemma_special_tokens
print(f"Special tokens: {special_tokens}")

# Print some tokens and their ids
vocab = llm_tokenizer.get_vocab()
print("Mapping of some tokens:")
for key in special_tokens + ['0', '1', '2', 'a', 'b', 'c', 'the', '\n\n']:
    print(f"Token {repr(key)} correspond to id '{vocab[key]}'")  # also valid: llm_tokenizer.token_to_id(key)

# Let's encode a string
text = "the0a1b"
text_encoded = llm_tokenizer.encode(text)

# It adds the <bos> token at the begining. To disable it, pass add_special_tokens=False to encode method.
print("Encoding:")
print(f"'{text}' is encoded as {text_encoded}")

# Decode the tokens back to a string
text_decoded = llm_tokenizer.decode(text_encoded)
print(f"{text_encoded} is decoded as '{text_decoded}'")

Using a vocabulary of 256000 tokens
Special tokens: ['<bos>', '<eos>', '<unk>', '<pad>', '<start_of_turn>', '<end_of_turn>']
Mapping of some tokens:
Token '<bos>' correspond to id '2'
Token '<eos>' correspond to id '1'
Token '<unk>' correspond to id '3'
Token '<pad>' correspond to id '0'
Token '<start_of_turn>' correspond to id '106'
Token '<end_of_turn>' correspond to id '107'
Token '0' correspond to id '235276'
Token '1' correspond to id '235274'
Token '2' correspond to id '235284'
Token 'a' correspond to id '235250'
Token 'b' correspond to id '235268'
Token 'c' correspond to id '235260'
Token 'the' correspond to id '1175'
Token '\n\n' correspond to id '109'
Encoding:
'the0a1b' is encoded as [2, 1175, 235276, 235250, 235274, 235268]
[2, 1175, 235276, 235250, 235274, 235268] is decoded as '<bos>the0a1b'


To train the model, we need the id of the tokens so the model can use its embedding and the __attention mask__. The primary use of the attention mask is to allow the model to differentiate between the actual content and padding within the input sequences. 

In NLP tasks, input sequences can vary in length. However, most neural networks require inputs to be of a fixed size. To address this, sequences are often padded with special tokens to reach a uniform length before being fed into a model. While necessary for processing, these padding tokens should not influence the model's predictions. The attention mask tells the model which parts of the input are actual data and which are padding. The attention mask is a binary mask (i.e., consisting of zeros and ones) indicating which tokens in the sequence are padding tokens and which are not. For most models, a 1 indicates a real token and a 0 indicates a padding token. During the attention calculation in the model, the mask is used to virtually eliminate the effect of padding tokens by setting their attention scores to a very large negative value (making their resulting softmax scores close to zero). This way, when the softmax function is applied to the attention scores, the padding tokens do not contribute to the final output.

Consider a scenario where you have two sentences:

- Sentence A: `"Hello, how are you?"`
- Sentence B: `"Good morning."`

If you need to pad these sentences to a fixed length of 5 tokens each, your input might look like this after tokenization and padding (assuming `[PAD]` is the padding token):

- Sentence A Tokens: `[Hello, how, are, you, ?]`
- Sentence B Tokens: `[Good, morning, [PAD], [PAD], [PAD]]`

Correspondingly, the attention mask for these inputs would be:

- Sentence A Mask: `[1, 1, 1, 1, 1]` (indicating all tokens should be attended to)
- Sentence B Mask: `[1, 1, 0, 0, 0]` (indicating only the first two tokens are real and the rest are padding)

By using the attention mask, the model knows to focus on the meaningful content and ignore the padding, thus ensuring more accurate processing and analysis of the input data.

We can prepare everything we need to train the model using the Tokenizer call method (`__call__`). The `__call__` method is a more flexible and feature-rich interface to the tokenizer. When you use the tokenizer object like a function (i.e., tokenizer("Your input text here.")), you're actually invoking its `__call__` method. This method can perform tokenization (similar to `encode`), but it also handles additional features like padding the input to a fixed length, truncating inputs to the model's maximum length, returning tensors ready to feed into a model, and more. Essentially, it's designed to prepare the model inputs in one step.

The `__call__` method often returns a dictionary containing various keys such as `input_ids`, `token_type_ids`, and `attention_mask`, depending on the configuration and the needs of the specific model you're working with.

In [54]:
# Since we plan to use the tokenizer to train the model and the model only accepts Pytorch Tensors as input, we need to
# set return_tensors="pt". We also need to add padding because the sentences are not the same length.
sentences = ["the", "the0a1b"]
tokenizer_output = llm_tokenizer(sentences, return_tensors="pt", padding=True)

print(f"Tokenizer output for two sentences: {sentences}")
for k, v in tokenizer_output.items():
    print(f"{k} {tuple(v.shape)}: {v}")

Tokenizer output for two sentences: ['the', 'the0a1b']
input_ids (2, 6): tensor([[     0,      0,      0,      0,      2,   1175],
        [     2,   1175, 235276, 235250, 235274, 235268]])
attention_mask (2, 6): tensor([[0, 0, 0, 0, 1, 1],
        [1, 1, 1, 1, 1, 1]])


Gemma-2B-it is an instruction model. When a language model (LLM) is described as an "instruction model," it refers to a type of language model that is specifically trained or fine-tuned to follow or understand natural language instructions and generate responses or outputs based on those instructions. 

Instruction models are typically created by training or fine-tuning large language models on datasets that contain pairs of instructions and corresponding outputs or actions. 

Beyond serving as instruction models, LLMs are increasingly employed for facilitating chat-based interactions. In such scenarios, rather than generating continuations for a singular text string, these models engage in conversations. These interactions are structured as sequences of messages, each tagged with a role—such as "user" or "assistant"—and accompanied by the corresponding message text. This approach enables the model to understand and maintain the flow of dialogue, distinguishing between different speakers and their messages.

Much like tokenization, different models expect very different input formats for chat. This is the reason HuggingFace added __chat templates__ as a feature. Chat templates are part of the tokenizer. They specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

Chat template are written using [Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/).

__Gemma-2B-it is not a chat model__ but Huggingface sets a default chat template to each model so that we can inspect it.

In [37]:
# The chat template for a model is stored on the tokenizer.chat_template attribute
chat_template = llm_tokenizer.chat_template
print("}\n{".join(chat_template.split("}{")))

# We can apply the template to any chat. A chat consists of a list of JSONs with the role and the message content
chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},

]

print(f"Chat template applied:\n{llm_tokenizer.apply_chat_template(chat, tokenize=False)}")

{{ bos_token }}
{% if messages[0]['role'] == 'system' %}
{{ raise_exception('System role not supported') }}
{% endif %}
{% for message in messages %}
{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}
{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}
{% endif %}
{% if (message['role'] == 'assistant') %}
{% set role = 'model' %}
{% else %}
{% set role = message['role'] %}
{% endif %}
{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}
{% endfor %}
{% if add_generation_prompt %}
{{'<start_of_turn>model
'}}
{% endif %}
Chat template applied:
<bos><start_of_turn>user
Hello, how are you?<end_of_turn>
<start_of_turn>model
I'm doing great. How can I help you today?<end_of_turn>
<start_of_turn>user
I'd like to show off how chat templating works!<end_of_turn>



## Inspect LLM

So far we have not used embeddings. We have converted the tokens (text) to their identifiers with the `encode` method and then converted them back to text with the `decode` method. When you use __call__ method does the same as `encode`, it just has some extra functionality. 

To obtain the embedding of a token you typically don't use the Tokenizer class directly for embeddings. Instead, the Tokenizer class is used to convert text to tokens or token IDs, which are then fed into a model to obtain embeddings. __The embeddings themselves are retrieved from the model, not the tokenizer__.

In the case of `Gemma-2B-it`, the embedding model is stored at `model.embed_tokens`. It's a Pytorch `nn.Embedding` layer, which is a simple lookup table that stores embeddings of a fixed dictionary and size.

In [41]:
import torch

# Embedding for 'the' token
embedding = llm_model.model.embed_tokens(torch.LongTensor([1175]))
print(f"Embedding for 'the' token (id 1175) with shape {embedding.shape}: {embedding.detach()}")

Embedding for 'the' token (id 1175) with shape torch.Size([1, 2048]): tensor([[ 0.1924, -0.0713, -0.0732,  ...,  0.0156, -0.0017,  0.0288]])


We can add new tokens to the tokenizer without training it again.

The new token we want to add can be a special token. They play a critical role in signaling specific types of information or instructions to the model. They help the model understand the structure of the input data, differentiate between various segments of data, and perform specific actions based on the type of token encountered.

For example, Padding tokens (`[PAD]`) are used to ensure that all input sequences in a batch have the same length. They are ignored by the model during processing, allowing sequences of varying lengths to be batched together without affecting the model's performance.

When we add a new token we have to modify the Tokenizer and the embedding layer of the LLM. First we have to add the new token to the Tokenizer so that it can be identified, and then we have to modify the embedding layer to add a new random embedding for this new token.

In [16]:
import torch

# new tokens 
new_tokens = ["new_token"]

# check if the tokens are already in the vocabulary
assert len(set(new_tokens) - set(llm_tokenizer.vocab.keys())) > 0

# add the tokens to the tokenizer vocabulary
print(f"token id for '{new_tokens[0]}' token before: {llm_tokenizer.encode(new_tokens[0], add_special_tokens=False)}")
llm_tokenizer.add_tokens(list(new_tokens))
print(f"token id for '{new_tokens[0]}' token after: {llm_tokenizer.encode(new_tokens[0], add_special_tokens=False)}")

# add new, random embeddings for the new tokens
llm_model.resize_token_embeddings(len(llm_tokenizer))

# The old embeddings are the same, but we have a new one for the new token
embedding = llm_model.model.embed_tokens(torch.LongTensor([1175, 256000]))
print(f"Embedding for 'the' and 'new_token' tokens (id 1175 and 256000) with shape {embedding.shape}:" 
      f"{embedding.detach()}")

# If we want to add a new special token, we just have to set special_token=True in the Tokenizer
new_special_tokens = ["<image>"]
llm_tokenizer.add_tokens(new_special_tokens, special_tokens=True)
llm_model.resize_token_embeddings(len(llm_tokenizer))
print(f"token id for '{new_special_tokens[0]}' token: {llm_tokenizer.encode(new_special_tokens[0])}")  # 2: <bos>

sentence = "<image> What do you see in this picture?"
tokenizer_output = llm_tokenizer(sentence, return_tensors="pt")
print(f"Tokenizer output for sentence: {sentence}")
for k, v in tokenizer_output.items():
    print(f"{k} {tuple(v.shape)}: {v}")

token id for 'new_token' token before: [1404, 235298, 5526]
token id for 'new_token' token after: [256000]
Embedding for 'the' and 'new_token' tokens (id 1175 and 256000) with shape torch.Size([2, 2048]):tensor([[ 0.1924, -0.0713, -0.0732,  ...,  0.0156, -0.0017,  0.0288],
        [ 0.0056, -0.0140, -0.0007,  ..., -0.0064, -0.0337, -0.0594]])
token id for '<image>' token: [2, 256001]
Tokenizer output for sentence: <image> What do you see in this picture?
input_ids (1, 10): tensor([[     2, 256001,   2439,    749,    692,   1443,    575,    736,   5642,
         235336]])
attention_mask (1, 10): tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])


To do inference with the LLM we only need to pass the output of the tokenizer to the `generate` method of the model (`input_ids` and, optionally, `attention_mask`).
The output of the LLM is a dictionary with the input_ids of the tokens, so we need to use the tokenizer to decode it and create the final text.

In [48]:
text = "What is the biggest mountain in the world?"
inputs = llm_tokenizer(text, return_tensors="pt")
generate_ids = llm_model.generate(**inputs, max_length=256)

# The output of the model is in batch format so we use batch_decode instead of decode to get the text. Also, we set 
# skip_special_tokens=True to remove all special tokens from the output (like <bos>). The output is a list with the 
# decoded text, so we only need the first element of the list ([0]), as we only have one sentence.
text_output_with_special_tokens = llm_tokenizer.batch_decode(
    generate_ids, skip_special_tokens=False, clean_up_tokenization_spaces=False
)[0]
text_output_without_special_tokens = llm_tokenizer.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

print(f"Raw model's output (id of the tokens): {generate_ids}")
print(f"Final output with special tokens: {text_output_with_special_tokens}")
print(f"Final output without special tokens: {text_output_without_special_tokens}")

# Note that the output also contains the input text!

Raw model's output (id of the tokens): tensor([[     2,   1841,    603,    573,  12324,   8180,    575,    573,   2134,
         235336,    109,  24059,  99771, 235269,   7023,    575,    573, 148783,
         235269,    603,    573,   9393,   8180,    611,  10379, 235269,    675,
            476,  14538,  28554,    576, 235248, 235321, 235269, 235321, 235310,
         235321, 235265, 235321, 235318,  18678,    591, 235284, 235315, 235269,
         235276, 235304, 235284, 235265, 235310,   5368,    846,      1]])
Final output with special tokens: <bos>What is the biggest mountain in the world?

Mount Everest, located in the Himalayas, is the highest mountain on Earth, with a peak elevation of 8,848.86 meters (29,032.4 feet).<eos>
Final output without special tokens: What is the biggest mountain in the world?

Mount Everest, located in the Himalayas, is the highest mountain on Earth, with a peak elevation of 8,848.86 meters (29,032.4 feet).


We can take a look at the configuration, architecture, the number of parameters of the model and the memory it needs using the following methods:

In [56]:
# Model configuration
print(f"LLM configuration:\n{llm_model.config }")

# Model architecture
print(f"Model architecture:\n{llm_model}")

# Number of parameters
print(f"Number of parameters: {llm_model.num_parameters()}")

# Memory requirements (in Bytes)
print(f"Memory requirement (in Bytes): {llm_model.get_memory_footprint()}")

LLM configuration:
GemmaConfig {
  "_name_or_path": "./models/gemma-2b-it",
  "architectures": [
    "GemmaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 2,
  "eos_token_id": 1,
  "head_dim": 256,
  "hidden_act": "gelu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 16384,
  "max_position_embeddings": 8192,
  "model_type": "gemma",
  "num_attention_heads": 8,
  "num_hidden_layers": 18,
  "num_key_value_heads": 1,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 10000.0,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size": 256000
}

Model architecture:
GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
  

# Visual encoder (CLIP model)