# Model Example

Explorar a arquitetura de um modelo Transformer do tipo "decoder-only" (apenas decodificador) e entender o processo de geração de texto, token por token.

## Setup

Começamos configurando o laboratório instalando as bibliotecas necessárias (`transformers` e `accelerate`) e ignorando os avisos. A biblioteca `accelerate` é necessária para o modelo `Phi-3`. Mas você não precisa se preocupar em instalar essas bibliotecas, pois os requisitos para este laboratório já estão instalados.

In [2]:
# !pip install transformers>=4.41.2 accelerate>=0.31.0

In [3]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

## Loading the LLM

Vamos primeiro carregar o modelo e seu tokenizador. Para isso, você primeiro importará as classes `AutoModelForCausalLM` e `AutoTokenizer`. Quando quiser processar uma frase, você pode aplicar o tokenizador primeiro e depois o modelo em duas etapas separadas. Ou você pode criar um objeto de pipeline que envolva as duas etapas e, em seguida, aplicar o pipeline à frase. Você explorará ambas as abordagens neste notebook. É por isso que você também importará a classe `pipeline`.

<p style="padding:15px; "> <b>FYI: </b> A biblioteca de transformadores possui dois tipos de classes de modelo: <code>AutoModelForCausalLM</code> e <code>AutoModelForMaskedLM</code>. Modelos de linguagem causal representam os modelos somente decodificadores usados ​​para geração de texto. Eles são descritos como causais porque, para prever o próximo token, o modelo só pode atender aos tokens à esquerda anteriores. Modelos de linguagem mascarada representam os modelos somente codificadores usados ​​para representação de texto enriquecido. Eles são descritos como mascarados porque são treinados para prever um token mascarado ou oculto em uma sequência.</p>

In [4]:
# import the required classes
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\Users\jonat\development\llms\001 how transformer LLMs work\.venv\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "c:\Users\jonat\development\llms\001 how transformer LLMs work\.venv\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "c:\Users\jonat\development\llms\001 how transformer LLMs work\.venv\Lib\

In [5]:
# Load model and tokenizer

model_name = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
   model_name,
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  8.49it/s]


<p style=" padding:15px; "> <b> Note:</b> Você receberá um aviso de que o pacote flash-attention não foi encontrado. Isso ocorre porque o flash attention requer certos tipos de hardware de GPU para ser executado. Como o modelo deste laboratório não utiliza nenhuma GPU, você pode ignorar este aviso..</p>

Agora você pode encapsular o modelo e o tokenizador em um objeto [pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline) que tem "geração de texto" como tarefa.

In [6]:
# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False, # False means to not include the prompt text in the returned text
    max_new_tokens=50, 
    do_sample=False, # no randomness in the generated text
)

## Gerando uma resposta de texto para um prompt

Agora você usará o objeto pipeline (rotulado como gerador) para gerar uma resposta composta de 50 tokens para o prompt fornecido.

<p style="padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Nota:</b> O modelo pode levar cerca de 2 minutos para gerar a saída.</p>

In [7]:
#pip install numpy

In [8]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened. "

In [9]:
output = generator(prompt)


You are not running the flash-attention implementation, expect numerical differences.


RuntimeError: Numpy is not available

In [10]:
print(output[0]['generated_text'])

NameError: name 'output' is not defined

## Explorando a Arquitetura do Modelo

Você pode imprimir o modelo para dar uma olhada em sua arquitetura.

In [None]:
model

Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=3206

O tamanho do vocabulário é 32064 tokens, e o tamanho da incorporação do vetor para cada token é 3072.

In [None]:
model.model.embed_tokens

Embedding(32064, 3072, padding_idx=32000)

You can just focus on printing the stack of transformer blocks without the LM head component.

In [None]:
model.model

Phi3Model(
  (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
  (embed_dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0-31): 32 x Phi3DecoderLayer(
      (self_attn): Phi3Attention(
        (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
        (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
        (rotary_emb): Phi3RotaryEmbedding()
      )
      (mlp): Phi3MLP(
        (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
        (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
        (activation_fn): SiLU()
      )
      (input_layernorm): Phi3RMSNorm()
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
      (post_attention_layernorm): Phi3RMSNorm()
    )
  )
  (norm): Phi3RMSNorm()
)

There are 32 transformer blocks or layers. You can access any particular block.

In [None]:
model.model.layers[0]

Phi3DecoderLayer(
  (self_attn): Phi3Attention(
    (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
    (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
    (rotary_emb): Phi3RotaryEmbedding()
  )
  (mlp): Phi3MLP(
    (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
    (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
    (activation_fn): SiLU()
  )
  (input_layernorm): Phi3RMSNorm()
  (resid_attn_dropout): Dropout(p=0.0, inplace=False)
  (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
  (post_attention_layernorm): Phi3RMSNorm()
)

## Generating a Single Token to a Prompt

You earlier used the Pipeline object to generate a text response to a prompt. The pipeline provides an abstraction to the underlying process of text generation. Each token in the text is actually generated one by one. 

Let's now give the model a prompt and check the first token it will generate.

In [None]:
prompt = "The capital of France is"

You'll need first to tokenize the prompt and get the ids of the tokens.

In [None]:
# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids

tensor([[ 450, 7483,  310, 3444,  338]])

Let's now pass the token ids to the transformer block (before the LM head).

In [None]:
# Get the output of the model before the lm_head
model_output = model.model(input_ids)

The transformer block outputs for each token a vector of size 3072 (embedding size). Let's check the shape of this output.

In [None]:
# Get the shape the output the model before the lm_head
model_output[0].shape

The first number represents the batch size, which is 1 in this case since we have one prompt. The second number 5 represents the number of tokens. And finally 3072 represents the embedding size (the size of the vector that corresponds to each token). 

Let's now get the output of the LM head.

In [None]:
# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])

In [None]:
lm_head_output.shape

The LM head outputs for each token in the input prompt, a vector of size 32064 (vocabulary size). So there are 5 vectors, each of size 32064. Each vector can be mapped to a probability distribution, that shows the probability for each token in the vocabulary to come after the given token in the input prompt.

Since we're interested in generating the output token that comes after the last token in the input prompt ("is"), we'll focus on the last vector. So in the next cell, `lm_head_output[0,-1]` is a vector of size 32064 from which you can generate the token that comes after ("is"). You can do that by finding the id of the token that corresponds to the highest value in the vector `lm_head_output[0,-1]` (using `argmax(-1)`, -1 means across the last axis here).

In [None]:
token_id = lm_head_output[0,-1].argmax(-1)
token_id

Finally, let's decode the returned token id.

In [None]:
tokenizer.decode(token_id)

NameError: name 'tokenizer' is not defined