# Using Framework

- ### Multi-Head Attention

## Architetury


<p align="center">
  <img src="files/image.png">
</p>


In [1]:
import torch
from torch import nn

In [2]:
class Transformers(nn.Module):

    # Construtor 
    def __init__(self, vocab_size, embedding_dim, n_heads, n_layers, dropout):

        #inicializa o construtor da classe mãe (nn.Module)
        super().__init__()

        # Atributos
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.n_heads = n_heads 
        self.n_layers = n_layers
        self.dropout = dropout

        #camada de embedding - sequência de entrada para senquencia de vetores

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        #mecanismo de auto-atenção multi-head
        self.attention = nn.MultiheadAttention(embedding_dim, n_heads, dropout = dropout)
        
        #rede neural feed-forward - gwera seq de saída a partir da entrada
        self.feed_forward = nn.Sequential(
            nn.Linear(embedding_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim)
        )
        
        #camda de saída
        self.out = nn.Linear(embedding_dim, embedding_dim)
        
def forward(self, x):
    
    #entrada
    x = self.embedding(x)
    
    #multi-head
    x = self.attention(x)
    
    #feed-forward 
    x = self.feed_forward(x)
    
    x = self.out(x)
    
    return x

In [3]:
modelo = Transformers(vocab_size= 1000, 
                      embedding_dim = 32,
                      n_heads = 4,
                      n_layers = 3,
                      dropout = 0.4)

modelo.modules

<bound method Module.modules of Transformers(
  (embedding): Embedding(1000, 32)
  (attention): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=32, out_features=32, bias=True)
  )
  (feed_forward): Sequential(
    (0): Linear(in_features=32, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=32, bias=True)
  )
  (out): Linear(in_features=32, out_features=32, bias=True)
)>

## Without Frameworks

- ### Only - attetion mecanism

### Hyperparameters

In [4]:
import numpy as np

Transformer model consists of several main parts:

- 1- Embedding Layer: Transforms words into numerical vectors of fixed size.
- 2- Attention Mechanism: Allows the model to focus on different parts of the input.
- 3- Encoder and Decoder Layers: Process data sequentially.
- 4- Linear and Softmax Layer: For final predictions

In [5]:
dim_model = 64
seq_length = 10
vocab_size = 10

### Embedding

Provide a rich and dense representation of words or tokens, capturing contextual and semantic information that is essential for tasks such as machine translation, text classification, among others.

In [6]:
def embedding(input, vocab_size, dim_model):
    # Sequência em contexto 
    
    #cria matriz onde cada linha representa um token do vocab - valores aleatórios
    embed = np.random.randn(vocab_size, dim_model)
    
    # para cada índice de token no input, é selecionado o embedding correspondente da matriz
    # array de embeddings da sequência de entrada - a sequência está em contexto  
    
    return np.array([embed[i] for i in input])

### Attention mecanism

Query, Key and Value - Components

<p align="center">
  <img src="files/image copy 5.png">
</p>


<img src="files/image copy 4.png">


Q (query): Query is the "search representation". It is generated by multiplying the input 
𝑋 by a weight matriz Wq. 

𝑄 = XWq

Q encodes what we are trying to find elsewhere in the sequence. For example, when processing a word, 𝑄 may represent the question: "Who should this word relate to?"

K (Key):  is like a "label" that represents the semantic content of the input. It is generated by multiplying 𝑋 by a weight matrix  Wk

𝐾 = 𝑋 Wk

 - 𝐾 is used to determine how relevant a part of the input is to the question asked by the 𝑄. Does it answer the question: "Do I have the information you are looking for?"

The Value contains the actual information that will be used in the final result. It is generated by multiplying 𝑋 by a weight matrix Wv.

V=XW 

After determining the relevant parts using 𝑄 and K, the 𝑉 delivers the related "content".

#### Explanation of \( Q \), \( K \), and \( V \) in Transformers

Value (\( V \))
- The **Value** represents the actual information that will be used in the final result.
- It is derived by multiplying the input \( X \) by a weight matrix \( W_V \):  



- After determining the relevant parts using \( Q \) (Query) and \( K \) (Key), the \( V \) provides the "content" related to the calculated attention.

---

How do \( Q \), \( K \), and \( V \) interact?

The interaction between \( Q \), \( K \), and \( V \) occurs during the **attention calculation**, where the model decides which values (\( V \)) are most important for a given query (\( Q \)).

1. Dot Product of \( Q \) and \( K \)
- First, the similarity between \( Q \) (what we are looking for) and \( K \) (what is available) is calculated using the dot product:


- This measures how strongly \( Q \) and \( K \) are related.

2. Normalization with Softmax
- The result of the dot product is passed through a softmax function to produce weights that sum to 1:


- These weights indicate how much attention each part of the input should receive.  
- The denominator \( \sqrt{d_k} \) (the dimension of \( K \)) is a scaling factor that helps stabilize training by preventing large gradients.

 3. Weighting the Values (\( V \))
- The attention weights are then applied to \( V \), combining the values based on their relevance:


- The result is a weighted sum of \( V \), where the weights come from the similarity between \( Q \) and \( K \).

---


1. **\( Q \)**: Defines what information is being searched for (query).  
2. **\( K \)**: Represents the available information (key).  
3. **\( V \)**: Contains the actual data that will be used (value).  

- The **attention mechanism** ensures that each word or token can "focus" on other parts of the input sequence that are contextually relevant. This enables the model to understand relationships in the data dynamically.


##### Softmax

In [7]:
def softmax(x):
    
    
    e_x = np.exp(x - np.max(x))
    
    return e_x/ e_x.sum(axis=1).reshape(-1, 1)

##### Scale dot product

In [8]:
def scale_dot_product_attention(Q, V, K):
    
    matmul_qk = np.dot(Q, K.T)
    
    #dimensão de K
    depth = K.shape[-1]
    
    logits = matmul_qk/ np.sqrt(depth)
    
    attetion_weights= softmax(logits)
    
    output = np.dot(attetion_weights, V)
    
    return output

#### Linear and Softmax

In [9]:
def linear_and_softmax(input):
    
    weights = np.random.randn(dim_model, vocab_size)
    
    logits = np.dot(input, weights)
    
    return softmax(logits)

### Model

In [10]:
def transformer_model(input):
    
    embedded_input = embedding(input, vocab_size, dim_model)

    attention_output = scale_dot_product_attention(embedded_input, embedded_input, embedded_input)
    
    output_probabilities = linear_and_softmax(attention_output)

    output_indices = np.argmax(output_probabilities, axis=-1)
    
    return output_indices

## Predictions

In [11]:
input_sequence = np.random.randint(0, vocab_size, seq_length)
print(f"sequence: {input_sequence}")

sequence: [2 2 5 2 6 7 1 2 1 1]


In [12]:
output = transformer_model(input_sequence)
print(f"Ouput: {output}")

Ouput: [5 5 0 5 4 0 9 5 9 9]
