<a href="https://colab.research.google.com/github/koliby777/Pytorch_tutorials/blob/main/Attention_is_all_you_need.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Attention is all you need

https://monica.im/share/chat?shareId=DhWsggKYwHZdyjOC

In [1]:
import torch
import torch.nn as nn

class SimpleAttention(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SimpleAttention, self).__init__()
        # Linear layer to transform the input into query vectors
        self.query = nn.Linear(input_dim, hidden_dim)
        # Linear layer to transform the input into key vectors
        self.key = nn.Linear(input_dim, hidden_dim)
        # Linear layer to transform the input into value vectors
        self.value = nn.Linear(input_dim, hidden_dim)
        # Softmax layer to normalize attention scores
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):
        # x shape: (batch_size, seq_length, input_dim)
        # Transform the input tensor x into query vectors Q
        Q = self.query(x)  # (batch_size, seq_length, hidden_dim)
        # Transform the input tensor x into key vectors K
        K = self.key(x)    # (batch_size, seq_length, hidden_dim)
        # Transform the input tensor x into value vectors V
        V = self.value(x)  # (batch_size, seq_length, hidden_dim)

        # Calculate attention scores by performing dot product of Q and K
        # Transpose K's last two dimensions for correct matrix multiplication
        # Divide by the square root of the dimension of K to scale the scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(K.shape[-1], dtype=torch.float32))

        # Apply softmax to the attention scores to get attention weights
        # Softmax ensures the weights sum up to 1 and are positive
        weights = self.softmax(scores)  # (batch_size, seq_length, seq_length)

        # Compute the weighted sum of the value vectors V
        # The weights determine how much focus each value vector gets
        output = torch.matmul(weights, V)  # (batch_size, seq_length, hidden_dim)

        # Return the output tensor, which now contains the attended information
        return output

# Example usage
batch_size = 2  # Number of sequences in a batch
seq_length = 5  # Length of each sequence
input_dim = 10  # Dimensionality of the input features
hidden_dim = 8  # Dimensionality of the hidden (transformed) features

# Create a random input tensor with shape (batch_size, seq_length, input_dim)
x = torch.randn(batch_size, seq_length, input_dim)
# Instantiate the SimpleAttention module
attention = SimpleAttention(input_dim, hidden_dim)
# Pass the input tensor through the attention mechanism
output = attention(x)
# Print the shape of the output tensor
print(output.shape)  # Should print: torch.Size([2, 5, 8])


torch.Size([2, 5, 8])


Detailed Explanation of Each Line
Importing Libraries:

python
Copy
import torch
import torch.nn as nn
torch: The main PyTorch library for tensor operations.
torch.nn: The sub-library for building neural networks.
Defining the SimpleAttention Class:

python
Copy
class SimpleAttention(nn.Module):
class SimpleAttention: Defines a new class inheriting from nn.Module, the base class for all neural network modules in PyTorch.
Initializing the Class:

python
Copy
def __init__(self, input_dim, hidden_dim):
    super(SimpleAttention, self).__init__()
init: The constructor method to initialize the instance.
super(SimpleAttention, self).init(): Calls the constructor of the parent class nn.Module.
Defining Linear Layers:

python
Copy
self.query = nn.Linear(input_dim, hidden_dim)
self.key = nn.Linear(input_dim, hidden_dim)
self.value = nn.Linear(input_dim, hidden_dim)
nn.Linear(input_dim, hidden_dim): Creates a linear transformation layer.
self.query, self.key, self.value: These layers transform the input into query, key, and value vectors respectively.
Defining Softmax Layer:

python
Copy
self.softmax = nn.Softmax(dim=-1)
nn.Softmax(dim=-1): Creates a softmax layer to normalize the attention scores.
Defining the Forward Method:

python
Copy
def forward(self, x):
forward(self, x): Defines the forward pass of the model, which computes the output given the input x.
Transforming Input into Q, K, V:

python
Copy
Q = self.query(x)
K = self.key(x)
V = self.value(x)
self.query(x), self.key(x), self.value(x): Apply the linear transformations to obtain the query, key, and value vectors.
Calculating Attention Scores:

python
Copy
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(K.shape[-1], dtype=torch.float32))
torch.matmul(Q, K.transpose(-2, -1)): Computes the dot product between Q and the transpose of K.
/ torch.sqrt(torch.tensor(K.shape[-1], dtype=torch.float32)): Scales the scores by the square root of the dimension of K.
Applying Softmax to Scores:

python
Copy
weights = self.softmax(scores)
self.softmax(scores): Normalizes the attention scores to get the attention weights.
Computing the Weighted Sum of Vectors:

python
Copy
output = torch.matmul(weights, V)
torch.matmul(weights, V): Computes the weighted sum of the value vectors V using the attention weights.
Returning the Output:

python
Copy
return output
return output: Returns the final attended output.
Example Usage:

python
Copy
batch_size = 2
seq_length = 5
input_dim = 10
hidden_dim = 8
x = torch.randn(batch_size, seq_length, input_dim)
attention = SimpleAttention(input_dim, hidden_dim)
output = attention(x)
print(output.shape)  # Should print: torch.Size([2, 5, 8])
batch_size, seq_length, input_dim, hidden_dim: Define the dimensions for the example.
torch.randn(batch_size, seq_length, input_dim): Creates a random input tensor.
SimpleAttention(input_dim, hidden_dim): Instantiates the attention module.
attention(x): Passes the input tensor through the attention mechanism.
print(output.shape): Prints the shape of the output tensor.