# Large Language Models (LLMs)
## Self-attention mechanism
The **self-attention mechanism** is the backbone of **transformer**-based LLMs, which enables them to understand and generate human-like text by dynamically focusing on relevant parts of the input. This way it captures relationships and dependencies between words, regardless of their distance from one another. 
- The self-attention mechanism is a key innovation that has driven the success of modern AI.

Here, we implement the formulae to get the output of a single self-attention by feeding it the embedded vectors of input data in row vectors stacked into a matrix called $X$.

<br>The code is at : https://github.com/ostad-ai/Large-Language-Models
<br>Explanation: https://www.pinterest.com/HamedShahHosseini/Deep-Learning/Large-Language-Models

In [1]:
# Importing the required module
import numpy as np

In [2]:
# An example
# Input sequence: 3 tokens, each with 5-dimensional embeddings
# so n=3, and d=5
X = np.array([
    [1, 0, 1, 0 , 1],  # Token 1
    [0, 2, 0, 2 , 3],  # Token 2
    [2, 1, 3, 1 , 0]   # Token 3
])

d=X.shape[1] # Dimension of each token
d_k = 4 # Dimension of embeddings (often we choose d_k=d, but in this case we didnot)

# Randomly initialize the three weight matrices
W_Q = np.random.rand(d, d_k)
W_K = np.random.rand(d, d_k)
W_V = np.random.rand(d, d_k)

# Compute Query, Key, and Value vectors
# Three matrices: Query, Key, and Value; each with dimension n*d_k
Q = X@W_Q
K = X@W_K
V = X@W_V

# Compute attention scores, a matrix of size n*n
scores=Q@K.T

# Scale the scores by dimension of embeddings
scaled_scores = scores / np.sqrt(d_k)

# Apply softmax to get attention weights, a matrix of size n*n
# actually we compute each row separately, 
# so the sum of elements of each row is one:
# Thus: attention_weights[i].sum()=1, for i=0,1,..,n-1
attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores),
                                                   axis=1, keepdims=True)

# Compute weighted sum of Value vectors
# output is a matrix of size n*d_k
output=attention_weights@V
print(f'Input Data of size: {X.shape}\n{X}')
print(f'\nAttention Weights of size{attention_weights.shape}:\n{attention_weights}')
print(f'\nOutput of size {output.shape}:\n{output}')

Input Data of size: (3, 5)
[[1 0 1 0 1]
 [0 2 0 2 3]
 [2 1 3 1 0]]

Attention Weights of size(3, 3):
[[1.20392655e-03 8.34065810e-01 1.64730264e-01]
 [6.43315607e-08 9.87412162e-01 1.25877732e-02]
 [2.71767270e-06 8.99538710e-01 1.00458572e-01]]

Output of size (3, 4):
[[1.86325483 2.40016869 3.94122033 1.66696158]
 [1.72402463 2.18216352 4.03390909 1.30739099]
 [1.80482602 2.3086879  3.98192516 1.51505179]]
