### - Effect of scaling Example - 
### Scaling by squere root of dimension during computation of Queries and Keys Matrices before applying softmax and calculating attention weights for causal attention mechanism

## 1. stability in learning

In [1]:
import torch

tensor = torch.tensor([0.3, -0.1, -0.2, 0.5, 0.3, 0.7])
scaled_tensor = tensor * 7

softmax_result = torch.softmax(tensor, dim=-1)
softmax_scaled_result = torch.softmax(scaled_tensor, dim=-1)

print(f"Result without without multiplying: {softmax_result}")
print(f"Result after multiplying: {softmax_scaled_result}")


Result without without multiplying: tensor([0.1669, 0.1119, 0.1013, 0.2039, 0.1669, 0.2490])
Result after multiplying: tensor([0.0443, 0.0027, 0.0013, 0.1795, 0.0443, 0.7279])


#### comparing the results shows how the output after scaling becomes peaky as the last value is unproportional high in comparison to the other values, in order to avoid such sharp softmax distribution the scaling by the square root of the dimensions in performed, leads to the model becoming over confident in the one key

## 2. to stabelize the variance of the dot product

In [2]:
import numpy as np
import torch


def compute_variance(dim, trails=1000):
    dot_products = []
    scaled_dot_products = []

    for _ in range(trails):
        q = np.random.randn(dim)
        k = np.random.randn(dim)

        dot_product = np.dot(q, k)
        dot_products.append(dot_product)

        scaled_dot_product = dot_product / dim**0.5
        scaled_dot_product = dot_product / torch.sqrt(torch.tensor(dim))
        scaled_dot_product = dot_product / np.sqrt(dim)
        scaled_dot_products.append(scaled_dot_product)

    variance_before_scaling = np.var(dot_products)
    variance_after_scaling = np.var(scaled_dot_products)

    return variance_before_scaling, variance_after_scaling

torch.manual_seed(123)

variance_before_5, variance_after_5 = compute_variance(5)
variance_before_30, variance_after_30 = compute_variance(20)

print(f"Variance before scaling for dimension = 5: {variance_before_5}")
print(f"Variance after scaling for dimension = 5: {variance_after_5}")

print(f"Variance before scaling for dimension = 30: {variance_before_30}")
print(f"Variance after scaling for dimension = 30: {variance_after_30}")



Variance before scaling for dimension = 5: 4.821266469703955
Variance after scaling for dimension = 5: 0.9642532939407908
Variance before scaling for dimension = 30: 20.403357325660153
Variance after scaling for dimension = 30: 1.0201678662830074


### as dimensions of Trainable Weight Matrices increase the variance also becomes larger which might cause unstable learning, after scaling by the square root of the dimensions the variance always stays close to 1 which ensures stability in learning