# The 3 ways of Attention and Dot Product Attention
Implementation of later 2 attentions
- Encoder - Decoder Attentions
- Causal Attention
- Bi-Directional Self Attention

1. Attention model constitute powerful tools in the NLP practitioner toolkit
2. Like LSTMs, they learn which words are most important to phrases, sentences, paragraphs and so on
3. They mitigate the vanishing gradient problem better than LSTMs
4. Attention combined with LSTMs to form encoder-decoder models 

![alt text](images/C4_W2_L3_dot-product-attention_S01_introducing-attention_stripped.png)

Covers, Integrate attention into transformers. Transformers are not sequence models, easier to parallelize and accelerate. Applictions of transformers are
- Auto-completion
- Named Entity Recognition
- Chatbots
- Question-Answering

1. Along with embedding, positional encoding, dense layers and residual connection - attention is a crucial component of transformers. 
2. At the heart of any attention scheme used in a transformer is dot product attention of which the figures below display simplified pic

![alt text](images/C4_W2_L3_dot-product-attention_S03_concept-of-attention_stripped.png)
![alt text](images/C4_W2_L3_dot-product-attention_S04_attention-math_stripped.png)

With basic dot product attention,
    - Interaction between every word(embedding) is captured in query
    - Every word is the key
    - queries and keys belong to the same sentences - constitutes bi-directional self-attention
    - sometimes, its more appropriate to consider only words which have come before the current one
    - a case like queries and keys come from the same sentences, fall into the category of causal attention
    
![alt text](images/C4_W2_L4_causal-attention_S02_causal-attention_stripped.png)
for causal attention, we add a mask to the argument of our softmax function
![alt text](images/C4_W2_L4_causal-attention_S03_causal-attention-math_stripped.png)
![alt text](images/C4_W2_L4_causal-attention_S04_causal-attention-math-2_stripped.png)


In [2]:
import sys
import numpy as np
import scipy.special

import textwrap
wrapper = textwrap.TextWrapper(width=70)

# To pring the entire np array
np.set_printoptions(threshold=sys.maxsize)

Here are some helper functions that will help you create tensors and display useful information:

* `create_tensor()` creates a numpy array from a list of lists.
* `display_tensor()` prints out the shape and the actual tensor.

In [4]:
def create_tensor(t):
    """
    Create tensor from list of lists
    """
    return np.array(t)

def display_tensor(t, name):
    """
    Display shape and tensor
    """
    print(f'{name} shape: {t.shape}\n')
    print(f'{t}\n')

- Create some tensors and display their shapes. 
- The query, key, and value arrays must all have the same embedding dimensions (number of columns), and 
- the mask array must have the same shape as `np.dot(query, key.T)`.

In [16]:
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, 'query')

k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, 'key')

v = create_tensor([[0, 1, 0], [1, 0, 1]])
display_tensor(v, 'value')

m = create_tensor([[0, 0], [-1e9, 0]])
display_tensor(m, 'mask')

query shape: (2, 3)

[[1 0 0]
 [0 1 0]]

key shape: (2, 3)

[[1 2 3]
 [4 5 6]]

value shape: (2, 3)

[[0 1 0]
 [1 0 1]]

mask shape: (2, 2)

[[ 0.e+00  0.e+00]
 [-1.e+09  0.e+00]]



## Dot Product Attention
$$softmax \left(\frac{Q K^T}{\sqrt{d}} + M \right)$$

where the scaling factor $\sqrt{d}$ is the squre root of the embedding dimension

In [13]:
def DotProductAttention(query, key, value, mask, scale=True):
    """
    Dot Product Self Attention
    Args:
        query (numpy.ndarray): array of query representations with shape (L_q by d)
        key (numpy.ndarray): array of key representations with shape (L_k by d)
        value (numpy.ndarray): array of value representations with shape (L_k by d) where L_v = L_k
        mask (numpy.ndarray): attention-mask, gates attention with shape (L_q by L_k)
        scale (bool): whether to scale the dot product of the query and transposed key

    Returns:
        numpy.ndarray: Self-attention array for q, k, v arrays. (L_q by L_k)
    """
    
    assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"
    
    # save depth/dimension of the query embedding for scaling down the dot product
    if(scale):
        depth = query.shape[-1]
    else:
        depth = 1
        
    # Calculate scaled query key dot product according to formula above
    dots = np.matmul(query, np.swapaxes(key, -1, -2)) / np.sqrt(depth)
    
    # Apply the mask
    if(mask is not None):
        dots = np.where(mask, dots, np.full_like(dots, -1e9))
        
    # Softmax formula implementation
    # Use scipy.special.logsumexp of masked_qkT to avoid underflow by division by large numbers
    # Note softmax = e^(dots - logaddexp(dots)) = E^dots / sumexp(dots)
    logsumexp = scipy.special.logsumexp(dots, axis=1, keepdims=True)
    
    # Take exponential of dots minus logsumexp to get softmax
    # Use np.exp()
    dots = np.exp(dots - logsumexp)
    
    # Multiply dots by value to get self attention
    # use np.matmul()
    attention = np.matmul(dots, value)
    
    return attention

Now let's implement the *masked* dot product self-attention (at the heart of causal attention) as a special case of dot product attention

In [14]:
def dot_product_self_attention(q, k, v, scale=True):
    """ Masked dot product self attention.
    Args:
        q (numpy.ndarray): queries.
        k (numpy.ndarray): keys.
        v (numpy.ndarray): values.
    Returns:
        numpy.ndarray: masked dot product self attention tensor.
    """
    # Size of the penultimate dimension of the query
    mask_size = q.shape[-2]
    
    # Creates a matrix with ones below the diagonal and 0s above. It should have shape (1, mask_size, mask_size)
    # Use np.tril() - Lower triangle of an array and np.ones()
    mask = np.tril(np.ones((1, mask_size, mask_size), dtype=np.bool_), k=0)
    
    return DotProductAttention(q, k, v, mask, scale=scale)

In [15]:
dot_product_self_attention(q, k, v)

array([[[0.        , 0.35954252, 0.        ],
        [1.        , 0.64045748, 1.        ]]])