Create a function named sparse_window_attention that computes sparse attention over long sequences by sliding a fixed-radius window across the sequence.

â¢ The parameter window_size represents the radius w of the window.

For a token at index i, attend only to tokens whose indices are within max(0, i - w) through min(seq_len - 1, i + w), inclusive.
Tokens near the beginning or end of the sequence simply have smaller windows; no padding is added.
â¢ Inputs

Q, K, V: NumPy arrays with shapes (seq_len, d_k) for Q and K, and (seq_len, d_v) for V.
window_size: integer window radius.
scale_factor (optional): value used to scale dot-product scores; if None, default to sqrt(d_k).
â¢ Output

A NumPy array of shape (seq_len, d_v) containing the attention results.
Example:
Input:
import numpy as np
Q = np.array([[1.0], [1.0], [1.0]])
K = np.array([[1.0], [1.0], [1.0]])
V = np.array([[1.0], [2.0], [3.0]])
print(sparse_window_attention(Q, K, V, 1))
Output:
[[1.5] [2. ] [2.5]]
Reasoning:
The sparse_window_attention function processes each query in the input Q by computing attention scores only with keys in K within a window of size 1 (i.e., the current position and one adjacent position on each side), then applies softmax to these scores to derive weights for the corresponding values in V. For the given input arrays, this results in the output where the first element is the weighted average of V[0] and V[1] (yielding 1.5), the second is the average of all elements in V (yielding 2.0), and the third is the average of V[1] and V[2] (yielding 2.5).

In [None]:
import numpy as np

def sparse_window_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray, window_size: int, scale_factor: float) -> np.ndarray:
    """
    Sparse sliding-window attention (radius = window_size).
    Q,K: (L, d_k), V: (L, d_v). Returns (L, d_v).
    For token i, attend to j in [max(0,i-w), min(L-1,i+w)].
    """
    L, d_k = Q.shape # L is the sequence length
    d_v = V.shape[1]
    if scale_factor is None:
        scale = np.sqrt(d_k)
    else:
        scale = float(scale_factor)

    out = np.zeros((L, d_v), dtype=float)

    for i in range(L):
        lo = max(0, i - window_size)
        hi = min(L - 1, i + window_size) + 1  # slice end is exclusive
        K_win = K[lo:hi]                     # (W, d_k)
        V_win = V[lo:hi]                     # (W, d_v)

        # scores: (W,)
        scores = (K_win @ Q[i])/ scale
        # stable softmax
        scores -= scores.max()
        weights = np.exp(scores)
        weights /= weights.sum()

        out[i] = weights @ V_win            # (d_v,)
    return out

# Example
if __name__ == "__main__":
    Q = np.array([[1.0],[1.0],[1.0]])
    K = np.array([[1.0],[1.0],[1.0]])
    V = np.array([[1.0],[2.0],[3.0]])
    print(sparse_window_attention(Q, K, V, 1, None))
    # [[1.5]
    #  [2. ]
    #  [2.5]]

[[1.5]
 [2. ]
 [2.5]]
