## Vector Similarity and Dot Product in Machine Learning

- https://medium.com/advanced-deep-learning/understanding-vector-similarity-b9c10f7506de
- https://medium.com/data-science-collective/cosine-distance-vs-dot-product-vs-euclidean-in-vector-similarity-search-227a6db32edb

### Introduction to Vectors in ML

Vectors as Data Representations: In ML, data (e.g., words, documents, images) is encoded as vectors in high-dimensional space.

Example: TF-IDF vectors for documents.

- Doc A: [10, 20, 30]
- Doc B: [100, 200, 300] (scaled version of A).




Why Similarity?: Measures how "related" two data points are. Applications:

- Information retrieval (search engines).
- Recommendation systems.
- Natural Language Processing (NLP).


Key Insight: Similarity can focus on direction (angle) or magnitude (length).

<img src='imgs/fig_embedding_space.png' width=800>
<a href='https://miro.medium.com/v2/resize:fit:2000/format:webp/1*vOvIYq7cLw4qTC70tSwwgQ.png'>Figure 1: Embedding Space </a>

### Dot Product: The Foundation

- Algebraic Definition:
For vectors $\vec{a} = [a_1, \dots, a_n]$ and $\vec{b} = [b_1, \dots, b_n]$,
$\vec{a} \cdot \vec{b} = \sum_{i=1}^n a_i b_i$
- Geometric Interpretation:
$\vec{a} \cdot \vec{b} = \|\vec{a}\| \|\vec{b}\| \cos \theta$
where $\|\vec{a}\| = \sqrt{\sum a_i^2}$ is the Euclidean norm, and $\theta$ is the angle between vectors.

- Properties:
    - Captures both direction and magnitude.
    - Range: $-\infty$ to $+\infty$.
    - Positive: Acute angle (similar direction).
    - Zero: Perpendicular (orthogonal).
    - Negative: Obtuse angle (opposite).


- Limitation: Sensitive to vector length → Longer vectors yield higher scores (e.g., long documents dominate in search).

### Cosine Similarity: Direction-Only Measure

- Formula:
$\text{Cosine Similarity} = \cos \theta = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}$

- Key Advantage: Normalizes for magnitude → Focuses purely on angle.
- Range: -1 (exactly opposite) to +1 (identical direction); 0 (orthogonal).
- Example Calculation:
    - $\vec{a} = [1.5, 1.5]$, $\vec{b} = [2.0, 1.0]$
    - Dot product: $1.5 \cdot 2.0 + 1.5 \cdot 1.0 = 4.5$
    - Norms: $\|\vec{a}\| \approx 2.121$, $\|\vec{b}\| \approx 2.236$
    - Cosine: $4.5 / (2.121 \cdot 2.236) \approx 0.949$ (close to 1, similar direction).

- Use Cases: Text similarity (ignores document length), word embeddings (e.g., Word2Vec).

<img src='imgs/fig_cos_sim.png' width=800>

### Proof of the Cosine Rule (Geometric Derivation)

- Setup: Treat $\vec{a}$ and $\vec{b}$ as adjacent sides of a triangle. The third side is $\vec{a} - \vec{b}$.
- Law of Cosines (from geometry):
$\|\vec{a} - \vec{b}\|^2 = \|\vec{a}\|^2 + \|\vec{b}\|^2 - 2 \|\vec{a}\| \|\vec{b}\| \cos \theta$
- Expand the Left Side:
$\|\vec{a} - \vec{b}\|^2 = (\vec{a} - \vec{b}) \cdot (\vec{a} - \vec{b}) = \vec{a} \cdot \vec{a} - 2 \vec{a} \cdot \vec{b} + \vec{b} \cdot \vec{b} = \|\vec{a}\|^2 - 2 \vec{a} \cdot \vec{b} + \|\vec{b}\|^2$
- Equate Both Sides:
$\|\vec{a}\|^2 - 2 \vec{a} \cdot \vec{b} + \|\vec{b}\|^2 = \|\vec{a}\|^2 + \|\vec{b}\|^2 - 2 \|\vec{a}\| \|\vec{b}\| \cos \theta$
- Simplify:
$-2 \vec{a} \cdot \vec{b} = -2 \|\vec{a}\| \|\vec{b}\| \cos \theta$
$\vec{a} \cdot \vec{b} = \|\vec{a}\| \|\vec{b}\| \cos \theta$,
$\cos \theta = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}$
- Validity: Proven in 2D/3D via Pythagorean theorem; generalizes to n-dimensions as dot product and norms are defined identically.
- Note: Assumes real vectors; holds for Euclidean space.

In [3]:
import numpy as np

a = np.array([1.0, 0.0])  # Unit vector
b = np.array([0.0, 1.0])  # Unit vector, orthogonal

cos_theta = np.dot(a, b)  # 0.0
d = np.linalg.norm(a - b)  # sqrt(2) ≈ 1.414
print(np.sqrt(2 - 2 * cos_theta))  # Matches: 1.414

1.4142135623730951


### Euclidean Distance: Magnitude and Direction

- Formula:
$d(\vec{a}, \vec{b}) = \|\vec{a} - \vec{b}\| = \sqrt{\sum_{i=1}^n (a_i - b_i)^2}$
- Relation to Cosine: For unit-norm vectors, $d = \sqrt{2 - 2 \cos \theta}$.
- Properties:

    - Always non-negative (0 for identical vectors).
    - Satisfies triangle inequality → Useful for indexing (e.g., KD-trees).


- When to Use: Clustering (k-means), anomaly detection → Considers absolute differences.

### Comparison of Measures

| Measure            | Formula                                      | Magnitude-Sensitive? | Range      | Best For                          | Is a Metric?                       |
|--------------------|----------------------------------------------|----------------------|------------|-----------------------------------|------------------------------------|
| Dot Product        | $\vec{a} \cdot \vec{b}$                      | Yes                  | $-\infty$ to $\infty$ | Fast similarity, inner prod.      | No                                 |
| Cosine Similarity  | $\dfrac{\vec{a} \cdot \vec{b}}{\|\vec{a}\|\;\|\vec{b}\|}$ | No                   | $-1$ to $1$ | Directional (text, embeddings)    | No (but $1 - \cos$ is distance)    |
| Euclidean Distance | $\sqrt{\sum_i (a_i - b_i)^2}$                | Yes                  | $0$ to $\infty$ | Absolute distance, clustering | Yes                                |

<center><img src='imgs/fig_dot_and_cos.png' width=800></center>


In [2]:
import numpy as np

def dot_product(a, b):
    return np.dot(a, b)

def cosine_similarity(a, b):
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0 or norm_b == 0:
        return 0  # Avoid division by zero
    return np.dot(a, b) / (norm_a * norm_b)

def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

# Example
a = np.array([10, 20, 30])
b = np.array([100, 200, 300])

print("Dot:", dot_product(a, b))         
print("Cosine:", cosine_similarity(a, b))
print("Euclidean:", euclidean_distance(a, b))  

Dot: 14000
Cosine: 1.0
Euclidean: 336.7491648096547


### Attention Mechanism: Dot Product in Transformers

- Context: From "Attention Is All You Need" (Vaswani et al., 2017).
- Scaled Dot-Product Attention:

    - Inputs: Queries (Q), Keys (K), Values (V) — projections of embeddings.
    - Similarity Scores: $QK^T$ (matrix of dot products).
    - Scaled: Divide by $\sqrt{d_k}$ (dimension of keys) to stabilize gradients.
    - Full Formula:
    $\text{Attention}(Q, K, V) = softmax\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$


- Why Dot Product?:

    - Efficient computation (O(n²) but parallelizable).
    - Measures alignment (like unnormalized cosine).
    - For normalized Q/K: Equivalent to cosine.


- Intuition: High dot score → Query "attends" strongly to that Key → Weights Values accordingly.
- Multi-Head Attention: Run in parallel heads → Capture different relationships.
- Example: In translation, query "it" attends to subject in source sentence.
- Scaling Rationale: In high dims, dot products grow large → Softmax saturates (gradients → 0).