# Term vector similarity

In this exercise you'll need to complete the code for computing the similarity between two documents that are represented by their term vectors.

In [None]:
import ipytest
import math
import pytest
from typing import List

ipytest.autoconfig()

## Jaccard similarity

This metric is a set similarity; that is, it only captures the presence and absence of terms with no regard to their frequency. Put simply, it captures the ratio of shared terms and total terms in the two documents.

$$sim_{Jaccard} = \frac{|X \cap Y|}{|X \cup Y|}$$

where $X$ and $Y$ denote the set of terms in documents $x$ and $y$, respectively.

If the two documents are given as term vectors, Jaccard similarity may be computed as:

$$sim_{\mathrm{Jaccard}}(\mathbf{x},\mathbf{y}) = \frac{\sum_{i} \mathbb{1}(x_i) \times \mathbb{1}(y_i)}{\sum_{i} \mathbb{1}(x_i+y_i)}$$

where $\mathbb{1}(x)$ is an indicator function ($1$ if $x>0$ and $0$ otherwise).

In [None]:
def jaccard(x: List[int], y: List[int]) -> float:
    """Computes the Jaccard similarity between two term vectors."""
    # TODO Complete this method.
    return -1

Tests.

In [None]:
%%run_pytest[clean]

def test_no_common_terms():
    x = [0, 0, 0, 1, 2, 1]
    y = [1, 5, 3, 0, 0, 0]
    assert jaccard(x, y) == 0

def test_only_common_terms():
    x = [0, 1, 2, 1, 0, 1]
    y = [0, 5, 3, 7, 0, 1]
    assert jaccard(x, y) == 1

def test_some_common_terms():
    x = [0, 1, 1, 0, 1, 1]
    y = [5, 0, 3, 0, 7, 0]
    assert jaccard(x, y) == 0.4

## Cosine similarity

$$sim_{cos}(x,y) = \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}||~||\mathbf{y}||} = \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i^2} \sqrt{\sum_{i=1}^n y_i^2}}$$

In [None]:
def cosine(x: List[float], y: List[float]) -> float:
    """Computes the Cosine similarity between two term vectors."""
    # TODO Complete this method.
    return -1

Tests.

In [None]:
%%run_pytest[clean]

def test_no_common_terms():
    x = [0, 0, 0, 1, 2, 1]
    y = [1, 5, 3, 0, 0, 0]
    assert cosine(x, y) == 0

def test_identical_docs():
    x = [0, 0, 0, 1, 2, 1]
    assert cosine(x, x) == pytest.approx(1, rel=1e-3)

def test_short_docs():
    x = [4, 2]
    y = [1, 3]
    assert cosine(x, y) == pytest.approx(math.sqrt(2) / 2, rel=1e-3)