# Exercise #1: Computing the similarity between term vectors

## Jaccard similarity

This metric is a set similarity; that is, it only captures the presence and absence of terms with no regard to their frequency. Put simply, it captures the ratio of shared terms and total terms in the two documents.

$$sim_{Jaccard} = \frac{|X \cap Y|}{|X \cup Y|}$$

where $X$ and $Y$ denote the set of terms in documents $x$ and $y$, respectively.

In [1]:
def jaccard(x, y):
    """Computes the Jaccard similarity between two term vectors."""
    num_both = 0
    num_either = 0
    for xi, yi in zip(x, y):
        num_both += int(xi > 0 and yi > 0)
        num_either += int(xi > 0 or yi > 0)
    return num_both / num_either

In [2]:
import unittest

class TestJaccard(unittest.TestCase):

    def test_no_common_terms(self):
        x = [0, 0, 0, 1, 2, 1]
        y = [1, 5, 3, 0, 0, 0]
        self.assertEqual(jaccard(x, y), 0)

    def test_only_common_terms(self):
        x = [0, 1, 2, 1, 0, 1]
        y = [0, 5, 3, 7, 0, 1]
        self.assertEqual(jaccard(x, y), 1)

    def test_some_common_terms(self):
        x = [0, 1, 1, 0, 1, 1]
        y = [5, 0, 3, 0, 7, 0]
        self.assertEqual(jaccard(x, y), 0.4)
        
        
unittest.main(argv=['-q', 'TestJaccard'], verbosity=2, exit=False)

test_no_common_terms (__main__.TestJaccard) ... ok
test_only_common_terms (__main__.TestJaccard) ... ok
test_some_common_terms (__main__.TestJaccard) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.004s

OK


<unittest.main.TestProgram at 0x10d563160>

## Cosine similarity

$$sim_{cos}(x,y) = \frac{\mathbf{x} \cdot \mathbf{y}}{||\mathbf{x}||~||\mathbf{y}||} = \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i^2} \sqrt{\sum_{i=1}^n y_i^2}}$$

In [3]:
import math

def cosine(x, y):
    """Computes the Cosine similarity between two term vectors."""
    dot_product = 0
    x_len = 0
    y_len = 0
    for xi, yi in zip(x, y):
        dot_product += xi * yi
        x_len += xi ** 2
        y_len += yi ** 2
        
    return dot_product / (math.sqrt(x_len) * math.sqrt(y_len))

In [4]:
class TestCosine(unittest.TestCase):

    def test_no_common_terms(self):
        x = [0, 0, 0, 1, 2, 1]
        y = [1, 5, 3, 0, 0, 0]
        self.assertEqual(cosine(x, y), 0)
        
    def test_identical_docs(self):
        x = [0, 0, 0, 1, 2, 1]
        self.assertAlmostEqual(cosine(x, x), 1.0, places=4)

    def test_short_docs(self):
        x = [4, 2]
        y = [1, 3]
        self.assertAlmostEqual(cosine(x, y), math.sqrt(2) / 2, places=4)
        
unittest.main(argv=['-q', 'TestCosine'], verbosity=2, exit=False)

test_identical_docs (__main__.TestCosine) ... ok
test_no_common_terms (__main__.TestCosine) ... ok
test_short_docs (__main__.TestCosine) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.004s

OK


<unittest.main.TestProgram at 0x10d636860>

## Feedback

Please give (anonymous) feedback on this exercise by filling out [this form](https://forms.gle/22o3ursi5YsR1Ztb8).