# Text Similarity Metrics

Exercise notebook

Course: Algorytmy Tekstowe at AGH University

## Preprocessing and vectorization

1. Preprocessing: Convert the text documents to lowercase and remove all punctuation marks (using regular expressions, for example).
2. Vocabulary creation: Create a vocabulary by taking all unique words from all text documents.
3. Word frequency vectors: Create two vectors, each representing the frequency of each word in the vocabulary in each text document.

In [88]:
import re
from string import punctuation

def preprocess(text: str) -> str:
    text = "".join([c for c in text if c not in punctuation])
    return text.lower()

def text_to_vec(docs: list[str]) -> list[list[int]]:

    docs = list(map(lambda vec: preprocess(vec), docs))
    docs = [doc.split(" ") for doc in docs]
    alphabet = {}

    ind = 0
    for key in set([word for doc in docs for word in doc]):
        alphabet[key] = ind
        ind += 1

    freq_vecs = []
    for doc in docs:
        vector = [0 for _ in range(ind)]
        for word in doc:
            vector[alphabet[word]] += 1
        freq_vecs.append(vector)
    return freq_vecs


In [89]:
# Tests
text_a = "The quick brown fox jumped over the lazy dog."
text_b = "The lazy dog was jumped over by the quick brown fox."
vec_a, vec_b = text_to_vec([text_a, text_b])

assert(set(vec_a) == {1, 1, 1, 2, 1, 1, 1, 1, 0, 0})
assert(set(vec_b) == {1, 1, 1, 2, 1, 1, 1, 1, 1, 1})

## Cosine similarity

$$
\begin{equation}
    \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|}= \frac{\sum\limits_{i=1}^{n} A_i B_i}{\sqrt{\sum\limits_{i=1}^{n} A_i^2} \sqrt{\sum\limits_{i=1}^{n} B_i^2}}
    \qquad\begin{aligned}
    &\text{where:} \\
    &\mathbf{A}\text{ and }\mathbf{B} \text{ are the two vectors being compared}\\
    &n \text{ is the dimensionality of the vectors}\\
    &\theta \text{ represents the angle between two vectors } \mathbf{A} \text{ and } \mathbf{B} \text{ in a high-dimensional space}
    \end{aligned}
\end{equation}
$$

The dot product of $\mathbf{A}$ and $\mathbf{B}$ is divided by the product of their Euclidean lengths to normalize the result to a range of [-1, 1]. A value of 1 indicates that the two vectors are identical, while a value of -1 indicates that they are completely dissimilar.


In [90]:
import numpy as np


def cosine_similarity(text_a: str, text_b: str) -> float:
    vec_1, vec_2 = text_to_vec([text_a, text_b])
    vec_1 = np.array(vec_1)
    vec_2 = np.array(vec_2)
    
    return vec_1.dot(vec_2) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))

In [91]:
# Tests
dist = cosine_similarity(text_a, text_b)
assert(abs(dist - 0.91986) < 0.0001)

## Dice coefficient / Sørensen-Dice Index

$$
\begin{equation}
    \text{Dice}(A, B) = \frac{2 |A \cap B|}{|A| + |B|} 
    \qquad\begin{aligned}
    &\text{where:} \\
    &A \text{ and } B \text{ represent the two sets being compared} \\
    &|A| \text{ and } |B| \text{ represent the cardinality (number of elements) of the sets} \\
    &\text{and } |A \cap B| \text{ represents the size of the intersection of the two sets}
    \end{aligned}
\end{equation}
$$


In [92]:
def dice(text_a: str, text_b: str) -> float:
    vec_1, vec_2 = text_to_vec([text_a, text_b])
    count_union = len(([(a, b) for a, b in zip(vec_1, vec_2) if a > 0 and b > 0]))
    count_1 = len([a for a in vec_1 if a > 0])
    count_2 = len(([b for b in vec_2 if b > 0]))
    return 2*count_union / (count_1 + count_2)

dice(text_a, text_b)

0.8888888888888888

In [104]:
# Tests
dist = dice(text_a, text_b)
assert(abs(dist - 0.8888888) < 0.0001)

## Euclidean distance

$$
\begin{equation}
    d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i-y_i)^2}
    \qquad\begin{aligned}
    &\text{where:} \\
    &d(x,y) \text{ is the Euclidean distance} \\
    &x_i, y_i \text{ are the values of the i-th dimension of vectors } x \text{ and } y \\
    &n \text{ is the number of dimensions in the vectors}
    \end{aligned}
\end{equation}
$$

In [105]:
def euclidean_distance(text_a: str, text_b: str) -> float:
    vec_1, vec_2 = text_to_vec([text_a, text_b])
    vec_1 = np.array(vec_1)
    vec_2 = np.array(vec_2)
    return np.linalg.norm(vec_1 - vec_2)

In [106]:
# Tests

dist = euclidean_distance(text_a, text_b)
print(dist)
assert(abs(dist - 1.4142135) < 0.0001)

1.4142135623730951


## LCS - Longest Common Subsequence

Longest, common, continuous subsequence of two sequences, aka "the longest substring".

In [107]:
from typing import Any, Sequence

def lcs(seq_a: Sequence[Any], seq_b: Sequence[Any]) -> int:
    m = len(seq_a)
    n = len(seq_b)
    dp = [[0 for _ in range(n + 1)] for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if seq_a[i-1] == seq_b[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    return dp[-1][-1]

def word_lcs(text_a: str, text_b: str) -> int:
    seq_a, seq_b = preprocess(text_a).split(), preprocess(text_b).split()
    return lcs(seq_a, seq_b)


In [108]:
# Tests
assert lcs("banana", "ananas") == 5
assert word_lcs(text_a, text_b) == 4

## Levenshtein distance

The minimal number of operations that needs to be performed in order to turn sequence A into sequence B.

Available operations:

* Replace element
* Remove element
* Add element

In [109]:

def levenshtein(seq_a: Sequence[Any], seq_b: Sequence[Any]) -> int:
    m = len(seq_a)
    n = len(seq_b)
    dp = [[float("inf") for _ in range(n + 1)] for _ in range(m + 1)]
    parent = [[-1 for _ in range(n + 1)] for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for i in range(n + 1):
        dp[0][i] = i

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if seq_a[i-1] == seq_b[j - 1]:
                dp[i][j] = dp[i-1][j-1]
                parent[i][j] = 0
            else:
                dp[i][j] = min(dp[i-1][j-1] + 1, dp[i-1][j] + 1, dp[i][j - 1] + 1)
    return dp[-1][-1]


def word_levenshtein(text_a: str, text_b: str) -> int:
    seq_a, seq_b = preprocess(text_a).split(), preprocess(text_b).split()
    return levenshtein(seq_a, seq_b)


In [110]:
# Tests
assert levenshtein("banana", "ananas") == 2
assert word_levenshtein(text_a, text_b) == 7