# Scalable Jaccard neighbour finding

## Jaccard clustering

A standard technique of handling document similarity or clustering is to tokenize the data, and then cluster on the tokens of each document. On small scales, this is relatively easy to get running in a notebook environment.

A tokenizer maps each document to a set of tokens

$ t : d_i \rightarrow \{t(d_i)_j\}_{j=1}^{n_i} = \{t_{ij}\}_{j=1}^{n_i}$

A tokenizer can be as elaborate or as simple as one wishes. Perhaps the simplest is breaking up a document over sentences.

In [2]:
docs = ["I am a fish", "a fish swims"]
tokenized_docs = [x.split(" ") for x in docs]
print(tokenized_docs)

[['I', 'am', 'a', 'fish'], ['a', 'fish', 'swims']]


Using the tokens, one can estimate how similar documents might be. A typical way of seeing the similarity is by a Jaccard similarity. This looks at (between a pair of documents) the number of common tokens divided by the number of all tokens

$ J_{sim}(d_x, d_y) = \frac{\{t_{xj}\}_{j=1}^{n_x} \cap \{t_{yj}\}_{j=1}^{n_y}}{\{t_{xj}\}_{j=1}^{n_x} \cup \{t_{yj}\}_{j=1}^{n_y}} $

In [3]:
def jaccard_sim(d_x, d_y):
    return len(set(d_x) & set(d_y)) / len(set(d_x) | set(d_y))

For identical sets, the Jaccard similarity will be 1. For sets with no tokens in common, the similarity will be 0. For sets with some tokens in common, the similarity will be between 0 and 1

In [6]:
jaccard_sim(tokenized_docs[0], tokenized_docs[0])

1.0

In [7]:
jaccard_sim(tokenized_docs[0], tokenized_docs[1])

0.4

From the similarity, one can define a distance, the Jaccard distance by

$J_{dist}(d_x, d_y) = 1 - J_{sim}(d_x, d_y)$.

This distance function forms a proper metric [[1]](https://en.wikipedia.org/wiki/Metric_space), i.e. it observes the distance function rules:
- $ J_{dist}(d_x, d_x) = 0 $
- $ J_{dist}(d_x, d_y) > 0 \textrm{  for  } d_x \neq d_y $
- $ J_{dist}(d_x, d_y) = J_{dist}(d_y, d_x) $
- $ J_{dist}(d_x, d_z) \leq J_{dist}(d_x, d_y) + J_{dist}(d_y, d_z) $

which means that the distance metric can legitimately be used in clustering. One can play the usual games with hierarchical clustering etc. In particular, for simple clustering, one can find a documents neighbours by thresholding on the distance, and describing the neighbours of a document as the other documents with a threshold distance $T$ of that document:

$ \textrm{neighbours}(d_i) = \{d_j | J_{dist}(d_i, d_j) < T \}$ 

In [8]:
def neighbours(this_doc, all_docs, threshold, tokenizer=lambda x: x.split(" ")):
    this_tokenized_docs = tokenizer(this_doc)
    all_tokenized_docs = [tokenizer(x) for x in all_docs]
    neighbours = [y[0] for y in zip(all_docs, all_tokenized_docs) if (1 - jaccard_sim(y[1], this_tokenized_docs)) < threshold]
    return neighbours

In [10]:
neighbours("I a am fish", docs, 0.1)

['I am a fish']

In [11]:
neighbours("I a am fish", docs, 0.7)

['I am a fish', 'a fish swims']

The simple implementations above work well for small datasets. However, for large datasets, the computation of distances, or finding nearest neighbours, is $O(N^2)$ where $N$ is the number of documents. Depending on how the results are stored, storage can also be a $N^2 $ problem.

## References

[1] https://en.wikipedia.org/wiki/Metric_space