# LSH for Minhash Signatures: Banding Technique

In this section, we discuss the more traditional approach to LSH which follows the workflow of ***shingling $\rightarrow$ minhashing $\rightarrow$ banding*** (*the actual LSH step*).

Recall: We can express documents as *k*-shingles (or whichever token we choose) and consequently perform a mminhashing to obtain signatures. We arrange these into columns of a matrix--where rows correspond to each item of a signature. Also, suppose that the signatures now have invariant lengths (i.e., number of rows) due to the prior 'preprocessing' done.

We obtain a structure similar to what is shown below:

<img src='band_structure.PNG'></img>
</br>
<center><b>Figure 1. Segmented signature matrix. Four segments/bands with three rows per segment.</b></br>Taken from Leskovec et. al. <a href="http://www.mmds.org/"><i>Mining Massive Datasets</i></a> (2020)</center>

Notice that unlike those discussed in Minhashing, this signature matrix are now segmented into bands containing three (3) rows per band. Why do this?

Let's break it down. If we simply apply a hash function to full signatures then it will most likely be that we will only get the completely identical signatures--losing pairs that hold some similarity (i.e., candidate pairs) in some segments of their respective signatures. *Note: We end up discarding similar but not identical documents.* This presents another compelling reason for the banded LSH approach.

A natural course to this is to "hash" the items several times using different hash functions banking on the idea that similar items will more likely be hashed to the same bucket--otherwise, dissimilar items. The book terminology for items hashed into the same buckets are *candidate pairs*. Narrowing down the search? Voila! *Candidate pairs* instead of that $n \choose 2$ number of pairs.

Alternatively, for minhashed signatures like the one shown in Figure 1, hashing can be applied per band/segment. Hash functions can either be varied per band/segment or the same. In effect, multiple hashing and/or segmented hashing addresses the overfit on getting only identical but not similar items.

Banded signatures are then hashed, forming different hash tables for each band. Candidate pairs are then determined according to those hashed in the same buckets. See Figure 2 for the underlying mechanics.


<img src='bandhash_mechanics.PNG' height=500 width=700></img>
</br>
<center><b>Figure 2. Underlying mechanics of the "Band Hashing" method for two banded signature sets $a$ and $b$.</b> </br>Taken from James Briggs' <a href="https://www.pinecone.io/learn/locality-sensitive-hashing/"><i>Locality Sensitive Hashing (LSH): The Illustrated Guide</i></a> © Pinecone Systems, Inc.</center>