# LSH Banding Technique

In this section, we discuss the more traditional approach to LSH which follows the workflow of ***shingling $\rightarrow$ minhashing $\rightarrow$ banding*** (*the actual LSH step*).

Recall: We can express documents as *k*-shingles (or whichever token we choose) and consequently perform a mminhashing to obtain signatures. We arrange these into columns of a matrix--where rows correspond to each item of a signature. Also, suppose that the signatures now have invariant lengths (i.e., number of rows) due to the prior 'preprocessing' done.

We obtain a structure similar to what is shown below:

```{figure} ./images/band_structure.PNG
:name: band-structure

Segmented signature matrix. Four segments/bands with three rows per segment. Taken from {cite:ps}`rajaraman2011mining`.
```

Notice that unlike those discussed in Minhashing, this signature matrix are now segmented into bands containing three (3) rows per band. Why do this?

Let's break it down. If we simply apply a hash function to full signatures then it will most likely be that we will only get the completely identical signatures--losing pairs that hold some similarity (i.e., candidate pairs) in some segments of their respective signatures. *Note: We end up discarding similar but not identical documents.* This presents another compelling reason for the banded LSH approach.

A natural course to this is to "hash" the items several times using different hash functions banking on the idea that similar items will more likely be hashed to the same bucket--otherwise, dissimilar items. The book terminology for items hashed into the same buckets are *candidate pairs*. Narrowing down the search? Voila! *Candidate pairs* instead of that $n \choose 2$ number of pairs.

Alternatively, for minhashed signatures like the one shown in {numref}`band-structure`, hashing can be applied per band/segment. Hash functions can either be varied per band/segment or the same. In effect, multiple hashing and/or segmented hashing addresses the overfit on getting only identical but not similar items.

Banded signatures are then hashed, forming different hash tables for each band. Candidate pairs are then determined according to those hashed in the same buckets. See {numref}`bandhash-mechanics` for the underlying mechanics.

```{figure} ./images/bandhash_mechanics.PNG
:width: 700px
:name: bandhash-mechanics

Underlying mechanics of the "Band Hashing" method for two banded signature sets $a$ and $b$. In this figure, A and B are considered as candidate pairs because their 3rd bands are hashed in the same buckets. Here only one hash function is used but bands get their own set of hash buckets.
```

Following a more tolerant approach wherein any pair are classified as *candidate pairs* as long as they are hashed in the same bucket in any of the bands formed, hence number of bands and the resulting number of rows in each band as a tuning parameter. Wherein more number of bands will increase the probability of any pair--despite having low similarity--to be tagged as *candidate pairs*. We then state that the number of bands determine the similarity threshold as shown below--where document pairs with pairwise similarities above the threshold (dashed line) will be tagged as *candidate pairs*.

In [1]:
import matplotlib.pyplot as plt
from ipywidgets import widgets, interact
import numpy as np
%matplotlib inline

def _get_factors(x):
    factors = []
    for i in range(1, x + 1):
        if x % i == 0:
            factors.append(x)
    return factors

def _get_valinit(factors):
    index = len(factors) // 2 
    return factors[index]

def _prob_of_s(s, b, r):
    """Return the probability of similarity s given b and r"""
    return 1 - (1 - s**r)**b

def _get_approx_thresh(b, r):
    """Return approximate similarity threshold for chosen b and r"""
    thresh = (1/b) ** (1/r)
    return thresh

# plotting
init_m = 100
init_b_value = init_m // 2

# fig, ax = plt.subplots(figsize=(10, 5))
style = {'description_width': 'initial',
         'handle_color' : 'deeppink',
         'font_weight': 'bold',
         'text_color' : 'darkstalegray'}

b_widget = widgets.IntSlider(
    description='Number of Bands (b)',
    value=init_b_value,
    min=1,
    max=init_m,
    continuous_update=True,
    orientation='horizontal',
    readout=True,
    disabled=False,
    layout=widgets.Layout(width='50%', height='80px'),
    style=style
)

size_widget = widgets.IntText(
    value=init_m,
    description='Signature\nSize (m)',
    disabled=False,
    continous_update=True,
    layout=widgets.Layout(width='25%', height='80px'),
    style=style
)


def update_b_range(*args):
    b_widget.value = _get_valinit(_get_factors(size_widget.value))
    b_widget.max = size_widget.value
    
size_widget.observe(update_b_range, 'value')

def plotter(b, m):
    """Interactive plotter for s-curve"""
    b = int(b)
    s_list = np.linspace(0, 1, num=100)
    r = m / b
    p_list = np.array([_prob_of_s(s, b, r) for s in s_list])
  
    fig, ax = plt.subplots(figsize=(8, 5))
    ax.plot(s_list, p_list, lw=3, c='deeppink')
    thresh = _get_approx_thresh(b, r)
    
    plt.axvline(thresh, color='black', linestyle='--',
               label=f'Similarity Threshold: {thresh:.2f}')
    # Hide the right and top spines
    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)
    
    # set spines lw
    ax.spines['left'].set_linewidth(3)
    ax.spines['bottom'].set_linewidth(3)
    
    plt.title('Probability of becoming a candidate given a similarity\nThe S-curve',
                 fontsize=15)
    plt.ylabel('Probability', fontsize=13)
    plt.xlabel('Jaccard Similarity of Documents', fontsize=13)
    plt.legend(fontsize=11)


interact(plotter, b=b_widget, m=size_widget)
plt.show()

interactive(children=(IntSlider(value=50, description='Number of Bands (b)', layout=Layout(height='80px', widt…