# Lab: Finding Similar Items
Data Mining 2019/2020 <br> 
Author: Data Mining Teaching Team

**WHAT** This *optional* lab consists of several programming and insight exercises/questions. 
These exercises are ment to let you practice with the theory covered in: [Chapter 3][2] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. <br>

**HOW** Follow the exercises in this notebook either on your own or with a friend. Talk to other students or use mattermost
to disscus questions with your peers. For additional questions and feedback please consult the TA's at the assigned lab session. The answers to these exercises will not be provided.

[1]: https://mattermost.ewi.tudelft.nl/signup_user_complete/?id=ccffzw3cdjrkxkksq79qbxww7a
[2]: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

**SUMMARY** <br>
In these exercises, you will create algorithms for finding similar items in a dataset. 
* Exercise 1: Shingles    
* Exercise 2: MinHasing
* Exercise 3: Locality Sensitive Hashing


## Exercise 1: Shingles

### Step 1: implement shingleString

First we will implement the shingleString function. This function takes as argument a string and the size parameter k, cuts  the string up in shingles of size k, and returns the a set of shingles. 

For example, if the input string is: "abcdabd" the resulting string ShingleSet, with a k of 2 should be: {"ab", "bc", "cd", "da", "bd"}

Implement this function and verify that it works as intended.

In [None]:
def shingleString(string, k):
    """
    This function takes as argument some string and cuts it up in shingles of size k.
    input ("abcdabd", 2) -> {'ab', 'bc', 'cd', 'da', 'bd'}
    :param string: The input string
    :param k: The size of the shingles
    :return: A Set of Shingles with size k
    """    
    shingles = set()
    
    # Start coding here!
    
    return shingles

shingleString("abcdabd", 2)

### Question 

What would be the output of the ShingleSet with k set to 4? Will the size of the ShingleSet increase of decrease? 

### Step 2: implement JaccardDistance

Next we will be implementing the jaccardDistance function. This function takes as input two sets and computes the distance between them. Remember that the Jaccard distance can be calculated as follows: 

# <center> $d(A, B) = 1 - \frac{| A \cap B|}{|A \cup B|}$ </center>



In [None]:
def jaccardDistance(a, b):
    """
    This function takes as input two sets and computes the distance between them -> 1 - length(intersection)/length(union)
    :param a: The first set
    :param b: The second set to compare
    :return: The (Jaccard) distance between set 'a' and 'b' (0 =< distance =< 1)
    """    
    
    distance = -1.0
    
    # Start coding here!
    
    return distance

jaccardDistance({"ab"}, {"ab"})

### Step 3: apply shingleString and JaccardDistance

Create two separate ShingleSets with k set to 5 (using shingleString from step 1) from the following strings:  

<center> "The plane was ready for touch down"</center> 

<center> "The quarterback scored a touchdown"</center>

Are these sentences very similar? Do you expect that the Jaccard distance between these two sentences will be large or small? <br>
Calculate the Jaccard distance between these two sets using the function implemented in step 2.

In [None]:
s1 = "The plane was ready for touch down"
s2 = "The quarterback scored a touchdown"

def jaccardDistanceUnstripped(s1, s2):
    """
    This function calculates the jaccard distance between two strings.
    :param a: The first string
    :param b: The second string to compare
    :return: The (Jaccard) distance between string 'a' and 'b' (0 =< distance =< 1)
    """   
    # Start coding here!
    
    return -1.0
    
jaccardDistanceUnstripped(s1, s2)

### Question

The jaccard distance you calculated for the above sentences should be equal to 0.96.
What would happend if we lower our k to 1? Would it increase or decrease the distance between the two sets? Which k would be appropriate for these two sentences?  

### Step 4: remove whitespaces

Both sentences from step 3 contain whitespaces, but they appear not to contribute much to the actual meaning of the sentence. An option would be to strip all whitespaces from the sentences before cutting them into shingles. Create a function that removes all whitespaces from the strings before creating any shingles and calculate the jaccard distance again.

In [None]:
def jaccardDistanceStripped(s1, s2):
    """
    This method computes the JaccardDistance between two ShingleSets without any white spaces in the original strings
    :param a: The first string
    :param b: The second string to compare
    :return: The (Jaccard) distance between string 'a' and 'b' (0 =< distance =< 1)
    """  
    # Start coding here!
    
    return -1.0

jaccardDistanceStripped(s1, s2)

### Question 

Did the jaccard distance between the two sets increase or decrease? Why is that?

# Exercise 2: MinHashing

In this exercise you will be creating a minhashing signature matrix. 

### Step 1
Firstly, for this exercises you are given 4 ShingleSets: s1-s4, with k set to 1.

In [None]:
s1 = {"a","b"}
s2 = {"a","c"}
s3 = {"d", "c"}
s4 = {"g", "b", "a"}

# Init Shingle sets
sets = [s1,s2,s3,s4]

Secondly, create a function which hashes an integer $x$ given an $alpha$ and $beta$. The function should hash a value $x$ given the formula:

<center> $h(x) = ((x * alpha) + beta)$ $mod$ $n$ </center>

where $x$ is an integer and $n$ is the number of unique shingles of all sets. <br>
For example for $x$=3 and $n$=2 you should get $h(x)$ = 0.

In [None]:
class HashFunction:
    """
    This HashFunction class can be used to create an unique hash given an alpha and beta
    """
    def __init__(self, alpha, beta):
        self.alpha = alpha
        self.beta = beta

    def hashf(self, x, n):
        """
        Returns a hash given an integer x and n
        :param x: The value to be hashed
        :param n: The number of unique shingles of all sets
        :return: The hashed value x given alpha and beta
        """
        # Replace this with your implementation!
        hash_value = -1
        
        return hash_value

# Assume alpha and beta equal 1
h1 = HashFunction(1,1)

# Solve 
h1.hashf(3, 2)       

### Question 
To gain some insight in computing minhash signature matrices, compute the MinHash
signature by hand of the given ShingleSets above using the the hash functions $h_1$ and $h_2$. Do
this computation by hand! Refer to the slides or study material if you forgot how to do this.  

### Step 2.

Next we are going to create the `computeSignature` function which will create the minhash signature matrix from our sets `s1`-`s4` using our hashfunctions $h_1$ and $h_2$. You could make use of the pseudocode below.
  
``` python
foreach shingle x in the shingle space do 
    foreach ShingleSet S do
        if x ∈ S then
            foreach hash function h do
                signature(h, S) = min(h(x), signature(h, S))
            end
        end
    end
end
```

In [None]:
def shingleSpace(sets):
    """
    Sets up the total shingle space given the list of shingles (sets)
    :param sets: A list of ShingleSets
    :return: The ShingleSpace set
    """
    space = set()
    
    # Start coding here!
    
    return space


# Init List of hash functions
hashes = list()

h1 = HashFunction(1,1)
h2 = HashFunction(3,1)

hashes.append(h1)
hashes.append(h2)

In [None]:
import numpy as np
import sys

space = shingleSpace(sets)
sortedSpace = sorted(space)

def computeSignature(space, hashes, sets):
    """
    This function will create the minhash signature matrix from our sets s1-s4 
    using the list of hashfunction hashes and the shingleSpace space
    :param space: The ShingleSpace set
    :param hashes: The list of hashes
    :param sets: The list of ShingleSets
    :return: The MinHashSignature
    """
    
    result = np.full((len(hashes),len(sets)), sys.maxsize)
    
    # Start coding here!
    
    return result

computeSignature(space, hashes, sets)

### Question

Compute the minhash signature matrix using your implemented function. Verify that the result of your implementation is correct by comparing the answers of your program to the answers of your manual calculation.

## Exercise 3: Locality Sensitive Hashing

In this part of the exercise we will use the implemented function of the last exercises to compute a Locality-Sensitive Hashing Table using the banding technique for minhashes described in the lecture and in the book.

### Step 1.
For this exercise we will be needing many hashfunctions. Construct a class which can create a hashfunction with a random $alpha$ and $beta$.

In [None]:
import random

class RandomHashFunction:
    """
    This RandomHashFunction class can be used to create a random unique hash given an alpha and beta
    """
    def __init__(self, alpha, beta):
        self.alpha = (random.randint(1,alpha))
        self.beta = (random.randint(1,beta))
        
    def hashf(self, x, n):
        """
        Returns a random hash given an integer x and n
        :param x: The value to be hashed
        :param n: The number of unique shingles of all sets
        :return: The hashed value x given alpha and beta
        """
        
        # Replace this with your implementation!
        hash_value = -1

        return hash_value
    

### Step 2.

Now create a function which computes the candidates using the LSH technique given a Minhash table. For this you may use the pseudocode given below.  
  
``` python
# Initialize buckets
foreach band do
    foreach set do
        s = a column segment of length r, for this band and set
        add set to buckets[hash(s),band]
    end
end
```  
   
``` python
# Retrieve candidates
foreach item in buckets[hash(s),band] do
    add [set,item] to the list of candidates
end

```

In [None]:
def computeCandidates(mhs, bs, r):
    """
    This function computes the candidates using the LSH technique given a Minhash table
    :param mhs: The minhash table
    :param bs: The bucketsize
    :param r: The rows per band
    :return: The list of candidates
    """
    
    assert(mhs.shape[0] % r == 0)
    b = mhs.shape[0] / r
    result = set()
    buckets = list()
  
    for i in range(bs):
        buckets.append(list())

    # Initialize the buckets
    for i in range(int(b)):
        for j in range(mhs.shape[1]):
            # Take segment from mhs column
            colSegment = mhs[i*r:(i+1)*r,[j]]
            
            # Convert column segement to string
            s = np.array2string(colSegment.flatten(), separator = '')
            s = s[1:len(s)-1]
            
            # Init bucket list item
            item = list()
            
            # Append string (s) to the bucket list (buckets)
            
            # Start coding here!
    
    
    # Retrieve the candidates
    for item in buckets:   
        item = set(item)
        
        # Start coding here!
        
    for x in result:
        jd = jaccardDistance(sets[x[0]], sets[x[1]])
        if jd < 0.5:
            print("ShingleSets: ", x, "within tolerenace   jaccard distance: ", jd)
        else:
            print("ShingleSets: ", x, "not within tolerance   jaccard distance: ", jd)
    return result

### Question 
An important implementation issue is that you should keep seperate lists of buckets for each band. This means that this algorithm will work suboptimal if you index the buckets only as: buckets[hash(s)] instead of buckets[hash(s),band]. Why is this the case?  

### Step 3. 
Similarly as before, compute the minhash signature matrix using the 100 random hash functions. Use a bucket size of 1000 and 5 rows per band.

In [None]:
# Init list for the 100 random hashes
rhashes = [RandomHashFunction(100,100) for x in range(100)]

# Calculate Minhash Table
mhs = computeSignature(space, rhashes, sets)

# Apply Locally Sensitive Hashing to find candidate pairs
computeCandidates(mhs, 10000, 5)

### Question
When you run your code multiple times you will notice that sometimes you get other candidates. Why?  

### Question 
Run your code 10 times. Take notes which candidates get suggested and how many times each candidate gets suggested. How does this relate to the Jaccard distance between the two
sets of the candidate pair (not in terms of formulas, just an indication)? To check this, compute the Jaccard distance between all possible combinations of all ShingleSets and compare this to the frequencies (how many times a pair gets suggested as a candidate) to verify your idea.

### Question
Why (or when) would you use this algorithm?

### Question 
What happens if the number of buckets is too small? For example what would happen
if we only use 10 buckets?  

### Question 
What is the effect of the number of rows per band? If we set the number of rows per band to 1, what will happen? And if you set the number of rows per band to the length of the
signature?  