# Lab: Finding Similar Items
Data Mining 2019/2020 <br> 
Author: Data Mining Teaching Team

**WHAT** This *optional* lab consists of several programming and insight exercises/questions. 
These exercises are ment to let you practice with the theory covered in: [Chapter 3][2] from "Mining of Massive Datasets" by J. Leskovec, A. Rajaraman, J. D. Ullman. <br>

**WHY** Practicing, both through programming and answering the insight questions, aims at deepening your knowledge and preparing you for the exam. <br>

**HOW** Follow the exercises in this notebook either on your own or with a friend. Talk to other students or use mattermost
to disscus questions with your peers. For additional questions and feedback please consult the TA's at the assigned lab session. The answers to these exercises will not be provided.

[1]: https://mattermost.ewi.tudelft.nl/signup_user_complete/?id=ccffzw3cdjrkxkksq79qbxww7a
[2]: http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

**SUMMARY** <br>
In these exercises, you will create algorithms for finding similar items in a dataset. 
* Exercise 1: Shingles    
* Exercise 2: MinHasing
* Exercise 3: Locality Sensitive Hashing


## Exercise 1: Shingles

### Step 1: implement shingleString

First we will implement the shingleString function. This function takes as argument a string and the size parameter k, cuts  the string up in shingles of size k, and returns the a set of shingles. 

For example, if the input string is: "abcdabd" the resulting string ShingleSet, with a k of 2 should be: {"ab", "bc", "cd", "da", "bd"}

Implement this function and verify that it works as intended.

In [15]:
def shingleString(string, k):
    """
    This function takes as argument some string and cuts it up in shingles of size k.
    input ("abcdabd", 2) -> {'ab', 'bc', 'cd', 'da', 'bd'}
    :param string: The input string
    :param k: The size of the shingles
    :return: A Set of Shingles with size k
    """    
    shingles = set()
    # Start coding here!
    for i, _ in enumerate(string):
        if i == len(string)-k+1:
            break
        curr = string[i:i+k]
#         alternatively
#         curr = ""
#         for j in range(k):
#             curr += string[i+j]
        shingles.add(curr)
    
    return shingles

shingleString("abcdabd", 2)

{'ab', 'bc', 'bd', 'cd', 'da'}

### Question 

What would be the output of the ShingleSet with k set to 4? Will the size of the ShingleSet increase of decrease? 

In [16]:
shingleString("abcdabd", 4)
# k increase, the size of shingles decrease

{'abcd', 'bcda', 'cdab', 'dabd'}

### Step 2: implement JaccardDistance

Next we will be implementing the jaccardDistance function. This function takes as input two sets and computes the distance between them. Remember that the Jaccard distance can be calculated as follows: 

# <center> $d(A, B) = 1 - \frac{| A \cap B|}{|A \cup B|}$ </center>



In [17]:
def jaccardDistance(a, b):
    """
    This function takes as input two sets and computes the distance between them -> 1 - length(intersection)/length(union)
    :param a: The first set
    :param b: The second set to compare
    :return: The (Jaccard) distance between set 'a' and 'b' (0 =< distance =< 1)
    """    
    
    distance = -1.0
    
    # Start coding here!
    intersection = a.intersection(b)
    union = a.union(b)
    jaccard_similarity = float(len(intersection)) / len(union)
    distance = 1 - jaccard_similarity
    
    return distance

print(jaccardDistance({"ab"}, {"ab"}))
print(jaccardDistance({"ab"}, {"abc"}))
print(jaccardDistance({"ab"}, {"abcd"}))

0.0
1.0
1.0


### Step 3: apply shingleString and JaccardDistance

Create two separate ShingleSets with k set to 5 (using shingleString from step 1) from the following strings:  

<center> "The plane was ready for touch down"</center> 

<center> "The quarterback scored a touchdown"</center>

Are these sentences very similar? Do you expect that the Jaccard distance between these two sentences will be large or small? <br>
Calculate the Jaccard distance between these two sets using the function implemented in step 2.

In [18]:
s1 = "The plane was ready for touch down"
s2 = "The quarterback scored a touchdown"

def jaccardDistanceUnstripped(s1, s2, k):
    """
    This function calculates the jaccard distance between two strings.
    :param a: The first string
    :param b: The second string to compare
    :return: The (Jaccard) distance between string 'a' and 'b' (0 =< distance =< 1)
    """   
    # Start coding here!
    # q1. the two sentences are not very similar
    # q2. I except the Jaccard distance to be large
    shingle_1 = shingleString(s1, k)
    shingle_2 = shingleString(s2, k)
    
    return jaccardDistance(shingle_1, shingle_2)
    
jaccardDistanceUnstripped(s1, s2, 5)

0.9655172413793104

### Question

The jaccard distance you calculated for the above sentences should be equal to 0.96.
What would happend if we lower our k to 1? Would it increase or decrease the distance between the two sets? Which k would be appropriate for these two sentences?  

In [19]:
for i in range(10):
    dist = jaccardDistanceUnstripped(s1, s2, i)
    print("i={}, distance={}".format(i, dist))

# Lower k will decrease the distance

i=0, distance=0.0
i=1, distance=0.33333333333333337
i=2, distance=0.75
i=3, distance=0.8571428571428572
i=4, distance=0.9122807017543859
i=5, distance=0.9655172413793104
i=6, distance=0.9824561403508771
i=7, distance=1.0
i=8, distance=1.0
i=9, distance=1.0


### Step 4: remove whitespaces

Both sentences from step 3 contain whitespaces, but they appear not to contribute much to the actual meaning of the sentence. An option would be to strip all whitespaces from the sentences before cutting them into shingles. Create a function that removes all whitespaces from the strings before creating any shingles and calculate the jaccard distance again.

In [20]:
def jaccardDistanceStripped(s1, s2, k):
    """
    This method computes the JaccardDistance between two ShingleSets without any white spaces in the original strings
    :param a: The first string
    :param b: The second string to compare
    :return: The (Jaccard) distance between string 'a' and 'b' (0 =< distance =< 1)
    """  
    # Start coding here!
    
    strip_s1 = s1.replace(" ", "")
    strip_s2 = s2.replace(" ", "")
    
    return jaccardDistanceUnstripped(strip_s1, strip_s2, k)

jaccardDistanceStripped(s1, s2, 5)

0.8888888888888888

### Question 

Did the jaccard distance between the two sets increase or decrease? Why is that?

In [21]:
# Decreased, because "touch down" and "touchdown" are considered the same word now

# Exercise 2: MinHashing

In this exercise you will be creating a minhashing signature matrix. 

### Step 1
Firstly, for this exercises you are given 4 ShingleSets: s1-s4, with k set to 1.

In [22]:
s1 = {"a","b"}
s2 = {"a","c"}
s3 = {"d", "c"}
s4 = {"g", "b", "a"}

# Init Shingle sets
sets = [s1,s2,s3,s4]

Secondly, create a function which hashes an integer $x$ given an $alpha$ and $beta$. The function should hash a value $x$ given the formula:

<center> $h(x) = ((x * alpha) + beta)$ $mod$ $n$ </center>

where $x$ is an integer and $n$ is the number of unique shingles of all sets. <br>
For example for $x$=3 and $n$=2 you should get $h(x)$ = 0.

In [23]:
class HashFunction:
    """
    This HashFunction class can be used to create an unique hash given an alpha and beta
    """
    def __init__(self, alpha, beta):
        self.alpha = alpha
        self.beta = beta

    def hashf(self, x, n):
        """
        Returns a hash given an integer x and n
        :param x: The value to be hashed
        :param n: The number of unique shingles of all sets
        :return: The hashed value x given alpha and beta
        """
        # Replace this with your implementation!
        hash_value = (x * self.alpha + self.beta) % n
        
        return hash_value

# Assume alpha and beta equal 1
h1 = HashFunction(1,1)

# Solve 
h1.hashf(3, 2)       

0

### Question 
To gain some insight in computing minhash signature matrices, compute the MinHash
signature by hand of the given ShingleSets above using the the hash functions $h_1$ and $h_2$. Do
this computation by hand! Refer to the slides or study material if you forgot how to do this.  

### Step 2.

Next we are going to create the `computeSignature` function which will create the minhash signature matrix from our sets `s1`-`s4` using our hashfunctions $h_1$ and $h_2$. You could make use of the pseudocode below.
  
``` python
foreach shingle x in the shingle space do 
    foreach ShingleSet S do
        if x ∈ S then
            foreach hash function h do
                signature(h, S) = min(h(x), signature(h, S))
            end
        end
    end
end
```

In [24]:
def shingleSpace(sets):
    """
    Sets up the total shingle space given the list of shingles (sets)
    :param sets: A list of ShingleSets
    :return: The ShingleSpace set
    """
    space = set()
    # Start coding here!
    for ss in sets:
        for e in ss:
            space.add(e)
    
    return space


# Init List of hash functions
hashes = list()

h1 = HashFunction(1,1)
h2 = HashFunction(3,1)

hashes.append(h1)
hashes.append(h2)

In [25]:
import numpy as np
import sys

space = shingleSpace(sets)
sortedSpace = sorted(space)

def computeSignature(space, hashes, sets):
    """
    This function will create the minhash signature matrix from our sets s1-s4 
    using the list of hashfunction hashes and the shingleSpace space
    :param space: The ShingleSpace set
    :param hashes: The list of hashes
    :param sets: The list of ShingleSets
    :return: The MinHashSignature
    """
    
    result = np.full((len(hashes),len(sets)), sys.maxsize)
#     Start coding here!
#     print(result)
    n = 3
    for x_i, x in enumerate(space):
        for s_i, s in enumerate(sets):
            if x in s:
                for h_i, h in enumerate(hashes):
                    # ord() convert str to ASCII int
                    result[h_i, s_i] = min(h.hashf(ord(x), n), result[h_i, s_i])
    return result

computeSignature(space, hashes, sets)

array([[0, 1, 1, 0],
       [1, 1, 1, 1]])

### Question

Compute the minhash signature matrix using your implemented function. Verify that the result of your implementation is correct by comparing the answers of your program to the answers of your manual calculation.

## Exercise 3: Locality Sensitive Hashing

In this part of the exercise we will use the implemented function of the last exercises to compute a Locality-Sensitive Hashing Table using the banding technique for minhashes described in the lecture and in the book.

### Step 1.
For this exercise we will be needing many hashfunctions. Construct a class which can create a hashfunction with a random $alpha$ and $beta$.

In [26]:
import random

class RandomHashFunction:
    """
    This RandomHashFunction class can be used to create a random unique hash given an alpha and beta
    """
    def __init__(self, alpha, beta):
        self.alpha = (random.randint(1,alpha))
        self.beta = (random.randint(1,beta))
        
    def hashf(self, x, n):
        """
        Returns a random hash given an integer x and n
        :param x: The value to be hashed
        :param n: The number of unique shingles of all sets
        :return: The hashed value x given alpha and beta
        """
        
        # Replace this with your implementation!
        hash_value = (x * self.alpha + self.beta) % n

        return hash_value
    

### Step 2.

Now create a function which computes the candidates using the LSH technique given a Minhash table. For this you may use the pseudocode given below.  
  
``` python
# Initialize buckets
foreach band do
    foreach set do
        s = a column segment of length r, for this band and set
        add set to buckets[hash(s),band]
    end
end
```  
   
``` python
# Retrieve candidates
foreach item in buckets[hash(s),band] do
    add [set,item] to the list of candidates
end

```

In [27]:
def computeCandidates(minhash_table, bucket_size, row_per_band):
    """
    This function computes the candidates using the LSH technique given a Minhash table
    :param minhash_table: The minhash table
    :param bucket_size: The bucketsize
    :param row_per_band: The rows per band
    :return: The list of candidates
    """
    
    assert(minhash_table.shape[0] % row_per_band == 0)
    b = minhash_table.shape[0] / row_per_band
    result = set()
    buckets = list()
  
    for i in range(bucket_size):
        buckets.append(list())
#     print(buckets)

    # Initialize the buckets
    for i in range(int(b)):                     # Yanqing: For each band
        for j in range(minhash_table.shape[1]): # Yanqing: Iterate thru minhash_table columns
            # Take segment from minhash_table column
            colSegment = minhash_table[i*row_per_band:(i+1)*row_per_band,[j]]
#             print(colSegment)            
            
            # Convert column segement to string
            s = np.array2string(colSegment.flatten(), separator = '')
            s = s[1:len(s)-1]
#             print(s)
            
            # Init bucket list item
            item = list()
            
            # Append string (s) to the bucket list (buckets)
            # Start coding here!
#             print(i,j)
            for c in s:
                item.append(c)
#             buckets[i,j].append(s)
            buckets[i] = item
#             buckets.append(s)\\
#             print(i, j)
#     print(buckets)
    
    # Retrieve the candidates
    for item_i, item in enumerate(buckets):   # I changed it
        item = set(item)
        
        # Start coding here!
        if item != set():
            print(item_i, item)
        
    for x in result:
        jd = jaccardDistance(sets[x[0]], sets[x[1]])
        if jd < 0.5:
            print("ShingleSets: ", x, "within tolerenace   jaccard distance: ", jd)
        else:
            print("ShingleSets: ", x, "not within tolerance   jaccard distance: ", jd)
    return result

### Question 
An important implementation issue is that you should keep seperate lists of buckets for each band. This means that this algorithm will work suboptimal if you index the buckets only as: buckets[hash(s)] instead of buckets[hash(s),band]. Why is this the case?  

### Step 3. 
Similarly as before, compute the minhash signature matrix using the 100 random hash functions. Use a bucket size of 1000 and 5 rows per band.

In [28]:
# Init list for the 100 random hashes
rhashes = [RandomHashFunction(100,100) for x in range(100)]

# Calculate Minhash Table
mhs = computeSignature(space, rhashes, sets)
# print(mhs)

# Apply Locally Sensitive Hashing to find candidate pairs
# computeCandidates(mhs, 10000, 5)

print("THE INSTRUCTION IS CONFUSING. FUCK THIS SHIT. I DON'T KNOW WTF IS GOING ON")

THE INSTRUCTION IS CONFUSING. FUCK THIS SHIT. I DON'T KNOW WTF IS GOING ON


__Content Based__ recommends only using information about the items being recommended. There is no information about the users.

__Collaborative Filtering__ takes advantage of user information. Generally speaking, the data contains likes or dislikes of every item every user has used. The likes and dislikes could be implicit like the fact that a user watched a whole movie or explicit like the user gave the movie a thumbs up or a good star rating

### Question
When you run your code multiple times you will notice that sometimes you get other candidates. Why?


Because we use random

### Question 
Run your code 10 times. Take notes which candidates get suggested and how many times each candidate gets suggested. How does this relate to the Jaccard distance between the two
sets of the candidate pair (not in terms of formulas, just an indication)? To check this, compute the Jaccard distance between all possible combinations of all ShingleSets and compare this to the frequencies (how many times a pair gets suggested as a candidate) to verify your idea.

### Question
Why (or when) would you use this algorithm?


Why LSH?
LSH can be considered an algorithm for dimensionality reduction. A problem that arises when we recommend items from large datasets is that there may be too many pairs of items to calculate the similarity of each pair. Also, we likely will have sparse amounts of overlapping data for all items.

In LSH you are talking about probability of two documents land on the same bucket. After generating the minhash, you should divide them into 'b' bands containing 'r' rows each. Then documents whose share the same bands are candidate to be similar. When finding candidates for given document is done, you can use whatever similarity measure you want to measure similarities and pick the k-most similar documents.

Candidate pairs are those that hash to the same bucket for more than one band.

### Question 
What happens if the number of buckets is too small? For example what would happen
if we only use 10 buckets?  

### Question 
What is the effect of the number of rows per band? If we set the number of rows per band to 1, what will happen? And if you set the number of rows per band to the length of the
signature?  

if rows/band increase, then
    
- fewer pairs would be selected for comparison,
- the # FP would go down, but # FN go up
- Performance would go up, but error rate also go up!

- Comparing all pairs of signatures take too much time or space
     - so we need LSH
- These methods can produce FN or FP

## Key idea
hash each column C to a small signature sig(c), s.t

1. sig(C) is small enought that we can fit a signature in main memory for each column
2. sim(C1, C2) is (almost) the same as the "similarity" of Sig(C1) and Sig(C2)