## **Assingment 1** - Group 50

Lütfi Altin (lutfia@kth.se) |
Jakob Heyder (heyder@kth.se)

### Description:

You are to implement the stages of finding textually similar documents based on Jaccard similarity using the shingling, minhashing, and locality-sensitive hashing (LSH) techniques and corresponding algorithms. The implementation can be done using any big data processing framework, such as Apache Spark, Apache Flink, or no framework, e.g., in Java, Python, etc. To test and evaluate your implementation, write a program that uses your implementation to find similar documents in a corpus of 5-10 or more documents such as web pages or emails.

The stages should be implemented as a collection of classes, modules, functions or procedures depending the framework and the language of your choice. Below, we give a description of sample classes that implement different stages of finding textually similar documents. You do not have to develop the exact same classes and data types as described below. Feel free to use data structures that suit you best.

* A class Shingling that constructs k–shingles of a given length k (e.g., 10) from a given document, computes a hash value for each unique shingle, and represents the document in the form of an ordered set of its hashed k-shingles.
* A class CompareSets that computes the Jaccard similarity of two sets of integers – two sets of hashed shingles.
* A class MinHashing that builds a minHash signature (in the form of a vector or a set) of a given length n from a given set of integers (a set of hashed shingles).
* A class CompareSignatures that estimates similarity of two integer vectors – minhash signatures – as a fraction of components, in which they agree.
* (Optional task for extra 2 bonus) A class LSH that implements the LSH technique: given a collection of minhash signatures (integer vectors) and a similarity threshold t, the LSH class (using banding and hashing) finds all candidate pairs of signatures that agree on at least fraction t of their components.

To test and evaluate scalability (the execution time versus the size of input dataset) of your implementation, write a program that uses your classes to find similar documents in a corpus of 5-10 documents. Choose a similarity threshold s (e.g., 0,8) that states that two documents are similar if the Jaccard similarity of their shingle sets is at least s. 

### Dataset and Tools

The implementation is done in Python. We use the [Eco-Hotel review dataset](https://archive.ics.uci.edu/ml/datasets/Eco-hotel) , which provides a corpus of 401 text-reviews as csv data.

The classes will be implemented as Python functions.

In [0]:
# Load dependencies (pandas, csv etc.)
import csv
import numpy as np
import re
import hashlib
import itertools
from collections import Counter
from pprint import pprint
import pandas as pd


In [2]:
# Produces shingles for a given text
def shingling(text, k = 2, trim = True):
    # possibly trim text 
    s = re.sub('[\s+]', '', text) if trim else text

    # generate k-sized shingles and store hashed version in set (avoid duplicates)
    shingles = {s[i:i + k] for i in range(len(s) - k + 1)}
    #print("Shingles", shingles)

    hashes = {hash(i) for i in shingles}
    return hashes

shingling('this is a test')

{-7313416603091209679,
 -6784575876679379395,
 -5676341188209704449,
 -5652620322941114103,
 -5124870803369009610,
 -5106443392422741323,
 1642313828906246015,
 6256248063624487866,
 6695213119578814433}

In [3]:
# Jaccard similarity for two sets
def compareSets(setA, setB):
    return len(setA.intersection(setB)) / len(setA.union(setB))

setA = shingling("lutfi is watchin")
setB = shingling("lets do this")   
compareSets(setA,setB) 

0.1

In [4]:
# Creates the min-hash signature of length n from the shingle set
algorithms = [x for x in hashlib.algorithms_guaranteed if x not in ["shake_128", "shake_256"]] # they need a fixed-length argument

def hashWith(alg, i):
    return int(hashlib.new(alg,str(i).encode('UTF-8')).hexdigest(), 16)

def minHashing(shingleSet, n = 3):
    # throw error if not enough hash functions are available
    if (n > len(algorithms)):
        raise ValueError('The maximum number of hash functions available is {0}.'.format(len(algorithms)))

    # iterate over hash functions and compute h_min(s) for the set.
    signature = [min(hashWith(alg, i) for i in shingleSet) for alg in algorithms[0:n]] 
    return signature

setA = shingling("lutfi is watchin")
signature = minHashing(setA, 12)
signature

[653522856589351066804982148836236964834507874529499700764425269523190911183043656104232048197055886598826652602065,
 2878782684885436217424264982721683562044914418219364350271341582825087244457958699899293877270865963077385307259753540546542469158071426733453335715400334,
 14144138279377104566127191076145144661468382159,
 6621206272040904076684347470483009911773157093580518936096250384798541153960,
 8645872421062765289909519959436344346574527858136863341143855008479672386244,
 376249982905439382313504452239938203898353071754973219989723774079,
 130696007095401796113889637556669921871605540957823295254733966364408800862753830170001848123680253515585750937275512243684057006244539068018152368375540,
 264835318754428606591177414508884148497389438160055789241582670618,
 574899293936071695424235681025076953640511785774343798506990393265257889770417353687423169253375068933127448534351777848364948495768156059444973689765291,
 7706346851118855014968578915069660240562869324552771590761825839555

In [5]:
# Compare the signatures, the returned probability will approximate the jaccard similarity of the original shingle sets
def compareSignatures(signatureA, signatureB):
    if (len(signatureA) != len(signatureB)):
        raise ValueError('The signatures have different length({0}, {1} respectively) and should not be compared!', len(signatureA), len(signatureB))

    A = np.array(signatureA)
    B = np.array(signatureB)
    count = np.count_nonzero(A==B)
    # Important: Not jaccard similarity -> Probability instead (number_of_same/number_of_total)
    probability = count / len(signatureA)
    #print(A==B, count, probability)
    
    return probability

setA = shingling("lütfi is watchin")
setB = shingling("lütfi is working")
print("Set A", setA)
print("Set B", setB)
similarity = compareSets(setA, setB)
print(similarity)
signatureA = minHashing(setA, 12)
signatureB = minHashing(setB, 12)
print("Signature A", signatureA)
print("Signature B", signatureB)
compareSignatures(signatureA, signatureB)

Set A {-8035911973540809245, 6695213119578814433, 3542108280038597480, -4527573745910354262, 2784169489973620333, 199007250924976368, 6386736999077128751, -5106443392422741323, -6565529605814372745, -1012230118648613320, -5542241521173551875, 6256248063624487866, -7641244423143262656}
Set B {6695213119578814433, -90497505286013432, 3542108280038597480, -4527573745910354262, 2784169489973620333, 2206788520029522222, -5542241521173551875, 6386736999077128751, 2721089368504791728, -6565529605814372745, 6564115130264095928, -5284489781584247171, -7641244423143262656}
0.4444444444444444
Signature A [653522856589351066804982148836236964834507874529499700764425269523190911183043656104232048197055886598826652602065, 3266090763773593029957100425728239449296043916690705806335763740478570997366870808168167737897065736782188831942500198489882346037074324291547019688239882, 14144138279377104566127191076145144661468382159, 10607885856810717787709741883475047215184833637329737871926315773612560774810

0.5

In [11]:
# Locality sensitive hashing , input parameter: # of bands to separate the signatures into
def lhs(signatures, similarity_threshold, nr_bands = 5, nr_buckets = 5):
    if (len(signatures) < 1 or not all(len(s) == len(signatures[0]) for s in signatures)):
        raise ValueError('The signatures need to have all the same length and be non empty.')

    # 1.) Iterate over signatures, cut in bands and hash each band into a bucket
    buckets = [set() for x in range(0, nr_buckets)]
    bands = np.linspace(0, len(signatures[0]), nr_bands).astype(int).tolist()

    for index, signature in enumerate(signatures):
        for i in range(0, nr_bands-1):
            band_start = bands[i]
            band_end = bands[i+1]
            # join band to be hashed "as one entity"
            band = "".join(str(x) for x in signature[band_start:band_end])
            bucket = hash(band) % nr_buckets
            # add the signature-set-join ("identifier") to the bucket
            #buckets[bucket].add("".join(str(x) for x in signature)) # store stringified signature
            buckets[bucket].add("index %s" % index) # replaced above line with a human readable string in order to validate results after
    
    # 2.) Use sets of buckets to determine candidate pairs based on threshold 
    relevant_buckets = [x for x in buckets if len(x) >= 2] # only check buckets with more than one signature       
    relevant_pairs = []
    for bucket in relevant_buckets:
        # get all relevant pairs from all buckets and append to a huge list
        pairs = [x for x in itertools.combinations(bucket, 2) if x[0] != x[1]]
        relevant_pairs += pairs

    count = Counter(relevant_pairs)
    # count the occourences of each pair in a list and see if the pairs similarity (based on same hashed buckets) crossed the threshold
    indices = [index for index, x in enumerate(count.values()) if (x/(nr_bands-1)) >= similarity_threshold]        
    # use the indices for the final candidate pairs
    candidate_pairs = [pair for index, pair in enumerate(count.keys()) if index in indices]

    return candidate_pairs

print(lhs([signatureA, signatureB], 0.1))
print(lhs([signatureA, signatureB], 0.8))

[('index 0', 'index 1')]
[]


Next, we use defined functions on our dataset to find similar comments. Dataset file contains one comment per line. Shingling size of 4 chosen because commnents are short documents with a few lines of content (average character count is around 270)

Shinglings are store with line number of comments so results can be evaluated later. A helper function `compare` is defined to help Jaccard similarity calculation. Threshold values are updated to keep a reasonable number of similar items.

In [9]:
import codecs
from pprint import pprint

# Import hotel review data
f = codecs.open('data.txt', encoding='utf-8')
dataSet = [line.strip() for line in f]

print('avg char count', sum(len(d) for d in dataSet)/len(dataSet))

# Executes given comparison function over combinations of elements in input array
def compare(fn, arr):
    return [(s[0][0], s[1][0], fn(s[0][1], s[1][1])) for s in itertools.combinations(arr, 2)]

# jaccard similarity with shinglings
shinglings = [(i+1, shingling(t, k=4)) for i, t in enumerate(dataSet)] # value i+1 is the line number and it is stored to evalute results
similarities = compare(compareSets, shinglings)

similarities = [s for s in similarities if s[2] > 0.3]
pprint(similarities)

# jaccard similarity with minHashing
signatures = [(i, minHashing(s, n=12)) for i, s in shinglings]
similarities = compare(compareSignatures, signatures)

similarities = [s for s in similarities if s[2] > 0.4]
pprint(similarities)

# local sensitive hashing
pprint(lhs([s for i, s in signatures], 0.5, nr_buckets=200)) # index i represents the comment with line number i+1

avg char count 268.6708229426434
[(73, 74, 1.0),
 (90, 186, 0.4090909090909091),
 (90, 194, 0.3235294117647059),
 (102, 140, 0.3492063492063492),
 (153, 154, 1.0),
 (178, 241, 0.375),
 (178, 395, 0.3333333333333333),
 (235, 347, 0.34146341463414637),
 (239, 258, 0.34285714285714286),
 (239, 347, 0.3333333333333333),
 (241, 395, 0.30303030303030304),
 (258, 320, 0.34285714285714286),
 (258, 347, 0.34375),
 (258, 351, 0.3055555555555556),
 (258, 366, 0.3783783783783784),
 (320, 328, 0.34615384615384615),
 (320, 330, 0.3055555555555556),
 (320, 347, 0.3333333333333333),
 (320, 377, 0.3055555555555556),
 (320, 382, 0.32075471698113206),
 (330, 377, 0.3142857142857143)]
[(8, 194, 0.4166666666666667),
 (40, 59, 0.4166666666666667),
 (68, 102, 0.4166666666666667),
 (73, 74, 1.0),
 (75, 194, 0.4166666666666667),
 (90, 186, 0.5),
 (98, 382, 0.4166666666666667),
 (124, 146, 0.4166666666666667),
 (128, 375, 0.4166666666666667),
 (145, 392, 0.4166666666666667),
 (149, 361, 0.4166666666666667),
 (1

The results of Jaccard similarity with shingling and min hashing are similar. But Locality Sensitive Hashing is slightly different due to low number of bands. There are 12 hashing algorithms used for Min Hashing, we use highest vector size of 12. They are split into 4 bands with 3 rows per band. This results in similarity values of multiples of 25%.

2 pairs have Jaccard similarity of 1.0 meaning they are identical. If we check the input file we see that comments on line 73 and 153 are duplicated. To validate the correctness lets take the pair (258, 347) which is found both in shingling and min hashing method.

Comment in line 258: "Staff are wonderful. Thank you"

Comment in line 347: "Thank you, it was wonderful"

 We can see that comments are very similar.

Now, we compare execution time of the 3 algorithms. In order to reduce variance we run each algorithm 20 times.

In [8]:
import time

start = time.time()
for i in range(20):
    compare(compareSets, shinglings)
print("--- %s seconds ---" % (time.time() - start))

start = time.time()
for i in range(20):
    compare(compareSignatures, signatures)
print("--- %s seconds ---" % (time.time() - start))

start = time.time()
for i in range(20):
    lhs([s for i, s in signatures], 0.7, nr_buckets=200)
print("--- %s seconds ---" % (time.time() - start))


--- 25.963175773620605 seconds ---
--- 15.989728450775146 seconds ---
--- 0.17509746551513672 seconds ---


Execution time results are as expected, comparison of shinglings is the slowest of all and Locality Sensitive Hashing the fastest.