# Finding Similar Items: Textually Similar Documents

In [1]:
from numpy import unique
from pandas import read_csv
url = 'https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv'
df = read_csv(url)

A class **Shingling** that constructs k–shingles of a given length k (e.g., 10) from a given document, computes a hash value for each unique shingle, and represents the document in the form of an ordered set of its hashed k-shingles.

In [2]:
import random, math
class Shingling:
   
    def __init__(self, input_text, input_k):
        self.text = input_text
        self.k = input_k
           
    def shingles(self):
        for line in self.text:
            return {line[i:i + self.k] for i in range(len(line) - self.k + 1)} 
    
    def ordered_hash(self, input_sh):
        random.seed(111)
        a = random.randint(0,100)
        b = random.randint(0,100)        
        # Convert string into int
        return set(sorted(set((a*(int.from_bytes(x.encode(), 'little')) + 
                           b)%1039205197 for x in input_sh)))

In [3]:
import io
sample_text = io.StringIO("This is a sample text. It is a ordinary string but simulated to act as the contents of a file")
# Construct 10-shingles
sh = Shingling(sample_text,10) 
shingles = sh.shingles() 
# Ordered set of hashed 10-shingles
ordered_set = sh.ordered_hash(shingles) 

A class **CompareSets** that computes the Jaccard similarity of two sets of integers – two sets of hashed shingles.

A class **MinHashing** that builds a minHash signature (in the form of a vector or a set) of a given length n from a given set of integers (a set of hashed shingles).

In [4]:
class MinHashing:

    def __init__(self, input_set, input_n):
        self.s = input_set
        self.n = input_n

    def hashing(self):
        min_hash = [None] * self.n # Vector of length n
        random.seed(111)
        for n in range(self.n):
            a = random.randint(0, 2**32-1) # Max shingle id
            b = random.randint(0, 2**32-1)
            vector = [None] * len(self.s) # Vector of length set hashed shingles
            for i, value in enumerate(self.s):
                hashcode = ((a * value + b) % 4294967311) # Make sure that prime > max shingle id
                vector[i] = hashcode 
            min_hash[n] = min(vector)
        return min_hash
    

In [11]:
mh = MinHashing(ordered_set,20)
minhash = mh.hashing()

A class **CompareSignatures** that estimates similarity of two integer vectors – minhash signatures – as a fraction of components, in which they agree.

In [6]:
class CompareSignatures:
    def __init__(self, vector1, vector2):
        self.v1 = vector1
        self.v2 = vector2

    def estimate(self):
        count = 0
        for i, elem in enumerate(self.v1):
            if self.v1[i] == self.v2[i]: # Assumed that length of both vectors are the same
                count += 1

        return count/len(self.v1)

In [7]:
mh1 = MinHashing(ordered_set,20).hashing()
mh2 = MinHashing(ordered_set,20).hashing()

In [8]:
CompareSignatures(mh1,mh2).estimate()

1.0

(Optional task for extra 2 bonus) A class **LSH** that implements the LSH technique: given a collection of minhash signatures (integer vectors) and a similarity threshold t, the LSH class (using banding and hashing) finds candidate pairs of signatures agreeing on at least fraction t of their components.