Skip to content

This repo includes an implementation of the MinHash function combined with the banding technique. This allows the user to create an LSH (Locality Sensitivity Hash) signature for texts (like file names)

Notifications You must be signed in to change notification settings

ofirgag/MinHash-function-implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MinHash-function-implementation

This repo includes an implementation of the MinHash function combined with the banding technique. This allows the user to create an LSH (Locality Sensitivity Hash) signature for texts (like file names)

MinHash function and banding technique

The MinHash function is useful when trying to find the similarity between sets. In our case, each file name/software name (or simply name) represents a set of tri-grams derived from the name's words. Given a set of tri-grams, the MinHash function outputs a signature. We now elaborate on the process of calculating a signature for a given set of tri-grams of a given name. First we build a characteristic matrix (CM). The CM rows defined by all possible tri-grams, overall there are 26^3 = 17; 576 possible tri-grams (26 letters), hence the matrix has 17,576 rows. The matrix columns defined by all the names. The matrixcells filled according to the next definition:

2021-06-07_15h02_57

Next we define 100 different hash functions of the form:

2021-06-07_15h03_03

for random ai, bi and i in [1, 100]. The input x represents the tri-gram position in [0, 26^3 - 1] defined according alphabetic order. We define the set of hash functions as H.

At this stage we calculate the signature matrix (SM). The SM rows defined by all hash functions (100 overall). The matrix columns defined by all the names. The matrix cells filled according to the next Algorithm:

2021-06-07_15h03_06

In practice, we only iterate through the non-zero elements. Each column in the SM represents a 100 length signature of a file name/software name (or simply name). Next we apply on the SM the banding technique. We split each signature into 25 bands of length 4 each. Each of this 4 length number represents a band tuple.

About

This repo includes an implementation of the MinHash function combined with the banding technique. This allows the user to create an LSH (Locality Sensitivity Hash) signature for texts (like file names)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages