<a href="https://colab.research.google.com/github/prateekchandrajha/mastering-ml-algorithms/blob/main/LSH_Recommendation_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Read: https://www.learndatasci.com/tutorials/building-recommendation-engine-locality-sensitive-hashing-lsh-python/



In [2]:
!pip install datasketch

Collecting datasketch
[?25l  Downloading https://files.pythonhosted.org/packages/8d/35/3e39356d97dc67c4bddaddb51693c20a6eb61e535ce5be09d3755ba2b823/datasketch-1.5.3-py2.py3-none-any.whl (67kB)
[K     |████▉                           | 10kB 17.4MB/s eta 0:00:01[K     |█████████▊                      | 20kB 19.9MB/s eta 0:00:01[K     |██████████████▋                 | 30kB 10.6MB/s eta 0:00:01[K     |███████████████████▌            | 40kB 8.6MB/s eta 0:00:01[K     |████████████████████████▎       | 51kB 4.3MB/s eta 0:00:01[K     |█████████████████████████████▏  | 61kB 4.8MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.6MB/s 
Installing collected packages: datasketch
Successfully installed datasketch-1.5.3


In [3]:
import numpy as np
import pandas as pd
import re
import time
from datasketch import MinHash, MinHashLSHForest

In [4]:
#Preprocess will split a string of text into individual tokens/shingles based on whitespace.
def preprocess(text):
    text = re.sub(r'[^\w\s]','',text)
    tokens = text.lower()
    tokens = tokens.split()
    return tokens

In [5]:
text = 'The devil went down to Georgia'
print('The shingles (tokens) are:', preprocess(text))

The shingles (tokens) are: ['the', 'devil', 'went', 'down', 'to', 'georgia']


In [6]:
#Number of Permutations
permutations = 128

#Number of Recommendations to return
num_recommendations = 1

In order to create the Minhash Forest, we will execute the following steps:

Pass in a dataframe with every string you want to query.
Preprocess a string of text using our preprocessing step above.
Set the number of permutations in your MinHash.
MinHash the string on all of your shingles in the string.
Store the MinHash of the string.
Repeat 2-5 for all strings in your dataframe.
Build a forest of all the MinHashed strings.
Index your forest to make it searchable.

In [7]:
def get_forest(data, perms):
    start_time = time.time()
    
    minhash = []
    
    for text in data['text']:
        tokens = preprocess(text)
        m = MinHash(num_perm=perms)
        for s in tokens:
            m.update(s.encode('utf8'))
        minhash.append(m)
        
    forest = MinHashLSHForest(num_perm=perms)
    
    for i,m in enumerate(minhash):
        forest.add(i,m)
        
    forest.index()
    
    print('It took %s seconds to build forest.' %(time.time()-start_time))
    
    return forest

In order to query the forest that was built, we will follow the steps below:

Preprocess your text into shingles.
Set the same number of permutations for your MinHash as was used to build the forest.
Create your MinHash on the text using all your shingles.
Query the forest with your MinHash and return the number of requested recommendations.
Provide the titles of each conference paper recommended.

In [8]:
def predict(text, database, perms, num_results, forest):
    start_time = time.time()
    
    tokens = preprocess(text)
    m = MinHash(num_perm=perms)
    for s in tokens:
        m.update(s.encode('utf8'))
        
    idx_array = np.array(forest.query(m, num_results))
    if len(idx_array) == 0:
        return None # if your query is empty, return none
    
    result = database.iloc[idx_array]['title']
    
    print('It took %s seconds to query forest.' %(time.time()-start_time))
    
    return result