## Introduction
About 100.000 users that watched in total 17.770 movies;
- Each user watched between 300 and 3000 movies
- The file contains about 65.000.000 records (720 MB) of the form:
`<user_id, movie_id> : “user_id watched movie_id”`
- Similarity between users: Jaccard similarity of sets of movies they watched: 
``` 
jsim(S1, S2) = #intersect(S1, S2)/#union(S1, S2)
``` 
- Task: find (with help of LSH) pairs of users whose jsim > 0.5

Process:
1. Tune it (signature length, number of bands, number of rows per band)
2. Randomize, optimize, benchmark, polish the code, ...
3. Dump results to a text file ans.txt (just a csv list of records: user1, user2) 

## Data preparation

In [148]:
# import packages
import numpy as np
import pandas as pd
import time
import csv
from collections import defaultdict
from scipy.sparse import csc_matrix

t0 = time.time()
np.random.seed(seed=20)

In [149]:
#  load the data
FILE = '../data/user_movie.npy'
df = pd.DataFrame(np.load(FILE), columns = ['user','movie'])

m_by_u = df.groupby('user')['movie'].apply(list)
n_movie = len(df.movie.unique())

## MinHash

The dataset is so small that random permutations is used instead of hash functions. 
In this way we can avoid time consuming loops.

In [150]:
# set user as the column, movie as the row in the sparse matrix
mat = csc_matrix(([1]*df.shape[0], (df.iloc[:,1], df.iloc[:,0])))

In [151]:
# Relatively short signatures (50-150) should result in good results (and take less time to compute).
n_sig = 90

In [152]:
def minhashing(mat, n_sig = n_sig):
    # create the sig matrix using minhashing
    sig_matrix = np.array([])
    for i in range(n_sig):
        perm = mat[np.random.permutation(n_movie)]
        # return the index of the first occurrence of 1
        sig = perm.argmax(axis=0)
        if sig_matrix.any():
            sig_matrix = np.vstack((sig_matrix,sig))
        else:
            sig_matrix = sig
        if i%10==0 and i: # check the progress
            print(i,' signatures have been created')
    return sig_matrix

In [153]:
sig_matrix = minhashing(mat,n_sig)

10  signatures have been created
20  signatures have been created
30  signatures have been created
40  signatures have been created
50  signatures have been created
60  signatures have been created
70  signatures have been created
80  signatures have been created


In [154]:
cal_time = round(time.time()-t0)
print("From data importing to minhashing, it takes {0:.2f} minutes".format(cal_time/60))

From data importing to minhashing, it takes 11.55 minutes


In [155]:
# check the dimension of the sig matrix
sig_matrix.shape

(90, 103703)

## Implement the LSH algorithm

When there are many users that fall into the same bucket (i.e., there are many candidates for being similar to each other) then checking if all the potential pairs are really similar might be very expensive: you have to check k(k-1)/2 pairs, when the bucket has k elements. Postpone evaluation of such a bucket till the very end (or just ignore it – they are really expensive). Or better: consider increasing the number of rows per band – that will reduce the chance of encountering big buckets.

Note that b*r doesn’t have to be exactly the length of the signature. For example, when you work with signature of length n=100, you may consider e.g., b=3, r=33, b=6, r=15, etc. – any combination of b
 
To make sure that your program will not exceed the 30 minutes runtime you are advised to close the result.txt file after any new pair is appended to it (and open it again, when you want to append a new one).

In [156]:
def calculate_jsim(a,b):
    intersect = len(set(a) & set(b))
    union = len(a) + len(b) - intersect
    jsim = intersect/union
    return jsim

In [157]:
def check_unique_pair(pair):
    # check if the pair from the bucket is similar based on sig matrix
    sigi = sig_matrix[:,pair[0]].ravel().tolist()[0]
    sigj = sig_matrix[:,pair[1]].ravel().tolist()[0]
    jsim = calculate_jsim(sigi, sigj)
    if jsim > 0.5:
        return True

def check_true_pair(pair):
    # check if the similar pair based on sig matrix is a true pair
    jsim = calculate_jsim(m_by_u[pair[0]], m_by_u[pair[1]])
    if jsim > 0.5:
        return True

Too many slices would give a lot of false positives while too few slices would only be able to identify the highest degrees of similarity.

In [163]:
bands = 3
rows = 30
max_bucket_item = 80

## Check, no write out

In [159]:
def LSH_bucket(sig_matrix, max_bucket_item = 25):
    unique_pairs = set() 
    for b in range(bands):
        bucket_list = defaultdict(set) # remove the duplicate items
        s = np.sum(sig_matrix[b*rows:(b+1)*rows,:], axis =0)
        for index, x in np.ndenumerate(s):
            bucket_list[x].add(index[1])
        for key, value in bucket_list.items():
            if len(value) > 1 and len(value) < max_bucket_item: 
                for i in value:
                    for j in value:
                        if i < j and ((i,j) not in unique_pairs): # remove the duplicate pair
                            unique_pairs.add((i,j))
    return unique_pairs

In [160]:
def print_not_write():
    unique_pairs = LSH_bucket(sig_matrix,max_bucket_item)
    print('Unique pairs: ',len(unique_pairs))

    sim_pairs = []
    for pair in unique_pairs:
        if check_unique_pair(pair):
            sim_pairs.append(t_pair)
    print('Similar pairs based on sig matrix: ',len(sim_pairs))

    true_pairs = []
    for pair in sim_pairs:
        if check_true_pair(pair):
            true_pairs.append(pair) 
    print('True similar pairs: ', true_pairs)  
print_not_write()

Unique pairs:  813429
Similar pairs based on sig matrix:  0
True similar pairs:  []


## Write to text file, for final submital

In [164]:
def LSH_bucket_true_paris(sig_matrix, max_bucket_item = 25): 
    unique_pairs = set() 
    for b in range(bands):
        bucket_list = defaultdict(set) # remove the duplicate items
        s = np.sum(sig_matrix[b*rows:(b+1)*rows,:], axis =0)
        for index, x in np.ndenumerate(s):
            bucket_list[x].add(index[1])
        for key, value in bucket_list.items():
            if len(value) > 1 and len(value) < max_bucket_item: 
                for i in value:
                    for j in value:
                        if i < j and ((i,j) not in unique_pairs): # remove the duplicate pair
                            unique_pairs.add((i,j))
                            if check_unique_pair((i,j)):
                                if check_true_pair((i,j)):
                                    with open('ans.txt','a') as file:
                                        writer = csv.DictWriter(file,fieldnames=['user1', 'user2'])
                                        writer.writerow({'user1':i,'user2':j})
#LSH_bucket_true_paris(sig_matrix,max_bucket_item)

## Total Time

In [162]:
cal_time = round(time.time()-t0)
print("Total running time is {0:.2f} minutes".format(cal_time/60))

Total running time is 12.47 minutes


Reference:
- 