## Introduction
About 100.000 users that watched in total 17.770 movies;
- Each user watched between 300 and 3000 movies
- The file contains about 65.000.000 records (720 MB) of the form:
`<user_id, movie_id> : “user_id watched movie_id”`
- Similarity between users: Jaccard similarity of sets of movies they watched: 
``` 
jsim(S1, S2) = #intersect(S1, S2)/#union(S1, S2)
``` 
- Task: find (with help of LSH) pairs of users whose jsim > 0.5

Process:
1. Tune it (signature length, number of bands, number of rows per band)
2. Randomize, optimize, benchmark, polish the code, ...
3. Dump results to a text file ans.txt (just a csv list of records: user1, user2) 

## Data preparation

In [208]:
# import packages
import numpy as np
import pandas as pd
from collections import defaultdict
from scipy.sparse import csc_matrix
np.random.seed(seed=17)

In [209]:
#  load the data
FILE = '../data/user_movie.npy'
df = pd.DataFrame(np.load(FILE), columns = ['user','movie'])

In [210]:
# data exploration
m_by_u = df.groupby('user')['movie'].apply(list)
n_user = len(df.user.unique())
n_movie = len(df.movie.unique())
print("Number of unique movies", n_movie)
print("movie starts at", df.movie.min())
print("movieID ends at", df.movie.max())

Number of unique movies 17770
movie starts at 0
movieID ends at 17769


Based on the min and max movieID, the movieID attribute is continous without gap.

## MinHash

The dataset is so small that random permutations is used instead of hash functions. 
In this way we can avoid time consuming loops.

In [217]:
# Relatively short signatures (50-150) should result in good results (and take less time to compute).
n_sig = 50

In [219]:
# set user as the column, movie as the row in the sparse matrix
mat = sparse.csc_matrix(([1]*df.shape[0], (df.iloc[:,1], df.iloc[:,0])))

In [220]:
def calculate_jsim(a,b):
    intersect = set(a) & set(b)
    union = set(a) | set(b)
    jsim = len(intersect)/len(union)
    return jsim

In [221]:
def minhashing(mat, n_sig = n_sig):
    sig_matrix = np.array([])
    for i in range(n_sig):
        perm = mat[np.random.permutation(n_movie)]
        if sig_matrix.any():
            # return the index of the first occurrence of 1
            sig_matrix = np.vstack((sig_matrix, perm.argmax(axis=0)))
        else:
            sig_matrix = perm.argmax(axis=0)
    return sig_matrix

In [222]:
sig_matrix = minhashing(mat,n_sig)

In [223]:
sig_matrix.shape

(50, 103703)

## Implement the LSH algorithm

When there are many users that fall into the same bucket (i.e., there are many candidates for being similar to each other) then checking if all the potential pairs are really similar might be very expensive: you have to check k(k-1)/2 pairs, when the bucket has k elements. Postpone evaluation of such a bucket till the very end (or just ignore it – they are really expensive). Or better: consider increasing the number of rows per band – that will reduce the chance of encountering big buckets.

Note that b*r doesn’t have to be exactly the length of the signature. For example, when you work with signature of length n=100, you may consider e.g., b=3, r=33, b=6, r=15, etc. – any combination of b
 
To make sure that your program will not exceed the 30 minutes runtime you are advised to close the result.txt file after any new pair is appended to it (and open it again, when you want to append a new one).

Too many slices would give a lot of false positives while too few slices would only be able to identify the highest degrees of similarity.

In [257]:
bands = 20
rows = 5
def LSH_bucket(mat):
    bucket_list = defaultdict(list)
    for b in range(bands):
        s = np.sum(mat[b*rows:(b+1)*rows,:], axis =0)
        for index, x in np.ndenumerate(s):
            bucket_list[x].append(index[1])
    return bucket_list.items()

In [258]:
bucket_list = LSH_bucket(sig_matrix)

In [None]:
unique_pairs = set()
sim_pairs = set()
for key, value in bucket_list:
    if len(value) > 0 :#and len(value) < 100: # remove the duplicate pairs
        for i in value:
            for j in value:
                if i > j:
                    t = j
                    j = i
                    i = t
                if i!=j and ((i,j) not in unique_pairs):
                    unique_pairs.add((i,j))
                    sigi = sig_matrix[:,i].ravel().tolist()[0]
                    sigj = sig_matrix[:,j].ravel().tolist()[0]
                    jsim = calculate_jsim(sigi, sigj)
                    if jsim > 0.5:
                        sim_pairs.add((i,j))

In [354]:
len(sim_pairs)

425

In [355]:
true_pairs = []
for pair in sim_pairs:
    jsim = calculate_jsim(m_by_u[pair[0]], m_by_u[pair[1]])
    if jsim>0.5:
        true_pairs.append(pair)
true_pairs

[]

Reference:
- 