## Introduction
About 100.000 users that watched in total 17.770 movies;
- Each user watched between 300 and 3000 movies
- The file contains about 65.000.000 records (720 MB) of the form:
`<user_id, movie_id> : “user_id watched movie_id”`
- Similarity between users: Jaccard similarity of sets of movies they watched: 
``` 
jsim(S1, S2) = #intersect(S1, S2)/#union(S1, S2)
``` 
- Task: estimate the time if brute-force serach is used

##  Brute-force search

In [1]:
# import packages
import numpy as np
import pandas as pd
from scipy import sparse
import time

In [2]:
#  load the data
FILE = '../data/user_movie.npy'
df = pd.DataFrame(np.load(FILE), columns = ['user','movie'])

In [6]:
# data exploration
n_user = len(df.user.unique())
print("Number of unique users", n_user)
print("UserID starts at", df.user.min())
print("UserID ends at", df.user.max())

n_movie = len(df.movie.unique())
print("Number of unique movies", n_movie)
print("movie starts at", df.movie.min())
print("movieID ends at", df.movie.max())

Number of unique users 103703
UserID starts at 0
UserID ends at 103702
Number of unique movies 17770
movie starts at 0
movieID ends at 17769


Based on the min and max UserID, the UserID attribute is continous without gap. Based on the min and max movieID, the movieID attribute is continous without gap.

### Method 1: Calcuating intersection and untion sets directly

In [20]:
def calculate_jsim(a,b):
    intersect = set(a) & set(b)
    union = set(a) | set(b)
    jsim = len(intersect)/len(union)
    return jsim

In [22]:
t0 = time.time()

n_sample = 1000

df_sample = df[df.user < n_sample]
#df_sample = df[1:800]
m_by_u = df_sample.groupby('user')['movie'].apply(list)
user = df_sample.user.unique().tolist()
pair=[]
for i in user:
    for j in user:
        if i != j:
            jsim = calculate_jsim(m_by_u[i],m_by_u[j])
            #print(i,j,jsim)
            if jsim > 0.5:
                pair.append([i,j,jsim])
                
sample_time = round(time.time()-t0, 3)
print("Running time:", sample_time, "s" )

Running time: 225.056 s


### Methods 2: Union = A + B - Intersection

In [4]:
def calculate_jsim2(a,b):
    intersect = len(set(a) & set(b))
    union = len(a) + len(b) - intersect
    jsim = intersect/union
    return jsim

In [5]:
t0 = time.time()

n_sample = 1000

df_sample = df[df.user < n_sample]
#df_sample = df[1:800]
m_by_u = df_sample.groupby('user')['movie'].apply(list)
user = df_sample.user.unique().tolist()
pair=[]
for i in user:
    for j in user:
        if i != j:
            jsim = calculate_jsim2(m_by_u[i],m_by_u[j])
            #print(i,j,jsim)
            if jsim > 0.5:
                pair.append([i,j,jsim])
                
sample_time = round(time.time()-t0, 3)
print("Running time:", sample_time, "s" )

Running time: 113.871 s


We can see the 2nd method saves half of the time compared to the 1st method.

## Estimate running time for the whole dataset

In [7]:
multiple = n_user*n_user/n_sample/n_sample
print("The umber of sample users is:", n_sample,"\nAnd the number of total users is:", n_user)
print("Since the running time is O(n), the total running time is {0:.2f} times of the sample runing time".format(multiple))

total_time = sample_time * multiple 
print("Total running time is {0:.2f} hours".format(total_time/60/60)) #(".format(round(a,2)))

The umber of sample users is: 1000 
And the number of total users is: 103703
Since the running time is O(n), the total running time is 10754.31 times of the sample runing time
Total running time is 340.17 hours


**Brute-force search takes about 340 hours. This is every expensive due to 5.000.000.000 pairs to be comuputed.**