# User Similarity Using Jaccard, MinHash & LSH

> *Data Mining*  
> *MSc in Data Science, Department of Informatics*  
> *Athens University of Economics and Business*

---

1) **Compute exact Jaccard similarity of users**

Download the movieLens dataset.To assess the similarity between users you should compute the exact Jaccard Similarity for all pairs of users and only output the pairs of users (unique) that have similarity at least 0,5 (>=50%). For each pair denote their ids and the similarity score.

3) **Compute similarity using Min-hash signatures**

In this step you compute min-hash signatures for each user and use them to evaluate their similarity.
Description of hash functions: use the following family of hash functions: ha,b(x)=(ax+b) mod R, with a,b random integers in the interval (0,R) and R a large
enough prime number that you may want to finetune in your initial experimentation. Make sure that each hash function uses different values of a,b pairs.
Evaluation of Min-hashing: Use 50, 100, and 200 hash functions. For each value, output the pair of users that have estimated similarity at least 0.5, and report the number of false positives and false negatives (against the exact Jaccard similarity) that you obtain. For the false positives and negatives, report the averages for 5 different runs using different functions. 

4) **Locate similar users using LSH index**

Using a set of 200 hash functions break up the signatures into b bands with r hash functions per band (bxr=200) and implement *Locality Sensitive Hashing*.
Recall that with LSH we first locate users that are similar (have the same mini-signatures) across at least one band and then assess their true similarity using their initial representations. Use the following two instances of LSH:

- LSH instance 1: b = 25, r = 8
- LSH instance 2: b = 40, r = 5

Using each instance find the pair of users with similarity at least 0.5 and report:

- The number of true pairs returned (true positives).
- The number of similarity evaluations performed using the initial representations.

Report the averages for 5 different runs using different functions.
Based on the reported results, what do we gain/loose by using LSH instead of directly comparing users on their true representations?

### *Libraries*

In [1]:
import pandas as pd
import random
import numpy as np
import os
from itertools import combinations
from collections import defaultdict

import time

### *Data*

- We will use the movieLens dataset.
- The dataset pertains to movie ratings provided by users.
- It comprises 100,000 ratings (1-5) from 943 users on 1682 movies.
- Each user has rated a minimum of 20 movies.
- The dataset is distributed across three files: users.txt, movies.txt, and ratings.txt.
  - users.txt: Contains id, age, gender, occupation, and postcode separated by |.
  - movies.txt: Includes id, title (with release year), and additional unrelated information separated by |.
  - ratings.txt: Tab-separated file containing userid, movieid, rating (1-5), and timestamp.
- For this assignment, only the set of movies that a user has rated will be used, excluding the ratings themselves.

In [2]:
# ratings
ratings_df=pd.read_csv(os.getcwd() + '/Movie Lens Dataset/ratings.txt',sep='\t',header = None,names = [
    'userId','movieId','rating','timestamp'])

# movies
movies_df=pd.read_csv(os.getcwd() + '/Movie Lens Dataset/movies.txt',sep='|', header = None, usecols=[0, 1], encoding='ANSI',
                     names = ['movieId','Title']).set_index('movieId')

In [3]:
# Get the set of distinct user IDs from the 'userId' column in the ratings DataFrame
distinct_users = set(ratings_df['userId'])

# Dictionary to store user IDs as keys and the set of movie IDs they have rated as values
user_films = {}

# Iterate through each distinct user
for user in distinct_users:
    # Extract movie IDs that the current user has rated and convert to a list
    my_films = ratings_df[ratings_df['userId'] == user]['movieId'].values.tolist()
    
    # Store the set of movie IDs in the user_films dictionary
    user_films[user] = set(my_films)

## *Compute Exact Jaccard Similarity*

##### *Compute user similarity using Jaccard coefficient*

In [4]:
def user_jaccard_similarity(user_films, threshold_similarity=0.5):
    """
    Calculate Jaccard similarity between pairs of users based on their rated movies.

    Parameters:
    - user_films: Dictionary with user IDs as keys and sets of movie IDs they have rated as values.
    - threshold_similarity: Minimum Jaccard similarity threshold to print pairs.

    Returns:
    - usim: Dictionary containing Jaccard similarities between user pairs.
    - max_pair: Tuple containing the user pair with the maximum Jaccard similarity.
    """

    # Generate pairs of user combinations
    pairs = list(combinations(list(user_films.keys()), 2))

    # Dictionary to store Jaccard similarities between user pairs
    usim = defaultdict(dict)

    # Variable to track the maximum Jaccard similarity
    max_jacc = 0
    max_pair = None

    # Iterate through pairs of users
    for u1, u2 in pairs:
        # Calculate Jaccard similarity
        union = user_films[u1].union(user_films[u2])
        intersection = user_films[u1].intersection(user_films[u2])
        jaccard_similarity = len(intersection) / len(union)
        
        # Store Jaccard similarity in the usim dictionary
        usim[u1][u2] = jaccard_similarity

        # Print user pairs with Jaccard similarity above the threshold
        if jaccard_similarity >= threshold_similarity:
            print(f'User Pair: ({u1}, {u2}) Jaccard Similarity: {jaccard_similarity}')

        # Update maximum Jaccard similarity and corresponding pair
        if jaccard_similarity > max_jacc:
            max_jacc = jaccard_similarity
            max_pair = (u1, u2)

    return usim, max_pair, max_jacc

usim_result, max_jacc_pair, max_jacc = user_jaccard_similarity(user_films)

User Pair: (197, 600) Jaccard Similarity: 0.5
User Pair: (197, 826) Jaccard Similarity: 0.512987012987013
User Pair: (328, 788) Jaccard Similarity: 0.6729559748427673
User Pair: (408, 898) Jaccard Similarity: 0.8387096774193549
User Pair: (451, 489) Jaccard Similarity: 0.5333333333333333
User Pair: (489, 587) Jaccard Similarity: 0.6299212598425197
User Pair: (554, 764) Jaccard Similarity: 0.5170068027210885
User Pair: (600, 826) Jaccard Similarity: 0.5454545454545454
User Pair: (674, 879) Jaccard Similarity: 0.5217391304347826
User Pair: (800, 879) Jaccard Similarity: 0.5


In [5]:
# Calculate the set of all films and common films for the most similar user pair
all_films = user_films[max_jacc_pair[0]].union(user_films[max_jacc_pair[1]])
common_films = user_films[max_jacc_pair[0]].intersection(user_films[max_jacc_pair[1]])

# Print the most similar user pair and Jaccard similarity
print(f'Most Similar User Pair: ({max_jacc_pair[0]}, {max_jacc_pair[1]}) Jaccard Similarity: {max_jacc}\n')

# Print common films
print('Common Films:\n')
for film in common_films:
    print(movies_df.loc[film]['Title'])

# Print unique films
print('\n\nUnique Films:\n')
for film in all_films:
    print(movies_df.loc[film]['Title'])

Most Similar User Pair: (408, 898) Jaccard Similarity: 0.8387096774193549

Common Films:

Contact (1997)
Gattaca (1997)
Starship Troopers (1997)
Indian Summer (1996)
Good Will Hunting (1997)
Mouse Hunt (1997)
English Patient, The (1996)
Scream (1996)
Rocket Man (1997)
Air Force One (1997)
L.A. Confidential (1997)
Jackal, The (1997)
Rainmaker, The (1997)
Midnight in the Garden of Good and Evil (1997)
Titanic (1997)
Apt Pupil (1998)
Everyone Says I Love You (1996)
Lost Highway (1997)
Cop Land (1997)
Conspiracy Theory (1997)
U Turn (1997)
Wag the Dog (1997)
Spawn (1997)
Saint, The (1997)
Tomorrow Never Dies (1997)
Kolya (1996)


Unique Films:

Contact (1997)
Gattaca (1997)
Starship Troopers (1997)
Indian Summer (1996)
Good Will Hunting (1997)
Mouse Hunt (1997)
English Patient, The (1996)
Scream (1996)
Liar Liar (1997)
Rocket Man (1997)
Air Force One (1997)
L.A. Confidential (1997)
Jackal, The (1997)
Deceiver (1997)
Rainmaker, The (1997)
Midnight in the Garden of Good and Evil (1997)
Titan

## *Compute Similarity Using MinHash Signatures* 

##### *Compute user similarity using MinHash signatures*

In [15]:
# Function to create a universal hash function
def universal_hash(p, a, b):
    # Return a lambda function representing the hash function
    return lambda x: (a * x + b) % p

# Function to get a random hash function with given prime 'p'
def get_random_hash_fn(p):
    # Generate random coefficients 'a' and 'b' within the range [1, p-1] and [0, p-1] respectively
    a = random.randint(1, p - 1)
    b = random.randint(0, p - 1)
    # Return a universal hash function with the generated coefficients
    return universal_hash(p, a, b)

def generate_min_hash_signatures(user_films, num_hashes, R):
    hash_functions = [get_random_hash_fn(R) for _ in range(num_hashes)]

    min_hash_signatures = {}
    for user, films in user_films.items():
        signature = [float('inf')] * num_hashes
        for film in films:
            for i, hash_fn in enumerate(hash_functions):
                h = hash_fn(film)
                if h < signature[i]:
                    signature[i] = h
        min_hash_signatures[user] = signature

    return min_hash_signatures

def compute_minhash_similarity(u1, u2, min_hash_signatures):
    signature1 = min_hash_signatures[u1]
    signature2 = min_hash_signatures[u2]
    common_hashes = sum(s1 == s2 for s1, s2 in zip(signature1, signature2))
    union_hashes = len(set(signature1 + signature2))
    similarity = common_hashes / union_hashes
    return similarity

def evaluate_min_hashing(num_hashes, user_films, usim, print_flag = True, similarity_threshold=0.5):
    pairs = list(combinations(list(user_films.keys()), 2))

    # Generate MinHash signatures for all users
    min_hash_signatures = generate_min_hash_signatures(user_films, num_hashes, 2**32)

    False_Positives = 0
    False_Negatives = 0
    Similar_Pairs = []

    if (print_flag):
        print(f'Similar Pairs ({num_hashes} Hash Functions)')
        print('='*100) 
    
    for u1, u2 in pairs:
        # Compute MinHash similarity between two users
        minhash_similarity = compute_minhash_similarity(u1, u2, min_hash_signatures)

        # Evaluate false positives and false negatives
        if minhash_similarity > round(usim[u1][u2], 2):
            False_Positives += 1
        elif minhash_similarity < round(usim[u1][u2], 2):
            False_Negatives += 1   
            
        if minhash_similarity >= similarity_threshold:
            
            Similar_Pairs.append((u1, u2, minhash_similarity))

            if (print_flag):
                print(f'User Pair: ({u1}, {u2}) MinHash Similarity: {minhash_similarity}'
                      f' Jaccard Similarity: {round(usim[u1][u2], 2)}')

    if (print_flag):
        print('\n')
        print(f'FP, FN ({num_hashes} Hash Functions)')
        print('='*100)
        print(f'\nFalse Positives ({num_hashes} Hash Functions) : {False_Positives}') 
        print(f'False Negatives   ({num_hashes} Hash Functions) : {False_Negatives}') 
        
    return False_Positives, False_Negatives, Similar_Pairs

##### *Report the average number of False Positives (FP) and False Negatives (FN) for 5 different runs using different functions*

In [16]:
def mulpiple_min_hash_evaluations(user_films, usim_result, hash_num, eval_runs = 5):
    
    # initialize empty lists
    # to store FP and FN values
    false_positives = []
    false_negatives = []
    
    for i in range(eval_runs):
        
        # compute FP FN and similar pairs
        FP, FN, Similar_Pairs = evaluate_min_hashing(hash_num, user_films, usim_result, False)
        
        # append the values from each iteration
        false_positives.append(FP)
        false_negatives.append(FN)
        
    # calculate the averages
    FP_avg = round(np.mean(false_positives))
    FN_avg = round(np.mean(false_negatives))
    
    # print
    print(f'Average False Positives ({hash_num} Hash Functions) : {FP_avg}')
    print(f'Average False Negatives ({hash_num} Hash Functions) : {FN_avg}')
    
    return FP_avg, FN_avg

### *Using 50 Hash Functions*

In [17]:
False_Positives_50, False_Negatives_50, Similar_Pairs_50 = evaluate_min_hashing(50, user_films, usim_result)

Similar Pairs (50 Hash Functions)
User Pair: (408, 898) MinHash Similarity: 0.7241379310344828 Jaccard Similarity: 0.84
User Pair: (489, 587) MinHash Similarity: 0.5384615384615384 Jaccard Similarity: 0.63


FP, FN (50 Hash Functions)

False Positives (50 Hash Functions) : 60212
False Negatives   (50 Hash Functions) : 366821


In [18]:
# start time
st = time.time()

FP_avg_50, FN_avg_50 = mulpiple_min_hash_evaluations(user_films, usim_result, 50)

# end time
et = time.time()

print(f'\nElapsed time: {round(et-st)} secs.')

Average False Positives (50 Hash Functions) : 43591
Average False Negatives (50 Hash Functions) : 383539

Elapsed time: 45 secs.


### *Using 100 Hash Functions*

In [19]:
False_Positives_100, False_Negatives_100, Similar_Pairs_100 = evaluate_min_hashing(100, user_films, usim_result)

Similar Pairs (100 Hash Functions)
User Pair: (408, 898) MinHash Similarity: 0.7857142857142857 Jaccard Similarity: 0.84


FP, FN (100 Hash Functions)

False Positives (100 Hash Functions) : 12967
False Negatives   (100 Hash Functions) : 414316


In [20]:
# start time
st = time.time()

FP_avg_100, FN_avg_100 = mulpiple_min_hash_evaluations(user_films, usim_result, 100)

# end time
et = time.time()

print(f'\nElapsed time: {round(et-st)} secs.')

Average False Positives (100 Hash Functions) : 21937
Average False Negatives (100 Hash Functions) : 405578

Elapsed time: 70 secs.


### *Using 200 Hash Functions*

In [21]:
False_Positives_200, False_Negatives_200, Similar_Pairs_200 = evaluate_min_hashing(200, user_films, usim_result)

Similar Pairs (200 Hash Functions)
User Pair: (408, 898) MinHash Similarity: 0.6460905349794238 Jaccard Similarity: 0.84


FP, FN (200 Hash Functions)

False Positives (200 Hash Functions) : 5883
False Negatives   (200 Hash Functions) : 422165


In [80]:
# start time
st = time.time()

FP_avg_200, FN_avg_200 = mulpiple_min_hash_evaluations(user_films, usim_result, 200)

# end time
et = time.time()

print(f'\nElapsed time: {round(et-st)} secs.')

Average False Positives (200 Hash Functions) : 8967
Average False Negatives (200 Hash Functions) : 419150

Elapsed time: 103 secs.


### *Comments* 

- **Number of Hash Functions and False Positives:**
  - **Decreasing False Positives:** Increasing the number of hash functions often leads to a decrease in false positives.
    - More hash functions provide a more accurate representation of the sets, making it less likely for dissimilar sets to have a high fraction of matching hash values.
  - **Improved Precision:** Higher precision means that sets with a MinHash similarity above the threshold are more likely to be truly similar.

- **Number of Hash Functions and False Negatives:**
  - **Increasing False Negatives:** Conversely, as the number of hash functions increases, false negatives may also increase.
    - More hash functions make it less likely for sets to have a high fraction of matching hash values by chance.
    - This may lead to similar sets having a lower estimated similarity, increasing false negatives.

## *Locate Similar Users Using LSH Index*

##### *Locate similar users using LSH index*

In [52]:
def create_signature_table(signature_matrix, num_bands):
    """
    Create a signature table from the given MinHash signatures.

    Parameters:
    - signature_matrix: 3D array representing MinHash signatures (users, bands, rows)
    - num_bands: Number of bands used in the LSH process

    Returns:
    - List of dictionaries representing signature tables for each band
    """
    users = signature_matrix.shape[0]
    bands = signature_matrix.shape[1]
    
    signature_tables = [{} for _ in range(num_bands)]
    
    for i in range(users):
        for b in range(bands):
            band = signature_matrix[i, b, :]
            hash_value = tuple(band)
            
            if hash_value not in signature_tables[b]:
                signature_tables[b][hash_value] = []
            
            signature_tables[b][hash_value].append(i)
            
    return signature_tables


def LSH_similar_users(user_films, signature_table, hash_tables, threshold):
    """
    Find similar users using Locality-Sensitive Hashing (LSH) on MinHash signatures.

    Parameters:
    - user_films: Dictionary mapping user IDs to sets of movies
    - signature_table: 3D array representing MinHash signatures (users, bands, rows)
    - hash_tables: List of dictionaries representing hash tables for each band
    - threshold: Jaccard similarity threshold for considering users as similar

    Returns:
    - Dictionary of similar user pairs with Jaccard similarity scores
    - Count of true similar user pairs
    - Count of total evaluations performed
    """
    users = signature_table.shape[0]
    bands = signature_table.shape[1]
    
    similar_users = defaultdict()
    true_pairs = 0
    evaluations = 0
    
    for i in range(users):
        for j in range(i + 1, users):
            for b in range(bands):
                
                band = signature_table[i, b, :]
                hash_value = tuple(band)

                if hash_value in hash_tables[b]:                    
                    if j in hash_tables[b][hash_value]:
                        
                        union = user_films[i + 1].union(user_films[j + 1])
                        inter = user_films[i + 1].intersection(user_films[j + 1])

                        jacc = len(inter) / len(union)
                
                        evaluations += 1
                        
                        if jacc >= threshold:                           
                            true_pairs += 1
                            key = str(i + 1) + "_" + str(j + 1)
                    
                            similar_users[key] = jacc
                            
                            break
                        
    similar_users = sorted(similar_users.items(), key=lambda x: x[1], reverse=True)
    
    return dict(similar_users), true_pairs, evaluations

def run_lsh_evaluation(users, user_films, b, r, num_runs=5, threshold=0.5):
    """
    Run Locality Sensitive Hashing (LSH) evaluation multiple times and collect results.

    Parameters:
    - user_films: Dictionary mapping user IDs to sets of movies
    - b: Number of bands
    - r: Number of rows in each band
    - num_runs: Number of runs for LSH evaluation
    - threshold: Jaccard similarity threshold for considering users as similar

    Returns:
    - List of true positive pairs for each run
    - List of total evaluations (candidate pairs) for each run
    """
    True_Pairs = []
    Candidates_Cnt = []

    for k in range(num_runs):
        
        # Generate MinHash signatures for each user
        min_hash_signatures = generate_min_hash_signatures(user_films, 200, 2**32)

        # Reshape MinHash signatures into a 3D array
        signature_t = np.array(min_hash_signatures).reshape(len(min_hash_signatures), b, r)

        # Create hash tables for LSH
        hash_t = create_hash_tables(signature_t)

        # Run LSH to find similar users
        similar_users, true_pairs, evaluations = LSH_similar_users(user_films, signature_t, hash_t, threshold)

        # Collect results for each run
        True_Pairs.append(true_pairs)
        Candidates_Cnt.append(evaluations)

    return True_Pairs, Candidates_Cnt

def run_lsh_evaluation(user_films, num_hashes, b, r, threshold=0.5):
    """
    Run Locality-Sensitive Hashing (LSH) evaluation using MinHash signatures.

    Parameters:
    - user_films: Dictionary mapping user IDs to sets of movies
    - num_hashes: Number of hash functions used in generating MinHash signatures
    - b: Number of bands in the LSH process
    - r: Number of rows in each band
    - threshold: Jaccard similarity threshold for considering users as similar

    Returns:
    - Dictionary of similar user pairs with Jaccard similarity scores
    - Count of true similar user pairs
    - Count of total evaluations performed
    """
    # Generate MinHash signatures for all users
    min_hash_signatures = generate_min_hash_signatures(user_films, num_hashes, 2**32)

    # Create a 3D array representing MinHash signatures
    signature_matrix = np.array(list(min_hash_signatures.values())).reshape(len(min_hash_signatures), b, r)

    # Create signature tables from the MinHash signatures
    sig_tables = create_signature_table(signature_matrix, b)

    # Create hash tables for each band
    hash_tables = [defaultdict(list) for _ in range(b)]
    for i, sig_table in enumerate(sig_tables):
        for hash_value, users in sig_table.items():
            hash_tables[i][hash_value] = users

    # Run LSH to find similar users
    similar_users, true_pairs, evaluations = LSH_similar_users(user_films, signature_matrix, hash_tables, threshold)

    return similar_users, true_pairs, evaluations

### *LSH Instance 1* 

- $b = 25$
- $r = 8$

In [60]:
# start time
st = time.time()

# initialization
b = 25
r = 8

# lists to hold values for each iteration
True_Pairs = []
Candidates = []

# five runs
for i in range(5):
    
    similar_users, true_pairs, evaluations = run_lsh_evaluation(user_films, 200, b, r, threshold=0.5)
    
    #append 
    True_Pairs.append(true_pairs)
    Candidates.append(evaluations)
    
# end time
et = time.time()

print(f'True Pairs Average (b = {b}, r = {r}): {sum(True_Pairs) / len(True_Pairs)}')
print(f'Evaluations Average (b = {b}, r = {r}): {sum(Candidates) / len(Candidates)}')

print(f'\nElapsed time: {round(et-st)} secs.')

True Pairs Average (b = 25, r = 8): 2.2
Evaluations Average (b = 25, r = 8): 50.0

Elapsed time: 135 secs.


### *LSH Instance 2*  <a class='anchor' id='lsh_instance_2'></a>

- $b = 40$
- $r = 5$

In [61]:
# start time
st = time.time()

# initialization
b = 40
r = 5

# lists to hold values for each iteration
True_Pairs = []
Candidates = []

# five runs
for i in range(5):
    
    similar_users, true_pairs, evaluations = run_lsh_evaluation(user_films, 200, b, r, threshold=0.5)
    
    #append 
    True_Pairs.append(true_pairs)
    Candidates.append(evaluations)
    
# end time
et = time.time()

print(f'True Pairs Average (b = {b}, r = {r}): {sum(True_Pairs) / len(True_Pairs)}')
print(f'Evaluations Average (b = {b}, r = {r}): {sum(Candidates) / len(Candidates)}')

print(f'\nElapsed time: {round(et-st)} secs.')

True Pairs Average (b = 40, r = 5): 8.2
Evaluations Average (b = 40, r = 5): 2751.6

Elapsed time: 173 secs.


### *Comments* 

1. **True Pairs Average and Evaluations Average:**
   - As `b` and `r` increase, the number of true pairs identified tends to increase.
   - The number of evaluations (comparisons between pairs of users) also increases with higher values of `b` and `r`.

2. **Effect of `b` (Number of Bands):**
   - Increasing `b` means fewer rows in each band, leading to a more granular hashing process.
   - A higher number of bands (`b`) might increase the precision of the LSH method, resulting in a better identification of true similar pairs.
   - This could explain the increase in the average number of true pairs as `b` increases.

3. **Effect of `r` (Number of Rows in Each Band):**
   - Increasing `r` means each band has more rows, which makes the hashing process more coarse.
   - A higher number of rows in each band (`r`) might increase the recall of the LSH method by covering a larger portion of the MinHash signature.
   - This might be the reason for a higher number of evaluations as `r` increases, as more pairs need to be compared.

4. **Trade-off between Precision and Recall:**
   - There is often a trade-off between precision (accuracy of identified similar pairs) and recall (ability to find all similar pairs).
   - Higher values of `b` and lower values of `r` could lead to better precision but lower recall, while lower values of `b` and higher values of `r` could result in higher recall but lower precision.

5. **Computational Cost:**
   - A higher number of evaluations generally increases computational cost, which is reflected in the elapsed time.