## Overview

This is the code for homework 2 of the Big Data Computing course. This notebook is organized into 10 numbered sections, each of which addressing one step to use LSH for cosine similarity and perform a comparison between LSH and simple similarity computation. You can run all the sections or part of them, following the instructions described in this introduction. 

All the sections related to heavier computations (4, 6, 7, 9, 10) contain the line of codes used to write the result of the corresponding conputation on a file. In this way, retrieving data from the files, it is possible to skip some slow portions of the code to run only the needed ones. This notebook comes with all the files already created.

**Disclaimer 1:** In this notebook, it is used the functionality of mounting of Google Drive on Colab to manage the files. If you make use of another environment pay attention to include the files if you don't want to recreate them from scratch. In any case, change the path of the file (defined in section 3) to match your needs. If the path is not correctly stated, no computation can be completed.

**Disclaimer 2:**  Running the writing lines of this notebook may destroy or corrupt the integrity of the mentioned files. To avoid unintentional execution, all the writing lines are commented out. Uncomment them to perform the actual writing (you can change the name of the destination files to keep the original ones safe).

### Details

1. **Import:** It contains all needed libraries and functions for correctly executing the other sections.
2. **Download dataset:** It contains the code to download and unzip the kaggle dataset Amazon Fine Food Reviews (https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews).
3. **Useful functions and parameters:** It contains functions and variables shared by many following sections, to avoid repeating them several times. Here is also the definition of the file paths.
4. **Preprocessing:** It contains the needed steps to perform the preprocessing of the data obtained 
> 1. **Retrieve data:** It reads data from the 'Reviews.csv' file downloaded from the original dataset.
> 2. **Perform the transformation:** It contains the function 'preprocessing' used to compute the actual transformation of the texts. When the function is called, it is possible to enable/disable some preprocessing steps among the implemented ones (expand contractions, discard tags, delete punctuations, delete numbers, apply -principled- lemmatization and delete stopwords) using the 'options' dictionary as parameter of the function.
> 3. **Write preprocessed data:** It allows to write the result of the preprocessing steps.
5. **Create tf-idf vectors:** It reads the preprocessed data and creates the vectorized representation. In this section the data are splitted in order to have different sets for the following computations. In particular, data are partitioned into the 'test' set (0.5% of the total) and the 'train' set. Moreover, it is created the set 'thresh_test' (1% of the total) to perform a preliminary analisys to set the threshold theta above which two documents are considered similar. From these sets, three different sparse vectors are obtained.
6. **Set the threshold:** It helps to find the right value for theta
> 1. **Compute similarities:** It computes the similarities between documents of 'thresh_test' (pairwise). Then it writes the result on a file.
> 2. **Read results:** It reads the file containing the previous results.
> 3. **Choose theta:** It analyzes the results to set the value of theta. There can be different way to address this problem. The assumption made here is that at least one document in 'thresh_test' has its most-similar document in 'thresh_test'. In order to not cut out too many results, we select the smallest value among the highest similar values of each document. In this way every document d (in 'thresh_test' and hopefully in all the dataset) has at least one similar document d'.
7. **Ground truth:** for each document in 'test', it computes the exact set of similar documents belonging to 'train' set using cosine similarity and the chosen threshold. During the process, it writes the results on a file.
8. **Signature matrix:** It defines the class to implement the computation of the signature matrix of one given document d. It has the variable 'projections' (the set of vectors used to perform the dot product with d), and the function 'generate_signature' that takes d and returns its signature as a sequence of bits. Note that both Signature class and the following LSH class can be instantiated from scratch or reusing previously computed variables retrieved from a file (as described in the folowing lines dedicated to section 9.3).
9. **LSH:** It contains all the needed steps to implement Locality Sensitive Hashing. It does not perform the actual similarity computation (the following section is responsible for this)
> 1. **Create LSH class:** It defines the class of LSH. The main components are the variable 'lst_dict', that stores a list of dictionaries related to each band in which we split the signature matrix of the 'train' documents, and the function 'compute_sim', that takes in input a document d of the 'test' set and computes the approximative set of similar documents using the banding technique.
> 2. **Create new instances:** It creates the new instances of Signature and LSH classes. Then it writes the variables 'projections' and 'lst_dict' in one file.
> 3. **Retrieve old instances:** It retrieves 'projections' and 'lst_dict' from the file in which they are stored and use them to create the instances respectively of Signature and LSH classes with the same parameters of the previous computation. 
10. **Comparison:** It returns the final results
> 1. **LSH result:** It performs LSH computation and compares the set of returned similar documents with the true set of similar documents computed in section 7. It uses the Jaccard coefficient and it takes into consideration how fast LSH is against the traditional approach.
> 2. **Aggregated results:** It shows the average values of the principal indicators over the number of test documents

Some specific implementations details are described in the comments in order to understand them while reading the code.

### How to run

For each task is indicated the number of the sections strictly necessary to perform that task

* **Preprocess data:** 1 -> 3 -> 4
* **Create tf-idf vectors:** 1 -> 3 -> 5
* **Set threshold (computing pairwise similarities of 'thresh_test' set from scratch):** 1 -> 3 -> 5 -> 6
* **Set threshold (reusing stored data):** 1 -> 3 -> 5 -> 6.2 -> 6.3
* **Computing exact similarities (from scratch):** 1 -> 3 -> 5 -> 7
* **Create instances of Signature class and LSH class (from scratch):** 1 -> 3 -> 5 -> 8 -> 9.1 -> 9.2
* **Create instances of Signature class and LSH class (reusing stored data):** 1 -> 3 -> 5 -> 8 -> 9.1 -> 9.3
* **Use LSH and compare results with ground truth:** 1 -> 3 -> 5 -> 8 -> 9.1 -> 9.3 -> 10.1
* **Get aggregated results (reusing stored data):** 1 -> 3 -> 5 -> 10.2


Final results:

* Average time for exact similarity computation: 73.19
* Average time for LSH similarity computation: 27.70
* Average Jaccard coefficient: 0.73

## 1) Import

In [None]:
# comment this lines if you don't use Google Colab
from google.colab import files
from google.colab import drive
drive.mount('/content/drive')

import os
import time

import json
import collections
from zipfile import ZipFile
import linecache

import re
import string
import pandas as pd
import numpy as np
np.seterr(divide='ignore', invalid='ignore')

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 2) Download dataset

In [None]:
# upload kaggle.json
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [None]:
# download and unzip the dataset
os.environ['KAGGLE_CONFIG_DIR'] = "/content"
!kaggle datasets download -d snap/amazon-fine-food-reviews

with ZipFile('amazon-fine-food-reviews.zip', 'r') as zipObj:
  zipObj.extractall()

Downloading amazon-fine-food-reviews.zip to /content
 95% 231M/242M [00:01<00:00, 180MB/s]
100% 242M/242M [00:01<00:00, 173MB/s]


## 3) Useful functions and parameters

Some elements defined in this section may be overridden by running the code in the following sections (e.g., theta)

In [None]:
# compute the similarity between two vectors
def cos_sim(u, v):
  norm = np.linalg.norm(u) * np.linalg.norm(v)
  cosine = np.dot(u, v.T) / norm
  ang = np.arccos(cosine)
  return 1 - ang/np.pi

In [None]:
# define some useful parameters for the following computations
# if theta changes in section 6, adjust the other parameters
theta = 0.60
b = 24
r = 6
m = b * r
print(round((1/b)**(1/r), 2), m)

0.59 144


In [None]:
# define file paths
dataset_file = '/content/drive/MyDrive/Reviews.csv'
preprocessed_file = '/content/drive/MyDrive/preprocessed.csv'
thresh_sim_file = '/content/drive/MyDrive/thresh.csv'
true_sim_file = '/content/drive/MyDrive/true_sim.txt'
sig_file = '/content/drive/MyDrive/sig.txt'
lsh_file = '/content/drive/MyDrive/lsh.txt'
comparison_file = '/content/drive/MyDrive/comparison.txt'

## 4) Preprocessing

### 4.1) Retrieve data

In [None]:
# read and clear the dataset
df = pd.read_csv(dataset_file, sep=',')
df.dropna(subset=['Score', 'Summary', 'Text'], inplace=True)
df.describe()

In [None]:
# extract useful informations
scores = df['Score'].tolist()
summaries = df['Summary'].tolist()
texts = df['Text'].tolist()

### 4.2) Perform the transformation

In [None]:
contracts = {"ain't": "is not", "aren't": "are not", "can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have", "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", "he'd": "he would", "he'd've": "he would have", "he'll": "he will", "he'll've": "he will have", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have", "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not", "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'alls": "you alls", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have"}
puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√']

# define preprocessing pipeline
def preprocessing(string, contracts, puncts, options):
  # convert all characters into lowercase characters 
  string = string.lower()
  
  # expand any contractions using 'contracts' dictionary
  if options.get("expand_contractions"):
    for pair in contracts.items():
      if pair[0] in string:
        string = string.replace(pair[0], pair[1])
    
    ''' alternative
    regex = re.compile('(%s)' % '|'.join(contracts.keys()))
    string = regex.sub(lambda x: contracts[x.group(0)], string)
    '''
  
  # delete any tags (they carry no infos)
  if options.get("discard_tags"):
    for tag in ["<p>", "<br />", "</a>"]:
      if tag in string:
        string = string.replace(tag, " ")
    
    string = " ".join(list(filter(lambda elem: "a href" not in elem, re.split("<|>", string))))

  # delete any punctuation symbols using 'puncts' list
  if options.get("del_punctuations"):
    for punct in puncts:
      if punct in string:
        string = string.replace(punct, " ")

    ''' alternative
    regex = re.compile('(%s)' % '|'.join(map(re.escape, puncts)))
    string = regex.sub(lambda x: "", string)
    '''
  
  # delete any numbers
  if options.get("del_numbers"):
    for num in list(range(0, 10)):
      if str(num) in string:
        string = string.replace(str(num), " ")
    
    ''' alternative
    string = re.sub(r'\d+', "", string)
    '''

  # apply lemmatization
  if options.get("use_lemmatization"):
    lemmatizer = WordNetLemmatizer()

    # principled strategy based on part-of-speech tags (inspired by https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)
    pos_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    string = " ".join([lemmatizer.lemmatize(pair[0], pos_dict.get(pair[1][0].upper(), wordnet.NOUN)) for pair in pos_tag(word_tokenize(string))])

    ''' alternative
    string = " ".join([lemmatizer.lemmatize(word) for word in word_tokenize(string)])
    '''

  # delete stopwords
  if options.get("del_stopwords"):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    string = " ".join([word for word in word_tokenize(string) if word not in stopwords])

  return string

In [None]:
# choose how to preprocess data
options = {"expand_contractions": True,
           "discard_tags": True,
           "del_punctuations": True,
           "del_numbers": True,
           "use_lemmatization": True,
           "del_stopwords": True}

# apply the actual preprocessing step
summaries = [preprocessing(elem, contracts, puncts, options) for elem in summaries]
texts = [preprocessing(elem, contracts, puncts, options) for elem in texts]

### 4.3) Write preprocessed data

In [None]:
# write preprocessed data into a csv file
# uncomment the following lines to perform the actual writing
'''
f_csv = open(preprocessed_file, "w")
f_csv.write("score,summary,text\n")

for i in range(len(scores)):
  f_csv.write(str(scores[i]) + "," + summaries[i] + "," + texts[i] + "\n")

f_csv.close()
'''

## 5) Create tf-idf vectors

In [None]:
# read preprocessed data
df = pd.read_csv(preprocessed_file, sep=',')

scores = df['score'].tolist()
summaries = df['summary'].tolist()
texts = df['text'].tolist()

# design choice: each document contains a reconciled string composed of 'scores', 'summaries' and 'texts'
# comment the following line to ignore 'scores' and 'summaries' (the number of terms in the vectorized representations will be less)
texts = [str(elem[0]) + " " + str(elem[1]) + " " + str(elem[2]) for elem in zip(scores, summaries, texts)]

In [None]:
# split the documents in train set and test set
# use thresh_test to carry out a preliminary analysis on the threshold (theta)
thresh_test_prop = 0.01
test_prop = 0.005

_, thresh_test = train_test_split(texts, test_size=thresh_test_prop, random_state=32)
train, test = train_test_split(texts, test_size=test_prop, random_state=64)

In [None]:
# compute the tf-idf vectors (sparse representation)
vectorizer = TfidfVectorizer(min_df=100)
texts_tfidf = vectorizer.fit_transform(texts)
train_tfidf = vectorizer.transform(train)
test_tfidf = vectorizer.transform(test)
thresh_test_tfidf = vectorizer.transform(thresh_test)

## 6) Set the threshold

### 6.1) Compute similarities

In [None]:
sim_list = []

# compute the similarity between each pair of documents in the 'thresh_test' set and 
for i in range(thresh_test_tfidf.shape[0]):
  u = thresh_test_tfidf[i:i+1].toarray()

  for j in range(thresh_test_tfidf.shape[0]):
    if j > i: # compute only one half of the similarities to speed up the process (the remaining ones are dual)
      v = thresh_test_tfidf[j:j+1].toarray()
      val = round(cos_sim(u,v).item(), 2)
      
      if 0.6 <= val < 1.0:
        sim_list.append([i, j, val])

# compute the second half and unify
sh = [[sim[1], sim[0], sim[2]] for sim in sim_list]
sim_list = sim_list + sh

In [None]:
# write the similarities obtained in the previous lines on a csv file
# uncomment the following lines to perform the actual writing
'''
f_csv = open(thresh_sim_file, "w")
f_csv.write("i,j,val\n")

for i in range(len(sim_list)):
  f_csv.write(str(sim_list[i][0]) + "," + str(sim_list[i][1]) + "," + str(sim_list[i][2]) + "\n")
  f_csv.write(str(sim_list[i][1]) + "," + str(sim_list[i][0]) + "," + str(sim_list[i][2]) + "\n")

f_csv.close()
'''

### 6.2) Read results

In [None]:
# read the similarities from the file
df = pd.read_csv(thresh_sim_file, sep=',')

i = df['i'].tolist()
j = df['j'].tolist()
val = df['val'].tolist()

sim_list = []
for k in range(len(i)):
  sim_list.append([i[k], j[k], val[k]])

### 6.3) Choose theta

In [None]:
# create a dictionary for similarity values
sim_dict = {}

for sim in sim_list:
  sim_dict.setdefault(sim[0], [])
  sim_dict[sim[0]].append(sim[2])

# create a dictionary for statistical insigths
stat_dict = {}

for key, value in sim_dict.items():
  stat_dict.setdefault(key, [])
  stat_dict[key].append(np.mean(value))
  stat_dict[key].append(np.median(value))
  stat_dict[key].append(np.percentile(value, 90))

In [None]:
# assumption: at least one document d in 'thresh_test' has its most-similar document in 'thresh_test'
# for each document in 'thresh_test', compute the maximum similarity value obtained
best_list = [max(lst) for lst in sim_dict.values()]
best_list.sort()

# set theta as the minimum of the previous list (conservative approach)
theta = min(best_list)
print(theta)

# the choice is empirically supported
# almost all documents in the following computations have at least one similar documents (similarity above the threshold theta)
# still, there will be documents with many similar vectors and a few documents with no similar vectors (negligible)

## 7) Ground truth

In [None]:
# define the writing function to store similar documents
def write(true_pairs_dict, file_path=true_sim_file):
  f = open(file_path, "a")

  for test_doc in true_pairs_dict.items():
    f.write(json.dumps(test_doc))
    f.write('\n')

  f.close()

In [None]:
# for each document in the test set, compute the similar documents in the train set
true_pairs_dict = {}

# mini-batch approach:
# a full computation takes too many time; the b_range parameter helps to focus on a subset of the documents
b_range = (0, 20)

for i in range(b_range[0], b_range[1]):
  
  # get the i-th document in the test set
  u = test_tfidf[i:i+1].toarray()

  # each value of true_pairs_dict is a list
  # the first entry is a list of couple (train document similar to u and similarity value)
  # the second entry is the time used to performed the computation
  # this structure is flexible enough to perform all the following comparison and for any future extensions of this project
  true_pairs_dict.setdefault(i, [[], 0.0])

  start = time.time()

  # for each document in the train set, compute the similarity
  for j in range(train_tfidf.shape[0]):
    v = train_tfidf[j:j+1].toarray()
    val = round(cos_sim(u, v).item(), 2)
    
    # store in true_pairs_dict the index of the document and the similarity value
    if val >= theta:
      true_pairs_dict[i][0].append([j, val])
  
  # store in true_pairs_dict the time needed to complete the computation for the i-th document
  true_pairs_dict[i][1] = round(time.time() - start, 2)
  
  # write the dictionary after some computations
  # uncomment the following lines to perform the actual writing
  '''
  if i % 10 == 0:
    write(true_pairs_dict)
    true_pairs_dict = {}
  '''

## 8) Signature Matrix

In [None]:
# define the class related to the signature computation
class SignatureClass:
  def __init__(self, m, dim, projections=[]):
    self.m = m
    self.dim = dim

    # create the set of vectors to compute the signature (or assign the set of vectors given as parameter)
    if projections == []:
      self.projections = np.random.randn(self.dim, self.m)
    else:
      self.projections = np.array(projections)

  # compute the dot product with the given vector v and create the signature as a bit sequence 
  def generate_signature(self, v):
    return ''.join(np.where(np.dot(v, self.projections) > 0, "1", "0"))

## 9) LSH

### 9.1) Create LSH class

In [None]:
# define the class related to LSH
# (note that both Signature class LSH class can be instantiated from scratch 
# or reusing previously computed variables retrieved from a file -> section 9.3)
class LSH:
  def __init__(self, so, b, r, theta, train_tfidf, lst_dict=[]):
    self.so = so
    self.r = r
    self.b = b
    self.theta = theta
    self.train_tfidf = train_tfidf
    self.lst_dict = lst_dict

    # performing LSH means to take the signatures of two vectors, split them into bands and compare the related dictionaries
    # most of the time is spent in creating the signature matrix M for each document
    # for this reason the idea is to precompute the signatures and the dictionaries (lst_dict) for each band for every doc in training set
    # this means speeding up the process of computing the similar documents when we call the function 'compute_sim' with the test document d
    # creating an instance of LSH class with this precomputation takes about 6 minutes with the predefined parameters
    # after this precomputation, the function 'compute_sim' takes a few seconds to perform
    # we can also reuse lst_dict if we store it in a file (sections 9.2 and 9.3)
    if self.lst_dict == []:
      M = []

      for j in range(train_tfidf.shape[0]):
        v = train_tfidf[j:j+1].toarray()[0]
        v_sig = so.generate_signature(v)
        M.append(self.__partition(v_sig, r))
    
      for i in range(self.b):
        self.lst_dict.append(collections.defaultdict(set))

      for j, elem in enumerate(M):
        for idx in range(self.b):
          self.lst_dict[idx][self.__get_n_hash(elem[idx])].add(j)
  
  # split the signature into b bands of r bits each
  def __partition(self, sig, r):
    return [sig[i:i+r] for i in range(0, len(sig), r)]
  
  # turn bit_seq into decimal integer
  def __get_n_hash(self, bit_seq):
    return int(bit_seq, 2)

  def cos_sim(self, u, v):
    norm = np.linalg.norm(u) * np.linalg.norm(v)
    cosine = np.dot(u, v.T) / norm
    ang = np.arccos(cosine)
    return 1 - ang/np.pi
  
  # compute the train documents similar to vec
  def compute_sim(self, vec):
    start = time.time()

    # generate the signature of vec and split it into bands
    u_sig = self.so.generate_signature(vec[0])
    u_band = self.__partition(u_sig, self.r)
    sim_set = set()

    # using list_dict, check for each band the documents that shares with vec the same bucket
    for idx, band in enumerate(u_band):
      '''
      if self.lst_dict[idx].get(self.__get_n_hash(band)) == None:
        continue
      '''

      for elem in self.lst_dict[idx].get(self.__get_n_hash(band)):
        sim_set.add(elem)

    # discard the false positive computing the real similarity between vec and the candidate documents
    sim_list = []
    for doc in sim_set:
      v = self.train_tfidf[doc].toarray()
      val = round(self.cos_sim(vec, v).item(), 2)
    
      if val >= self.theta:
        sim_list.append(doc)

    end = round(time.time() - start, 2)
    return [sim_list, end]

### 9.2) Create new instances

In [None]:
# create the instances of the signature class and the LSH class
so = SignatureClass(m, test_tfidf.shape[1])
lsh = LSH(so, b, r, theta, train_tfidf)

In [None]:
# write the variables of so (projections) and lsh (list_dict)
# uncomment the following lines to perform the actual writing
'''
f = open(sig_file, "w")

# write so.projections in the first line of the file
f.write(json.dumps(so.projections.tolist()))
f.write('\n')

# write the dictionaries of list_dict in the following lines (one per line)
for d in lsh.lst_dict:
  tmp = {}

  for k, v in d.items():
    tmp.setdefault(k, list(v))

  f.write(json.dumps(tmp))
  f.write('\n')

f.close()
'''

### 9.3) Retrieve old instances

In [None]:
# retrieve the variables of so (projections) and lsh (list_dict)
f = open(sig_file, "r")

list_dict = []
projections = []

i = 0
for line in f:
  if i == 0: # read so.projections from the first line of the file
    projections = json.loads(line)
    i += 1
  else: # read the dictionaries of list_dict from the following lines (one per line)
    d = json.loads(line)
    tmp = {}

    for k, v in d.items():
      tmp.setdefault(int(k), list(v))
    list_dict.append(tmp)

f.close()

In [None]:
# create the instances of the signature class and the LSH class using the parameters retrieved from the file
so = SignatureClass(m, test_tfidf.shape[1], projections)
lsh = LSH(so, b, r, theta, train_tfidf, list_dict)

## 10) Comparison

### 10.1) LSH result

In [None]:
# define the jaccard coefficient
def jaccard(sim_list1, sim_list2):
  set1, set2 = (set(sim_list1), set(sim_list2))

  # as noted in section 6, some test documents may have no similar train documents
  # in this case, the correct result for LSH is to return an empty set as the ground truth computation
  # thus, if both sets are empty, the jaccard coefficient is defined to be 1
  if len(set1) == 0 and len(set2) == 0:
    return 1.0
  
  return len(set1.intersection(set2)) / len(set1.union(set2))

In [None]:
# define the writing function for LSH computations and comparison between LSH and ground truth
def write(doc, lsh_res, file_path):
  f = open(file_path, "a")

  f.write(json.dumps([doc, lsh_res]))
  f.write('\n')

  f.close()

In [None]:
# define the function to compare LSH and ground truth for document doc
def compare(doc):

  # get the real similar documents
  line = linecache.getline(true_sim_file, doc+1)
  test_doc = json.loads(line)

  ground_sim = [elem[0] for elem in test_doc[1][0]]
  ground_t = test_doc[1][1]

  # get the LSH results
  lsh_sim, lsh_t = lsh.compute_sim(test_tfidf[doc].toarray())

  # uncomment the following line to perform the actual writing
  # write(doc, [lsh_sim, lsh_t], lsh_file)

  # compute the jaccard coefficient between the two sets and compare how inefficient is the standard approach wrt LSH
  jac = jaccard(ground_sim, lsh_sim)
  time = ground_t / lsh_t

  # uncomment the following line to perform the actual writing
  # write(doc, [jac, time], comparison_file)
  
  print(jac)
  print(time)

In [None]:
# choose the index of the document to test
doc = 0
compare(doc)

0.8455284552845529
1.781063122923588


### 10.2) Aggregated results

In [None]:
def read(file_path, idx):
  f = open(file_path, "r")
  s = 0

  for line in f:
    test_doc = json.loads(line)
    s += test_doc[1][idx]
    
  f.close()
  return s

print('Average time for exact similarity computation: {:.2f}'.format(read(true_sim_file, 1) / test_tfidf.shape[0]))
print('Average time for LSH similarity computation: {:.2f}'.format(read(lsh_file, 1) / test_tfidf.shape[0]))
print('Average Jaccard coefficient: {:.2f}'.format(read(comparison_file, 0) / test_tfidf.shape[0]))

Average time for exact similarity computation: 73.19
Average time for LSH similarity computation: 27.70
Average Jaccard coefficient: 0.73
