# Homework 1 (include all parts from challenge 1+optional, 2 and homework)
Authors:
- Nazarii Drushchak
- Igor Babin
- Uliana Zbezhkhovska

In [1]:
!pip install findspark
!pip install -q annoy
!pip install -q joblib
!pip install -q joblibspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1


In [2]:
import findspark
findspark.init()

In [38]:
import pyspark
import numpy as np
from tqdm import tqdm

from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import MinHashLSH
from pyspark.sql.functions import col, avg, when
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover
from pyspark.ml.feature import Word2Vec
from annoy import AnnoyIndex
from sklearn.neighbors import NearestNeighbors
from pyspark.sql.functions import monotonically_increasing_id

from joblib import Parallel
from joblibspark import register_spark

In [4]:
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark

## Challenge I

In [5]:
!wget http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2023-09-03/visualisations/listings.csv

--2023-10-28 11:36:51--  http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2023-09-03/visualisations/listings.csv
Resolving data.insideairbnb.com (data.insideairbnb.com)... 16.182.106.29, 52.216.251.19, 52.217.45.147, ...
Connecting to data.insideairbnb.com (data.insideairbnb.com)|16.182.106.29|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1698431 (1.6M) [application/csv]
Saving to: ‘listings.csv’


2023-10-28 11:36:52 (1.69 MB/s) - ‘listings.csv’ saved [1698431/1698431]



In [6]:
df = spark.read.csv("listings.csv", header=True, multiLine=True)
df.show(5)

+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+
|    id|                name|host_id|host_name|neighbourhood_group|neighbourhood|latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|             license|
+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+
|761411|Condo in Amsterda...|4013546|   Xsjong|               NULL|   Noord-Oost|52.40164|  4.95106|   Private room|   61|             3|              303| 2023-08-19|  

Tokenize (remove punctuation and split by word), you can do it in pure python or using ml-lib tokenizer

In [7]:
tokenizer = Tokenizer(inputCol="name", outputCol="words")
wordData = tokenizer.transform(df)
wordData.show(5)

+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+
|    id|                name|host_id|host_name|neighbourhood_group|neighbourhood|latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|             license|               words|
+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+
|761411|Condo in Amsterda...|4013546|   Xsjong|               NULL|   Noord-Oost|52.40164|  4.95106|   Pri

Remove stopwords using ML-LIB stopwordsremover, and store in a new column called “CleanTokens”

In [8]:
remover = StopWordsRemover(inputCol="words", outputCol="CleanTokens")
cleanData = remover.transform(wordData)
cleanData.show(5)

+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+--------------------+
|    id|                name|host_id|host_name|neighbourhood_group|neighbourhood|latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|             license|               words|         CleanTokens|
+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+--------------------+
|761411|Condo in Amsterda...|4013546|   Xsj

But we don’t have a stopwordsremover for all language and contexts. Create your own list of stopwords from this text (think: what is a stopword?) Remove stopwords again, and store in column “MyCleanTokens”

In [9]:
stop_list = ['the', 'a', 'an', 'another', "for", "an", "nor", "but", "or", "yet", "so", 
                                      "in", "under", "towards", "before"]
remover = StopWordsRemover(stopWords=stop_list, inputCol='words', outputCol='MyCleanTokens')
cleanData = remover.transform(cleanData)
cleanData.show(5)

+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+--------------------+--------------------+
|    id|                name|host_id|host_name|neighbourhood_group|neighbourhood|latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|             license|               words|         CleanTokens|       MyCleanTokens|
+------+--------------------+-------+---------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+--------------------+--

Perform TFIDF in a new column called “VectorSpace”

In [10]:
hashingTF = HashingTF(inputCol="MyCleanTokens", outputCol="VectorSpace", numFeatures=20)
featurizedData = hashingTF.transform(cleanData)

idf = IDF(inputCol="VectorSpace", outputCol="features")
idfModel = idf.fit(featurizedData)
results = idfModel.transform(featurizedData)

results.select("MyCleanTokens", "features").show(5)

+--------------------+--------------------+
|       MyCleanTokens|            features|
+--------------------+--------------------+
|[condo, amsterdam...|(20,[1,2,7,10,11,...|
|[rental, unit, am...|(20,[1,2,5,11,12,...|
|[boat, amsterdam,...|(20,[0,1,2,5,11,1...|
|[houseboat, amste...|(20,[0,1,5,6,11,1...|
|[rental, unit, am...|(20,[1,2,9,11,12,...|
+--------------------+--------------------+
only showing top 5 rows



## Homework (Optional)

In a new column(‘word2vec’), repeat the procedure using word2vec instead of TF-IDF.

In [11]:
word2Vec = Word2Vec(vectorSize=20, minCount=0, inputCol="MyCleanTokens", outputCol="word2vec")
model = word2Vec.fit(results)
result = model.transform(results)

result.select("MyCleanTokens", "word2vec").show(5)

+--------------------+--------------------+
|       MyCleanTokens|            word2vec|
+--------------------+--------------------+
|[condo, amsterdam...|[-0.0925399937799...|
|[rental, unit, am...|[-0.2815365232527...|
|[boat, amsterdam,...|[-0.1428595901681...|
|[houseboat, amste...|[-0.4491232094856...|
|[rental, unit, am...|[-0.1793470645944...|
+--------------------+--------------------+
only showing top 5 rows



Show first row word2vec vector

In [12]:
result.select("word2vec").first()

Row(word2vec=DenseVector([-0.0925, 0.1008, 0.4251, -0.2498, 0.2804, -0.1673, 0.123, 0.2466, -0.2948, -0.1483, -0.2474, -0.0043, -0.4474, 0.0884, 0.3136, 0.0642, -0.0431, -0.4975, 0.1821, 0.3358]))

## Challenge II

Take the first 500 flats in the list

In [13]:
mysample = result.limit(500)
mysample.count()

500

Find the 3 nearest neighbors for each element in that subset (candidates and query points are within the sample of 500) USING KNN

In [14]:
mysample_pd = mysample.toPandas()
tfidf = mysample_pd['features'].tolist()
text = mysample_pd['name'].tolist()
id_ = mysample_pd['id'].tolist()

# fit nearest neighbors
nbrs = NearestNeighbors(n_neighbors=4).fit(tfidf)
distances, indices = nbrs.kneighbors(tfidf[:5])

# show 3 nearest neighbors for first row except itself
print('id', [id_[i] for i in indices[0]])

id ['761411', '634170', '721291', '730916']


Find the 3 nearest neighbors for each element in that subset (candidates and query points are within the sample of 500) USING LSH with sklearn

In [15]:
# IT IS DEPRECATED or Install 3 years old version of sklearn 0.16.1
try:
    from sklearn.neighbors import LSHForest
    
    mysample_pd = mysample.toPandas()
    tfidf = mysample_pd['features'].tolist()
    text = mysample_pd['name'].tolist()
    id_ = mysample_pd['id'].tolist()
    
    lshf = LSHForest(random_state=42)
    lshf.fit(tfidf)
    
    # get the feature vectore of the first row
    query = tfidf[0]
    id_ = id_[0]
    
    # show 3 nearest neighbors for first row except itself
    distances, indices = lshf.kneighbors([query], n_neighbors=4)
    for i in range(1, len(distances[0])):
        print("distance: ", distances[0][i], "id: ", id_[indices[0][i]])
except ImportError:
    print("LSHForest could not be imported")

LSHForest could not be imported


Find the 3 nearest neighbors for each element in that subset (candidates and query points are within the sample of 500) USING LSH with pyspark

In [16]:
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=3)
model = mh.fit(mysample)

# get the feature vector of the first row
key =  mysample.select("features").take(1)[0].features
id_ = mysample.select("id").take(1)[0].id


# show 3 nearest neighbors for first row except itself
model.approxNearestNeighbors(mysample, key, 4).filter(col("id") != id_).show()

+-------+--------------------+-------+----------------+-------------------+-------------+--------+---------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+
|     id|                name|host_id|       host_name|neighbourhood_group|neighbourhood|latitude|longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|             license|               words|         CleanTokens|       MyCleanTokens|         VectorSpace|            features|            word2vec|              hashes|           distCol|
+-------+--------------------+-------+----------------+-------------------+-------------+--------+

## Challenge III: Homework

![image.png](HW1_task.jpeg)

## Load Barcelona Airbnb data

In [17]:
!wget http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/listings.csv -O listings_barcelona.csv 

--2023-10-28 11:37:07--  http://data.insideairbnb.com/spain/catalonia/barcelona/2023-09-06/visualisations/listings.csv
Resolving data.insideairbnb.com (data.insideairbnb.com)... 54.231.199.253, 52.216.212.101, 52.217.81.187, ...
Connecting to data.insideairbnb.com (data.insideairbnb.com)|54.231.199.253|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3664972 (3.5M) [application/csv]
Saving to: ‘listings_barcelona.csv’


2023-10-28 11:37:08 (2.73 MB/s) - ‘listings_barcelona.csv’ saved [3664972/3664972]



In [18]:
df = spark.read.csv("listings_barcelona.csv", header=True, multiLine=True)
df.show(5)

+------+--------------------+-------+----------------+-------------------+--------------------+-----------------+-----------------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+-----------+
|    id|                name|host_id|       host_name|neighbourhood_group|       neighbourhood|         latitude|        longitude|      room_type|price|minimum_nights|number_of_reviews|last_review|reviews_per_month|calculated_host_listings_count|availability_365|number_of_reviews_ltm|    license|
+------+--------------------+-------+----------------+-------------------+--------------------+-----------------+-----------------+---------------+-----+--------------+-----------------+-----------+-----------------+------------------------------+----------------+---------------------+-----------+
| 18674|Rental unit in Ba...|  71615|Mireia And Maria|           Eixample|  la Sagrada Família|        

In [19]:
df.count()

18086

## Load wikipedia data

Load file from: https://pageviews.wmcloud.org/topviews/?project=uk.wikipedia.org&platform=all-access&date=2023-09&excludes= in CSV format

In [24]:
df = spark.read.csv("work/topviews-2023_09.csv", header=True, multiLine=True)
df.show(5)

+--------------------+-----+-------+------+
|                Page|Edits|Editors| Views|
+--------------------+-----+-------+------+
|Умєров Рустем Енв...|   54|     34|127352|
|             Ukr.net|    2|      1| 97183|
|             Україна|    8|      6| 94568|
|Кадиров Рамзан Ах...|   15|      8| 86347|
|    Нагірний Карабах|   32|     12| 79264|
+--------------------+-----+-------+------+
only showing top 5 rows



In [25]:
df.count()

991

## Additional function

In [39]:
use_stopwords = True  # Use True or False
use_custom_stopwords = False  # Use True or False
latent_features = 20  # Dimension of features
nearest = 3  # Number of nearest neighbors

# # Register Spark to be used by joblib
register_spark()

def timeit(func):
    def timed(*args, **kwargs):
        import time
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print("Time elapsed for " + func.__name__ + ": " + str(end - start))
        return result

    return timed

def read_data(spark, data):
    if data == "barcelona":
        df = spark.read.csv("listings_barcelona.csv", header=True, multiLine=True)
    elif data == "titles":
        df = spark.read.csv("work/topviews-2023_09.csv", header=True, multiLine=True)
        # add id column as row number
        df = df.withColumn("id", monotonically_increasing_id())
    else:
        raise ValueError("Invalid data")
    return df

def limit_data(df, limit=50): # -1 means no limit
    if limit > 0:
        df = df.limit(limit)
    return df

def get_features(df, input_col="name", output_col="features", type_features="tfidf"):
    tokenizer = Tokenizer(inputCol=input_col, outputCol="words")
    df = tokenizer.transform(df)

    if use_stopwords:
        if use_custom_stopwords:
            remover = StopWordsRemover(stopWords=stop_list, inputCol="words", outputCol="clean_tokens")
        else:
            remover = StopWordsRemover(inputCol="words", outputCol="clean_tokens")
        df = remover.transform(df)
        df = df.drop("words")
        df = df.withColumnRenamed("clean_tokens", "words")

    if type_features == "tfidf":
        hashing = HashingTF(inputCol="words", outputCol="hash", numFeatures=latent_features)
        df = hashing.transform(df)

        idf = IDF(inputCol="hash", outputCol=output_col)
        model = idf.fit(df)
        df = model.transform(df)
    elif type_features == "word_to_vec":
        word_vec = Word2Vec(vectorSize=latent_features, minCount=0, inputCol="words", outputCol=output_col)
        model = word_vec.fit(df)
        df = model.transform(df)
    else:
        raise ValueError("Invalid feature " + type_features)

    return df

@timeit
def compute_gt(ds, spark, k=nearest, input_col="features", output_col="gt_neighbors"):
    df = ds.toPandas()
    features = df[input_col].tolist()
    model = NearestNeighbors(n_neighbors=k + 1, algorithm='ball_tree').fit(features)
    _, indices = model.kneighbors(features)
    # remove self from neighbors
    indices = [Vectors.dense(df["id"][np.delete(ind, np.where(ind == i))].values) for i, ind in enumerate(indices)]
    df[output_col] = indices

    return spark.createDataFrame(df)

@timeit
def lsh_prediction(ds, spark, k=nearest, input_col="features",
                   output_col="ann_neighbors", num_hash_tables=3):
    model = MinHashLSH(inputCol=input_col, outputCol=output_col, numHashTables=num_hash_tables)
    model = model.fit(ds)

    # TODO There should be something better than this
    pred = []
    for i in tqdm(ds.collect()):
        id_ = i["id"]
        key = i[input_col]
        pred.append(model.approxNearestNeighbors(ds, key, k + 1).filter(col("id") != id_).select("id").collect())

    pred = [Vectors.dense([i["id"] for i in ann]) for ann in pred]
    df = ds.toPandas()
    df[output_col] = pred
    ds = spark.createDataFrame(df)

    return ds

@timeit
def grid_search_lsh(ds, spark, k=nearest, input_col="features", output_col="ann_neighbors"):
    param_grid = [5, 10, 20, 100]
    results = []
    print("="*50)
    print("Grid search for LSH")

    # Run the grid search in parallel
    with Parallel(n_jobs=-1, backend="spark") as parallel:
        for num_hash_table in param_grid:
            ds_ = lsh_prediction(ds, spark, k=k, input_col=input_col, output_col=output_col,
                                num_hash_tables=num_hash_table)
            acc = evaluation(ds_)
            print(f"Method: LSH - Num Hash Tables: {num_hash_table} - Accuracy: {acc}\n")
            results.append((num_hash_table, acc))
    print("="*50, "\n\n")
    return results

@timeit
def annoy_prediction(ds, k=nearest, input_col="features", output_col="ann_neighbors", metric='angular', tree=10):
    df = ds.toPandas()
    features = df[input_col].tolist()
    f = len(features[0])
    t = AnnoyIndex(f, metric=metric)
    for i, v in enumerate(features):
        t.add_item(i, v)
    t.build(tree)
    pred = []
    for i in tqdm(features):
        pred.append(t.get_nns_by_vector(i, k + 1, include_distances=False)[1:])
    pred = [Vectors.dense([df["id"][i] for i in ann]) for ann in pred]
    df[output_col] = pred
    ds = spark.createDataFrame(df)

    return ds

@timeit
def grid_search_annoy(ds, k=nearest, input_col="features", output_col="ann_neighbors"):
    metrics = ['angular', 'euclidean', 'dot']
    trees = [10, 100, 1000]
    param_grid = [(metric, tree) for metric in metrics for tree in trees]
    results = []
    print("="*50)
    print("Grid search for Annoy")

    # Run the grid search in parallel
    with Parallel(n_jobs=-1, backend="spark") as parallel:
        for metric, tree in param_grid:
            ds = annoy_prediction(ds, k=k, input_col=input_col, output_col=output_col, metric=metric, tree=tree)
            acc = evaluation(ds)
            print(f"Method: Annoy - Metric: {metric} - Tree: {tree} - Accuracy: {acc}\n")
            results.append((metric, tree, acc))
    print("="*50, "\n\n")
    return results

def evaluation(ds):
    acc = 0
    for i in ds.collect():
        gt = i["gt_neighbors"]
        ann = i["ann_neighbors"]
        gt.sort(), ann.sort()
        acc += len(set(gt).intersection(set(ann)))
    acc /= len(ds.collect()) * nearest

    return acc

## Barcelona dataset
### type_features: `Tf-idf`
**Try LSH method from Pyspark(not optimized) and LSH method from Annoy**

In [27]:
limit = -1  # Use -1 for no limit
data = "barcelona"  # Use "barcelona" or "titles"
nearest = 3
use_stopwords = True  # Use True or False
use_custom_stopwords = False  # Use True or False
latent_features = 20  # Dimension of features
type_features = "tfidf"  # Use "tfidf" or "word_to_vec"

In [28]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_lsh(df, spark)

Reading data for barcelona with limit 50 and features tfidf ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.3051784038543701
Grid search for LSH


100%|██████████| 50/50 [00:13<00:00,  3.65it/s]


Time elapsed for lsh_prediction: 15.093586206436157
Method: LSH - Num Hash Tables: 5 - Accuracy: 0.5733333333333334



100%|██████████| 50/50 [00:11<00:00,  4.36it/s]


Time elapsed for lsh_prediction: 11.93342900276184
Method: LSH - Num Hash Tables: 10 - Accuracy: 0.5666666666666667



100%|██████████| 50/50 [00:09<00:00,  5.41it/s]


Time elapsed for lsh_prediction: 9.636497497558594
Method: LSH - Num Hash Tables: 20 - Accuracy: 0.5666666666666667



100%|██████████| 50/50 [00:10<00:00,  4.84it/s]


Time elapsed for lsh_prediction: 10.809253215789795
Method: LSH - Num Hash Tables: 100 - Accuracy: 0.5666666666666667



Time elapsed for grid_search_lsh: 48.77841663360596


[(5, 0.5733333333333334),
 (10, 0.5666666666666667),
 (20, 0.5666666666666667),
 (100, 0.5666666666666667)]

In [29]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_annoy(df, k=nearest, input_col="features", output_col="ann_neighbors")

Reading data for barcelona with limit 50 and features tfidf ...



Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.1712968349456787
Grid search for Annoy


100%|██████████| 50/50 [00:00<00:00, 3755.64it/s]


Time elapsed for annoy_prediction: 0.37523818016052246
Method: Annoy - Metric: angular - Tree: 10 - Accuracy: 0.82



100%|██████████| 50/50 [00:00<00:00, 7341.94it/s]

Time elapsed for annoy_prediction: 0.1696934700012207





Method: Annoy - Metric: angular - Tree: 100 - Accuracy: 0.8133333333333334



100%|██████████| 50/50 [00:00<00:00, 2013.01it/s]


Time elapsed for annoy_prediction: 0.25090670585632324
Method: Annoy - Metric: angular - Tree: 1000 - Accuracy: 0.8133333333333334



100%|██████████| 50/50 [00:00<00:00, 6132.02it/s]


Time elapsed for annoy_prediction: 0.22156262397766113
Method: Annoy - Metric: euclidean - Tree: 10 - Accuracy: 0.9466666666666667



100%|██████████| 50/50 [00:00<00:00, 5015.91it/s]


Time elapsed for annoy_prediction: 0.2560689449310303
Method: Annoy - Metric: euclidean - Tree: 100 - Accuracy: 0.96



100%|██████████| 50/50 [00:00<00:00, 1943.77it/s]


Time elapsed for annoy_prediction: 0.259688138961792
Method: Annoy - Metric: euclidean - Tree: 1000 - Accuracy: 0.9666666666666667



100%|██████████| 50/50 [00:00<00:00, 7620.74it/s]

Time elapsed for annoy_prediction: 0.18961048126220703





Method: Annoy - Metric: dot - Tree: 10 - Accuracy: 0.31333333333333335



100%|██████████| 50/50 [00:00<00:00, 5088.94it/s]

Time elapsed for annoy_prediction: 0.20073294639587402





Method: Annoy - Metric: dot - Tree: 100 - Accuracy: 0.3



100%|██████████| 50/50 [00:00<00:00, 1433.67it/s]


Time elapsed for annoy_prediction: 0.3222932815551758
Method: Annoy - Metric: dot - Tree: 1000 - Accuracy: 0.3



Time elapsed for grid_search_annoy: 4.976179599761963


[('angular', 10, 0.82),
 ('angular', 100, 0.8133333333333334),
 ('angular', 1000, 0.8133333333333334),
 ('euclidean', 10, 0.9466666666666667),
 ('euclidean', 100, 0.96),
 ('euclidean', 1000, 0.9666666666666667),
 ('dot', 10, 0.31333333333333335),
 ('dot', 100, 0.3),
 ('dot', 1000, 0.3)]

## Barcelona dataset
### type_features: `Word2vec`
**Try LSH method from Pyspark(not optimized) and LSH method from Annoy**

In [30]:
limit = -1  # Use -1 for no limit
data = "barcelona"  # Use "barcelona" or "titles"
nearest = 3
use_stopwords = True  # Use True or False
use_custom_stopwords = False  # Use True or False
latent_features = 20  # Dimension of features
type_features = "word_to_vec"  # Use "tfidf" or "word_to_vec"

In [31]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_lsh(df, spark)

Reading data for barcelona with limit 50 and features word_to_vec ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.1323404312133789
Grid search for LSH


100%|██████████| 50/50 [00:10<00:00,  4.84it/s]


Time elapsed for lsh_prediction: 10.80126690864563
Method: LSH - Num Hash Tables: 5 - Accuracy: 0.07333333333333333



100%|██████████| 50/50 [00:09<00:00,  5.29it/s]


Time elapsed for lsh_prediction: 9.926118850708008
Method: LSH - Num Hash Tables: 10 - Accuracy: 0.07333333333333333



100%|██████████| 50/50 [00:08<00:00,  5.63it/s]


Time elapsed for lsh_prediction: 9.312865972518921
Method: LSH - Num Hash Tables: 20 - Accuracy: 0.07333333333333333



100%|██████████| 50/50 [00:08<00:00,  5.75it/s]


Time elapsed for lsh_prediction: 9.116144895553589
Method: LSH - Num Hash Tables: 100 - Accuracy: 0.07333333333333333



Time elapsed for grid_search_lsh: 40.24219989776611


[(5, 0.07333333333333333),
 (10, 0.07333333333333333),
 (20, 0.07333333333333333),
 (100, 0.07333333333333333)]

In [32]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_annoy(df, k=nearest, input_col="features", output_col="ann_neighbors")

Reading data for barcelona with limit 50 and features word_to_vec ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.11008501052856445
Grid search for Annoy


100%|██████████| 50/50 [00:00<00:00, 37203.34it/s]


Time elapsed for annoy_prediction: 0.19582915306091309
Method: Annoy - Metric: angular - Tree: 10 - Accuracy: 0.8333333333333334



100%|██████████| 50/50 [00:00<00:00, 20317.30it/s]

Time elapsed for annoy_prediction: 0.1560664176940918





Method: Annoy - Metric: angular - Tree: 100 - Accuracy: 0.84



100%|██████████| 50/50 [00:00<00:00, 2824.29it/s]

Time elapsed for annoy_prediction: 0.1698594093322754





Method: Annoy - Metric: angular - Tree: 1000 - Accuracy: 0.8533333333333334



100%|██████████| 50/50 [00:00<00:00, 115164.85it/s]

Time elapsed for annoy_prediction: 0.19256114959716797





Method: Annoy - Metric: euclidean - Tree: 10 - Accuracy: 0.9733333333333334



100%|██████████| 50/50 [00:00<00:00, 20728.99it/s]

Time elapsed for annoy_prediction: 0.15189266204833984





Method: Annoy - Metric: euclidean - Tree: 100 - Accuracy: 0.9866666666666667



100%|██████████| 50/50 [00:00<00:00, 2515.08it/s]

Time elapsed for annoy_prediction: 0.182830810546875





Method: Annoy - Metric: euclidean - Tree: 1000 - Accuracy: 0.9866666666666667



100%|██████████| 50/50 [00:00<00:00, 55627.37it/s]

Time elapsed for annoy_prediction: 0.1544325351715088





Method: Annoy - Metric: dot - Tree: 10 - Accuracy: 0.08666666666666667



100%|██████████| 50/50 [00:00<00:00, 36908.69it/s]

Time elapsed for annoy_prediction: 0.17861557006835938





Method: Annoy - Metric: dot - Tree: 100 - Accuracy: 0.04666666666666667



100%|██████████| 50/50 [00:00<00:00, 3084.64it/s]


Time elapsed for annoy_prediction: 0.27036428451538086
Method: Annoy - Metric: dot - Tree: 1000 - Accuracy: 0.04666666666666667



Time elapsed for grid_search_annoy: 4.144101858139038


[('angular', 10, 0.8333333333333334),
 ('angular', 100, 0.84),
 ('angular', 1000, 0.8533333333333334),
 ('euclidean', 10, 0.9733333333333334),
 ('euclidean', 100, 0.9866666666666667),
 ('euclidean', 1000, 0.9866666666666667),
 ('dot', 10, 0.08666666666666667),
 ('dot', 100, 0.04666666666666667),
 ('dot', 1000, 0.04666666666666667)]

## Wikipedia dataset
### type_features: `Tf-idf`
**Try LSH method from Pyspark(not optimized) and LSH method from Annoy**

In [40]:
limit = -1  # Use -1 for no limit
data = "titles"  # Use "barcelona" or "titles"
nearest = 3
use_stopwords = True  # Use True or False
use_custom_stopwords = False  # Use True or False
latent_features = 20  # Dimension of features
type_features = "tfidf"  # Use "tfidf" or "word_to_vec"

In [41]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, input_col="Page", type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_lsh(df, spark)

Reading data for titles with limit 50 and features tfidf ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.19445300102233887
Grid search for LSH


100%|██████████| 50/50 [00:09<00:00,  5.25it/s]


Time elapsed for lsh_prediction: 10.044053792953491
Method: LSH - Num Hash Tables: 5 - Accuracy: 0.6266666666666667



100%|██████████| 50/50 [00:09<00:00,  5.26it/s]


Time elapsed for lsh_prediction: 9.916013479232788
Method: LSH - Num Hash Tables: 10 - Accuracy: 0.6933333333333334



100%|██████████| 50/50 [00:09<00:00,  5.27it/s]


Time elapsed for lsh_prediction: 10.049498558044434
Method: LSH - Num Hash Tables: 20 - Accuracy: 0.6933333333333334



100%|██████████| 50/50 [00:08<00:00,  5.66it/s]


Time elapsed for lsh_prediction: 9.221503496170044
Method: LSH - Num Hash Tables: 100 - Accuracy: 0.6933333333333334



Time elapsed for grid_search_lsh: 40.323973178863525


[(5, 0.6266666666666667),
 (10, 0.6933333333333334),
 (20, 0.6933333333333334),
 (100, 0.6933333333333334)]

In [42]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, input_col="Page", type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_annoy(df, k=nearest, input_col="features", output_col="ann_neighbors")

Reading data for titles with limit 50 and features tfidf ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.13683843612670898
Grid search for Annoy


100%|██████████| 50/50 [00:00<00:00, 9367.30it/s]


Time elapsed for annoy_prediction: 0.15227961540222168
Method: Annoy - Metric: angular - Tree: 10 - Accuracy: 0.72



100%|██████████| 50/50 [00:00<00:00, 7221.35it/s]

Time elapsed for annoy_prediction: 0.18560028076171875





Method: Annoy - Metric: angular - Tree: 100 - Accuracy: 0.7133333333333334



100%|██████████| 50/50 [00:00<00:00, 2391.28it/s]

Time elapsed for annoy_prediction: 0.18099522590637207





Method: Annoy - Metric: angular - Tree: 1000 - Accuracy: 0.7133333333333334



100%|██████████| 50/50 [00:00<00:00, 7551.05it/s]

Time elapsed for annoy_prediction: 0.1476612091064453





Method: Annoy - Metric: euclidean - Tree: 10 - Accuracy: 0.9



100%|██████████| 50/50 [00:00<00:00, 6001.12it/s]

Time elapsed for annoy_prediction: 0.19230151176452637





Method: Annoy - Metric: euclidean - Tree: 100 - Accuracy: 0.9666666666666667



100%|██████████| 50/50 [00:00<00:00, 2224.22it/s]

Time elapsed for annoy_prediction: 0.1911003589630127





Method: Annoy - Metric: euclidean - Tree: 1000 - Accuracy: 0.9666666666666667



100%|██████████| 50/50 [00:00<00:00, 8288.48it/s]

Time elapsed for annoy_prediction: 0.20042037963867188





Method: Annoy - Metric: dot - Tree: 10 - Accuracy: 0.35333333333333333



100%|██████████| 50/50 [00:00<00:00, 5986.73it/s]

Time elapsed for annoy_prediction: 0.1830732822418213





Method: Annoy - Metric: dot - Tree: 100 - Accuracy: 0.38



100%|██████████| 50/50 [00:00<00:00, 1762.24it/s]


Time elapsed for annoy_prediction: 0.2177422046661377
Method: Annoy - Metric: dot - Tree: 1000 - Accuracy: 0.38



Time elapsed for grid_search_annoy: 4.075116872787476


[('angular', 10, 0.72),
 ('angular', 100, 0.7133333333333334),
 ('angular', 1000, 0.7133333333333334),
 ('euclidean', 10, 0.9),
 ('euclidean', 100, 0.9666666666666667),
 ('euclidean', 1000, 0.9666666666666667),
 ('dot', 10, 0.35333333333333333),
 ('dot', 100, 0.38),
 ('dot', 1000, 0.38)]

## Wikipedia dataset
### type_features: `Word2Vec`
**Try LSH method from Pyspark(not optimized) and LSH method from Annoy**

In [43]:
limit = -1  # Use -1 for no limit
data = "titles"  # Use "barcelona" or "titles"
nearest = 3
use_stopwords = True  # Use True or False
use_custom_stopwords = False  # Use True or False
latent_features = 20  # Dimension of features
type_features = "word_to_vec"  # Use "tfidf" or "word_to_vec"

In [44]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, input_col="Page", type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_lsh(df, spark)

Reading data for titles with limit 50 and features word_to_vec ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.08867573738098145
Grid search for LSH


100%|██████████| 50/50 [00:09<00:00,  5.14it/s]


Time elapsed for lsh_prediction: 10.199967861175537
Method: LSH - Num Hash Tables: 5 - Accuracy: 0.05333333333333334



100%|██████████| 50/50 [00:09<00:00,  5.02it/s]


Time elapsed for lsh_prediction: 10.422884702682495
Method: LSH - Num Hash Tables: 10 - Accuracy: 0.05333333333333334



100%|██████████| 50/50 [00:08<00:00,  5.57it/s]


Time elapsed for lsh_prediction: 9.395716905593872
Method: LSH - Num Hash Tables: 20 - Accuracy: 0.05333333333333334



100%|██████████| 50/50 [00:09<00:00,  5.29it/s]


Time elapsed for lsh_prediction: 9.97730803489685
Method: LSH - Num Hash Tables: 100 - Accuracy: 0.05333333333333334



Time elapsed for grid_search_lsh: 41.161221742630005


[(5, 0.05333333333333334),
 (10, 0.05333333333333334),
 (20, 0.05333333333333334),
 (100, 0.05333333333333334)]

In [45]:
print("Reading data for " + data + " with limit " + str(limit) + " and features " + type_features + " ...\n")

df = read_data(spark, data)
df = limit_data(df, limit)

df = get_features(df, input_col="Page", type_features=type_features)

print("Calculating gt neighbors nearest neighbors, could take a while...")
df = compute_gt(df, spark)

grid_search_annoy(df, k=nearest, input_col="features", output_col="ann_neighbors")

Reading data for titles with limit 50 and features word_to_vec ...

Calculating gt neighbors nearest neighbors, could take a while...
Time elapsed for compute_gt: 0.15445518493652344
Grid search for Annoy


100%|██████████| 50/50 [00:00<00:00, 59426.24it/s]


Time elapsed for annoy_prediction: 0.15851211547851562
Method: Annoy - Metric: angular - Tree: 10 - Accuracy: 0.5



100%|██████████| 50/50 [00:00<00:00, 25052.59it/s]

Time elapsed for annoy_prediction: 0.16124391555786133





Method: Annoy - Metric: angular - Tree: 100 - Accuracy: 0.5066666666666667



100%|██████████| 50/50 [00:00<00:00, 3065.29it/s]


Time elapsed for annoy_prediction: 0.25551581382751465
Method: Annoy - Metric: angular - Tree: 1000 - Accuracy: 0.5066666666666667



100%|██████████| 50/50 [00:00<00:00, 66031.23it/s]

Time elapsed for annoy_prediction: 0.185532808303833





Method: Annoy - Metric: euclidean - Tree: 10 - Accuracy: 0.88



100%|██████████| 50/50 [00:00<00:00, 20944.29it/s]

Time elapsed for annoy_prediction: 0.1557481288909912





Method: Annoy - Metric: euclidean - Tree: 100 - Accuracy: 1.0



100%|██████████| 50/50 [00:00<00:00, 2407.92it/s]


Time elapsed for annoy_prediction: 0.21593117713928223
Method: Annoy - Metric: euclidean - Tree: 1000 - Accuracy: 1.0



100%|██████████| 50/50 [00:00<00:00, 55319.23it/s]

Time elapsed for annoy_prediction: 0.18690085411071777





Method: Annoy - Metric: dot - Tree: 10 - Accuracy: 0.26666666666666666



100%|██████████| 50/50 [00:00<00:00, 16586.14it/s]

Time elapsed for annoy_prediction: 0.19915127754211426





Method: Annoy - Metric: dot - Tree: 100 - Accuracy: 0.22



100%|██████████| 50/50 [00:00<00:00, 2573.38it/s]

Time elapsed for annoy_prediction: 0.18084073066711426





Method: Annoy - Metric: dot - Tree: 1000 - Accuracy: 0.22



Time elapsed for grid_search_annoy: 4.051683187484741


[('angular', 10, 0.5),
 ('angular', 100, 0.5066666666666667),
 ('angular', 1000, 0.5066666666666667),
 ('euclidean', 10, 0.88),
 ('euclidean', 100, 1.0),
 ('euclidean', 1000, 1.0),
 ('dot', 10, 0.26666666666666666),
 ('dot', 100, 0.22),
 ('dot', 1000, 0.22)]

# Results and conclusions: