
#  **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

## gdgd

In [None]:
# !wget -q https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
# !tar xf spark-3.0.1-bin-hadoop2.7.tgz
# !pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [21]:
import os
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "../spark-3.0.1-bin-hadoop2.7"

Run a local spark session to test your installation:

In [22]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Downloaded data

In [1]:
# ! wget http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2020-08-18/data/listings.csv
# ! wget http://data.insideairbnb.com/the-netherlands/north-holland/amsterdam/2020-08-18/data/reviews.csv
# ! wget http://data.insideairbnb.com/spain/catalonia/barcelona/2020-09-12/visualisations/listings.csv

--2020-11-03 20:42:45--  http://data.insideairbnb.com/spain/catalonia/barcelona/2020-09-12/visualisations/listings.csv
Resolving data.insideairbnb.com (data.insideairbnb.com)... 52.216.77.19
Connecting to data.insideairbnb.com (data.insideairbnb.com)|52.216.77.19|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3137507 (3,0M) [application/csv]
Saving to: ‘listings.csv’


2020-11-03 20:42:50 (836 KB/s) - ‘listings.csv’ saved [3137507/3137507]



In [23]:
import pandas as pd
import tqdm
import numpy as np

In [69]:
listings = spark.read.csv('listings.csv',inferSchema=True, header =True)

train = spark.read.parquet('train_test_data/train.pkt')
test = spark.read.parquet('train_test_data/test.pkt')

In [70]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover
from pyspark.ml.feature import BucketedRandomProjectionLSH


tokenizer = Tokenizer(inputCol="name", outputCol="CleanTokens")
stopwordsremover = StopWordsRemover(inputCol="CleanTokens", outputCol="CleanTokensStopRemoved")
hashingTF = HashingTF(inputCol="CleanTokensStopRemoved", outputCol="VectorSpace", numFeatures=50)
idf = IDF(inputCol="VectorSpace", outputCol="VectorSpaceIDF")



In [71]:
pipeline = Pipeline(stages=[tokenizer, stopwordsremover, hashingTF, idf])
pipelineModel = pipeline.fit(train)

In [72]:
# df_pipe = pipelineModel.transform(train)

Learning LSH

In [73]:
pipelineModelTest = pipeline.fit(test)
test_prepared = pipelineModel.transform(test)

In [74]:
test_pd = test_prepared.toPandas()

In [75]:
looking_for = test_pd.VectorSpaceIDF.to_list()

In [31]:
def find_neighbours(data, value, number, colName):
    result = model.approxNearestNeighbors(data, value, number, distCol=colName)
  
    return result.select("id").toPandas()['id'][1:].to_list()



It's slow anyway

In [94]:
def compare_lists(a,b):
    results = []
    for i, _ in enumerate(a):
        intersection = set(a[i]).intersection(b[i])
        results.append(len(intersection))
        
    return np.mean(results), np.sum(np.array(results) > 1)/len(b), np.sum(np.array(results) == 5)/len(b) 

In [95]:
score_a, score_b, score_c = compare_lists(out, test_pd['ground_truth'])

In [93]:
score_a, score_b, score_c

(1.7424354243542435, 0.5682656826568265, 0.07060270602706027)

In [39]:
test_pd.loc[test_pd.id.isin(out[5])].name

177           Beautiful Apartment Sagrada Familia
206           Gran Vía Apartment Catalunya square
285            Beautiful Double Room With Balcony
327                                Cozy Apartment
331    Beautiful Double Room with private terrace
Name: name, dtype: object

In [85]:
grid_bucket_length = [2, 10]
grid_num_hash_tables = [1, 10]

def find_neighbours(model_, data, value, number, colName):
    result = model_.approxNearestNeighbors(data, value, number, distCol=colName)
    
    return result.select("id").toPandas()['id'][1:].to_list()

def compare_lists(a,b):
    results = []
    for i, _ in enumerate(a):
        intersection = set(a[i]).intersection(b[i])
        results.append(len(intersection))
        
#     return np.mean(results), results
    return np.mean(results), np.sum(np.array(results) > 1)/len(b), np.sum(np.array(results) == 5)/len(b), results


def grid_search_lsh(train_data, test_data, grid_bucket_length, grid_num_hash_tables, targets, limit=300):
    results = []  
    for bucket_length in grid_bucket_length:
        for n_hash_table in grid_num_hash_tables: 
            brp = BucketedRandomProjectionLSH(inputCol="VectorSpaceIDF", 
                                              outputCol="hashes", 
                                              bucketLength=bucket_length,
                                              numHashTables=n_hash_table)
            # fit train
            model = brp.fit(train_data)
            print(f'Models params: \nbucket: {model.getBucketLength()}, n_ht: {model.getNumHashTables()}')
            
#             df_pipe = model.transform(train_data)
            looking_for = targets.VectorSpaceIDF.to_list()
            print(f'Calculating LSH for bucket_length={bucket_length} and numHashTables = {n_hash_table}')
            test = test_data.limit(limit)
            targ = looking_for[:limit]
            prediction = []
            
            
            for key in tqdm.tqdm_notebook(targ):
                neigh = find_neighbours(model, test, key, 6, 'hashes')
                prediction.append(neigh)
                
            score_a, score_b, score_c, num_neighb = compare_lists(prediction, targets['ground_truth'])
            print(f'Total score: {score_a}, {score_b}, {score_c}\n {num_neighb[:20]} \n{prediction[:20]}' )
            results.append([bucket_length, n_hash_table, score_a, score_b, score_c, model])
    return results


In [None]:
res = grid_search_lsh(df_pipe, test_prepared, grid_bucket_length,
                      grid_num_hash_tables, test_pd, limit=test_prepared.count())

Models params: 
bucket: 2.0, n_ht: 1
Calculating LSH for bucket_length=2 and numHashTables = 1


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for key in tqdm.tqdm_notebook(targ):


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4065.0), HTML(value='')))

In [None]:
res
