# Exercise 2

For this exercise the LSH algorithm was developed to identify similar news articles. This algorithm was implemented using Spark, more specifically the PySpark library with the Dataframe API.

## Imports

PySpark is the only non-standard library required.

In [72]:
import os.path
import random
import math
import pyspark.sql.functions as F
from pyspark.sql import SparkSession, Row, DataFrame
from pyspark.sql.types import StringType, ArrayType, IntegerType, StructType, StructField, DecimalType, DoubleType
from itertools import combinations, chain
from functools import partial
from typing import Iterable, Any, List, Callable

## Parameters

The values for parameters $b$ and $r$ chosen, according to the requirements, were:
- $b = 13$
- $r = 11$

The values were hand-picked by visually analyzing the plot for the probability of two documents sharing a bucket depending on their similarity, as $b$ and $r$ changed.

In [73]:
N = 100

r = 11
b = 13

point_below = (0.85, 0.9)
point_above = (0.6, 0.05)

prob = lambda s, r, b: 1 - (1 - s**r)**b

try:
    import matplotlib.pyplot as plt

    ss = [i/N for i in range(N)]

    plt.plot(*point_below, color='g', marker='o')
    plt.plot(*point_above, color='r', marker='o')
    plt.plot(ss, [prob(s, r, b) for s in ss])

    plt.title(f'Probability of two documents sharing a bucket w.r.t. their similarity $s$\n($r={r}$, $b={b}$)')
    plt.legend(['at least', 'less than', 'probability'])
    plt.xlabel('$s$')
    plt.ylabel('probability')

    plt.show()

except ImportError:
    print('Could not plot, since the \'matplotlib\' module is not present.')

assert prob(point_below[0], r, b) >= point_below[1], 'Pairs with a similarity of 85%% should have at least 90%% probability of sharing a bucket!'
assert prob(point_above[0], r, b) <  point_above[1], 'Pairs with a similarity of 60%% should have less than 5%% probability of sharing a bucket!'

Could not plot, since the 'matplotlib' module is not present.


Here we have defined all the variables used as parameters for the algorithm

In [74]:
# Shingle size
k = 9

# Number of bands
b = 13

# Number of rows per band
r = 11

# Min-hash: number of hash functions
num_functions = b*r

# Seed for the random number generator
seed = 123

# TODO: Similarity threshold, what is this?
similarity_threshold = 0.85

# Sample of the dataset to use
sample_fraction = 0.01

# Number of explicit partitions
num_partitions = 8

In [75]:
random.seed(seed)

## Spark Initialization

Spark is initialized, with as many worker threads as logical cores on the machine.
We did not use a fixed value since the machines used for development had a different number of CPU cores.

In [76]:
spark = SparkSession.builder \
    .appName('LSH') \
    .config('spark.master', 'local[*]') \
    .getOrCreate()

## Prepare the Data

The dataset is about news in Twitter, where each row identifies a tweet id and it's text as well as the URL for the tweet.

The data's format is json, and is loaded to a dataframe.

In [77]:
# TODO: Configure partitions for speedup?
df = spark.read \
    .json('./data/covid_news_small.json.bz2') \
    .repartition(num_partitions)

                                                                                

The dataframe `df` will have three columns: `text`, `tweet_id` and `url`.

## Pipeline

### Generate shingles

In [78]:
# TODO: ignore punctuation? use different shingling strategy? (see 'Further fun' slide, which is the last, of 3b)
@F.udf(returnType=ArrayType(IntegerType(), False))
def generate_shingles(text: str):
    shingles = (text[idx:idx+k] for idx in range(len(text) - k + 1))
    # Get last 32 bits in order to have 4-byte integers (Python allows arbitrarily large integers)
    to_integer = lambda s: hash(s) & ((1 << 32) - 1)
    return list(set(to_integer(shingle_str) for shingle_str in shingles))

The first step for the algorithm is to generate the shingles for each document/tweet.
We acomplish this by removing all the tweets which won't have at least on shingle of size `k` using a filter.
Then we use a UDF to create all the shingles of each `text` and immediately hashing them to a 32 bit number.

In [79]:
df_shingles = df \
    .drop('url') \
    .filter(F.length('text') >= k) \
    .repartition(num_partitions) \
    .withColumn('shingles', generate_shingles('text')) \
    .drop('text')

With this, the dataframe `df_shingles` will be composed of two columns: `tweet_id` and `shingles`, the latter being an array of the hashed shingles for this tweet.

### Min-hash

The next step is to generate the min-hash signatures.

First we need to generate the hash functions. We will use the following function to generate `num_functions` hash functions, using the universal hash function contained in the lecture slides.
Our `N` is the number of possible shingles (in this case our hashed shingles are 32-bit integers, so `N` is `2**32`), and `p` is a prime number larger than `N`. 

In [80]:
# Assumes the values to hash are 4-byte integers
def generate_universal_hash_family(K: int) -> List[Callable[[int], int]]:
    N = 1 << 32
    p = 2305843009213693951

    parameters = set()
    while (len(parameters) < K):
        parameters |= {(random.randint(1, N), random.randint(0, N)) for _ in range(K - len(parameters))}
    
    return [(a, b, p, N) for a, b in parameters]

In [81]:
hash_family = generate_universal_hash_family(num_functions)
broadcasted_hash_family = spark.sparkContext.broadcast(hash_family)

Then use just need to use the generated hash functions to calculate the min-hash signatures for each tweet.

In [82]:
@F.udf(returnType=ArrayType(IntegerType(), False))
def calculate_min_hash(shingles: List[int]):
    return [min(((a * shingle + b) % p) % N for shingle in shingles) for (a, b, p, N) in broadcasted_hash_family.value]

In [83]:
df_minhash = df_shingles.withColumn('min_hash', calculate_min_hash('shingles')).drop('shingles')

With this, the dataframe `df_minhash` will be composed of two columns: `tweet_id` and `min_hash`, where `tweet_id` is the id of the document/tweet and `min_hash` is a list of integers, each one being the result of applying one of the hash functions to the shingles of the document/tweet calculated using the calculate_min_hash UDF.

In [84]:
#t = time.time()
#hash_col_names = [f'hashed_{i}' for i in range(num_functions)]

#data = df_shingles \
#    .withColumn('shingles', F.explode('shingles')) \
#    .withColumnRenamed('shingles', 'shingle') \
#    .select('tweet_id', *( (((a * F.col('shingle').cast(DecimalType()) + b) % p) % N).alias(name) for (a, b, p, N), name in zip(hash_family, hash_col_names) )) \
#    .groupby('tweet_id') \
#    .min(*hash_col_names) \
#    .withColumn('min_hash', F.array(*(f'min({name})' for name in hash_col_names))) \
#    .select('tweet_id', 'min_hash') \
#    .collect()

#print('Execution time:', time.time() - t)

The min_hash results are saved in disk in Parquet format (Spark's default format) for later use.

In [85]:
fname_minhash = f'minhash_{r}_{b}'
if not os.path.exists(fname_minhash):
    df_minhash.write.mode('overwrite').parquet(path=fname_minhash, compression='gzip')

df_minhash = spark.read.parquet(fname_minhash)

### LSH

The last step is to aplly the LSH algorithm to the minhashes. We will use the LSH algorithm described in the slides.

First we need to divide the min_hash signatures into `b` bands, each of size `r`. We will used the following UDF to do so:

In [86]:
@F.udf(returnType=ArrayType(ArrayType(IntegerType(), False), False))
def generate_even_slices(minhashes: List[int]):
    return [minhashes[i:i+r] for i in range(0, num_functions, r)]

Second, we need to hash all the min_hash values of each text of each band to obtain the bucket identifiers of each band.
For this we use the hash function of the spark library, creating a column named bands which will have the bucket identifier (`band_hash`) and the `band` number for each `tweet_id`.

At the end of the function we separate the `band_hash` and `band` columns into two different columns, one for each.

In [87]:
def create_df_bands(df: DataFrame) -> DataFrame:
    return df \
        .withColumn('min_hash_slices', generate_even_slices('min_hash')) \
        .withColumn('bands', F.array(*(
            F.struct(
                F.hash(F.col('min_hash_slices')[band]).alias('band_hash'),
                F.lit(band).alias('band')
            )
            for band in range(b))
        )) \
        .withColumn('bands', F.explode('bands')) \
        .select('tweet_id', F.col('bands').band.alias('band'), F.col('bands').band_hash.alias('band_hash'))

df_bands = create_df_bands(df_minhash)

This leaves us with the dataframe `df_bands`, which will be composed of three columns: `tweet_id`, `band` and `band_hash`, the latter being the bucket identifier.

In [88]:
@F.udf(returnType=ArrayType(ArrayType(StringType(), False), False))
def combine_pairs(elems: Iterable[Any]):
    return list(combinations(elems, 2))

Having the buckets for each document/tweet in a band, we can now generate the pairs of documents/tweets that are candidates for being similar.
For this, we begin by grouping the documents/tweets by `band` and `band_hash`.
This, with a `collect_list`, gives us a list of tweets for each bucket, called `candidates`.
Then we sort the `candidates` list to facilitate the removal of duplicate pairs, and filter the rows that have only one tweet ot less.
By doing a select of `candidates` we can remove most of the duplicates by a distinct.
Finally, we explode the `candidates` column to get the pairs of tweets and separate them into two columns, named `candidate_pair_first` and `candidate_pair_second`.
To remove the duplicates generated by the combinations, we filter them using distinct().

In [89]:
def create_df_candidate_pairs(df: DataFrame) -> DataFrame:
    return df \
        .groupby('band', 'band_hash') \
        .agg(F.collect_list('tweet_id')) \
        .withColumnRenamed('collect_list(tweet_id)', 'candidates') \
        .withColumn('candidates', F.array_sort('candidates')) \
        .filter(F.size('candidates') > 1) \
        .repartition(num_partitions) \
        .select('candidates') \
        .distinct() \
        .select(F.explode(combine_pairs('candidates')).alias('candidate_pair')) \
        .select(F.col('candidate_pair')[0].alias('candidate_pair_first'), F.col('candidate_pair')[1].alias('candidate_pair_second')) \
        .distinct() 

df_candidate_pairs = create_df_candidate_pairs(df_bands)

Lastly, before saving the results on disk, we remove the false positives.
To verify if a given pair is a false positive, we compare the min_hash values of each document of the pair using the UDF below

In [90]:
@F.udf(returnType=IntegerType())
def min_hash_similar(min_hash_0: List[int], min_hash_1: List[int]):
    return sum((elem0 == elem1) for elem0, elem1 in zip(min_hash_0, min_hash_1))

In [91]:
def create_df_candidate_pairs_fpless(df_candidate_pairs: DataFrame, df_minhash: DataFrame, similarity_threshold: float) -> DataFrame:
    return df_candidate_pairs \
        .join(df_minhash, df_minhash['tweet_id'] == F.col('candidate_pair_first')) \
        .withColumnRenamed('min_hash', 'min_hash_first') \
        .drop('tweet_id') \
        .join(df_minhash, df_minhash['tweet_id'] == F.col('candidate_pair_second')) \
        .withColumnRenamed('min_hash', 'min_hash_second') \
        .drop('tweet_id') \
        .withColumn('similarity', min_hash_similar('min_hash_first', 'min_hash_second') / num_functions) \
        .filter(F.col('similarity') >= similarity_threshold)

df_candidate_pairs_fpless = create_df_candidate_pairs_fpless(df_candidate_pairs, df_minhash, similarity_threshold)

This leaves us with the dataframe `df_candidate_pairs_fpless`, which will be composed of five columns: `candidate_pair_first`, `candidate_pair_second`, `min_hash_first`,`min_hash_second` and `similarity`.

Save results into disk.

In [92]:
fname_candidate_pairs = f'candidate_pairs_{r}_{b}'
if not os.path.exists(fname_candidate_pairs):
    df_candidate_pairs_fpless.write.mode('overwrite').parquet(path=fname_candidate_pairs, compression='gzip')

df_candidate_pairs_fpless = spark.read.parquet(fname_candidate_pairs)

### For exercise 2.2 we developed a function to get similar articles.

In this function we filter all the pairs which have the given document/tweet_id and create an array with all the similar articles

In [93]:
def get_similar_articles(tweet_id: str) -> List[str]:
    rows = df_candidate_pairs_fpless \
        .withColumn('candidate_pair_first', F.when(F.col('candidate_pair_first') != tweet_id, F.col('candidate_pair_first'))) \
        .withColumn('candidate_pair_second', F.when(F.col('candidate_pair_second') != tweet_id, F.col('candidate_pair_second'))) \
        .filter(F.col('candidate_pair_first').isNull() | F.col('candidate_pair_second').isNull()) \
        .select(F.coalesce('candidate_pair_first', 'candidate_pair_second').alias('similar_article')) \
        .collect()

    return [row.similar_article for row in rows]

## Analysis of false positives/negatives

Here we load sample of the min_hash data to do the analysis of false positives and negatives.

Then we generate the candidate pairs like before.

In [94]:
df_minhash_sample = df_minhash.sample(fraction=0.1, seed=seed, withReplacement=False)

df_candidate_pairs_sample = create_df_candidate_pairs(create_df_bands(df_minhash_sample))

df_candidate_pairs_fpless_sample = create_df_candidate_pairs_fpless(df_candidate_pairs_sample, df_minhash_sample, similarity_threshold)

In [102]:
df_candidate_pairs_sample.count()

6997

In [103]:
df_candidate_pairs_fpless_sample.count()

5645

Getting the false positive percentage (false discovery rate).

Since we have the dataframe of candidate pairs and the dataframe of candidate pairs without false positives, we can get the number of false positives by subtracting the number of rows of the dataframes.

In [95]:
# TODO: is this what we are supposed to calculate?
print(f'Percentage of false positives: {(df_candidate_pairs_sample.count() - df_candidate_pairs_fpless_sample.count()) / df_candidate_pairs_sample.count():%}')

Percentage of false positives: 19.322567%


Getting the false negative percentage (false omission rate).
The following code gets us the number of false negatives by comparing the values of the sample of min_hashes

In [100]:
df_minhash_sample \
    .crossJoin(df_minhash_sample.select(F.col('tweet_id').alias('tweet_id_other'), F.col('min_hash').alias('min_hash_other'))) \
    .filter(F.col('tweet_id') < F.col('tweet_id_other')) \
    .select(F.array('tweet_id', 'tweet_id_other').alias('pair'),'min_hash', 'min_hash_other') \
    .join(df_candidate_pairs_sample, F.array(df_candidate_pairs_sample['candidate_pair_first'], df_candidate_pairs_sample['candidate_pair_second']) == F.col('pair'), 'left') \
    .filter(F.col('candidate_pair_first').isNull()) \
    .drop('candidate_pair_first', 'candidate_pair_second') \
    .withColumn('similarity', min_hash_similar('min_hash', 'min_hash_other') / num_functions) \
    .show()

+--------------------+--------------------+--------------------+--------------------+
|                pair|            min_hash|      min_hash_other|          similarity|
+--------------------+--------------------+--------------------+--------------------+
|[1375958178026901...|[4173847, 522047,...|[1867425, 3935602...|0.027972027972027972|
|[1375958178026901...|[4173847, 522047,...|[13250592, 486195...|0.013986013986013986|
|[1375958178026901...|[4173847, 522047,...|[181913, 643007, ...|0.006993006993006993|
|[1375958178026901...|[4173847, 522047,...|[13250592, 486195...|0.013986013986013986|
|[1375958178026901...|[4173847, 522047,...|[13250592, 486195...|0.013986013986013986|
|[1375958178026901...|[4173847, 522047,...|[822262, 731778, ...|0.013986013986013986|
|[1375958178026901...|[4173847, 522047,...|[1867425, 1306842...|0.013986013986013986|
|[1375958178026901...|[4173847, 522047,...|[857035, 2891616,...|0.006993006993006993|
|[1375958178026901...|[4173847, 522047,...|[1867425, 1

                                                                                

In [96]:
false_negatives = df_minhash_sample \
    .crossJoin(df_minhash_sample.select(F.col('tweet_id').alias('tweet_id_other'), F.col('min_hash').alias('min_hash_other'))) \
    .filter(F.col('tweet_id') < F.col('tweet_id_other')) \
    .select(F.array('tweet_id', 'tweet_id_other').alias('pair'),'min_hash', 'min_hash_other') \
    .join(df_candidate_pairs_sample, F.array(df_candidate_pairs_sample['candidate_pair_first'], df_candidate_pairs_sample['candidate_pair_second']) == F.col('pair'), 'left') \
    .filter(F.col('candidate_pair_first').isNull()) \
    .drop('candidate_pair_first', 'candidate_pair_second') \
    .withColumn('similarity', min_hash_similar('min_hash', 'min_hash_other') / num_functions) \
    .filter(F.col('similarity') >= similarity_threshold) \
    .count()

                                                                                

To calculate the percentage of false negatives we used the count previously calculated and divide it by the total number of negatives detected, which is obtained by subtracting the number of candidate pairs to the number of combinations possible of documents.

In [97]:
print(f'Percentage of false negatives: {false_negatives / (math.comb(df_minhash_sample.count(), 2) - df_candidate_pairs.count()):%}')

[Stage 311:>                                                        (0 + 1) / 1]

Percentage of false negatives: 0.000000%


                                                                                