# Exemplo 02: Similaridade de Textos: LSH

## Locality-Sensitive Hashing (LSH) Algorithms

LSH for Euclidean distance metrics. The input is a dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

In [1]:
# Load Spark Library
#from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.types import *

import time, os, string

from pyspark.ml import Pipeline, Transformer
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import RegexTokenizer
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import NGram
from pyspark.ml.feature import HashingTF
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.feature import BucketedRandomProjectionLSH

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import regexp_replace, trim, col, lower, when, size, lit, avg

In [2]:
# Configuration

cwd = os.getcwd()
book_folder = "/data/"
book1 = 'file://'+cwd+book_folder+"01-Harry_Potter_and_the_Sorcerers_Stone.txt.gz"
book2 = 'file://'+cwd+book_folder+"02-Harry_Potter_and_the_Chamber_of_Secrets.txt.gz"

In [3]:
# Functions to converte to lower case, remove ponctuation and empty lines

def removePunctuation(column):
    return trim(lower(regexp_replace(column,'[!,*)@#%|“”(&$_?.^—]', ''))).alias('text')

class RemoveEmptyLines(Transformer):
    def __init__(self, column: StringType() ):
        super(RemoveEmptyLines, self).__init__()
        self.column = column

    def _transform(self, df: DataFrame) -> DataFrame:
        return df.withColumn(self.column, when(size(col(self.column)) == 0, lit(None)).otherwise(col(self.column))).na.drop()

## Create Spark Session

In [4]:
# Create local Spark session
sc = SparkSession.builder\
     .appName("SparkSimilarityLSH")\
     .master("local[*]") \
     .getOrCreate()

start_time = time.time()

### Reading Data

In [5]:
# Read de file Book 1
text_1 = sc.read.text(book1)
text_1.show(10, truncate = False)
# Remove blank lines
text_1 = text_1.filter("value != ''")
# Remove punctuation and rename column
text_1 = text_1.select(removePunctuation(col('value')))
text_1 = text_1.withColumnRenamed('value', 'text')
text_1.show(10, truncate = False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                                   

In [6]:
# Read de book file:  Book 2
text_2 = sc.read.text(book2)
text_2.show(10, truncate = False)
# Remove blank lines
text_2 = text_2.filter("value != ''")
# Remove punctuation and rename column
text_2 = text_2.select(removePunctuation(col('value')))
text_2 = text_2.withColumnRenamed('value', 'text')
text_2.show(10, truncate = False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                     |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Not for the first time, an argument had broken out over breakfast at number four, Privet Drive. Mr. Vernon Dursley had been woken in the early hours of the morning by a loud, hooting noise from his nephew Harry’s room.|
|                                                                                                                   

## Calculate Similarity

### Tokenizer

In [7]:
tokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", minTokenLength=2, toLowercase=True)
tokenData = tokenizer.transform(text_1)
tokenData.show(truncate = False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------

### Shingling

In [8]:
ngram = NGram(n=3, inputCol="tokens", outputCol="ngrams")
ngramData = ngram.transform(tokenData)

rememptylines = RemoveEmptyLines(column = "ngrams")
ngramData = rememptylines.transform(ngramData)

ngramData.show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------

### Counting Hash

In [9]:
hash_tf = HashingTF(inputCol="ngrams", outputCol="vectors")

hashtfData = hash_tf.transform(ngramData)
hashtfData.show() 

+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|
+--------------------+--------------------+--------------------+--------------------+
|m r and mrs dursl...|[and, mrs, dursle...|[and mrs dursley,...|(262144,[11709,13...|
|mr dursley was th...|[mr, dursley, was...|[mr dursley was, ...|(262144,[11297,13...|
|the dursleys had ...|[the, dursleys, h...|[the dursleys had...|(262144,[3002,346...|
|when mr and mrs d...|[when, mr, and, m...|[when mr and, mr ...|(262144,[5287,659...|
|none of them noti...|[none, of, them, ...|[none of them, of...|(262144,[2188,298...|
|at half past eigh...|[at, half, past, ...|[at half past, ha...|(262144,[7558,450...|
|little tyke chort...|[little, tyke, ch...|[little tyke chor...|(262144,[8466,988...|
|it was on the cor...|[it, was, on, the...|[it was on, was o...|(262144,[1394,192...|
|but on the edge o...|[but, on, the, ed...|[but on the

### Min-Hashing

In [10]:
minhash = MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=3).fit(hashtfData)

minhashData = minhash.transform(hashtfData)
minhashData.show() #truncate=False)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|m r and mrs dursl...|[and, mrs, dursle...|[and mrs dursley,...|(262144,[11709,13...|[[2064303.0], [5....|
|mr dursley was th...|[mr, dursley, was...|[mr dursley was, ...|(262144,[11297,13...|[[1.3667454E7], [...|
|the dursleys had ...|[the, dursleys, h...|[the dursleys had...|(262144,[3002,346...|[[2.2917606E7], [...|
|when mr and mrs d...|[when, mr, and, m...|[when mr and, mr ...|(262144,[5287,659...|[[2.657059E7], [1...|
|none of them noti...|[none, of, them, ...|[none of them, of...|(262144,[2188,298...|[[2.01390165E8], ...|
|at half past eigh...|[at, half, past, ...|[at half past, ha...|(262144,[7558,450...|[[1.81252919E8], ...|
|little tyke chort...|[little, tyke, 

#### Using Pipeline

In [11]:
pipeline = Pipeline(stages=[
            tokenizer,
            ngram,
            rememptylines,
            hash_tf,
            minhash
        ])

model= pipeline.fit(text_1)

text_A = model.transform(text_1)
text_B = model.transform(text_2)

text_A.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|m r and mrs dursl...|[and, mrs, dursle...|[and mrs dursley,...|(262144,[11709,13...|[[2064303.0], [5....|
|mr dursley was th...|[mr, dursley, was...|[mr dursley was, ...|(262144,[11297,13...|[[1.3667454E7], [...|
|the dursleys had ...|[the, dursleys, h...|[the dursleys had...|(262144,[3002,346...|[[2.2917606E7], [...|
|when mr and mrs d...|[when, mr, and, m...|[when mr and, mr ...|(262144,[5287,659...|[[2.657059E7], [1...|
|none of them noti...|[none, of, them, ...|[none of them, of...|(262144,[2188,298...|[[2.01390165E8], ...|
|at half past eigh...|[at, half, past, ...|[at half past, ha...|(262144,[7558,450...|[[1.81252919E8], ...|
|little tyke chort...|[little, tyke, 

In [12]:
text_B.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|not for the first...|[not, for, the, f...|[not for the, for...|(262144,[2157,114...|[[2.9732996E7], [...|
|third time this w...|[third, time, thi...|[third time this,...|(262144,[45337,50...|[[1.0894894E7], [...|
|harry tried yet a...|[harry, tried, ye...|[harry tried yet,...|(262144,[43840,47...|[[8.4393945E7], [...|
|she’s bored he sa...|[she’s, bored, he...|[she’s bored he, ...|(262144,[2579,287...|[[1.76535334E8], ...|
|do i look stupid ...|[do, look, stupid...|[do look stupid, ...|(262144,[4449,178...|[[1.26008883E8], ...|
|he exchanged dark...|[he, exchanged, d...|[he exchanged dar...|(262144,[29140,73...|[[1.25424955E8], ...|
|harry tried to ar...|[harry, tried, 

### Locality-Sensitive Hashing (LSH)

In [13]:
rows_text_A = text_A.count()
rows_text_B = text_B.count()

# Show similarity with Jaccard Distance below 0.9 
result_A_B = model.stages[-1].approxSimilarityJoin(text_A, text_B, 0.9, distCol="JaccardDistance")
result_A_B.show()

rows_result_A_B = result_A_B.count()
simil_index_AB = rows_result_A_B / rows_text_B * 100
print("Similarity Text 1 x Text 2 = ",simil_index_AB, " %")

+--------------------+--------------------+------------------+
|            datasetA|            datasetB|   JaccardDistance|
+--------------------+--------------------+------------------+
|{all right harry ...|{all right harry ...|0.8571428571428572|
|{how did you know...|{you saved her yo...|0.8888888888888888|
|{what are you tal...|{what are you tal...|0.7647058823529411|
|{the standard boo...|{the standard boo...|               0.0|
|{the what said ha...|{yes said harry a...|0.7142857142857143|
|{bye said harry a...|{yes said harry a...|0.8666666666666667|
|{get out of the w...|{oh get out of th...|              0.85|
|{and ron pulled h...|{you wish said ha...|0.8461538461538461|
|{no said harry, [...|{no said harry ha...|               0.8|
|{are you all righ...|{ron ron are you ...|0.8181818181818181|
|{and ron pulled h...|{riddle slid off ...|             0.875|
|{the what said ha...|{harry and ron ex...|0.8571428571428572|
|{what are you tal...|{what are you tal...|            

In [14]:
# Show similarity with Jaccard Distance below 0.9
result_A_A = model.stages[-1].approxSimilarityJoin(text_A, text_A, 0.9, distCol="JaccardDistance")
result_A_A .show()

simil_index_AA = result_A_A.count() / rows_text_A * 100
print("Similarity Text 1 x Text 1 = ",simil_index_AA, " %")

+--------------------+--------------------+------------------+
|            datasetA|            datasetB|   JaccardDistance|
+--------------------+--------------------+------------------+
|{couldn’t make us...|{couldn’t make us...|               0.0|
|{aunt petunia gav...|{aunt petunia gav...|               0.0|
|{term begins on s...|{term begins on s...|               0.0|
|{he pointed at ha...|{he pointed at ha...|               0.0|
|{i’m not trying t...|{i’m not trying t...|               0.0|
|{harry looked at ...|{harry looked at ...|               0.0|
|{what house are y...|{what house are y...|               0.0|
|{there was only o...|{there was only o...|               0.0|
|{well that’s it a...|{well that’s it a...|               0.0|
|{he’s got lots o’...|{he’s got lots o’...|               0.0|
|{not if i can hel...|{not if i can hel...|               0.0|
|{the next chamber...|{the next chamber...|               0.0|
|{you must come an...|{you must come an...|            

## Finishing

In [15]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))
# Stop Spark
sc.stop()

--- Execution time: 23.942343711853027 seconds ---
