# Exemplo 02: Similaridade de Textos: LSH

## Locality-Sensitive Hashing (LSH) Algorithms

LSH for Euclidean distance metrics. The input is a dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

In [1]:
# Start Spark environment
import findspark
findspark.init()

In [2]:
# Load Spark Library
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.types import *

import time, os, string

from pyspark.ml import Pipeline, Transformer
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import RegexTokenizer
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import NGram
from pyspark.ml.feature import HashingTF
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.feature import BucketedRandomProjectionLSH

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import regexp_replace, trim, col, lower, when, size, lit, avg

In [3]:
# Configuration

cwd = os.getcwd()
book_folder = "/data/"
dubliners = 'file://'+cwd+book_folder+"Dubliners_James_Joyce.txt.gz"
ulysses= 'file://'+cwd+book_folder+"Ulysses_James_Joyce.txt.gz"

In [4]:
# Functions to remove ponctuation and empty lines

def removePunctuation(column):
    return trim(lower(regexp_replace(column,'[^\sa-zA-Z0-9]',''))).alias('text')

class RemoveEmptyLines(Transformer):
    def __init__(self, column: StringType() ):
        super(RemoveEmptyLines, self).__init__()
        self.column = column

    def _transform(self, df: DataFrame) -> DataFrame:
        return df.withColumn(self.column, when(size(col(self.column)) == 0, lit(None)).otherwise(col(self.column))).na.drop()
        #return df

## Create Spark Session

In [5]:
# Create local Spark session
spark = SparkSession.builder\
        .appName("SparkSimilarityLSH")\
        .master("local[*]") \
        .getOrCreate()

start_time = time.time()

### Reading Data

In [6]:
# Read de book file: Ulysses 
text_1 = spark.read.text(ulysses)
text_1.show(10, truncate = False)
text_1 = text_1.filter("value != ''")
text_1.show(10, truncate = False)
text_1 = text_1.select(removePunctuation(col('value')))
text_1 = text_1.withColumnRenamed('value', 'text')
text_1.show(10, truncate = False)

+-------------------------------------------------------------------+
|value                                                              |
+-------------------------------------------------------------------+
|                                                                   |
|The Project Gutenberg EBook of Ulysses, by James Joyce             |
|                                                                   |
|This eBook is for the use of anyone anywhere at no cost and with   |
|almost no restrictions whatsoever. You may copy it, give it away or|
|re-use it under the terms of the Project Gutenberg License included|
|with this eBook or online at www.gutenberg.org                     |
|                                                                   |
|                                                                   |
|Title: Ulysses                                                     |
+-------------------------------------------------------------------+
only showing top 10 

In [7]:
# Read de book file:  Dubliner
text_2 = spark.read.text(dubliners)
text_2 = text_2.filter("value != ''")
text_2 = text_2.select(removePunctuation(col('value')))
text_2 = text_2.withColumnRenamed('value', 'text')
text_2.show(10, truncate = False)

+------------------------------------------------------------------+
|text                                                              |
+------------------------------------------------------------------+
|the project gutenberg ebook of dubliners by james joyce           |
|this ebook is for the use of anyone anywhere at no cost and with  |
|almost no restrictions whatsoever you may copy it give it away or |
|reuse it under the terms of the project gutenberg license included|
|with this ebook or online at wwwgutenbergorg                      |
|title dubliners                                                   |
|author james joyce                                                |
|release date september 2001 ebook 2814                            |
|last updated january 20 2019                                      |
|language english                                                  |
+------------------------------------------------------------------+
only showing top 10 rows



## Calculate Similarity

### Tokenizer

In [8]:
tokenizer = RegexTokenizer(pattern='\s+', inputCol="text", outputCol="tokens", minTokenLength=3, toLowercase=True)

tokenData = tokenizer.transform(text_1)
tokenData.show(truncate = False)

+------------------------------------------------------------------+----------------------------------------------------------------------+
|text                                                              |tokens                                                                |
+------------------------------------------------------------------+----------------------------------------------------------------------+
|the project gutenberg ebook of ulysses by james joyce             |[the, project, gutenberg, ebook, ulysses, james, joyce]               |
|this ebook is for the use of anyone anywhere at no cost and with  |[this, ebook, for, the, use, anyone, anywhere, cost, and, with]       |
|almost no restrictions whatsoever you may copy it give it away or |[almost, restrictions, whatsoever, you, may, copy, give, away]        |
|reuse it under the terms of the project gutenberg license included|[reuse, under, the, terms, the, project, gutenberg, license, included]|
|with this ebook or 

### Shingling

In [9]:
ngram = NGram(n=3, inputCol="tokens", outputCol="ngrams")
ngramData = ngram.transform(tokenData)

rememptylines = RemoveEmptyLines(column = "ngrams")
ngramData = rememptylines.transform(ngramData)

ngramData.show(truncate=False)

+---------------------------------------------------------------------+------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                 |tokens                                                                        |ngrams                                                                                                                                                                             |
+---------------------------------------------------------------------+------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|the pr

### Counting Hash

In [10]:
hash_tf = HashingTF(inputCol="ngrams", outputCol="vectors")

hashtfData = hash_tf.transform(ngramData)
hashtfData.show() 

+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|
+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[57299,74...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[17652,18...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[37352,11...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[74024,89...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[61204,17...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[190411],...|
|release date augu...|[release, date, a...|[release date aug...|(262144,[63333,72...|
|last updated octo...|[last, updated, o...|[last updated oct...|(262144,[25383,13...|
|character set enc...|[character, set, ...|[character 

### Min-Hashing

In [11]:
minhash = MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=3).fit(hashtfData)

minhashData = minhash.transform(hashtfData)
minhashData.show() #truncate=False)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[57299,74...|[[1.80943127E8], ...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[17652,18...|[[2.99329989E8], ...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[37352,11...|[[2.54070642E8], ...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[74024,89...|[[1.80943127E8], ...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[61204,17...|[[8.44159097E8], ...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[190411],...|[[1.26469144E8], ...|
|release date augu...|[release, date,

#### Using Pipeline

In [12]:
pipeline = Pipeline(stages=[
            tokenizer,
            ngram,
            rememptylines,
            hash_tf,
            minhash
        ])

model= pipeline.fit(text_1)

text_A = model.transform(text_1)
text_B = model.transform(text_2)

text_A.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[57299,74...|[[1.80943127E8], ...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[17652,18...|[[2.99329989E8], ...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[37352,11...|[[2.54070642E8], ...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[74024,89...|[[1.80943127E8], ...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[61204,17...|[[8.44159097E8], ...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[190411],...|[[1.26469144E8], ...|
|release date augu...|[release, date,

In [13]:
text_B.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[74024,11...|[[1.80943127E8], ...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[17652,18...|[[2.99329989E8], ...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[37352,11...|[[2.54070642E8], ...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[74024,89...|[[1.80943127E8], ...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[61204,17...|[[8.44159097E8], ...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[190411],...|[[1.26469144E8], ...|
|release date sept...|[release, date,

### Locality-Sensitive Hashing (LSH)

In [14]:
rows_text_A = text_A.count()
rows_text_B = text_B.count()

# Show similarity with Jaccard Distance below 0.9
result_A_B = model.stages[-1].approxSimilarityJoin(text_A, text_B, 0.9, distCol="JaccardDistance")
result_A_B.show()

rows_result_A_B = result_A_B.count()
simil_index_AB = rows_result_A_B / rows_text_B * 100
print("Similarity Ulysses x Dubliners = ",simil_index_AB, " %")

+--------------------+--------------------+------------------+
|            datasetA|            datasetB|   JaccardDistance|
+--------------------+--------------------+------------------+
|[on the chair by ...|[military review ...|0.6666666666666667|
|[mr chairman ladi...|[ladies and gentl...|               0.8|
|[trial to resembl...|[no no i can see ...|             0.875|
|[section 3 inform...|[section 4 inform...|               0.8|
|[ways including c...|[ways including c...|               0.0|
|[progressed the t...|[and did he go ho...|             0.875|
|[the nymph during...|[and make me your...|0.8888888888888888|
|[with this ebook ...|[ebook or online ...|             0.875|
|[badge i was in m...|[im up to all the...|0.8888888888888888|
|[which possibly a...|[nothing politica...|             0.875|
|[and may not be u...|[and may not be u...|               0.0|
|[author james joy...|[author james joy...|               0.0|
|[language of our ...|[every morning in...|            

In [15]:
# Show similarity with Jaccard Distance below 0.5
result_A_A = model.stages[-1].approxSimilarityJoin(text_A, text_A, 0.5, distCol="JaccardDistance")
result_A_A .show()

simil_index_AA = result_A_A.count() / rows_text_A * 100
print("Similarity Ulysses x Ulysses = ",simil_index_AA, " %")

+--------------------+--------------------+---------------+
|            datasetA|            datasetB|JaccardDistance|
+--------------------+--------------------+---------------+
|[you can almost t...|[you can almost t...|            0.0|
|[cranlys arm his ...|[cranlys arm his ...|            0.0|
|[their brazen bel...|[their brazen bel...|            0.0|
|[between his fing...|[between his fing...|            0.0|
|[toothache encore...|[toothache encore...|            0.0|
|[what i look like...|[what i look like...|            0.0|
|[in westland row ...|[in westland row ...|            0.0|
|[morning under th...|[morning under th...|            0.0|
|[sly their charac...|[sly their charac...|            0.0|
|[ill take one of ...|[ill take one of ...|            0.0|
|[plastos sir phil...|[plastos sir phil...|            0.0|
|[knows there are ...|[knows there are ...|            0.0|
|[we can do that h...|[we can do that h...|            0.0|
|[excuse me j j om...|[excuse me j j om.

## Finishing

In [16]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))
# Stop Spark
spark.stop()

--- Execution time: 56.270848751068115 seconds ---
