# Similaridade de Textos: LSH

## Locality-Sensitive Hashing (LSH) Algorithms

LSH for Euclidean distance metrics. The input is a dense or sparse vectors, each of which represents a point in the Euclidean distance space. The output will be vectors of configurable dimension. Hash values in the same dimension are calculated by the same hash function.

In [1]:
import findspark
findspark.init()

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pyspark.sql.types import *

import time, os, string

from pyspark.ml import Pipeline, Transformer
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import RegexTokenizer
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import NGram
from pyspark.ml.feature import HashingTF
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.feature import BucketedRandomProjectionLSH

from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import regexp_replace, trim, col, lower, when, size, lit, avg

def removePunctuation(column):
    return trim(lower(regexp_replace(column,'[^\sa-zA-Z0-9]',''))).alias('text')

class RemoveEmptyLines(Transformer):
    def __init__(self, column: StringType() ):
        super(RemoveEmptyLines, self).__init__()
        self.column = column

    def _transform(self, df: DataFrame) -> DataFrame:
        return df.withColumn(self.column, when(size(col(self.column)) == 0, lit(None)).otherwise(col(self.column))).na.drop()
        #return df

cwd = os.getcwd()
book_folder = "/data/"
dubliners = 'file://'+cwd+book_folder+"Dubliners_James_Joyce.txt.gz"
ulysses= 'file://'+cwd+book_folder+"Ulysses_James_Joyce.txt.gz"

## Create Spark Session

In [2]:
# Create local Spark session
spark = SparkSession.builder\
        .appName("SparkSimilarityLSH")\
        .master("local[*]") \
        .getOrCreate()

start_time = time.time()

### Reading Data

In [3]:
# Read de book file: Ulysses 
text_1 = spark.read.text(ulysses)
text_1.show(10, truncate = False)
text_1 = text_1.filter("value != ''")
text_1.show(10, truncate = False)
text_1 = text_1.select(removePunctuation(col('value')))
text_1 = text_1.withColumnRenamed('value', 'text')
text_1.show(10, truncate = False)

+-------------------------------------------------------------------+
|value                                                              |
+-------------------------------------------------------------------+
|                                                                   |
|The Project Gutenberg EBook of Ulysses, by James Joyce             |
|                                                                   |
|This eBook is for the use of anyone anywhere at no cost and with   |
|almost no restrictions whatsoever. You may copy it, give it away or|
|re-use it under the terms of the Project Gutenberg License included|
|with this eBook or online at www.gutenberg.org                     |
|                                                                   |
|                                                                   |
|Title: Ulysses                                                     |
+-------------------------------------------------------------------+
only showing top 10 

In [4]:
# Read de book file:  Dubliner
text_2 = spark.read.text(dubliners)
text_2 = text_2.filter("value != ''")
text_2 = text_2.select(removePunctuation(col('value')))
text_2 = text_2.withColumnRenamed('value', 'text')
text_2.show(10, truncate = False)

+------------------------------------------------------------------+
|text                                                              |
+------------------------------------------------------------------+
|the project gutenberg ebook of dubliners by james joyce           |
|this ebook is for the use of anyone anywhere at no cost and with  |
|almost no restrictions whatsoever you may copy it give it away or |
|reuse it under the terms of the project gutenberg license included|
|with this ebook or online at wwwgutenbergorg                      |
|title dubliners                                                   |
|author james joyce                                                |
|release date september 2001 ebook 2814                            |
|last updated january 20 2019                                      |
|language english                                                  |
+------------------------------------------------------------------+
only showing top 10 rows



## Calculate Similarity

### Tokenizer

In [5]:
tokenizer = RegexTokenizer(pattern='\s+', inputCol="text", outputCol="tokens", minTokenLength=3, toLowercase=True)
#tokenizer = RegexTokenizer(pattern="", inputCol="text", outputCol="tokens", minTokenLength=1, toLowercase=True)
#tokenizer = RegexTokenizer(pattern="", inputCol="text", outputCol="tokens", minTokenLength=1)

tokenData = tokenizer.transform(text_1)
tokenData.show() #(truncate = False)

+--------------------+--------------------+
|                text|              tokens|
+--------------------+--------------------+
|the project guten...|[the, project, gu...|
|this ebook is for...|[this, ebook, for...|
|almost no restric...|[almost, restrict...|
|reuse it under th...|[reuse, under, th...|
|with this ebook o...|[with, this, eboo...|
|       title ulysses|    [title, ulysses]|
|  author james joyce|[author, james, j...|
|release date augu...|[release, date, a...|
|last updated octo...|[last, updated, o...|
|    language english| [language, english]|
|character set enc...|[character, set, ...|
|start of this pro...|[start, this, pro...|
|produced by col c...|[produced, col, c...|
|               cover|             [cover]|
|             ulysses|           [ulysses]|
|      by james joyce|      [james, joyce]|
|            contents|          [contents]|
|                   i|                  []|
|                   1|                  []|
|                   2|          

### Shingling

In [6]:
ngram = NGram(n=3, inputCol="tokens", outputCol="ngrams")
ngramData = ngram.transform(tokenData)

rememptylines = RemoveEmptyLines(column = "ngrams")
ngramData = rememptylines.transform(ngramData)

#ngramData = ngramData.withColumn("ngrams", when(size(col("ngrams")) == 0, lit(None)).otherwise(col("ngrams"))).na.drop()
ngramData.show() #(truncate=False)

+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|
+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|
|almost no restric...|[almost, restrict...|[almost restricti...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|
|  author james joyce|[author, james, j...|[author james joyce]|
|release date augu...|[release, date, a...|[release date aug...|
|last updated octo...|[last, updated, o...|[last updated oct...|
|character set enc...|[character, set, ...|[character set en...|
|start of this pro...|[start, this, pro...|[start this proje...|
|produced by col c...|[produced, col, c...|[produced col cho...|
|stately plump buc...|[stately, plump, ...|[stately plump bu...|
|lather on which a...|[la

### Counting Hash

In [7]:
hash_tf = HashingTF(inputCol="ngrams", outputCol="vectors")

hashtfData = hash_tf.transform(ngramData)
hashtfData.show() #truncate=False)

+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|
+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[11558,11...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[18308,32...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[10891,10...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[52778,86...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[13156,17...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[19040],[...|
|release date augu...|[release, date, a...|[release date aug...|(262144,[111497,1...|
|last updated octo...|[last, updated, o...|[last updated oct...|(262144,[25383,13...|
|character set enc...|[character, set, ...|[character 

### Min-Hashing

In [8]:
#minhash = MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=5).fit(hashtfData)
minhash = MinHashLSH(inputCol="vectors", outputCol="lsh", numHashTables=3).fit(hashtfData)

minhashData = minhash.transform(hashtfData)
minhashData.show() #truncate=False)

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[11558,11...|[[1.41920939E8], ...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[18308,32...|[[3.06401254E8], ...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[10891,10...|[[5.5546337E7], [...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[52778,86...|[[1.41920939E8], ...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[13156,17...|[[2.99715787E8], ...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[19040],[...|[[1.078648149E9],...|
|release date augu...|[release, date,

#### Using Pipeline

In [9]:
pipeline = Pipeline(stages=[
            tokenizer,
            ngram,
            rememptylines,
            hash_tf,
            minhash
        ])

model= pipeline.fit(text_1)

text_A = model.transform(text_1)
text_B = model.transform(text_2)

text_A.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[11558,11...|[[1.41920939E8], ...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[18308,32...|[[3.06401254E8], ...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[10891,10...|[[5.5546337E7], [...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[52778,86...|[[1.41920939E8], ...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[13156,17...|[[2.99715787E8], ...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[19040],[...|[[1.078648149E9],...|
|release date augu...|[release, date,

In [10]:
text_B.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|              tokens|              ngrams|             vectors|                 lsh|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|the project guten...|[the, project, gu...|[the project gute...|(262144,[144429,1...|[[1.41920939E8], ...|
|this ebook is for...|[this, ebook, for...|[this ebook for, ...|(262144,[18308,32...|[[3.06401254E8], ...|
|almost no restric...|[almost, restrict...|[almost restricti...|(262144,[10891,10...|[[5.5546337E7], [...|
|reuse it under th...|[reuse, under, th...|[reuse under the,...|(262144,[52778,86...|[[1.41920939E8], ...|
|with this ebook o...|[with, this, eboo...|[with this ebook,...|(262144,[13156,17...|[[2.99715787E8], ...|
|  author james joyce|[author, james, j...|[author james joyce]|(262144,[19040],[...|[[1.078648149E9],...|
|release date sept...|[release, date,

### Locality-Sensitive Hashing (LSH)

In [11]:
rows_text_A = text_A.count()
rows_text_B = text_B.count()

# Show similarity with Jaccard Distance below 0.9
result_A_B = model.stages[-1].approxSimilarityJoin(text_A, text_B, 0.9, distCol="JaccardDistance")
result_A_B.show()

rows_result_A_B = result_A_B.count()
simil_index_AB = rows_result_A_B / rows_text_B * 100
print("Similarity Ulysses x Dubliners = ",simil_index_AB, " %")

+--------------------+--------------------+------------------+
|            datasetA|            datasetB|   JaccardDistance|
+--------------------+--------------------+------------------+
|[before cheese i ...|[read the paragra...|0.8823529411764706|
|[performances and...|[performances and...|               0.0|
|[united states an...|[united states an...|               0.0|
|[where did you ge...|[who did you get ...|0.8571428571428572|
|[house of stephen...|[why asked miss i...|0.8571428571428572|
|[or any other wor...|[or any other wor...|               0.0|
|[1e1 the followin...|[1e1 the followin...|               0.0|
|[access to or dis...|[access to or dis...|               0.0|
|[or entity provid...|[or entity provid...|               0.0|
|[production promo...|[distribution of ...|               0.8|
|[for additional c...|[for additional c...|               0.0|
|[including how to...|[generations to l...|               0.8|
|[father conmee tu...|[she turned away ...|0.8823529411

In [12]:
# Show similarity with Jaccard Distance below 0.5
result_A_A = model.stages[-1].approxSimilarityJoin(text_A, text_A, 0.5, distCol="JaccardDistance")
result_A_A .show()

simil_index_AA = result_A_A.count() / rows_text_A * 100
print("Similarity Ulysses x Ulysses = ",simil_index_AA, " %")

+--------------------+--------------------+---------------+
|            datasetA|            datasetB|JaccardDistance|
+--------------------+--------------------+---------------+
|[well i mean it h...|[well i mean it h...|            0.0|
|[five fathoms out...|[five fathoms out...|            0.0|
|[the pillars prie...|[the pillars prie...|            0.0|
|[me from her door...|[me from her door...|            0.0|
|[his hand accepte...|[his hand accepte...|            0.0|
|[felt its way und...|[felt its way und...|            0.0|
|[if im not there ...|[if im not there ...|            0.0|
|[the chemist turn...|[the chemist turn...|            0.0|
|[wheeling by farr...|[wheeling by farr...|            0.0|
|[and madame twent...|[and madame twent...|            0.0|
|[pawning the furn...|[pawning the furn...|            0.0|
|[with a fluent cr...|[with a fluent cr...|            0.0|
|[the priest close...|[the priest close...|            0.0|
|[the thing else t...|[the thing else t.

## Finishing

In [13]:
print("--- Execution time: %s seconds ---" % (time.time() - start_time))
# Stop Spark
spark.stop()

--- Execution time: 73.51522588729858 seconds ---
