## Search Engine Example - 2

#### Passo a Passo

- Adicionar arquivo com texto corrido no Hadoop FS (extensão TXT)
  - pode ser o arquivo `alice_in_wonderland.txt`
- Adicionar arquivo com stop words (extensão TXT)
  - [https://gist.github.com/sebleier/554280](https://gist.github.com/sebleier/554280)
- Indexar sentenças do arquivo texto
  - Leitura sentença a sentença
  - Remoção das stop words
  - Criar estrutura com palavras e quantidade de repetições
- Transformar essa estrutura em um DataFrame
- Criar uma tabela baseada no DataFrame
- Consultar via PySpark SQL

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, row_number, max, regexp_replace, trim, split, array_contains, lower
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pyspark.ml.feature import StopWordsRemover
from pathlib import Path

In [2]:
conf = SparkConf().setAppName("search-engine-example-2")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [3]:
%%time
! hadoop fs -put ../datasets/alice_in_wonderland.txt

CPU times: user 159 ms, sys: 52.2 ms, total: 212 ms
Wall time: 4.9 s


In [4]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-03-31 23:14 hdfs://node-master:9000/user/root/.sparkStaging
-rw-r--r--   2 root supergroup     152173 2023-03-31 23:15 hdfs://node-master:9000/user/root/alice_in_wonderland.txt


### Auxiliary Functions

In [5]:
def shape(df):
    return (df.count(), len(df.columns))

### Reading txt file

In [6]:
dataset_path = "hdfs://node-master:9000/user/root/alice_in_wonderland.txt"

In [7]:
dfft_ = spark.read.format(
    "csv"
).option(
    "header", "false"
).schema(
    StructType([StructField("full_text", StringType(), True)])
).load(
    dataset_path
)

In [8]:
shape(dfft_)

(2726, 1)

In [9]:
dfft_.head(20)

[Row(full_text="Alice's Adventures in Wonderland"),
 Row(full_text="                ALICE'S ADVENTURES IN WONDERLAND"),
 Row(full_text='                          Lewis Carroll'),
 Row(full_text='               THE MILLENNIUM FULCRUM EDITION 3.0'),
 Row(full_text='                            CHAPTER I'),
 Row(full_text='                      Down the Rabbit-Hole'),
 Row(full_text='  Alice was beginning to get very tired of sitting by her sister'),
 Row(full_text='on the bank'),
 Row(full_text='peeped into the book her sister was reading'),
 Row(full_text='pictures or conversations in it'),
 Row(full_text="thought Alice `without pictures or conversation?'"),
 Row(full_text='  So she was considering in her own mind (as well as she could'),
 Row(full_text='for the hot day made her feel very sleepy and stupid)'),
 Row(full_text='the pleasure of making a daisy-chain would be worth the trouble'),
 Row(full_text='of getting up and picking the daisies'),
 Row(full_text='Rabbit with pink eyes ra

In [10]:
WINDOWS_SEP = "\r\n\r\n"
UNIX_LIKE_SEP = "\n\n"

In [11]:
dfft = spark.read.text(
    [dataset_path],
    lineSep=WINDOWS_SEP
)

In [12]:
shape(dfft)

(842, 1)

In [13]:
dfft.printSchema()

root
 |-- value: string (nullable = true)



In [14]:
dfft.head(20)

[Row(value="Alice's Adventures in Wonderland"),
 Row(value="                ALICE'S ADVENTURES IN WONDERLAND"),
 Row(value='                          Lewis Carroll'),
 Row(value='               THE MILLENNIUM FULCRUM EDITION 3.0'),
 Row(value=''),
 Row(value='\r\n                            CHAPTER I'),
 Row(value='                      Down the Rabbit-Hole'),
 Row(value="\r\n  Alice was beginning to get very tired of sitting by her sister\r\non the bank, and of having nothing to do:  once or twice she had\r\npeeped into the book her sister was reading, but it had no\r\npictures or conversations in it, `and what is the use of a book,'\r\nthought Alice `without pictures or conversation?'"),
 Row(value='  So she was considering in her own mind (as well as she could,\r\nfor the hot day made her feel very sleepy and stupid), whether\r\nthe pleasure of making a daisy-chain would be worth the trouble\r\nof getting up and picking the daisies, when suddenly a White\r\nRabbit with pink eyes ran

### Data Wrangling

In [17]:
dfft_s1 = dfft.withColumn(
    "value_s1", regexp_replace(col("value"), "[\r\n\`\'\;\,]", " ")
)

In [18]:
dfft_s1.collect()[7]["value_s1"]

'    Alice was beginning to get very tired of sitting by her sister  on the bank  and of having nothing to do:  once or twice she had  peeped into the book her sister was reading  but it had no  pictures or conversations in it   and what is the use of a book    thought Alice  without pictures or conversation? '

In [19]:
dfft_s2 = dfft_s1.withColumn(
    "value_s2", regexp_replace(col("value_s1"), "[^A-z0-9\ ]", "")
)

In [20]:
dfft_s2.collect()[7]["value_s2"]

'    Alice was beginning to get very tired of sitting by her sister  on the bank  and of having nothing to do  once or twice she had  peeped into the book her sister was reading  but it had no  pictures or conversations in it   and what is the use of a book    thought Alice  without pictures or conversation '

In [21]:
dfft_s3 = dfft_s2.select(
    "value",
    "value_s1",
    "value_s2",
    trim(col("value_s2")).alias("value_s3")
)

In [22]:
dfft_s3.collect()[7]["value_s3"]

'Alice was beginning to get very tired of sitting by her sister  on the bank  and of having nothing to do  once or twice she had  peeped into the book her sister was reading  but it had no  pictures or conversations in it   and what is the use of a book    thought Alice  without pictures or conversation'

In [23]:
dfft_s4 = dfft_s3.withColumn(
    "value_s4", regexp_replace(col("value_s3"), "\s+", " ")
)

In [25]:
dfft_s4.collect()[7]["value_s4"]

'Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversation'

In [26]:
dfftc = dfft_s4.withColumn("paragraph", col("value_s4")).select("paragraph")

In [27]:
dfftc.show()

+--------------------+
|           paragraph|
+--------------------+
|Alice s Adventure...|
|ALICE S ADVENTURE...|
|       Lewis Carroll|
|THE MILLENNIUM FU...|
|                    |
|           CHAPTER I|
| Down the RabbitHole|
|Alice was beginni...|
|So she was consid...|
|There was nothing...|
|In another moment...|
|The rabbithole we...|
|Either the well w...|
|Well thought Alic...|
|Down down down Wo...|
|Presently she beg...|
|Down down down Th...|
|Alice was not a b...|
|There were doors ...|
|Suddenly she came...|
+--------------------+
only showing top 20 rows



### Removing Stop Words

In [28]:
dfftc_s1 = dfftc.withColumn("words", split(lower(col("paragraph")), " "))

In [26]:
dfftc_s1.printSchema()

root
 |-- paragraph: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [27]:
dfftc_s1.head(10)

[Row(paragraph='Alice s Adventures in Wonderland', words=['alice', 's', 'adventures', 'in', 'wonderland']),
 Row(paragraph='ALICE S ADVENTURES IN WONDERLAND', words=['alice', 's', 'adventures', 'in', 'wonderland']),
 Row(paragraph='Lewis Carroll', words=['lewis', 'carroll']),
 Row(paragraph='THE MILLENNIUM FULCRUM EDITION 30', words=['the', 'millennium', 'fulcrum', 'edition', '30']),
 Row(paragraph='', words=['']),
 Row(paragraph='CHAPTER I', words=['chapter', 'i']),
 Row(paragraph='Down the RabbitHole', words=['down', 'the', 'rabbithole']),
 Row(paragraph='Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversation', words=['alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'h

In [29]:
remover = StopWordsRemover(inputCol="words", outputCol="keys")

In [31]:
dfftc_s2 = remover.transform(dfftc_s1)

In [32]:
dfftc_s2.head()

Row(paragraph='Alice s Adventures in Wonderland', words=['alice', 's', 'adventures', 'in', 'wonderland'], keys=['alice', 'adventures', 'wonderland'])

In [34]:
dfftc_s2.collect()[7]

Row(paragraph='Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought Alice without pictures or conversation', words=['alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', 'thought', 'alice', 'without', 'pictures', 'or', 'conversation'], keys=['alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'pictures', 'conversations', 'use', 'book', 'thought', 'alice', 'without', 'pictures', '

### Registering search table

In [41]:
dff = dfftc_s2.select(
    "keys", "paragraph"
)

In [42]:
dff.show()

+--------------------+--------------------+
|                keys|           paragraph|
+--------------------+--------------------+
|[alice, adventure...|Alice s Adventure...|
|[alice, adventure...|ALICE S ADVENTURE...|
|    [lewis, carroll]|       Lewis Carroll|
|[millennium, fulc...|THE MILLENNIUM FU...|
|                  []|                    |
|           [chapter]|           CHAPTER I|
|        [rabbithole]| Down the RabbitHole|
|[alice, beginning...|Alice was beginni...|
|[considering, min...|So she was consid...|
|[nothing, remarka...|There was nothing...|
|[another, moment,...|In another moment...|
|[rabbithole, went...|The rabbithole we...|
|[either, well, de...|Either the well w...|
|[well, thought, a...|Well thought Alic...|
|[fall, never, com...|Down down down Wo...|
|[presently, began...|Presently she beg...|
|[nothing, else, a...|Down down down Th...|
|[alice, bit, hurt...|Alice was not a b...|
|[doors, round, ha...|There were doors ...|
|[suddenly, came, ...|Suddenly s

In [43]:
dff.registerTempTable("finder")

In [61]:
kws = "door"

In [62]:
result = spark.sql(f"""
select *
from finder
where array_contains(keys, lower("{kws}"))
""")

In [63]:
result.count()

27

In [64]:
result.show(10)

+--------------------+--------------------+
|                keys|           paragraph|
+--------------------+--------------------+
|[doors, round, ha...|There were doors ...|
|[suddenly, came, ...|Suddenly she came...|
|[alice, opened, d...|Alice opened the ...|
|[seemed, use, wai...|There seemed to b...|
|[indeed, ten, inc...|And so it was ind...|
|[finding, nothing...|After a while fin...|
|[soon, eye, fell,...|Soon her eye fell...|
|[head, struck, ro...|Just then her hea...|
|[narrow, escape, ...|That WAS a narrow...|
|[white, rabbit, t...|It was the White ...|
+--------------------+--------------------+
only showing top 10 rows



### Calculating score

In [91]:
multi_kws = "alice rabbit door"

#### Jaccard Similarity

In [92]:
score = spark.sql(f"""
select *, size(intersection_)/size(union_) score
from (
    select *,
           array_intersect(keys, split(lower("{multi_kws}"), " ")) intersection_,
           array_union(keys, split(lower("{multi_kws}"), " ")) union_
    from finder
    where size(array_intersect(keys, split(lower("{multi_kws}"), " "))) > 0
) tmp
order by score desc
limit 10
""")

- uma palavra: 264 ms
- quatro palavras: 326 ms
- dez palavras: 314 ms

In [93]:
%%time
score.show()

+--------------------+--------------------+-------------+--------------------+------------------+
|                keys|           paragraph|intersection_|              union_|             score|
+--------------------+--------------------+-------------+--------------------+------------------+
|[alice, went, tim...|Alice went timidl...|[alice, door]|[alice, went, tim...|0.3333333333333333|
|       [said, alice]|But what am I to ...|      [alice]|[said, alice, rab...|              0.25|
|     [alice, silent]|    Alice was silent|      [alice]|[alice, silent, r...|              0.25|
|       [said, alice]| What for said Alice|      [alice]|[said, alice, rab...|              0.25|
|   [inquired, alice]|What was that inq...|      [alice]|[inquired, alice,...|              0.25|
|   [alice, evidence]|    Alice s Evidence|      [alice]|[alice, evidence,...|              0.25|
|[said, alice, duc...|Very said Alice w...|      [alice]|[said, alice, duc...|               0.2|
|[alice, adventure..

#### Cosine Similarity

In [43]:
def cossim(v1, v2):
    v = v2.map(lambda elem: 1 if array_contains(v1, elem) else 0)
    

links:
- [spark-sql-array-funcs](https://kontext.tech/article/587/spark-sql-array-functions)
- [spark-map-syntax](https://sparkbyexamples.com/pyspark/pyspark-map-transformation/)