## Search Engine Example

#### Passo a Passo

- Adicionar arquivo com texto corrido no Hadoop FS (extensão TXT)
  - pode ser o arquivo `alice_in_wonderland.txt`
- Adicionar arquivo com stop words (extensão TXT)
  - [https://gist.github.com/sebleier/554280](https://gist.github.com/sebleier/554280)
- Indexar sentenças do arquivo texto
  - Leitura sentença a sentença
  - Remoção das stop words
  - Criar estrutura com palavras e quantidade de repetições
- Transformar essa estrutura em um DataFrame
- Criar uma tabela baseada no DataFrame
- Consultar via PySpark SQL

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, row_number, max
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pathlib import Path

In [2]:
conf = SparkConf().setAppName("search-engine-example")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [3]:
def shape(df):
    return (df.count(), len(df.columns))

In [4]:
%%time
! hadoop fs -put ../datasets/alice_in_wonderland.txt

CPU times: user 309 ms, sys: 152 ms, total: 462 ms
Wall time: 9.51 s


In [5]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-03-24 23:14 hdfs://node-master:9000/user/root/.sparkStaging
-rw-r--r--   2 root supergroup     152173 2023-03-24 23:15 hdfs://node-master:9000/user/root/alice_in_wonderland.txt


In [7]:
with open("../datasets/alice_in_wonderland.txt") as fp:
    full_text = fp.read()

In [59]:
paragraphs = full_text.split("\n\n")

In [60]:
paragraphs[7]

"\n  Alice was beginning to get very tired of sitting by her sister\non the bank, and of having nothing to do:  once or twice she had\npeeped into the book her sister was reading, but it had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without pictures or conversation?'"

In [61]:
paragraphs = [p.replace("\n", " ") for p in paragraphs]

In [62]:
type(paragraphs)

list

In [63]:
paragraphs[7]

"   Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do:  once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'"

In [64]:
dff = spark.createDataFrame(paragraphs, StringType())

In [66]:
dff.show()

+--------------------+
|               value|
+--------------------+
|Alice's Adventure...|
|                A...|
|                 ...|
|               TH...|
|                    |
|                 ...|
|                 ...|
|   Alice was begi...|
|  So she was cons...|
|  There was nothi...|
|  In another mome...|
|  The rabbit-hole...|
|  Either the well...|
|  `Well!' thought...|
|  Down, down, dow...|
|  Presently she b...|
|  Down, down, dow...|
|  Alice was not a...|
|  There were door...|
|  Suddenly she ca...|
+--------------------+
only showing top 20 rows



In [68]:
dff.count()

842

In [69]:
with open("../datasets/stop-words-en.txt") as fp:
    stop_words = fp.read()

In [71]:
stop_words = stop_words.split("\n")

In [86]:
words_per_paragraph = [p.split(" ") for p in paragraphs]

In [97]:
words_per_paragraph

[["Alice's", 'Adventures', 'in', 'Wonderland'],
 ['',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  "ALICE'S",
  'ADVENTURES',
  'IN',
  'WONDERLAND'],
 ['',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'Lewis',
  'Carroll'],
 ['',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'THE',
  'MILLENNIUM',
  'FULCRUM',
  'EDITION',
  '3.0'],
 [''],
 ['',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'CHAPTER',
  'I'],
 ['',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  '',
  'Down',
  'the',
  'Rabbit-Hole'],
 ['',
  '',
  '',
  'Alice',
  'was',
  'beginning',
  'to',
  'get',
  'very',
  'tired',
  'of',


In [93]:
filtered = []
for wp in words_per_paragraph:
    p = []
    for w in wp:
        if w != "" and w.lower() not in stop_words:
            p.append(w)
    if len(p) > 0:
        filtered.append(p)

In [94]:
filtered

[["Alice's", 'Adventures', 'Wonderland'],
 ["ALICE'S", 'ADVENTURES', 'WONDERLAND'],
 ['Lewis', 'Carroll'],
 ['MILLENNIUM', 'FULCRUM', 'EDITION', '3.0'],
 ['CHAPTER'],
 ['Rabbit-Hole'],
 ['Alice',
  'beginning',
  'get',
  'tired',
  'sitting',
  'sister',
  'bank,',
  'nothing',
  'do:',
  'twice',
  'peeped',
  'book',
  'sister',
  'reading,',
  'pictures',
  'conversations',
  'it,',
  '`and',
  'use',
  "book,'",
  'thought',
  'Alice',
  '`without',
  'pictures',
  "conversation?'"],
 ['considering',
  'mind',
  '(as',
  'well',
  'could,',
  'hot',
  'day',
  'made',
  'feel',
  'sleepy',
  'stupid),',
  'whether',
  'pleasure',
  'making',
  'daisy-chain',
  'would',
  'worth',
  'trouble',
  'getting',
  'picking',
  'daisies,',
  'suddenly',
  'White',
  'Rabbit',
  'pink',
  'eyes',
  'ran',
  'close',
  'her.'],
 ['nothing',
  'remarkable',
  'that;',
  'Alice',
  'think',
  'much',
  'way',
  'hear',
  'Rabbit',
  'say',
  'itself,',
  '`Oh',
  'dear!',
  'Oh',
  'dear!',
 

In [95]:
dff = spark.createDataFrame([" ".join(words) for words in filtered], StringType())

In [96]:
dff.show()

+--------------------+
|               value|
+--------------------+
|Alice's Adventure...|
|ALICE'S ADVENTURE...|
|       Lewis Carroll|
|MILLENNIUM FULCRU...|
|             CHAPTER|
|         Rabbit-Hole|
|Alice beginning g...|
|considering mind ...|
|nothing remarkabl...|
|another moment we...|
|rabbit-hole went ...|
|Either well deep,...|
|`Well!' thought A...|
|Down, down, down....|
|Presently began a...|
|Down, down, down....|
|Alice bit hurt, j...|
|doors round hall,...|
|Suddenly came upo...|
|Alice opened door...|
+--------------------+
only showing top 20 rows



In [98]:
dff.count()

829