## Search Engine Example - 2

#### Passo a Passo

- Adicionar arquivo com texto corrido no Hadoop FS (extensão TXT)
  - pode ser o arquivo `alice_in_wonderland.txt`
- Adicionar arquivo com stop words (extensão TXT)
  - [https://gist.github.com/sebleier/554280](https://gist.github.com/sebleier/554280)
- Indexar sentenças do arquivo texto
  - Leitura sentença a sentença
  - Remoção das stop words
  - Criar estrutura com palavras e quantidade de repetições
- Transformar essa estrutura em um DataFrame
- Criar uma tabela baseada no DataFrame
- Consultar via PySpark SQL

In [150]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, row_number, max, regexp_replace, trim, split
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pyspark.ml.feature import StopWordsRemover
from pathlib import Path

In [2]:
conf = SparkConf().setAppName("search-engine-example-2")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [3]:
%%time
! hadoop fs -put ../datasets/alice_in_wonderland.txt

CPU times: user 62.9 ms, sys: 49.1 ms, total: 112 ms
Wall time: 3.33 s


In [4]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-03-27 20:12 hdfs://node-master:9000/user/root/.sparkStaging
-rw-r--r--   2 root supergroup     148574 2023-03-27 20:13 hdfs://node-master:9000/user/root/alice_in_wonderland.txt


### Auxiliary Functions

In [5]:
def shape(df):
    return (df.count(), len(df.columns))

### Reading txt file

In [6]:
dataset_path = "hdfs://node-master:9000/user/root/alice_in_wonderland.txt"

In [115]:
dfft_ = spark.read.format(
    "csv"
).option(
    "header", "false"
).schema(
    StructType([StructField("full_text", StringType(), True)])
).load(
    dataset_path
)

In [116]:
shape(dfft_)

(2726, 1)

In [117]:
dfft_.head(20)

[Row(full_text="Alice's Adventures in Wonderland"),
 Row(full_text="                ALICE'S ADVENTURES IN WONDERLAND"),
 Row(full_text='                          Lewis Carroll'),
 Row(full_text='               THE MILLENNIUM FULCRUM EDITION 3.0'),
 Row(full_text='                            CHAPTER I'),
 Row(full_text='                      Down the Rabbit-Hole'),
 Row(full_text='  Alice was beginning to get very tired of sitting by her sister'),
 Row(full_text='on the bank'),
 Row(full_text='peeped into the book her sister was reading'),
 Row(full_text='pictures or conversations in it'),
 Row(full_text="thought Alice `without pictures or conversation?'"),
 Row(full_text='  So she was considering in her own mind (as well as she could'),
 Row(full_text='for the hot day made her feel very sleepy and stupid)'),
 Row(full_text='the pleasure of making a daisy-chain would be worth the trouble'),
 Row(full_text='of getting up and picking the daisies'),
 Row(full_text='Rabbit with pink eyes ra

In [124]:
dfft = spark.read.text(
    [dataset_path],
    lineSep="\n\n"
)

In [125]:
shape(dfft)

(842, 1)

In [126]:
dfft.printSchema()

root
 |-- value: string (nullable = true)



In [127]:
dfft.head(20)

[Row(value="Alice's Adventures in Wonderland"),
 Row(value="                ALICE'S ADVENTURES IN WONDERLAND"),
 Row(value='                          Lewis Carroll'),
 Row(value='               THE MILLENNIUM FULCRUM EDITION 3.0'),
 Row(value=''),
 Row(value='\n                            CHAPTER I'),
 Row(value='                      Down the Rabbit-Hole'),
 Row(value="\n  Alice was beginning to get very tired of sitting by her sister\non the bank, and of having nothing to do:  once or twice she had\npeeped into the book her sister was reading, but it had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without pictures or conversation?'"),
 Row(value='  So she was considering in her own mind (as well as she could,\nfor the hot day made her feel very sleepy and stupid), whether\nthe pleasure of making a daisy-chain would be worth the trouble\nof getting up and picking the daisies, when suddenly a White\nRabbit with pink eyes ran close by her.'),
 R

### Data Wrangling

In [128]:
dfft_s1 = dfft.withColumn(
    "value_s1", regexp_replace(col("value"), "[\n\`\'\;\,]", " ")
)

In [130]:
dfft_s1.collect()[7]

Row(value="\n  Alice was beginning to get very tired of sitting by her sister\non the bank, and of having nothing to do:  once or twice she had\npeeped into the book her sister was reading, but it had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without pictures or conversation?'", value_s1='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do:  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation? ')

In [133]:
dfft_s2 = dfft_s1.withColumn(
    "value_s2", regexp_replace(col("value_s1"), "[^A-z0-9\ ]", "")
)

In [134]:
dfft_s2.collect()[7]

Row(value="\n  Alice was beginning to get very tired of sitting by her sister\non the bank, and of having nothing to do:  once or twice she had\npeeped into the book her sister was reading, but it had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without pictures or conversation?'", value_s1='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do:  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation? ', value_s2='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation ')

In [138]:
dfft_s3 = dfft_s2.select(
    "value",
    "value_s1",
    "value_s2",
    trim(col("value_s2")).alias("value_s3")
)

In [139]:
dfft_s3.collect()[7]

Row(value="\n  Alice was beginning to get very tired of sitting by her sister\non the bank, and of having nothing to do:  once or twice she had\npeeped into the book her sister was reading, but it had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without pictures or conversation?'", value_s1='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do:  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation? ', value_s2='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation ', value_s3='Alice was beginning to get very 

In [140]:
dfft_s4 = dfft_s3.withColumn(
    "value_s4", regexp_replace(col("value_s3"), "\s+", " ")
)

In [141]:
dfft_s4.collect()[7]

Row(value="\n  Alice was beginning to get very tired of sitting by her sister\non the bank, and of having nothing to do:  once or twice she had\npeeped into the book her sister was reading, but it had no\npictures or conversations in it, `and what is the use of a book,'\nthought Alice `without pictures or conversation?'", value_s1='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do:  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation? ', value_s2='   Alice was beginning to get very tired of sitting by her sister on the bank  and of having nothing to do  once or twice she had peeped into the book her sister was reading  but it had no pictures or conversations in it   and what is the use of a book   thought Alice  without pictures or conversation ', value_s3='Alice was beginning to get very 

In [144]:
dfftc = dfft_s4.withColumn("paragraph", col("value_s4")).select("paragraph")

In [146]:
dfftc.show()

+--------------------+
|           paragraph|
+--------------------+
|Alice s Adventure...|
|ALICE S ADVENTURE...|
|       Lewis Carroll|
|THE MILLENNIUM FU...|
|                    |
|           CHAPTER I|
| Down the RabbitHole|
|Alice was beginni...|
|So she was consid...|
|There was nothing...|
|In another moment...|
|The rabbithole we...|
|Either the well w...|
|Well thought Alic...|
|Down down down Wo...|
|Presently she beg...|
|Down down down Th...|
|Alice was not a b...|
|There were doors ...|
|Suddenly she came...|
+--------------------+
only showing top 20 rows



### Removing Stop Words

In [152]:
dfftc_s1 = dfftc.withColumn("words", col("paragraph").split(" "))

TypeError: 'Column' object is not callable

In [151]:
remover = StopWordsRemover(inputCol="words", outputCol="keys")

In [149]:
remover.transform(dfftc).show(truncate=False)

IllegalArgumentException: 'requirement failed: Input type must be array<string> but got string.'