## Projeto de Bloco: Engenharia de Dados: Big Data
### TP5
    
- Ingestão
    - Selecionar os 30 livros no formato Plain Text UTF-8
    - Adicionar os arquivos no HDFS
- Limpeza e Normalização
    - Importação dos dados em DataFrames Spark
    - Remoção de caracteres descartáveis
    - Remoção de linha descartáveis
    - Remoção de stop-words
    - Aplicação de lematização
    - Utilização de técnicas complementares, caso julgue necessário
    - DataFrame unificado de informações dos 30 livros selecionados, com as seguintes colunas:
        - Nome do Livro
        - Idioma do Livro
        - Número do Parágrafo (começando com o índice #1)
        - Parágrafo Original (antes da limpeza e normalização)
        - Conjunto de Palavras do Parágrafo (após limpeza e normalização)
        - Outras colunas, caso julgue necessário
- Análise e Mensuração dos tempos
- Responder às seguintes perguntas:
    - Quantidade de palavras únicas utilizadas por livro
    - Quantidade de parágrafos e palavras não-únicas por parágrafo por livro
    - Identificar a palavra que mais aparece e a palavra que menos aparece por parágrafo por livro
    - Dos livros em inglês, selecionar as top-10 palavras que mais aparecem. Fazer o mesmo para os livros em português
    - Montar dois gráficos9,10 de linhas, um para os livros em inglês e o outro para os em português, com o eixo X sendo o índice dos parágrafos, o eixo Y sendo a quantidade de palavras únicas e a linha o livro em questão
    - Montar dois histogramas8,9,10, um para os livros em inglês e o outro para os em português, para análise da frequência de palavras dos livros
- Exibir o tempo gasto no processamento de cada uma das respostas acima


In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, row_number, max, regexp_replace, trim, split, array_contains, lower
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pyspark.ml.feature import StopWordsRemover
from pathlib import Path

In [2]:
conf = SparkConf().setAppName("luis-barbosa-pb-tp5")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [3]:
%%time
! hadoop fs -put ../datasets/books

CPU times: user 117 ms, sys: 43.9 ms, total: 160 ms
Wall time: 4.45 s


In [4]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-06-24 23:48 hdfs://node-master:9000/user/root/.sparkStaging
drwxr-xr-x   - root supergroup          0 2023-06-24 23:48 hdfs://node-master:9000/user/root/books


### Auxiliary Functions

In [5]:
def shape(df):
    return (df.count(), len(df.columns))

### Reading txt file

In [6]:
dataset_path = "hdfs://node-master:9000/user/root/books"

In [7]:
dfft_ = spark.read.format(
    "csv"
).option(
    "header", "false"
).schema(
    StructType([
        StructField("line", StringType(), True)
    ])
).load(
    dataset_path
)

In [8]:
shape(dfft_)

(1264, 1)

In [165]:
dfft_.head(50)

[Row(line='The Project Gutenberg eBook'),
 Row(line='This eBook is for the use of anyone anywhere in the United States and most'),
 Row(line='other parts of the world at no cost and with almost no restrictions'),
 Row(line='whatsoever.  You may copy it'),
 Row(line='the Project Gutenberg License included with this eBook or online at'),
 Row(line='www.gutenberg.org.  If you are not located in the United States'),
 Row(line='to check the laws of the country where you are located before using this ebook.'),
 Row(line='Title: Aaron Trow'),
 Row(line='Author: Anthony Trollope'),
 Row(line='Release Date: January 16'),
 Row(line='[This file was first posted on July 31'),
 Row(line='Language: English'),
 Row(line='Character set encoding: UTF-8'),
 Row(line='***START OF THE PROJECT GUTENBERG EBOOK AARON TROW***'),
 Row(line='Transcribed from the 1864 Chapman and Hall “Tales of All Countries”'),
 Row(line='edition by David Price'),
 Row(line='                               AARON TROW.'),
 Row(li

In [166]:
WINDOWS_SEP = "\r\n\r\n"
UNIX_LIKE_SEP = "\n\n"

In [167]:
dfft = spark.read.text(
    [dataset_path],
    lineSep=WINDOWS_SEP
).withColumnRenamed("value", "original_paragraph")


In [168]:
shape(dfft)

(178, 1)

In [169]:
dfft.printSchema()

root
 |-- original_paragraph: string (nullable = true)



In [170]:
dfft.head(20)

[Row(original_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope'),
 Row(original_paragraph="\r\nThis eBook is for the use of anyone anywhere in the United States and most\r\nother parts of the world at no cost and with almost no restrictions\r\nwhatsoever.  You may copy it, give it away or re-use it under the terms of\r\nthe Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.  If you are not located in the United States, you'll have\r\nto check the laws of the country where you are located before using this ebook."),
 Row(original_paragraph=''),
 Row(original_paragraph='\r\nTitle: Aaron Trow'),
 Row(original_paragraph='\r\nAuthor: Anthony Trollope'),
 Row(original_paragraph=''),
 Row(original_paragraph='Release Date: January 16, 2015  [eBook #3713]\r\n[This file was first posted on July 31, 2001]'),
 Row(original_paragraph='Language: English'),
 Row(original_paragraph='Character set encoding: UTF-8'),
 Row(original_paragraph='\r\n

### Data Wrangling

In [214]:
dfft_s1 = dfft.withColumn(
    "clean_paragraph", regexp_replace(col("original_paragraph"), "[\:\r\n\`\'\;\,]", " ")
).withColumn(
    "clean_paragraph", regexp_replace(col("original_paragraph"), "[^A-z0-9\ ]", "")
).select(
    "original_paragraph",
    trim(col("clean_paragraph")).alias("clean_paragraph")
).withColumn(
    "clean_paragraph", regexp_replace(col("original_paragraph"), "\s+", " ")
)

In [215]:
dfft_s1.collect()[7]

Row(original_paragraph='Language: English', clean_paragraph='Language: English')

### Removing Stop Words

In [216]:
dfftc_s1 = dfft_s1.withColumn("words", split(lower(col("clean_paragraph")), " "))

In [217]:
dfftc_s1.printSchema()

root
 |-- original_paragraph: string (nullable = true)
 |-- clean_paragraph: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [218]:
dfftc_s1.head(10)

[Row(original_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope', clean_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope', words=['the', 'project', 'gutenberg', 'ebook,', 'aaron', 'trow,', 'by', 'anthony', 'trollope']),
 Row(original_paragraph="\r\nThis eBook is for the use of anyone anywhere in the United States and most\r\nother parts of the world at no cost and with almost no restrictions\r\nwhatsoever.  You may copy it, give it away or re-use it under the terms of\r\nthe Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.  If you are not located in the United States, you'll have\r\nto check the laws of the country where you are located before using this ebook.", clean_paragraph=" This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project

In [219]:
remover = StopWordsRemover(inputCol="words", outputCol="keys")

In [220]:
dfftc_s2 = remover.transform(dfftc_s1)

In [221]:
dfftc_s2.head()

Row(original_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope', clean_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope', words=['the', 'project', 'gutenberg', 'ebook,', 'aaron', 'trow,', 'by', 'anthony', 'trollope'], keys=['project', 'gutenberg', 'ebook,', 'aaron', 'trow,', 'anthony', 'trollope'])

In [222]:
dfftc_s2.collect()[1]

Row(original_paragraph="\r\nThis eBook is for the use of anyone anywhere in the United States and most\r\nother parts of the world at no cost and with almost no restrictions\r\nwhatsoever.  You may copy it, give it away or re-use it under the terms of\r\nthe Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.  If you are not located in the United States, you'll have\r\nto check the laws of the country where you are located before using this ebook.", clean_paragraph=" This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you'll have to check the laws of the country where you are located before using this ebook.", words=['', 'this', 'ebook', 'is', 'for', 'the', 'us

### Registering search table

In [223]:
dff = dfftc_s2.select(
    "original_paragraph", "clean_paragraph", "keys"
)

In [224]:
dff.show()

+--------------------+--------------------+--------------------+
|  original_paragraph|     clean_paragraph|                keys|
+--------------------+--------------------+--------------------+
|The Project Guten...|The Project Guten...|[project, gutenbe...|
|
This eBook is f...| This eBook is fo...|[, ebook, use, an...|
|                    |                    |                  []|
| 
Title: Aaron Trow|   Title: Aaron Trow|[, title:, aaron,...|
|
Author: Anthony...| Author: Anthony ...|[, author:, antho...|
|                    |                    |                  []|
|Release Date: Jan...|Release Date: Jan...|[release, date:, ...|
|   Language: English|   Language: English|[language:, english]|
|Character set enc...|Character set enc...|[character, set, ...|
|
***START OF THE...| ***START OF THE ...|[, ***start, proj...|
|
Transcribed fro...| Transcribed from...|[, transcribed, 1...|
|                    |                    |                  []|
|                    |   

In [150]:
dff.registerTempTable("books")

In [232]:
result = spark.sql(f"""
select keys as language
from books
where array_contains(keys, "convict")
""")

In [233]:
result.count()

5

In [236]:
result.collect()

[Row(language=['bermuda,', 'world', 'knows,', 'british', 'colony', 'maintain', 'convict', 'establishment.', 'outlying', 'convict', 'establishments', 'sent', 'back', 'upon', 'hands', 'colonies,', 'one', 'still', 'maintained.', 'also', 'islands', 'strong', 'military', 'fortress,', 'though', 'fortress', 'looking', 'magnificent', 'eyes', 'civilians,', 'malta', 'gibraltar.', 'also', 'six', 'thousand', 'white', 'people', 'six', 'thousand', 'black', 'people,', 'eating,', 'drinking,', 'sleeping,', 'dying.']),
 Row(language=['convict', 'establishment', 'notable', 'feature', 'bermuda', 'stranger,', 'seem', 'attract', 'much', 'attention', 'regular', 'inhabitants', 'place.', 'intercourse', 'prisoners', 'bermudians.', 'convicts', 'rarely', 'seen', 'them,', 'convict', 'islands', 'rarely', 'visited.', 'prisoners', 'themselves,', 'course', 'open', 'them—or', 'open', 'them—to', 'intercourse', 'prison', 'authorities.']),
 Row(language=['have,', 'however,', 'instances', 'convicts', 'escaped', 'confinemen

### Calculating score

In [91]:
multi_kws = "alice rabbit door"

#### Jaccard Similarity

In [92]:
score = spark.sql(f"""
select *, size(intersection_)/size(union_) score
from (
    select *,
           array_intersect(keys, split(lower("{multi_kws}"), " ")) intersection_,
           array_union(keys, split(lower("{multi_kws}"), " ")) union_
    from finder
    where size(array_intersect(keys, split(lower("{multi_kws}"), " "))) > 0
) tmp
order by score desc
limit 10
""")

- uma palavra: 264 ms
- quatro palavras: 326 ms
- dez palavras: 314 ms

In [93]:
%%time
score.show()

+--------------------+--------------------+-------------+--------------------+------------------+
|                keys|           paragraph|intersection_|              union_|             score|
+--------------------+--------------------+-------------+--------------------+------------------+
|[alice, went, tim...|Alice went timidl...|[alice, door]|[alice, went, tim...|0.3333333333333333|
|       [said, alice]|But what am I to ...|      [alice]|[said, alice, rab...|              0.25|
|     [alice, silent]|    Alice was silent|      [alice]|[alice, silent, r...|              0.25|
|       [said, alice]| What for said Alice|      [alice]|[said, alice, rab...|              0.25|
|   [inquired, alice]|What was that inq...|      [alice]|[inquired, alice,...|              0.25|
|   [alice, evidence]|    Alice s Evidence|      [alice]|[alice, evidence,...|              0.25|
|[said, alice, duc...|Very said Alice w...|      [alice]|[said, alice, duc...|               0.2|
|[alice, adventure..

#### Cosine Similarity

In [43]:
def cossim(v1, v2):
    v = v2.map(lambda elem: 1 if array_contains(v1, elem) else 0)
    

links:
- [spark-sql-array-funcs](https://kontext.tech/article/587/spark-sql-array-functions)
- [spark-map-syntax](https://sparkbyexamples.com/pyspark/pyspark-map-transformation/)