
# Projeto de Bloco: Engenharia de Dados: Big Data
---

## **TP5**

**Escolha 15 livros no idioma inglês e 15 livros no idioma português (totalizando 30 livros).**

**Utilizando o Jupyter Notebook, com o interpretador do Apache Spark em Python, faremos as tarefas de ingestão, limpeza e normalização, e análise e mensuração dos tempos de execução dos textos escolhidos.**

---

 - **Ingestão**
    - Selecionar os 30 livros no formato Plain Text UTF-8
    - Adicionar os arquivos no HDFS

- **Limpeza e Normalização**
    - Importação dos dados em DataFrames Spark
    - Remoção de caracteres descartáveis
    - Remoção de linha descartáveis
    - Remoção de stop-words
    - Aplicação de lematização
    - Utilização de técnicas complementares, caso julgue necessário
    - DataFrame unificado de informações dos 30 livros selecionados, com as seguintes colunas:
        - Nome do Livro
        - Idioma do Livro
        - Número do Parágrafo (começando com o índice #1)
        - Parágrafo Original (antes da limpeza e normalização)
        - Conjunto de Palavras do Parágrafo (após limpeza e normalização)
        - Outras colunas, caso julgue necessário


### **Análise e Mensuração dos tempos**
---
Responder às seguintes perguntas:
- Quantidade de palavras únicas utilizadas por livro
- Quantidade de parágrafos e palavras não-únicas por parágrafo por livro
- Identificar a palavra que mais aparece e a palavra que menos aparece por parágrafo por livro
- Dos livros em inglês, selecionar as top-10 palavras que mais aparecem. Fazer o mesmo para os livros em português
- Montar dois gráficos de linhas, um para os livros em inglês e o outro para os em português, com o eixo X sendo o índice dos parágrafos, o eixo Y sendo a quantidade de palavras únicas e a linha o livro em questão
- Montar dois histogramas um para os livros em inglês e o outro para os em português, para análise da frequência de palavras dos livros
- Exibir o tempo gasto no processamento de cada uma das respostas acima


In [1]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, sum, avg, row_number, max, regexp_replace, trim, split, array_contains, lower
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pyspark.ml.feature import StopWordsRemover
from pathlib import Path

In [2]:
conf = SparkConf().setAppName("TP5")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

### **Limpeza e Normalização**

- **Importação dos dados em DataFrames Spark**

In [3]:
%%time
! hadoop fs -put ../datasets/tp5

put: `tp5/At The Mountains of Madness by H. P. Lovecraft.txt': File exists
put: `tp5/Dracula by Bram Stoker.txt': File exists
put: `tp5/Elegia by Manuel Maria Barbosa du Bocage.txt': File exists
put: `tp5/Frankenstein by Mary Wollstonecraft Shelley.txt': File exists
put: `tp5/Improvisos de Bocage by Manuel Maria Barbosa du Bocage.txt': File exists
put: `tp5/Medusa's coil by Zealia B. Bishop and H. P. Lovecraft.txt': File exists
put: `tp5/Os jardins ou a arte de aformosear as paisagens Poema by Jacques Delille.txt': File exists
put: `tp5/Poesias Eroticas, Burlescas, e Satyricas de M.M. de Barbosa du Bocage by Bocage.txt': File exists
put: `tp5/Queixumes do Pastor Elmano Contra a Falsidade da Pastora Urselina by Bocage.txt': File exists
put: `tp5/The Call of Cthulhu by H. P. Lovecraft.txt': File exists
put: `tp5/The Castle of Otranto by Horace Walpole.txt': File exists
put: `tp5/The Colour Out of Space by H. P. Lovecraft.txt': File exists
put: `tp5/The Curse of Yig by Zealia B. Bishop an

In [4]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-06-24 04:10 hdfs://node-master:9000/user/root/.sparkStaging
drwxr-xr-x   - root supergroup          0 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5


In [5]:
! hadoop fs -ls hdfs://node-master:9000/user/root/tp5

Found 20 items
-rw-r--r--   2 root supergroup     270199 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/At The Mountains of Madness by H. P. Lovecraft.txt
-rw-r--r--   2 root supergroup     881691 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/Dracula by Bram Stoker.txt
-rw-r--r--   2 root supergroup      23527 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/Elegia by Manuel Maria Barbosa du Bocage.txt
-rw-r--r--   2 root supergroup     448609 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/Frankenstein by Mary Wollstonecraft Shelley.txt
-rw-r--r--   2 root supergroup      37440 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/Improvisos de Bocage by Manuel Maria Barbosa du Bocage.txt
-rw-r--r--   2 root supergroup     111420 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/Medusa's coil by Zealia B. Bishop and H. P. Lovecraft.txt
-rw-r--r--   2 root supergroup     241819 2023-06-24 00:27 hdfs://node-master:9000/user/root/tp5/Os jardins 

In [6]:
def shape(df):
    return (df.count(), len(df.columns))

### Reading txt file

In [7]:
dataset_path = "hdfs://node-master:9000/user/root/tp5"

In [8]:
dfft_ = spark.read.format(
    "csv"
).option(
    "header", "false"
).schema(
    StructType([StructField("full_text", StringType(), True)])
).load(
    f"{dataset_path}/*.txt"
)

In [12]:
shape(dfft_)

(87194, 1)

In [13]:
dfft_.head(30)

[Row(full_text='The Project Gutenberg eBook of The Mysteries of Udolpho'),
 Row(full_text='This eBook is for the use of anyone anywhere in the United States and'),
 Row(full_text='most other parts of the world at no cost and with almost no restrictions'),
 Row(full_text='whatsoever. You may copy it'),
 Row(full_text='of the Project Gutenberg License included with this eBook or online at'),
 Row(full_text='www.gutenberg.org. If you are not located in the United States'),
 Row(full_text='will have to check the laws of the country where you are located before'),
 Row(full_text='using this eBook.'),
 Row(full_text='Title: The Mysteries of Udolpho'),
 Row(full_text='Author: Ann Radcliffe'),
 Row(full_text='Release Date: March 4'),
 Row(full_text='[Most recently updated: December 1'),
 Row(full_text='Language: English'),
 Row(full_text='Produced by: Karalee Coleman and David Widger'),
 Row(full_text='*** START OF THE PROJECT GUTENBERG EBOOK THE MYSTERIES OF UDOLPHO ***'),
 Row(full_text='cov

- **Remoção de linha descartáveis**

In [14]:
WINDOWS_SEP = "\r\n\r\n"
UNIX_LIKE_SEP = "\n\n"

In [15]:
dfft = spark.read.text(
    [f"{dataset_path}/*.txt"],
    lineSep=WINDOWS_SEP
)

In [16]:
shape(dfft)

(14255, 1)

In [17]:
dfft.printSchema()

root
 |-- value: string (nullable = true)



In [18]:
dfft.head(20)

[Row(value='The Project Gutenberg eBook of The Mysteries of Udolpho, by Ann Radcliffe'),
 Row(value='This eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.'),
 Row(value='Title: The Mysteries of Udolpho'),
 Row(value='Author: Ann Radcliffe'),
 Row(value='Release Date: March 4, 2001 [eBook #3268]\r\n[Most recently updated: December 1, 2022]'),
 Row(value='Language: English'),
 Row(value='\r\nProduced by: Karalee Coleman and David Widger'),
 Row(value='*** START OF THE PROJECT GUTENBERG EBOOK THE MYSTERIES OF UDOLPHO ***'),
 Row(value=''),
 Row(value='\r\ncover '),
 Row(value=''),
 Row(val

### Data Wrangling

In [19]:
dfft_s1 = dfft.withColumn(
    "value_s1", regexp_replace(col("value"), "[\r\n\`\'\;\,]", " ")
)

In [20]:
dfft_s1.collect()[2]["value_s1"]

'Title: The Mysteries of Udolpho'

In [21]:
dfftc = dfft_s1.withColumn("paragraph", col("value_s1")).select("paragraph")

In [22]:
dfftc.show()

+--------------------+
|           paragraph|
+--------------------+
|The Project Guten...|
|This eBook is for...|
|Title: The Myster...|
|Author: Ann Radcl...|
|Release Date: Mar...|
|   Language: English|
|  Produced by: Ka...|
|*** START OF THE ...|
|                    |
|              cover |
|                    |
|The Mysteries of ...|
|          A Romance |
| Interspersed Wit...|
|    By Ann Radcliffe|
|                    |
|            Contents|
|   VOLUME I   CHA...|
| VOLUME II   CHAP...|
| VOLUME III   CHA...|
+--------------------+
only showing top 20 rows



In [23]:
dfft_s2 = dfftc.withColumn(
    "value_s2", regexp_replace(col("paragraph"), "[^A-z0-9\ ]", "")
)

In [24]:
dfft_s2.collect()[2]["value_s2"]

'Title The Mysteries of Udolpho'

In [25]:
dfft_s3 = dfft_s2.select(
    "value_s2",
    "paragraph",
    trim(col("value_s2")).alias("value_s3")
)

In [26]:
dfft_s3.collect()[2]["value_s3"]

'Title The Mysteries of Udolpho'

In [27]:
dfft_s4 = dfft_s3.withColumn(
    "value_s4", regexp_replace(col("value_s3"), "\s+", " ")
)

In [28]:
dfft_s4.collect()[2]["value_s4"]

'Title The Mysteries of Udolpho'

### Removing Stop Words

In [29]:
dfftc_s1 = dfft_s4.withColumn("words", split(lower(col("value_s4")), " "))

In [30]:
dfftc_s1.printSchema()

root
 |-- value_s2: string (nullable = true)
 |-- paragraph: string (nullable = true)
 |-- value_s3: string (nullable = true)
 |-- value_s4: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [31]:
dfftc_s1 = dfftc_s1.select("words", "paragraph")

In [32]:
dfftc_s1.printSchema()

root
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- paragraph: string (nullable = true)



In [33]:
dfftc_s1.head(10)

[Row(words=['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'mysteries', 'of', 'udolpho', 'by', 'ann', 'radcliffe'], paragraph='The Project Gutenberg eBook of The Mysteries of Udolpho  by Ann Radcliffe'),
 Row(words=['this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'if', 'you', 'are', 'not', 'located', 'in', 'the', 'united', 'states', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'ebook'], paragraph='This eBook is for the use of anyone anywhere in the United States and  most other par

In [34]:
remover = StopWordsRemover(inputCol="words", outputCol="keys")

In [35]:
dfftc_s2 = remover.transform(dfftc_s1)

In [36]:
dfftc_s2.head()

Row(words=['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'mysteries', 'of', 'udolpho', 'by', 'ann', 'radcliffe'], paragraph='The Project Gutenberg eBook of The Mysteries of Udolpho  by Ann Radcliffe', keys=['project', 'gutenberg', 'ebook', 'mysteries', 'udolpho', 'ann', 'radcliffe'])

In [37]:
dfftc_s2.collect()[2]

Row(words=['title', 'the', 'mysteries', 'of', 'udolpho'], paragraph='Title: The Mysteries of Udolpho', keys=['title', 'mysteries', 'udolpho'])

### Registering search table

In [38]:
dff = dfftc_s2.select(
    "keys", "paragraph"
)

In [39]:
dff.show()

+--------------------+--------------------+
|                keys|           paragraph|
+--------------------+--------------------+
|[project, gutenbe...|The Project Guten...|
|[ebook, use, anyo...|This eBook is for...|
|[title, mysteries...|Title: The Myster...|
|[author, ann, rad...|Author: Ann Radcl...|
|[release, date, m...|Release Date: Mar...|
| [language, english]|   Language: English|
|[produced, karale...|  Produced by: Ka...|
|[start, project, ...|*** START OF THE ...|
|                  []|                    |
|             [cover]|              cover |
|                  []|                    |
|[mysteries, udolpho]|The Mysteries of ...|
|           [romance]|          A Romance |
|[interspersed, pi...| Interspersed Wit...|
|    [ann, radcliffe]|    By Ann Radcliffe|
|                  []|                    |
|          [contents]|            Contents|
|[volume, chapter,...|   VOLUME I   CHA...|
|[volume, ii, chap...| VOLUME II   CHAP...|
|[volume, iii, cha...| VOLUME II

In [40]:
dfftc = dfft_s1.withColumn("paragraph", col("value_s1")).select("paragraph")

In [47]:
dff_title = dff.withColumn("Title", col("paragraph").contains("Title:"))

In [46]:
dff_title2 = dff.withColumn("Title", col("paragraph").filter(col("paragraph") == 'Title:'))

TypeError: 'Column' object is not callable

In [42]:
dff_title2.show()

+--------------------+--------------------+-----+
|                keys|           paragraph|Title|
+--------------------+--------------------+-----+
|[project, gutenbe...|The Project Guten...|false|
|[ebook, use, anyo...|This eBook is for...|false|
|[title, mysteries...|Title: The Myster...| true|
|[author, ann, rad...|Author: Ann Radcl...|false|
|[release, date, m...|Release Date: Mar...|false|
| [language, english]|   Language: English|false|
|[produced, karale...|  Produced by: Ka...|false|
|[start, project, ...|*** START OF THE ...|false|
|                  []|                    |false|
|             [cover]|              cover |false|
|                  []|                    |false|
|[mysteries, udolpho]|The Mysteries of ...|false|
|           [romance]|          A Romance |false|
|[interspersed, pi...| Interspersed Wit...|false|
|    [ann, radcliffe]|    By Ann Radcliffe|false|
|                  []|                    |false|
|          [contents]|            Contents|false|


In [107]:
column_title = dff.filter(col("paragraph").contains("Title:"))

In [108]:
column_title = column_title.select("paragraph")

In [109]:
column_title.show()

+--------------------+
|           paragraph|
+--------------------+
|Title: The Myster...|
|      Title: Dracula|
|  Title: The Phan...|
|Title: Frankenste...|
|Title: At the mou...|
|Title: Os jardins...|
|Title: The Castle...|
|Title: Poesias Er...|
|Title: The Strang...|
|  Title: The Wendigo|
|Title: Medusa s coil|
|Title: The call o...|
|Title: The colour...|
|Title: The lurkin...|
|  Title: The Fall...|
|Title: The curse ...|
|  Title: Improvis...|
|  Title: Queixume...|
|    Title: The Raven|
|       Title: Elegia|
+--------------------+



In [110]:
column_title = column_title.withColumn(
    "Titles", regexp_replace(col("paragraph"), "Title: ", "")
)
column_title = column_title.select("Titles")

In [111]:
column_title.show()

+--------------------+
|              Titles|
+--------------------+
|The Mysteries of ...|
|             Dracula|
|  The Phantom of ...|
|Frankenstein     ...|
|At the mountains ...|
|Os jardins ou a a...|
|The Castle of Otr...|
|Poesias Eroticas ...|
|The Strange Case ...|
|         The Wendigo|
|       Medusa s coil|
| The call of Cthulhu|
|The colour out of...|
|    The lurking fear|
|  The Fall of the...|
|    The curse of Yig|
|  Improvisos de B...|
|  Queixumes do Pa...|
|           The Raven|
|              Elegia|
+--------------------+



In [112]:
columns_title_language = column_title.withColumn('Language', lit(None).cast(StringType()))

In [113]:
columns_title_language.show()

+--------------------+--------+
|              Titles|Language|
+--------------------+--------+
|The Mysteries of ...|    null|
|             Dracula|    null|
|  The Phantom of ...|    null|
|Frankenstein     ...|    null|
|At the mountains ...|    null|
|Os jardins ou a a...|    null|
|The Castle of Otr...|    null|
|Poesias Eroticas ...|    null|
|The Strange Case ...|    null|
|         The Wendigo|    null|
|       Medusa s coil|    null|
| The call of Cthulhu|    null|
|The colour out of...|    null|
|    The lurking fear|    null|
|  The Fall of the...|    null|
|    The curse of Yig|    null|
|  Improvisos de B...|    null|
|  Queixumes do Pa...|    null|
|           The Raven|    null|
|              Elegia|    null|
+--------------------+--------+



In [114]:
column_language = dff.filter(col("paragraph").contains("Language:"))

In [115]:
column_language = column_language.select("paragraph")

In [116]:
column_language = column_language.withColumn(
    "Language", regexp_replace(col("paragraph"), "Language: ", "")
)
column_language = column_language.select("Language")

In [117]:
column_language.show()

+----------+
|  Language|
+----------+
|   English|
|   English|
|   English|
|   English|
|   English|
|Portuguese|
|   English|
|Portuguese|
|   English|
|   English|
|   English|
|   English|
|   English|
|   English|
|   English|
|   English|
|Portuguese|
|Portuguese|
|   English|
|Portuguese|
+----------+



In [118]:
data_union = column_title.union(column_language)

In [119]:
data_union.printSchema()

root
 |-- Titles: string (nullable = true)



In [120]:
data_union.show()

+--------------------+
|              Titles|
+--------------------+
|The Mysteries of ...|
|             Dracula|
|  The Phantom of ...|
|Frankenstein     ...|
|At the mountains ...|
|Os jardins ou a a...|
|The Castle of Otr...|
|Poesias Eroticas ...|
|The Strange Case ...|
|         The Wendigo|
|       Medusa s coil|
| The call of Cthulhu|
|The colour out of...|
|    The lurking fear|
|  The Fall of the...|
|    The curse of Yig|
|  Improvisos de B...|
|  Queixumes do Pa...|
|           The Raven|
|              Elegia|
+--------------------+
only showing top 20 rows



In [121]:
dff.registerTempTable("finder")

In [42]:
kws = "title"

In [43]:
result = spark.sql(f"""
select *
from finder
where array_contains(keys, lower("{kws}"))
""")

In [44]:
result.count()

36

In [45]:
result.show(10)

+--------------------+--------------------+
|                keys|           paragraph|
+--------------------+--------------------+
|[title, mysteries...|Title: The Myster...|
|[extraordinary, s...|“This is very ext...|
|[every, man, dese...|“If every man des...|
|[valancourt, whos...|With Valancourt  ...|
|[emily, seized, f...|  Emily seized th...|
|    [title, dracula]|      Title: Dracula|
|[call, said, hope...|"Call me what you...|
|[lord, godalming,...|"Lord Godalming  ...|
|[quincey, find, l...|"Quincey and I wi...|
|[cursory, glance,...|After a cursory g...|
+--------------------+--------------------+
only showing top 10 rows



### Calculating score

In [46]:
multi_kws = "title language"

#### Jaccard Similarity

In [47]:
score = spark.sql(f"""
select *, size(intersection_)/size(union_) score
from (
    select *,
           array_intersect(keys, split(lower("{multi_kws}"), " ")) intersection_,
           array_union(keys, split(lower("{multi_kws}"), " ")) union_
    from finder
    where size(array_intersect(keys, split(lower("{multi_kws}"), " "))) > 0
) tmp
order by score desc
limit 10
""")

- uma palavra: 264 ms
- quatro palavras: 326 ms
- dez palavras: 314 ms

In [48]:
%%time
score.show()

+--------------------+--------------------+-------------+--------------------+------------------+
|                keys|           paragraph|intersection_|              union_|             score|
+--------------------+--------------------+-------------+--------------------+------------------+
| [language, english]|   Language: English|   [language]|[language, englis...|0.3333333333333333|
|    [title, dracula]|      Title: Dracula|      [title]|[title, dracula, ...|0.3333333333333333|
| [language, english]|   Language: English|   [language]|[language, englis...|0.3333333333333333|
| [language, english]|   Language: English|   [language]|[language, englis...|0.3333333333333333|
| [language, english]|   Language: English|   [language]|[language, englis...|0.3333333333333333|
| [language, english]|   Language: English|   [language]|[language, englis...|0.3333333333333333|
|[language, portug...|Language: Portuguese|   [language]|[language, portug...|0.3333333333333333|
| [language, english

#### Cosine Similarity

In [43]:
def cossim(v1, v2):
    v = v2.map(lambda elem: 1 if array_contains(v1, elem) else 0)
    

links:
- [spark-sql-array-funcs](https://kontext.tech/article/587/spark-sql-array-functions)
- [spark-map-syntax](https://sparkbyexamples.com/pyspark/pyspark-map-transformation/)