
# Projeto de Bloco: Engenharia de Dados: Big Data
---

## **TP5**

**Escolha 15 livros no idioma inglês e 15 livros no idioma português (totalizando 30 livros).**

**Utilizando o Jupyter Notebook, com o interpretador do Apache Spark em Python, faremos as tarefas de ingestão, limpeza e normalização, e análise e mensuração dos tempos de execução dos textos escolhidos.**

---

 - **Ingestão**
    - Selecionar os 30 livros no formato Plain Text UTF-8
    - Adicionar os arquivos no HDFS

- **Limpeza e Normalização**
    - Importação dos dados em DataFrames Spark
    - Remoção de caracteres descartáveis
    - Remoção de linha descartáveis
    - Remoção de stop-words
    - Aplicação de lematização
    - Utilização de técnicas complementares, caso julgue necessário
    - DataFrame unificado de informações dos 30 livros selecionados, com as seguintes colunas:
        - Nome do Livro
        - Idioma do Livro
        - Número do Parágrafo (começando com o índice #1)
        - Parágrafo Original (antes da limpeza e normalização)
        - Conjunto de Palavras do Parágrafo (após limpeza e normalização)
        - Outras colunas, caso julgue necessário


### **Análise e Mensuração dos tempos**
---
Responder às seguintes perguntas:
- Quantidade de palavras únicas utilizadas por livro
- Quantidade de parágrafos e palavras não-únicas por parágrafo por livro
- Identificar a palavra que mais aparece e a palavra que menos aparece por parágrafo por livro
- Dos livros em inglês, selecionar as top-10 palavras que mais aparecem. Fazer o mesmo para os livros em português
- Montar dois gráficos de linhas, um para os livros em inglês e o outro para os em português, com o eixo X sendo o índice dos parágrafos, o eixo Y sendo a quantidade de palavras únicas e a linha o livro em questão
- Montar dois histogramas um para os livros em inglês e o outro para os em português, para análise da frequência de palavras dos livros
- Exibir o tempo gasto no processamento de cada uma das respostas acima


In [3]:
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col, sum, avg, row_number, max, regexp_replace, trim, split, array_contains, lower
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pyspark.ml.feature import StopWordsRemover
from pathlib import Path

In [4]:
conf = SparkConf().setAppName("TP5")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

### Importação dos dados em DataFrames Spark

In [5]:
%%time
! hadoop fs -put ../datasets/tp5

CPU times: user 18.7 ms, sys: 20.5 ms, total: 39.2 ms
Wall time: 2.18 s


In [6]:
! hadoop fs -ls hdfs://node-master:9000/user/root

Found 2 items
drwxr-xr-x   - root supergroup          0 2023-06-24 18:30 hdfs://node-master:9000/user/root/.sparkStaging
drwxr-xr-x   - root supergroup          0 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5


In [7]:
! hadoop fs -ls hdfs://node-master:9000/user/root/tp5

Found 20 items
-rw-r--r--   2 root supergroup     270199 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/At The Mountains of Madness by H. P. Lovecraft.txt
-rw-r--r--   2 root supergroup     881691 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/Dracula by Bram Stoker.txt
-rw-r--r--   2 root supergroup      23527 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/Elegia by Manuel Maria Barbosa du Bocage.txt
-rw-r--r--   2 root supergroup     448609 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/Frankenstein by Mary Wollstonecraft Shelley.txt
-rw-r--r--   2 root supergroup      37440 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/Improvisos de Bocage by Manuel Maria Barbosa du Bocage.txt
-rw-r--r--   2 root supergroup     111420 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/Medusa's coil by Zealia B. Bishop and H. P. Lovecraft.txt
-rw-r--r--   2 root supergroup     241819 2023-06-24 18:30 hdfs://node-master:9000/user/root/tp5/Os jardins 

#### Contagem de linhas

In [8]:
def shape(df):
    return (df.count(), len(df.columns))

### Lendo os Arquivos

In [9]:
dataset_path = "hdfs://node-master:9000/user/root/tp5"

In [10]:
dfft_ = spark.read.format(
    "csv"
).option(
    "header", "false"
).schema(
    StructType([StructField("full_text", StringType(), True)])
).load(
    f"{dataset_path}/*.txt"
)

In [11]:
shape(dfft_)

(87194, 1)

In [12]:
dfft_.head(30)

[Row(full_text='The Project Gutenberg eBook of The Mysteries of Udolpho'),
 Row(full_text='This eBook is for the use of anyone anywhere in the United States and'),
 Row(full_text='most other parts of the world at no cost and with almost no restrictions'),
 Row(full_text='whatsoever. You may copy it'),
 Row(full_text='of the Project Gutenberg License included with this eBook or online at'),
 Row(full_text='www.gutenberg.org. If you are not located in the United States'),
 Row(full_text='will have to check the laws of the country where you are located before'),
 Row(full_text='using this eBook.'),
 Row(full_text='Title: The Mysteries of Udolpho'),
 Row(full_text='Author: Ann Radcliffe'),
 Row(full_text='Release Date: March 4'),
 Row(full_text='[Most recently updated: December 1'),
 Row(full_text='Language: English'),
 Row(full_text='Produced by: Karalee Coleman and David Widger'),
 Row(full_text='*** START OF THE PROJECT GUTENBERG EBOOK THE MYSTERIES OF UDOLPHO ***'),
 Row(full_text='cov

### Remoção de linha descartáveis

In [13]:
WINDOWS_SEP = "\r\n\r\n"
UNIX_LIKE_SEP = "\n\n"

In [14]:
dfft = spark.read.text(
    [f"{dataset_path}/*.txt"],
    lineSep=WINDOWS_SEP
)

In [15]:
shape(dfft)

(14255, 1)

In [16]:
dfft.printSchema()

root
 |-- value: string (nullable = true)



In [17]:
dfft.head(20)

[Row(value='The Project Gutenberg eBook of The Mysteries of Udolpho, by Ann Radcliffe'),
 Row(value='This eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.'),
 Row(value='Title: The Mysteries of Udolpho'),
 Row(value='Author: Ann Radcliffe'),
 Row(value='Release Date: March 4, 2001 [eBook #3268]\r\n[Most recently updated: December 1, 2022]'),
 Row(value='Language: English'),
 Row(value='\r\nProduced by: Karalee Coleman and David Widger'),
 Row(value='*** START OF THE PROJECT GUTENBERG EBOOK THE MYSTERIES OF UDOLPHO ***'),
 Row(value=''),
 Row(value='\r\ncover '),
 Row(value=''),
 Row(val

In [18]:
dfft_s1 = dfft.withColumn(
    "value_s1", regexp_replace(col("value"), "[\r\n\`\'\;\,]", " ")
)

In [19]:
dfft_s1.collect()[2]["value_s1"]

'Title: The Mysteries of Udolpho'

#### Salvando um parágrafo original para comparação

In [20]:
dfftc = dfft_s1.withColumn("paragraph", col("value_s1")).select("paragraph")

In [21]:
dfftc.show()

+--------------------+
|           paragraph|
+--------------------+
|The Project Guten...|
|This eBook is for...|
|Title: The Myster...|
|Author: Ann Radcl...|
|Release Date: Mar...|
|   Language: English|
|  Produced by: Ka...|
|*** START OF THE ...|
|                    |
|              cover |
|                    |
|The Mysteries of ...|
|          A Romance |
| Interspersed Wit...|
|    By Ann Radcliffe|
|                    |
|            Contents|
|   VOLUME I   CHA...|
| VOLUME II   CHAP...|
| VOLUME III   CHA...|
+--------------------+
only showing top 20 rows



### Remoção de caracteres descartáveis

In [22]:
dfft_s2 = dfftc.withColumn(
    "value_s2", regexp_replace(col("paragraph"), "[^A-z0-9\ ]", "")
)

In [23]:
dfft_s2.collect()[2]["value_s2"]

'Title The Mysteries of Udolpho'

In [24]:
dfft_s3 = dfft_s2.select(
    "value_s2",
    "paragraph",
    trim(col("value_s2")).alias("value_s3")
)

In [25]:
dfft_s3.collect()[2]["value_s3"]

'Title The Mysteries of Udolpho'

In [26]:
dfft_s4 = dfft_s3.withColumn(
    "value_s4", regexp_replace(col("value_s3"), "\s+", " ")
)

In [27]:
dfft_s4.collect()[2]["value_s4"]

'Title The Mysteries of Udolpho'

### Remoção de stop-words

In [28]:
dfftc_s1 = dfft_s4.withColumn("words", split(lower(col("value_s4")), " "))

In [29]:
dfftc_s1.printSchema()

root
 |-- value_s2: string (nullable = true)
 |-- paragraph: string (nullable = true)
 |-- value_s3: string (nullable = true)
 |-- value_s4: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [30]:
dfftc_s1 = dfftc_s1.select("words", "paragraph")

In [31]:
dfftc_s1.printSchema()

root
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- paragraph: string (nullable = true)



In [32]:
dfftc_s1.head(10)

[Row(words=['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'mysteries', 'of', 'udolpho', 'by', 'ann', 'radcliffe'], paragraph='The Project Gutenberg eBook of The Mysteries of Udolpho  by Ann Radcliffe'),
 Row(words=['this', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'united', 'states', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorg', 'if', 'you', 'are', 'not', 'located', 'in', 'the', 'united', 'states', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'ebook'], paragraph='This eBook is for the use of anyone anywhere in the United States and  most other par

In [33]:
remover = StopWordsRemover(inputCol="words", outputCol="keys")

In [34]:
dfftc_s2 = remover.transform(dfftc_s1)

In [35]:
dfftc_s2.head()

Row(words=['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'mysteries', 'of', 'udolpho', 'by', 'ann', 'radcliffe'], paragraph='The Project Gutenberg eBook of The Mysteries of Udolpho  by Ann Radcliffe', keys=['project', 'gutenberg', 'ebook', 'mysteries', 'udolpho', 'ann', 'radcliffe'])

In [36]:
dfftc_s2.collect()[2]

Row(words=['title', 'the', 'mysteries', 'of', 'udolpho'], paragraph='Title: The Mysteries of Udolpho', keys=['title', 'mysteries', 'udolpho'])

### Criando Dataframe com dados limpos

In [153]:
dff = dfftc_s2.select(
    "keys", "paragraph"
)

In [154]:
dff.show()

+--------------------+--------------------+
|                keys|           paragraph|
+--------------------+--------------------+
|[project, gutenbe...|The Project Guten...|
|[ebook, use, anyo...|This eBook is for...|
|[title, mysteries...|Title: The Myster...|
|[author, ann, rad...|Author: Ann Radcl...|
|[release, date, m...|Release Date: Mar...|
| [language, english]|   Language: English|
|[produced, karale...|  Produced by: Ka...|
|[start, project, ...|*** START OF THE ...|
|                  []|                    |
|             [cover]|              cover |
|                  []|                    |
|[mysteries, udolpho]|The Mysteries of ...|
|           [romance]|          A Romance |
|[interspersed, pi...| Interspersed Wit...|
|    [ann, radcliffe]|    By Ann Radcliffe|
|                  []|                    |
|          [contents]|            Contents|
|[volume, chapter,...|   VOLUME I   CHA...|
|[volume, ii, chap...| VOLUME II   CHAP...|
|[volume, iii, cha...| VOLUME II

#### Adicionando Número do Parágrafos

In [119]:
w = Window().orderBy("paragraph")
dff_ids = dff.select(row_number().over(w).alias("ID"), col("*"))
dff_ids.show()

+---+--------------------+--------------------+
| ID|                keys|           paragraph|
+---+--------------------+--------------------+
|  1|                  []|                    |
|  2|               [pag]|                 ...|
|  3|[_algernon, black...|                 ...|
|  4|                  []|                 ...|
|  5|   [horace, walpole]|                 ...|
|  6|                [ii]|                 ...|
|  7|                [iv]|                 ...|
|  8|                [ix]|                 ...|
|  9|              [note]|                 ...|
| 10|                 [v]|                 ...|
| 11|                [vi]|                 ...|
| 12|                 [x]|                 ...|
| 13|                [xi]|                 ...|
| 14|               [iii]|                 ...|
| 15|   [castle, otranto]|                 ...|
| 16|               [vii]|                 ...|
| 17|              [viii]|                 ...|
| 18|               [xii]|              

In [123]:
dff_ids.collect()[799]

Row(ID=800, keys=['crimino', 'esse', 'de', 'bola', 'chata', 'que', 'na', 'eschola', 'de', 'marte', 'inda', 'menino', 'e', 'ao', 'falso', 'pastor', 'pastor', 'sem', 'tino', 'que', 'mal', 'das', 'ovelhas', 'cura', 'e', 'tracta'], paragraph='    Só crimino esse heróe de bola chata       Que na eschola de Marte inda é menino       E ao falso pastor  pastor sem tino       Que tão mal das ovelhas cura  e tracta:')

In [124]:
dff_ids.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- keys: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- paragraph: string (nullable = true)



In [125]:
dff_ids = dff_ids.filter(dff_ids.paragraph != '')

In [126]:
dff_ids.show()

+---+--------------------+--------------------+
| ID|                keys|           paragraph|
+---+--------------------+--------------------+
|  1|                  []|                    |
|  2|               [pag]|                 ...|
|  3|[_algernon, black...|                 ...|
|  4|                  []|                 ...|
|  5|   [horace, walpole]|                 ...|
|  6|                [ii]|                 ...|
|  7|                [iv]|                 ...|
|  8|                [ix]|                 ...|
|  9|              [note]|                 ...|
| 10|                 [v]|                 ...|
| 11|                [vi]|                 ...|
| 12|                 [x]|                 ...|
| 13|                [xi]|                 ...|
| 14|               [iii]|                 ...|
| 15|   [castle, otranto]|                 ...|
| 16|               [vii]|                 ...|
| 17|              [viii]|                 ...|
| 18|               [xii]|              

### Selecionando Linhas com os Títulos

In [127]:
dfftc = dff_ids.withColumn("paragraph", col("value_s1")).select("paragraph")

In [130]:
dff_title = dff_ids.withColumn("title", col("paragraph").contains("Title:"))

In [131]:
dff_title.show()

+---+--------------------+--------------------+-----+
| ID|                keys|           paragraph|title|
+---+--------------------+--------------------+-----+
|  1|                  []|                    |false|
|  2|               [pag]|                 ...|false|
|  3|[_algernon, black...|                 ...|false|
|  4|                  []|                 ...|false|
|  5|   [horace, walpole]|                 ...|false|
|  6|                [ii]|                 ...|false|
|  7|                [iv]|                 ...|false|
|  8|                [ix]|                 ...|false|
|  9|              [note]|                 ...|false|
| 10|                 [v]|                 ...|false|
| 11|                [vi]|                 ...|false|
| 12|                 [x]|                 ...|false|
| 13|                [xi]|                 ...|false|
| 14|               [iii]|                 ...|false|
| 15|   [castle, otranto]|                 ...|false|
| 16|               [vii]|  

In [139]:
dff_db = dff_title.withColumn("language", col("paragraph").contains("Language:"))

In [140]:
dff_db.show()

+---+--------------------+--------------------+-----+--------+
| ID|                keys|           paragraph|title|language|
+---+--------------------+--------------------+-----+--------+
|  1|                  []|                    |false|   false|
|  2|               [pag]|                 ...|false|   false|
|  3|[_algernon, black...|                 ...|false|   false|
|  4|                  []|                 ...|false|   false|
|  5|   [horace, walpole]|                 ...|false|   false|
|  6|                [ii]|                 ...|false|   false|
|  7|                [iv]|                 ...|false|   false|
|  8|                [ix]|                 ...|false|   false|
|  9|              [note]|                 ...|false|   false|
| 10|                 [v]|                 ...|false|   false|
| 11|                [vi]|                 ...|false|   false|
| 12|                 [x]|                 ...|false|   false|
| 13|                [xi]|                 ...|false|  

### Selecionando Title

In [141]:
dff_db.registerTempTable("tabledb")

In [144]:
titledb = spark.sql(f"""
select *
from tabledb
where title = true
""")

In [145]:
titledb.show()

+-----+--------------------+--------------------+-----+--------+
|   ID|                keys|           paragraph|title|language|
+-----+--------------------+--------------------+-----+--------+
| 1487|     [title, elegia]|       Title: Elegia| true|   false|
| 1488|[title, improviso...|  Title: Improvis...| true|   false|
| 1489|[title, queixumes...|  Title: Queixume...| true|   false|
| 1490|[title, fall, hou...|  Title: The Fall...| true|   false|
| 1491|[title, phantom, ...|  Title: The Phan...| true|   false|
| 1492|    [title, wendigo]|  Title: The Wendigo| true|   false|
|10509|[title, mountains...|Title: At the mou...| true|   false|
|10510|    [title, dracula]|      Title: Dracula| true|   false|
|10511|[title, frankenst...|Title: Frankenste...| true|   false|
|10512|[title, medusa, c...|Title: Medusa s coil| true|   false|
|10513|[title, os, jardi...|Title: Os jardins...| true|   false|
|10514|[title, poesias, ...|Title: Poesias Er...| true|   false|
|10515|[title, castle, o.

### Selecionando Language

In [146]:
languagedb = spark.sql(f"""
select *
from tabledb
where language = true
""")

In [147]:
languagedb.show()

+----+--------------------+--------------------+-----+--------+
|  ID|                keys|           paragraph|title|language|
+----+--------------------+--------------------+-----+--------+
|7739| [language, english]|   Language: English|false|    true|
|7740| [language, english]|   Language: English|false|    true|
|7741| [language, english]|   Language: English|false|    true|
|7742| [language, english]|   Language: English|false|    true|
|7743| [language, english]|   Language: English|false|    true|
|7744| [language, english]|   Language: English|false|    true|
|7745| [language, english]|   Language: English|false|    true|
|7746| [language, english]|   Language: English|false|    true|
|7747| [language, english]|   Language: English|false|    true|
|7748| [language, english]|   Language: English|false|    true|
|7749| [language, english]|   Language: English|false|    true|
|7750| [language, english]|   Language: English|false|    true|
|7751| [language, english]|   Language: 

### Tabela Titulo

In [87]:
column_title = dff.filter(col("paragraph").contains("Title:"))

In [88]:
column_title = column_title.select("paragraph")

In [89]:
column_title.show()

+--------------------+
|           paragraph|
+--------------------+
|Title: The Myster...|
|      Title: Dracula|
|  Title: The Phan...|
|Title: Frankenste...|
|Title: At the mou...|
|Title: Os jardins...|
|Title: The Castle...|
|Title: Poesias Er...|
|Title: The Strang...|
|  Title: The Wendigo|
|Title: Medusa s coil|
|Title: The call o...|
|Title: The colour...|
|Title: The lurkin...|
|  Title: The Fall...|
|Title: The curse ...|
|  Title: Improvis...|
|  Title: Queixume...|
|    Title: The Raven|
|       Title: Elegia|
+--------------------+



In [90]:
column_title = column_title.withColumn(
    "title", regexp_replace(col("paragraph"), "Title: ", "")
)
column_title = column_title.select("title")

In [91]:
column_title.show()

+--------------------+
|               title|
+--------------------+
|The Mysteries of ...|
|             Dracula|
|  The Phantom of ...|
|Frankenstein     ...|
|At the mountains ...|
|Os jardins ou a a...|
|The Castle of Otr...|
|Poesias Eroticas ...|
|The Strange Case ...|
|         The Wendigo|
|       Medusa s coil|
| The call of Cthulhu|
|The colour out of...|
|    The lurking fear|
|  The Fall of the...|
|    The curse of Yig|
|  Improvisos de B...|
|  Queixumes do Pa...|
|           The Raven|
|              Elegia|
+--------------------+



### Tabela Language

In [148]:
column_language = dff.filter(col("paragraph").contains("Language:"))

In [149]:
column_language = column_language.select("paragraph")

In [150]:
column_language = column_language.withColumn(
    "language", regexp_replace(col("paragraph"), "Language: ", "")
)
column_language = column_language.select("language")

In [151]:
column_language.show()

+----------+
|  language|
+----------+
|   English|
|   English|
|   English|
|   English|
|   English|
|Portuguese|
|   English|
|Portuguese|
|   English|
|   English|
|   English|
|   English|
|   English|
|   English|
|   English|
|   English|
|Portuguese|
|Portuguese|
|   English|
|Portuguese|
+----------+

