## Projeto de Bloco: Engenharia de Dados: Big Data
### TP5
    
- Ingestão
    - Selecionar os 30 livros no formato Plain Text UTF-8
    - Adicionar os arquivos no HDFS
- Limpeza e Normalização
    - Importação dos dados em DataFrames Spark
    - Remoção de caracteres descartáveis
    - Remoção de linha descartáveis
    - Remoção de stop-words
    - Aplicação de lematização
    - Utilização de técnicas complementares, caso julgue necessário
    - DataFrame unificado de informações dos 30 livros selecionados, com as seguintes colunas:
        - Nome do Livro
        - Idioma do Livro
        - Número do Parágrafo (começando com o índice #1)
        - Parágrafo Original (antes da limpeza e normalização)
        - Conjunto de Palavras do Parágrafo (após limpeza e normalização)
        - Outras colunas, caso julgue necessário
- Análise e Mensuração dos tempos
- Responder às seguintes perguntas:
    - Quantidade de palavras únicas utilizadas por livro
    - Quantidade de parágrafos e palavras não-únicas por parágrafo por livro
    - Identificar a palavra que mais aparece e a palavra que menos aparece por parágrafo por livro
    - Dos livros em inglês, selecionar as top-10 palavras que mais aparecem. Fazer o mesmo para os livros em português
    - Montar dois gráficos9,10 de linhas, um para os livros em inglês e o outro para os em português, com o eixo X sendo o índice dos parágrafos, o eixo Y sendo a quantidade de palavras únicas e a linha o livro em questão
    - Montar dois histogramas8,9,10, um para os livros em inglês e o outro para os em português, para análise da frequência de palavras dos livros
- Exibir o tempo gasto no processamento de cada uma das respostas acima


In [1]:
import findspark
import os
findspark.init()
from pyspark import SparkContext, SparkConf, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, row_number, max, regexp_replace, trim, split, array_contains, lower, lit, monotonically_increasing_id
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window
from pyspark.ml.feature import StopWordsRemover
from pathlib import Path

In [None]:
conf = SparkConf().setAppName("luis-barbosa-pb-tp5")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

In [None]:
%%time
! hadoop fs -put ../datasets/books

In [None]:
! hadoop fs -ls hdfs://node-master:9000/user/root

### Auxiliary Functions

In [None]:
def shape(df):
    return (df.count(), len(df.columns))

### Reading txt file

In [None]:
dataset_path = "hdfs://node-master:9000/user/root/books"

In [None]:
def dfft(file, field, type): 
    return spark.read.format(
        "csv"
    ).option(
        "header", "false"
    ).schema(
        StructType([
            StructField(field, type, True)
        ])
    ).load(
        file
    )
    

In [None]:
def captureValue(df, field, text, collumn): 
    return df.filter(col(field).contains(text)
        ).select(trim(split(col(field), ":").getItem(1)).alias(collumn)
        ).withColumn(
            collumn, regexp_replace(col(collumn), "[\r\n]", " ")
        ).withColumn(
            collumn, regexp_replace(col(collumn), "\s+", " ")
        ).first()[0]

In [2]:
files_rdd = spark.sparkContext.wholeTextFiles(f"{dataset_path}/*.txt")
index = 1
dfs = []
WINDOWS_SEP = "\r\n\r\n"
UNIX_LIKE_SEP = "\n\n"

# Percorre cada arquivo .txt separadamente
for file_path, file_content in files_rdd.collect():
    dfft_ = spark.read.text(
        [file_path],
        lineSep=WINDOWS_SEP
    ).withColumnRenamed("value", "original_paragraph") \
    .withColumn("id", monotonically_increasing_id() + 1)
            
    language = captureValue(dfft_, "original_paragraph", "Language:", "language")
    title = captureValue(dfft_, "original_paragraph", "Title:", "title")  
    
    dfft_ = dfft_.withColumn("language", lit(language)
                ).withColumn("title", lit(title))
    
        
    dfft_s1 = dfft_.withColumn(
        "clean_paragraph", regexp_replace(col("original_paragraph"), "[\:\r\n\`\'\;\,]", " ")
    ).withColumn(
        "clean_paragraph", regexp_replace(col("original_paragraph"), "[^A-z0-9\ ]", "")
    ).withColumn(
        "clean_paragraph", regexp_replace(col("original_paragraph"), "\s+", " ")
    )
    
    dfftc_s1 = dfft_s1.withColumn("words", split(lower(col("clean_paragraph")), " "))
    remover = StopWordsRemover(inputCol="words", outputCol="keys")
    dfftc_s2 = remover.transform(dfftc_s1)
    dfftc_s2 = dfftc_s2.select(
        "id",
        "original_paragraph",
        "keys",
        "title",
        "language"
    )
    index += 1
    
    dfs.append(dfftc_s2)

dfftc_s2.collect()[0]

NameError: name 'spark' is not defined

In [206]:
# Unifica os DataFrames
merged_df = dfs[0]
for df in dfs[1:]:
    merged_df = merged_df.union(df)

# Exibe o DataFrame unificado
test = merged_df.orderBy(merged_df["language"].desc())
test.show()

+----------+--------------------+--------------------+--------------------+--------+
|        id|  original_paragraph|                keys|               title|language|
+----------+--------------------+--------------------+--------------------+--------+
|8589934593|The Project Guten...|[project, gutenbe...|Abbé Mouret's Tra...| English|
|8589934594|This eBook is for...|[ebook, use, anyo...|Abbé Mouret's Tra...| English|
|8589934595|Title: A Pata da ...|[title:, pata, da...|Abbé Mouret's Tra...| English|
|8589934596|Author: José Mart...|[author:, josé, m...|Abbé Mouret's Tra...| English|
|8589934597|Release Date: Apr...|[release, date:, ...|Abbé Mouret's Tra...| English|
|8589934598|Language: Portuguese|[language:, portu...|Abbé Mouret's Tra...| English|
|8589934599|Produced by: Laur...|[produced, by:, l...|Abbé Mouret's Tra...| English|
|8589934600|*** START OF THE ...|[***, start, proj...|Abbé Mouret's Tra...| English|
|8589934601|             
SENIO|           [, senio]|Abbé Mouret

In [185]:
merged_df.registerTempTable("tabledb")
test = spark.sql('''
    select *
    from tabledb
    where language = "Portuguese"
''')

test.show()

+---+------------------+----+-----+--------+
| id|original_paragraph|keys|title|language|
+---+------------------+----+-----+--------+
+---+------------------+----+-----+--------+



In [106]:
shape(dfft_)

(7166, 4)

In [22]:
WINDOWS_SEP = "\r\n\r\n"
UNIX_LIKE_SEP = "\n\n"

In [23]:
dfft = spark.read.text(
    [dataset_path],
    lineSep=WINDOWS_SEP
).withColumnRenamed("value", "original_paragraph")


In [24]:
shape(dfft)

(39311, 1)

In [25]:
dfft.printSchema()

root
 |-- original_paragraph: string (nullable = true)



In [26]:
dfft.head(20)

[Row(original_paragraph="The Project Gutenberg eBook of Abbé Mouret's Transgression, by Émile Zola"),
 Row(original_paragraph='This eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.'),
 Row(original_paragraph="Title: Abbé Mouret's Transgression\r\nLa Faute De L'abbé Mouret"),
 Row(original_paragraph='Author: Émile Zola'),
 Row(original_paragraph='Editor: Ernest Alfred Vizetelly'),
 Row(original_paragraph='Release Date: November 28, 2004 [eBook #14200]\r\n[Most recently updated: June 8, 2021]'),
 Row(original_paragraph='Language: English'),
 Row(original_paragraph='Character set encoding:

### Data Wrangling

In [49]:
dfft_s1 = dfft.withColumn(
    "clean_paragraph", regexp_replace(col("original_paragraph"), "[\:\r\n\`\'\;\,]", " ")
).withColumn(
    "clean_paragraph", regexp_replace(col("original_paragraph"), "[^A-z0-9\ ]", "")
).select(
    "original_paragraph",
    trim(col("clean_paragraph")).alias("clean_paragraph")
).withColumn(
    "clean_paragraph", regexp_replace(col("original_paragraph"), "\s+", " ")
)

#.withColumn(
#    "language", regexp_replace(col("original_paragraph"), "Language:\s*([^\n]+)", "$1")
#)

In [48]:
dfft_s1.collect()

[Row(original_paragraph="The Project Gutenberg eBook of Abbé Mouret's Transgression, by Émile Zola", clean_paragraph="The Project Gutenberg eBook of Abbé Mouret's Transgression, by Émile Zola", language="The Project Gutenberg eBook of Abbé Mouret's Transgression, by Émile Zola"),
 Row(original_paragraph='This eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.', clean_paragraph='This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms 

### Removing Stop Words

In [50]:
dfftc_s1 = dfft_s1.withColumn("words", split(lower(col("clean_paragraph")), " "))

In [51]:
dfftc_s1.printSchema()

root
 |-- original_paragraph: string (nullable = true)
 |-- clean_paragraph: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [218]:
dfftc_s1.head(10)

[Row(original_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope', clean_paragraph='The Project Gutenberg eBook, Aaron Trow, by Anthony Trollope', words=['the', 'project', 'gutenberg', 'ebook,', 'aaron', 'trow,', 'by', 'anthony', 'trollope']),
 Row(original_paragraph="\r\nThis eBook is for the use of anyone anywhere in the United States and most\r\nother parts of the world at no cost and with almost no restrictions\r\nwhatsoever.  You may copy it, give it away or re-use it under the terms of\r\nthe Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.  If you are not located in the United States, you'll have\r\nto check the laws of the country where you are located before using this ebook.", clean_paragraph=" This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project

In [53]:
remover = StopWordsRemover(inputCol="words", outputCol="keys")

In [54]:
dfftc_s2 = remover.transform(dfftc_s1)

In [55]:
dfftc_s2.head()

Row(original_paragraph="The Project Gutenberg eBook of Abbé Mouret's Transgression, by Émile Zola", clean_paragraph="The Project Gutenberg eBook of Abbé Mouret's Transgression, by Émile Zola", words=['the', 'project', 'gutenberg', 'ebook', 'of', 'abbé', "mouret's", 'transgression,', 'by', 'émile', 'zola'], keys=['project', 'gutenberg', 'ebook', 'abbé', "mouret's", 'transgression,', 'émile', 'zola'])

In [64]:
for test in dfftc_s2.collect():
    

SyntaxError: unexpected EOF while parsing (<ipython-input-64-33765c24f383>, line 2)

### Registering search table

In [223]:
dff = dfftc_s2.select(
    "original_paragraph", "clean_paragraph", "keys"
)

In [224]:
dff.show()

+--------------------+--------------------+--------------------+
|  original_paragraph|     clean_paragraph|                keys|
+--------------------+--------------------+--------------------+
|The Project Guten...|The Project Guten...|[project, gutenbe...|
|
This eBook is f...| This eBook is fo...|[, ebook, use, an...|
|                    |                    |                  []|
| 
Title: Aaron Trow|   Title: Aaron Trow|[, title:, aaron,...|
|
Author: Anthony...| Author: Anthony ...|[, author:, antho...|
|                    |                    |                  []|
|Release Date: Jan...|Release Date: Jan...|[release, date:, ...|
|   Language: English|   Language: English|[language:, english]|
|Character set enc...|Character set enc...|[character, set, ...|
|
***START OF THE...| ***START OF THE ...|[, ***start, proj...|
|
Transcribed fro...| Transcribed from...|[, transcribed, 1...|
|                    |                    |                  []|
|                    |   

In [150]:
dff.registerTempTable("books")

In [232]:
result = spark.sql(f"""
select keys as language
from books
where array_contains(keys, "convict")
""")

In [233]:
result.count()

5

In [236]:
result.collect()

[Row(language=['bermuda,', 'world', 'knows,', 'british', 'colony', 'maintain', 'convict', 'establishment.', 'outlying', 'convict', 'establishments', 'sent', 'back', 'upon', 'hands', 'colonies,', 'one', 'still', 'maintained.', 'also', 'islands', 'strong', 'military', 'fortress,', 'though', 'fortress', 'looking', 'magnificent', 'eyes', 'civilians,', 'malta', 'gibraltar.', 'also', 'six', 'thousand', 'white', 'people', 'six', 'thousand', 'black', 'people,', 'eating,', 'drinking,', 'sleeping,', 'dying.']),
 Row(language=['convict', 'establishment', 'notable', 'feature', 'bermuda', 'stranger,', 'seem', 'attract', 'much', 'attention', 'regular', 'inhabitants', 'place.', 'intercourse', 'prisoners', 'bermudians.', 'convicts', 'rarely', 'seen', 'them,', 'convict', 'islands', 'rarely', 'visited.', 'prisoners', 'themselves,', 'course', 'open', 'them—or', 'open', 'them—to', 'intercourse', 'prison', 'authorities.']),
 Row(language=['have,', 'however,', 'instances', 'convicts', 'escaped', 'confinemen

### Calculating score

In [91]:
multi_kws = "alice rabbit door"

#### Jaccard Similarity

In [92]:
score = spark.sql(f"""
select *, size(intersection_)/size(union_) score
from (
    select *,
           array_intersect(keys, split(lower("{multi_kws}"), " ")) intersection_,
           array_union(keys, split(lower("{multi_kws}"), " ")) union_
    from finder
    where size(array_intersect(keys, split(lower("{multi_kws}"), " "))) > 0
) tmp
order by score desc
limit 10
""")

- uma palavra: 264 ms
- quatro palavras: 326 ms
- dez palavras: 314 ms

In [93]:
%%time
score.show()

+--------------------+--------------------+-------------+--------------------+------------------+
|                keys|           paragraph|intersection_|              union_|             score|
+--------------------+--------------------+-------------+--------------------+------------------+
|[alice, went, tim...|Alice went timidl...|[alice, door]|[alice, went, tim...|0.3333333333333333|
|       [said, alice]|But what am I to ...|      [alice]|[said, alice, rab...|              0.25|
|     [alice, silent]|    Alice was silent|      [alice]|[alice, silent, r...|              0.25|
|       [said, alice]| What for said Alice|      [alice]|[said, alice, rab...|              0.25|
|   [inquired, alice]|What was that inq...|      [alice]|[inquired, alice,...|              0.25|
|   [alice, evidence]|    Alice s Evidence|      [alice]|[alice, evidence,...|              0.25|
|[said, alice, duc...|Very said Alice w...|      [alice]|[said, alice, duc...|               0.2|
|[alice, adventure..

#### Cosine Similarity

In [43]:
def cossim(v1, v2):
    v = v2.map(lambda elem: 1 if array_contains(v1, elem) else 0)
    

links:
- [spark-sql-array-funcs](https://kontext.tech/article/587/spark-sql-array-functions)
- [spark-map-syntax](https://sparkbyexamples.com/pyspark/pyspark-map-transformation/)