# Exercicios com Spark

In [1]:
import pyspark

conf = pyspark.SparkConf()
conf.setAppName('Aplicação')
conf.setMaster('local[*]')

sc = pyspark.SparkContext(conf=conf)

In [2]:
sc

Para este exercicio vamos trabalhar com o dataset de reviews da Amazon visto em https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews. Baixe o arquivo "train.csv"

Vamos ler o arquivo "train.csv" em um RDD.

In [3]:
rdd = sc.textFile('train.csv')

In [4]:
rdd.take(1)

['"2","Stuning even for the non-gamer","This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^"']

De acordo com a documentação deste arquivo vista no Kaggle, cada linha contem 2 elementos: o sentimento do review (1 - negativo, 2 - positivo), o título e o corpo do review. A linha contem esses elementos em um formato "comma-separated value" (CSV), onde cada um dos campos está delimitado por aspas duplas. Se o texto em si (titulo ou corpo) contem aspas, elas aparecem como um par de aspas duplas. Vamos usar o `.filter()` para achar um exemplo desses.

In [5]:
example_line = rdd.filter(lambda x: '""' in x).take(1)
example_line = example_line[0]

example_line

'"2","Amazing!","This soundtrack is my favorite music of all time, hands down. The intense sadness of ""Prisoners of Fate"" (which means all the more if you\'ve played the game) and the hope in ""A Distant Promise"" and ""Girl who Stole the Star"" have been an important inspiration to me personally throughout my teen years. The higher energy tracks like ""Chrono Cross ~ Time\'s Scar~"", ""Time of the Dreamwatch"", and ""Chronomantique"" (indefinably remeniscent of Chrono Trigger) are all absolutely superb as well.This soundtrack is amazing music, probably the best of this composer\'s work (I haven\'t heard the Xenogears soundtrack, so I can\'t say for sure), and even if you\'ve never played the game, it would be worth twice the price to buy it.I wish I could give it 6 stars."'

Levando isso em comsideração, vamos fazer uma função simples para separar os campos:

In [6]:
def parse_line(line):
    parts = line[1:-1].split('","')
    sentiment = int(parts[0])
    title = parts[1].replace('""', '"')
    body = parts[2].replace('""', '"')
    return (sentiment, title, body)

In [7]:
parse_line(example_line)

(2,
 'Amazing!',
 'This soundtrack is my favorite music of all time, hands down. The intense sadness of "Prisoners of Fate" (which means all the more if you\'ve played the game) and the hope in "A Distant Promise" and "Girl who Stole the Star" have been an important inspiration to me personally throughout my teen years. The higher energy tracks like "Chrono Cross ~ Time\'s Scar~", "Time of the Dreamwatch", and "Chronomantique" (indefinably remeniscent of Chrono Trigger) are all absolutely superb as well.This soundtrack is amazing music, probably the best of this composer\'s work (I haven\'t heard the Xenogears soundtrack, so I can\'t say for sure), and even if you\'ve never played the game, it would be worth twice the price to buy it.I wish I could give it 6 stars.')

Podemos agora utilizar nossa função para separar os campos de cada linha do dataset. 

In [8]:
rdd_split = rdd.map(parse_line).cache()

Como de costume, nada realmente acontece até que uma "action" seja invocada. O `.map()` é uma "transformation". Vamos usar uma action simples para "materializar" o novo RDD.

In [9]:
rdd_split.count()

3600000

Vamos explorar os resultados para ver se deu certo

In [10]:
rdd_split.take(1)

[(2,
  'Stuning even for the non-gamer',
  'This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^')]

**Atividade**: Conte quantos sentimentos diferentes existem, e quantas vezes aparecem, para confirmar que só tem os sentimentos 1 e 2:

In [11]:
rdd_split.map(lambda x: x[0]).countByValue()

defaultdict(int, {2: 1800000, 1: 1800000})

**Atividade**: Quantos reviews não tem titulo?

In [12]:
rdd_split.filter(lambda x: x[1] == "").count()

48

**Atividade**: Quantos reviews não tem corpo?

In [13]:
rdd_split.filter(lambda x: x[2] == "").countByValue()

defaultdict(int, {})

**Atividade**: Qual o comprimento máximo de um título e de um corpo?

In [14]:
rdd_split.map(lambda x: (len(x[1]), len(x[2]))) \
         .reduce(lambda x, y: (max(x[0], y[0]), max(x[1], y[1])))

(139, 1010)

**Atividade**: Qual a maior palavra palíndroma no titulo ou corpo?

In [19]:
def split_word(item):
    title = item[1]
    body = item[2]
    return title.split() + body.split()

def is_palidrom(word):
    return word == word[::-1]


rdd_split.flatMap(split_word)\
         .filter(is_palidrom)\
         .take(26)

['I',
 'I',
 'I',
 'a',
 '^_^',
 'a',
 'I',
 'a',
 'a',
 'I',
 'I',
 '~',
 'I',
 'I',
 '6',
 'I',
 'I',
 'I',
 'I',
 '-',
 '-',
 'a',
 'a',
 'a',
 'I',
 'a']

**Atividade**: Quais as 20 palavras mais populares do titulo? 