#  Simple MapReduce Process

> What is MapReduce? What is your importance?

MapReduce is a program model to use for Big Data, splitting work in some independent clusters. 

MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware). Processing can occur on data stored either in a filesystem (unstructured) or in a database (structured). MapReduce can take advantage of the locality of data, processing it near the place it is stored in order to minimize communication overhead.

A MapReduce framework (or system) is usually composed of three operations (or steps):

• Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.

• Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.

• Reduce: worker nodes now process each group of output data, per key, in parallel.

In [1]:
# Import Librares
import pandas as pd
import re

In [2]:
# Reading file and apply split
texto = open('StarWars.txt')
texto = list(texto)
for i in texto:
    texto = i.split()

In [3]:
# Map
def myfunc(n):
    return n ,1

x = map(myfunc, texto)
list(x)

[('Três', 1),
 ('anos', 1),
 ('após', 1),
 ('a', 1),
 ('destruição', 1),
 ('da', 1),
 ('Estrela', 1),
 ('da', 1),
 ('Morte,', 1),
 ('a', 1),
 ('Aliança', 1),
 ('Rebelde', 1),
 ('continua', 1),
 ('fugindo', 1),
 ('da', 1),
 ('ameaça', 1),
 ('do', 1),
 ('Império', 1),
 ('Galáctico,', 1),
 ('que', 1),
 ('manda', 1),
 ('sondas', 1),
 ('espiãs', 1),
 ('por', 1),
 ('toda', 1),
 ('a', 1),
 ('Galáxia', 1),
 ('em', 1),
 ('busca', 1),
 ('dos', 1),
 ('rebeldes.', 1),
 ('Uma', 1),
 ('das', 1),
 ('sondas', 1),
 ('vai', 1),
 ('para', 1),
 ('o', 1),
 ('planeta', 1),
 ('gelado', 1),
 ('de', 1),
 ('Hoth,', 1),
 ('onde', 1),
 ('a', 1),
 ('Aliança', 1),
 ('Rebelde', 1),
 ('montou', 1),
 ('a', 1),
 ('base', 1),
 ('Echo.', 1),
 ('Han', 1),
 ('Solo', 1),
 ('planeja', 1),
 ('partir', 1),
 ('para', 1),
 ('pagar', 1),
 ('o', 1),
 ('dinheiro', 1),
 ('que', 1),
 ('deve', 1),
 ('ao', 1),
 ('gangster', 1),
 ('Jabba', 1),
 ('the', 1),
 ('Hutt,', 1),
 ('embora', 1),
 ('tenha', 1),
 ('sentimentos', 1),
 ('pela', 1),


In [6]:
# Shuffle
# Criando Data Frame com a lista para tirar Caracteres que nao agregam
df = pd.DataFrame({'Texto': texto})
# Eliminando os caracteres
df['Texto'] = df['Texto'].apply(lambda x: re.sub(',','',x))
df['Texto'] = df['Texto'].apply(lambda x: x.replace('(',''))
df['Texto'] = df['Texto'].apply(lambda x: x.replace(')',''))
df['Texto'] = df['Texto'].apply(lambda x: x.replace('.',''))
df['Contagem'] = 1
df = df.sort_values('Texto',ignore_index=True)
df.head(12)

Unnamed: 0,Texto,Contagem
0,-,1
1,-,1
2,-,1
3,5,1
4,AT-AT,1
5,AT-AT,1
6,AT-ST,1
7,Aliança,1
8,Aliança,1
9,Aliança,1


In [5]:
# And finally: Reduce
dict(df['Texto'].value_counts())

{'e': 65,
 'a': 40,
 'que': 38,
 'de': 35,
 'o': 35,
 'Luke': 30,
 'em': 23,
 'da': 23,
 'se': 21,
 'para': 19,
 'Vader': 19,
 'seu': 17,
 'Han': 13,
 'um': 13,
 'com': 13,
 'os': 13,
 'uma': 12,
 'é': 12,
 'Leia': 11,
 'ao': 10,
 'ele': 9,
 'do': 8,
 'Lando': 8,
 'Força': 7,
 'Falcon': 7,
 'caverna': 7,
 'não': 7,
 'por': 7,
 'na': 6,
 'Império': 6,
 'cidade': 5,
 'à': 5,
 'C-3PO': 5,
 'sua': 5,
 'consegue': 5,
 'encontra': 5,
 'eles': 5,
 'Yoda': 5,
 'Skywalker': 5,
 'Darth': 4,
 'Rebelde': 4,
 'Sombrio': 4,
 'Bespin': 4,
 'das': 4,
 'diz': 4,
 'Obi-Wan': 4,
 'Lado': 4,
 'Aliança': 4,
 'planeta': 4,
 'no': 4,
 'Millenium': 4,
 'vão': 4,
 'Carbonita': 4,
 'Solo': 4,
 'Ele': 4,
 'mestre': 4,
 'Imperador': 4,
 'Chewie': 4,
 'fica': 4,
 'onde': 4,
 'velho': 3,
 'após': 3,
 'tentar': 3,
 'num': 3,
 'R2': 3,
 'Ao': 3,
 'mas': 3,
 'sala': 3,
 'vai': 3,
 'nave': 3,
 'ferido': 3,
 'tenta': 3,
 'criatura': 3,
 'dos': 3,
 '-': 3,
 'treinamento': 3,
 'manda': 3,
 'sobrevive': 3,
 'Dagobah': 3,
 