# Spark - RDD - Basic Transformations

In [17]:
#!pip install findspark

In [6]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="RDDBasics")

ImportError: cannot import name 'SparkContext' from 'pyspark' (unknown location)

In [2]:
sc

''

# Creando un RDD con 3 lineas de texto

* __Resilient__ - se va enviar los datos y las funciones que se quieren ejecutar a varios ordenadroes, pero cada trozito de datos es unico. Spark guardará los datos crudos y el planning de ejecución (incluye las funciones) y en caso de falla de una de los Workers se lo envia a otro.
* __Distributed__ - Separar los datos en Trabajadores (Workers). Un worker no es un slave, un worker es un proceso (PID) lo que quiere decir que en un "slave" pueden y normalmente hay mas de uno worker. Razón para poder usar todos los procesadores de la dicha maquina.
* __Dataset__ - Datos no estructurados - acepta cualquier tipo de datos.

In [8]:
rdd_lines = sc.parallelize(["linea 1 Python",
                            "linea 2 Python",
                            "linea 3 Spark"] )

AttributeError: module 'pyspark' has no attribute 'parallelize'

In [None]:
type(rdd_lines)

In [None]:
rdd_lines

In [19]:
rdd_lines.collect()

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

## Ojo: rdd_lines.collect() es un "code smell"
Es decir tu codigo huele mal, porque estas trayendo 100% de los datos a Python, seguro que quieres hacer eso?

In [20]:
type(rdd_lines.collect())

list

In [21]:
rdd_numbers = sc.parallelize([1,2,3,3] )

In [22]:
rdd_numbers

ParallelCollectionRDD[10] at readRDDFromFile at PythonRDD.scala:274

.collect() returns the whole rdd()

In [23]:
rdd_lines.collect()

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

In [24]:
rdd_numbers.collect() # Action that returns the whole RDD

[1, 2, 3, 3]

In [25]:
rdd_lines.take(10)

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

In [26]:
rdd_numbers.take(10)

[1, 2, 3, 3]

In [27]:
sc.parallelize([1,2,3,3]).take(10)

[1, 2, 3, 3]

In [28]:
contador = rdd_numbers.count()
contador

4

In [29]:
contador

4

* Una __acción__ termina con el procesamiento en Spark y devuelve una respuesta al master con un tipo de datos de Python.
* Una __transformación__ no hace trigger (no dispara, no acciona) el procesamiento de Spark (es lazy evaluated),  lo quiere decir que si compruebo el tipo será algo de Spark no de Python.

## map vs. flatMap transformation
El flatMap reduce la dimensión de la lista y si es una lista de una dimensión, rompe la string.

In [30]:
rdd_lines.map(lambda x:x.split(" ")).take(4)

[['linea', '1', 'Python'], ['linea', '2', 'Python'], ['linea', '3', 'Spark']]

In [31]:
rdd_lines.flatMap(lambda x:x.split(" ")).take(10)

['linea', '1', 'Python', 'linea', '2', 'Python', 'linea', '3', 'Spark']

In [32]:
rdd_lines.map(lambda x: x.split(' ')).collect()

[['linea', '1', 'Python'], ['linea', '2', 'Python'], ['linea', '3', 'Spark']]

In [33]:
sc.parallelize(["linea 1 Python","linea 2 Python y Python","linea 3 Spark"]).map(lambda x:x.split(' ')). \
filter(lambda x: "Python" in x).count()

2

In [34]:
sc.parallelize(["linea 1 Python","linea 2 Python y Python","linea 3 Spark"]).flatMap(lambda x:x.split(' ')). \
filter(lambda x: "Python" in x).count()

3

In [35]:
rdd_lines.flatMap(lambda x: x.split(' ')).collect()

['linea', '1', 'Python', 'linea', '2', 'Python', 'linea', '3', 'Spark']

In [36]:
rdd_lines.map(lambda x:x).collect()

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

In [37]:
rdd_lines.flatMap(lambda x:x).collect()

['l',
 'i',
 'n',
 'e',
 'a',
 ' ',
 '1',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'l',
 'i',
 'n',
 'e',
 'a',
 ' ',
 '2',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'l',
 'i',
 'n',
 'e',
 'a',
 ' ',
 '3',
 ' ',
 'S',
 'p',
 'a',
 'r',
 'k']

## filter transformation

In [38]:
for item in ["linea 1 Python","linea 2 Python","linea 3 Spark"]:
    print("Python" in item)

True
True
False


In [39]:
list(filter(lambda x: "Python" in x ,["linea 1 Python","linea 2 Python","linea 3 Spark"])) 
# Esto es Python, no Spark!

['linea 1 Python', 'linea 2 Python']

In [40]:
rdd_lines.filter(lambda x: "Python" in x).take(10)

['linea 1 Python', 'linea 2 Python']

# Juntando flatMap y filter

In [41]:
sc.parallelize(["linea 1 Python","linea 2 Python y Python y Python","linea 3 Spark"]). \
flatMap(lambda x:x.split(' ')). \
filter(lambda x: "Python" in x).take(10)

['Python', 'Python', 'Python', 'Python']

In [42]:
rdd_lines.flatMap(lambda x: x.split(' ')).filter(lambda x: "Python" in x).collect()

['Python', 'Python']

In [43]:
rdd_lines.filter(lambda x: "Python" in x).flatMap(lambda x: x.split(' ')).collect()

['linea', '1', 'Python', 'linea', '2', 'Python']

In [44]:
#sc.p.map.filter.flatMap.map.......

### Cache
* Los caches en Spark se hacen en los workers (es en la memoria del PID) 
* Los caches son volatiles (cerrando Spark todo se pierden)
* Si pides un cache, es una proposión, si Spark no lo puede hacer, no lo hará!
* Si los caches no se utilizan en algun momento los borra
* Si los caches molestan a nuevos procesamiento los manda a disco duro

In [45]:
rdd_lines.cache()

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

## distinct transformation

In [46]:
rdd_numbers = sc.parallelize([3,3,2,1] )
rdd_numbers.take(10)

[3, 3, 2, 1]

In [47]:
rdd_numbers.distinct().collect()

[1, 2, 3]

In [48]:
rdd_numbers.map(lambda x:x**2).filter(lambda x:x>3).distinct().count()#.take(10)

2

## sample without replacement

In [49]:
rdd_numbers.sample(False,0.5).collect()

[3, 3, 2, 1]

## sample with replacement

In [50]:
rdd_numbers.sample(True,3).collect()

[3, 3, 3, 3, 3, 2, 1, 1]

---

# Transformation SET OPERATIONS

In [51]:
rdd_numbers.collect()

[3, 3, 2, 1]

In [52]:
rdd_more_numbers = sc.parallelize([3,4,2,5])
rdd_more_numbers.collect()

[3, 4, 2, 5]

## union - como un append sin mas lista1.append(lista2)

In [53]:
rdd_numbers.union(rdd_more_numbers).collect()

[3, 3, 2, 1, 3, 4, 2, 5]

## intersection - INNER JOIN

In [54]:
rdd_numbers.intersection(rdd_more_numbers).collect()

[2, 3]

## subtraction - Left Outer join

In [55]:
rdd_numbers.subtract(rdd_more_numbers).collect()

[1]

## cartesian product

In [56]:
rdd_numbers.cartesian(rdd_more_numbers).collect()

[(3, 3),
 (3, 4),
 (3, 2),
 (3, 5),
 (3, 3),
 (3, 4),
 (3, 2),
 (3, 5),
 (2, 3),
 (2, 4),
 (2, 2),
 (2, 5),
 (1, 3),
 (1, 4),
 (1, 2),
 (1, 5)]

In [57]:
rdd_numbers.cartesian(rdd_more_numbers).map(lambda x: x[0]/x[1]).take(100)

[1.0,
 0.75,
 1.5,
 0.6,
 1.0,
 0.75,
 1.5,
 0.6,
 0.6666666666666666,
 0.5,
 1.0,
 0.4,
 0.3333333333333333,
 0.25,
 0.5,
 0.2]

In [58]:
rdd_numbers.cartesian(rdd_more_numbers).map(lambda x:x[0]+x[1]).collect()

[6, 7, 5, 8, 6, 7, 5, 8, 5, 6, 4, 7, 4, 5, 3, 6]

## Exercise 1) sum rdd1 + rdd2

Expected result:
['(1+3)=4',
 '(1+4)=5',
 '(2+3)=5',
 '(2+4)=6',
 '(1+2)=3',
 '(1+5)=6',
 '(2+2)=4',
 '(2+5)=7',
 '(3+3)=6',
 '(3+4)=7',
 '(3+3)=6',
 '(3+4)=7',
 '(3+2)=5',
 '(3+5)=8',
 '(3+2)=5',
 '(3+5)=8']


In [59]:
rdd1 = sc.parallelize([1,2,3,3])

In [60]:
rdd2 = sc.parallelize([3,4,2,5])

In [61]:
sc.parallelize([(1,2),(3,4)]).map(lambda x: "({}+{})={}".format(x[0],x[1],x[0]+x[1])).take(10)

['(1+2)=3', '(3+4)=7']

## Exercise 2) Explain what is the code doing, input (values and technology/type), technologies involved and output (values and technology/type)

rdd = sc.parallelize([1,2,3,4])

rdd.map(lambda x: x * 2).collect()

rdd (sc.parallelize([1,2,3,4])) - __transformation__/action? Python/__Spark__?

rdd.map(lambda x: x * 2) - transformation/action? Python/Spark?

rdd.map(lambda x: x * 2).collect() - transformation/action? Python/Spark?

## Exercise 3) What is wrong with the following code and how to fix it?

len(sc.parallelize([1,2,3,4]).map(lambda x: x * 2).collect())

In [62]:
len(sc.parallelize([1,2,3,4]).map(lambda x: x * 2).collect())

4

---

# Spark - RDD - Basic Actions

## collect

In [63]:
rdd_numbers = sc.parallelize([1,2,3,3])

In [64]:
rdd_numbers.collect()

[1, 2, 3, 3]

## count

In [65]:
rdd_numbers.count()

4

## countByValue - same as value_counts() in DataFrame in Pandas

In [66]:
rdd_more_numbers = sc.parallelize([4,5,2,7])

In [67]:
rdd_many_numbers = rdd_numbers.union(rdd_more_numbers)

In [68]:
rdd_many_numbers.collect()

[1, 2, 3, 3, 4, 5, 2, 7]

In [69]:
rdd_many_numbers.countByValue()

defaultdict(int, {1: 1, 2: 2, 3: 2, 4: 1, 5: 1, 7: 1})

In [70]:
#df.value_counts()

In [71]:
rdd_lines.flatMap(lambda x:x.split(" ")).countByValue()#take(10)

defaultdict(int, {'linea': 3, '1': 1, 'Python': 2, '2': 1, '3': 1, 'Spark': 1})

## Ejercicio 4) calcular el numero de ocurrencia de cada palabras en rdd_lines

In [72]:
rdd_lines = sc.parallelize(["linea 1 Python","linea 2 Python","linea 3 Spark"] )

In [73]:
rdd_lines.flatMap(lambda x: x.split(' ')).countByValue()#.take(10) 

defaultdict(int, {'linea': 3, '1': 1, 'Python': 2, '2': 1, '3': 1, 'Spark': 1})

### Transformaciones 0 o más? Porque el resultado de una transformación de un objeto de Spark siempre es el mismo objeto, en el caso actual RDD en otro RDD.
### Acciones - 0 o 1? Porque una acción devulve la respuesta a Python en un objeto de Python, ya no hay nade de Spark funcionando!

## take - same as head() in DataFrame in Pandas

In [74]:
rdd_many_numbers.take(2)

[1, 2]

## top - return the highest values

In [75]:
rdd_more_numbers = sc.parallelize([3,4,5,2,5])

In [76]:
rdd_more_numbers.top(3)

[5, 5, 4]

In [77]:
rdd_more_numbers.take(3)

[3, 4, 5]

### Ejercicio 5) coger los 3 valores unicos máximos.

In [78]:
rdd_more_numbers.distinct().top(3)

[5, 4, 3]

In [79]:
rdd_lines.distinct().take(3)

['linea 2 Python', 'linea 1 Python', 'linea 3 Spark']

In [80]:
rdd_lines.distinct().top(2)

['linea 3 Spark', 'linea 2 Python']

## takeOrdered

In [81]:
rdd_more_numbers.collect()

[3, 4, 5, 2, 5]

In [82]:
rdd_more_numbers.take(3)

[3, 4, 5]

In [83]:
rdd_more_numbers.takeOrdered(3,lambda x: -x) # Descending

[5, 5, 4]

In [84]:
rdd_more_numbers.takeOrdered(3,lambda x: x) # Ascending

[2, 3, 4]

In [85]:
rdd_more_numbers.takeOrdered(3) # Ascending

[2, 3, 4]

### Ejercicio 6) coger 3 valores por orden, los pares primero en orden descreciente y los impares en orden cresciente.

In [86]:
rdd_more_numbers.takeOrdered(3,lambda x: -x if x % 2 == 0 else x)

[4, 2, 3]

### Ejercicio 7) coger 3 valores por orden, los pares primero en orden descreciente y los impares a continuación en orden tambien decresciente.

In [87]:
contador = rdd_more_numbers.count()
sc.parallelize(rdd_more_numbers.takeOrdered(contador, lambda x:-x)).takeOrdered(3, lambda x:x%2)

[4, 2, 5]

---

# Persist

In [88]:
print(dir())

['In', 'Out', '_', '_10', '_11', '_12', '_13', '_14', '_16', '_19', '_20', '_22', '_23', '_24', '_25', '_26', '_27', '_28', '_29', '_3', '_30', '_31', '_32', '_33', '_34', '_35', '_36', '_37', '_39', '_40', '_41', '_42', '_43', '_45', '_46', '_47', '_48', '_49', '_5', '_50', '_51', '_52', '_53', '_54', '_55', '_56', '_57', '_58', '_6', '_61', '_62', '_64', '_65', '_68', '_69', '_7', '_71', '_73', '_74', '_76', '_77', '_78', '_79', '_8', '_80', '_81', '_82', '_83', '_84', '_85', '_86', '_87', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i10', '_i11', '_i12', '_i13', '_i14', '_i15', '_i16', '_i17', '_i18', '_i19', '_i2', '_i20', '_i21', '_i22', '_i23', '_i24', '_i25', '_i26', '_i27', '_i28', '_i29', '_i3', '_i30', '_i31', '_i32', '_i33', '_i34', '_i35', '_i36', '_i37', '_i38', '_i39', '_i4', '_i40', '_i41', '_i42', '_i43', '_i44', '_i45', '_i46', '_i47', '_i48', '_i49', '_i5', '_i50', '_i51', '_i52', '_i

In [89]:
a = [1,2,3,3] # En casi todas las lenguajes de programación las variables son persistentes dentro de su ambito!

In [90]:
print(dir())

['In', 'Out', '_', '_10', '_11', '_12', '_13', '_14', '_16', '_19', '_20', '_22', '_23', '_24', '_25', '_26', '_27', '_28', '_29', '_3', '_30', '_31', '_32', '_33', '_34', '_35', '_36', '_37', '_39', '_40', '_41', '_42', '_43', '_45', '_46', '_47', '_48', '_49', '_5', '_50', '_51', '_52', '_53', '_54', '_55', '_56', '_57', '_58', '_6', '_61', '_62', '_64', '_65', '_68', '_69', '_7', '_71', '_73', '_74', '_76', '_77', '_78', '_79', '_8', '_80', '_81', '_82', '_83', '_84', '_85', '_86', '_87', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i10', '_i11', '_i12', '_i13', '_i14', '_i15', '_i16', '_i17', '_i18', '_i19', '_i2', '_i20', '_i21', '_i22', '_i23', '_i24', '_i25', '_i26', '_i27', '_i28', '_i29', '_i3', '_i30', '_i31', '_i32', '_i33', '_i34', '_i35', '_i36', '_i37', '_i38', '_i39', '_i4', '_i40', '_i41', '_i42', '_i43', '_i44', '_i45', '_i46', '_i47', '_i48', '_i49', '_i5', '_i50', '_i51', '_i52', '_i

In [91]:
rdd_numbers = sc.parallelize([1,2,3,3]) 
# Un RDD de Spark es volatil!!!! Solo existe cuando hay una acción y luego desaparece

In [92]:
rdd_more_numbers.persist

<bound method RDD.persist of ParallelCollectionRDD[95] at readRDDFromFile at PythonRDD.scala:274>

In [93]:
rdd_more_numbers.takeOrdered(rdd_more_numbers.count()) # Ascending

[2, 3, 4, 5, 5]

In [94]:
rdd_more_numbers.unpersist

<bound method RDD.unpersist of ParallelCollectionRDD[95] at readRDDFromFile at PythonRDD.scala:274>

---

In [95]:
rdd_more_numbers.takeOrdered(3,lambda x: -x**4 if x % 2 == 0 else -x) # Ascending

[4, 2, 5]

## takeSample

In [96]:
rdd_many_numbers.collect()

[1, 2, 3, 3, 4, 5, 2, 7]

In [97]:
rdd_many_numbers.count()

8

In [98]:
rdd_many_numbers.take(10)

[1, 2, 3, 3, 4, 5, 2, 7]

In [99]:
rdd_many_numbers.takeSample(False,20,seed=321) #Without replacement

[7, 1, 5, 2, 3, 2, 3, 4]

In [100]:
rdd_many_numbers.takeSample(True,20,seed=321) #With replacement

[7, 3, 2, 2, 3, 4, 7, 4, 4, 7, 4, 3, 2, 7, 2, 2, 2, 2, 3, 1]

In [101]:
for semilla in range(20):
    print(rdd_many_numbers.takeSample(True,10,seed=semilla)) #With replacement

[3, 2, 4, 5, 3, 1, 1, 4, 1, 2]
[3, 2, 1, 4, 7, 2, 4, 3, 3, 7]
[2, 3, 1, 3, 1, 1, 7, 3, 3, 3]
[3, 3, 1, 2, 3, 2, 3, 3, 1, 2]
[2, 2, 5, 2, 4, 4, 2, 1, 7, 7]
[3, 4, 3, 7, 7, 5, 7, 2, 7, 2]
[4, 2, 1, 2, 2, 2, 4, 4, 5, 2]
[5, 4, 3, 4, 7, 5, 1, 5, 5, 4]
[3, 7, 4, 3, 2, 4, 3, 3, 3, 2]
[7, 3, 3, 7, 7, 5, 3, 2, 2, 2]
[5, 4, 2, 1, 3, 2, 5, 1, 4, 3]
[5, 3, 5, 1, 5, 7, 5, 2, 2, 4]
[5, 2, 3, 7, 2, 2, 2, 1, 3, 7]
[4, 1, 3, 3, 2, 2, 3, 3, 7, 3]
[4, 7, 1, 1, 3, 3, 3, 5, 5, 4]
[4, 5, 5, 4, 7, 3, 1, 4, 3, 3]
[4, 2, 2, 1, 2, 2, 2, 2, 2, 1]
[2, 1, 4, 7, 5, 2, 3, 1, 3, 3]
[5, 2, 1, 1, 4, 1, 2, 3, 4, 4]
[7, 3, 5, 2, 7, 5, 5, 3, 3, 7]


### Ejercicio 8) crear 10 listas aleatorias con 10 elementos de rdd_many_numbers usando semillas diferentes y unirlas todas en un unico rdd. ojo: se puede hacer persistent alguna de ellas?

In [102]:
rdd_nuevo = sc.parallelize([])
rdd_nuevo.persist 
rdd_many_numbers.persist
for semilla in range(10): 
    rdd_nuevo = rdd_nuevo.union(sc.parallelize(rdd_many_numbers.takeSample(True,10,seed=semilla)))
rdd_many_numbers.unpersist

<bound method RDD.unpersist of UnionRDD[88] at union at NativeMethodAccessorImpl.java:0>

In [103]:
print(rdd_nuevo.collect())
print(rdd_nuevo.count())
rdd_nuevo.unpersist 

[3, 2, 4, 5, 3, 1, 1, 4, 1, 2, 3, 2, 1, 4, 7, 2, 4, 3, 3, 7, 2, 3, 1, 3, 1, 1, 7, 3, 3, 3, 3, 3, 1, 2, 3, 2, 3, 3, 1, 2, 2, 2, 5, 2, 4, 4, 2, 1, 7, 7, 3, 4, 3, 7, 7, 5, 7, 2, 7, 2, 4, 2, 1, 2, 2, 2, 4, 4, 5, 2, 5, 4, 3, 4, 7, 5, 1, 5, 5, 4, 3, 7, 4, 3, 2, 4, 3, 3, 3, 2, 7, 3, 3, 7, 7, 5, 3, 2, 2, 2]




100


                                                                                

<bound method RDD.unpersist of UnionRDD[227] at union at NativeMethodAccessorImpl.java:0>

---

# EMPEZAMO AQUI 4 de Marzo

# Spark - RDD - Reduce Actions - Reducing the whole list to a single value

## reduce

In [104]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


SELECT COUNT(*) FROM tabla1 - la acción COUNT - es un agregador en SQL, en Python/Spark llamamos de reductor o reduce

In [105]:
rdd = sc.parallelize([1, 2, 3, 4]) 
rdd.reduce(lambda a, b: a + b)

10

In [106]:
(1*2)*(3*4)

24

In [107]:
rdd_many_numbers.collect()

[1, 2, 3, 3, 4, 5, 2, 7]

In [108]:
rdd_many_numbers.reduce(lambda a, b: a * b)

5040

## fold - the same as reduce, but you can provide a starting value

In [115]:
sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+b)

40

In [116]:
1+25+8+4+2

40

In [117]:
sc.parallelize([1,25,8,4,2]).fold(1,lambda a,b:a+b)

45

In [118]:
min([1,2,3,4,5])

1

In [119]:
max([1,2,3,4,5])

5

## aggregate

In [127]:
sc.parallelize([1,2,3,4,5]).reduce(lambda a,b:(a+b)/2)

3.375

In [142]:
# sc.parallelize([1,2,3,4,5]).aggregate(INICIALIZADORES,REDUCE_DENTRO_WORKER, REDUCE_ENTRE_WORKERS)

# sc.parallelize([1,2,3,4,5]).aggregate((0),lambda acumlulador,valor_de_linea:x, lambda x,y:x)
t = sc.parallelize([1,2,3,4,5]).aggregate((0,0,1), \
                                      lambda x,y:(x[0]+y,x[1]+1,x[2]*y), \
                                      lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]*y[2]))
# x es el acumulador
# y son los valores


In [143]:
t[0]/t[1]

3.0

In [144]:
t[2]**(1/t[1])

2.605171084697352

In [120]:
t = sc.parallelize([1,2,3,4,5]).aggregate(
  (0, 0), # INICIALIZADOR DE LOS DOS CONTADORES 
  (lambda acc, value: (acc[0] + value, acc[1] + 1)), # REDUCE dentro del WORKER
  (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) #REDUCE entre WORKERS

In [121]:
t

(15, 5)

In [122]:
t[0]/t[1]

3.0

## reduce by Key

In [145]:
[(1,2), (3,4), (3,6)]

[(1, 2), (3, 4), (3, 6)]

In [161]:
d = {1:2, 3:4,3:6}
d

{1: 2, 3: 6}

In [162]:
rdd = sc.parallelize({1:2, 3:4,3:6})
rdd.collect()

[1, 3]

In [148]:
rdd.reduceByKey(lambda a, b: a + b).collect()

[(1, 2), (3, 10)]

In [149]:
[("Python",20),("Python",50),("Python",35),("Spark",23)]

[('Python', 20), ('Python', 50), ('Python', 35), ('Spark', 23)]

In [157]:
sc.parallelize([(None,30),(4+4j,50),(4+4j,35),(4+4j,23)]).\
reduceByKey(lambda x,y: x+y).\
collect()

[(None, 30), ((4+4j), 108)]

In [166]:
sc.parallelize([("Python",20),("Python",50),("Python",35),("Spark",23)]).\
reduceByKey(lambda x,y: x+y).\
collect()

[('Python', 105), ('Spark', 23)]

In [167]:
[("Python",(20,3)),("Python",(50,4)),("Python",(35,2)),("Spark",(20,3))]

[('Python', (20, 3)),
 ('Python', (50, 4)),
 ('Python', (35, 2)),
 ('Spark', (20, 3))]

In [168]:
sc.parallelize([("Python",(20,3)),("Python",(50,4)),("Python",(35,2)),("Spark",(20,3))]).\
reduceByKey(lambda x,y: (x[0]+y[0],x[1]+y[1])).\
collect()

[('Python', (105, 9)), ('Spark', (20, 3))]

## Persistent (Catching)

In [None]:
rdd.persist

In [169]:
rdd.count()

2

## Cache

In [None]:
rdd_cached_lines = rdd_lines.cache()

In [None]:
rdd_cached_lines.collect()

In [None]:
rdd_cached_lines.count()

---

# Example 1

In [170]:
lines = sc.parallelize(["linea 500 Python","linea 404 Python","linea 404 Spark"] )

In [171]:
lines.map(lambda x: x.split(' ')).filter(lambda x : "404" in x).map(lambda word : (word, 1)).collect()

[(['linea', '404', 'Python'], 1), (['linea', '404', 'Spark'], 1)]

In [172]:
lines.flatMap(lambda x: x.split(' ')).collect()

['linea', '500', 'Python', 'linea', '404', 'Python', 'linea', '404', 'Spark']

In [173]:
lines.map(lambda x: x.split(' ')).collect()

[['linea', '500', 'Python'],
 ['linea', '404', 'Python'],
 ['linea', '404', 'Spark']]

In [176]:
lines.flatMap(lambda x: x.split(' ')).filter(lambda x : "404" in x).collect()

2

In [178]:
lines.flatMap(lambda x: x.split(' ')).\
filter(lambda x : "404" in x).\
map(lambda word : (word, 1)).collect()

[('404', 1), ('404', 1)]

In [179]:
from operator import add
#lambda x,y: x+y

In [185]:
lines.flatMap(lambda x: x.split(' ')).\
filter(lambda x : "404" in x).\
map(lambda word : (word, 1)).\
reduceByKey(add).collect()

[('404', 2)]

In [186]:
lines.flatMap(lambda x: x.split(' ')).filter(lambda x : "404" in x or "500" in x).count()

3

In [187]:
from operator import add

In [194]:
rdd = lines.flatMap(lambda x: x.split(' '))
rdd.map(lambda x:(x,1)).reduceByKey(add).collect()

[('Python', 2), ('linea', 3), ('Spark', 1), ('500', 1), ('404', 2)]

In [196]:
lines.flatMap(lambda x: x.split(' '))\
.filter(lambda x : "404" in x or "500" in x).map(lambda word : (word, 1)) \
.reduceByKey(add).collect()

[('500', 1), ('404', 2)]

In [197]:
lines.flatMap(lambda x: x.split(' ')).filter(lambda x : "404" in x or "500" in x).map(lambda word : (word, 1)) \
.reduceByKey(lambda x,y: x+y).collect()

[('500', 1), ('404', 2)]

# Ejercicio - Usando %3, sumar los número entre 1 y 1000000 divisible por %3==0, %3==1 o %3==2 (como si fuera por grupo). Hasta las 11:00
respuesta esperada: (166666833333, 166667166667, 166666500000)

# Ejercicio 2) Hacer el anterior juntando las claves 0 y 1
respuesta esperada: (333334000000, 166666500000)