# Spark - RDD - Basic Transformations

In [1]:
#!pip install findspark

In [1]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="RDDBasics")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/02 09:43:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
sc

# Creando un RDD con 3 lineas de texto

* __Resilient__ - se va enviar los datos y las funciones que se quieren ejecutar a varios ordenadroes, pero cada trozito de datos es unico. Spark guardará los datos crudos y el planning de ejecución (incluye las funciones) y en caso de falla de una de los Workers se lo envia a otro.
* __Distributed__ - Separar los datos en Trabajadores (Workers). Un worker no es un slave, un worker es un proceso (PID) lo que quiere decir que en un "slave" pueden y normalmente hay mas de uno worker. Razón para poder usar todos los procesadores de la dicha maquina.
* __Dataset__ - Datos no estructurados - acepta cualquier tipo de datos.

In [4]:
rdd_lines = sc.parallelize(["linea 1 Python",
                            "linea 2 Python",
                            "linea 3 Spark"] )

In [5]:
type(rdd_lines)

pyspark.rdd.RDD

In [6]:
rdd_lines

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274

In [7]:
rdd_lines.collect()

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

## Ojo: rdd_lines.collect() es un "code smell"
Es decir tu codigo huele mal, porque estas trayendo 100% de los datos a Python, seguro que quieres hacer eso?

In [8]:
type(rdd_lines.collect())

list

In [9]:
rdd_numbers = sc.parallelize([1,2,3,3] )

In [10]:
rdd_numbers

ParallelCollectionRDD[2] at readRDDFromFile at PythonRDD.scala:274

.collect() returns the whole rdd()

In [11]:
rdd_lines.collect()

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

In [12]:
rdd_numbers.collect() # Action that returns the whole RDD

[1, 2, 3, 3]

In [13]:
rdd_lines.take(10)

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

In [14]:
rdd_numbers.take(10)

[1, 2, 3, 3]

In [15]:
sc.parallelize([1,2,3,3]).take(10)

[1, 2, 3, 3]

In [16]:
contador = rdd_numbers.count()
contador

4

In [17]:
contador

4

* Una __acción__ termina con el procesamiento en Spark y devuelve una respuesta al master con un tipo de datos de Python.
* Una __transformación__ no hace trigger (no dispara, no acciona) el procesamiento de Spark (es lazy evaluated),  lo quiere decir que si compruebo el tipo será algo de Spark no de Python.

## map vs. flatMap transformation
El flatMap reduce la dimensión de la lista y si es una lista de una dimensión, rompe la string.

In [18]:
rdd_lines.map(lambda x:x.split(" ")).take(4)

[['linea', '1', 'Python'], ['linea', '2', 'Python'], ['linea', '3', 'Spark']]

In [19]:
rdd_lines.flatMap(lambda x:x.split(" ")).take(10)

['linea', '1', 'Python', 'linea', '2', 'Python', 'linea', '3', 'Spark']

In [20]:
rdd_lines.map(lambda x: x.split(' ')).collect()

[['linea', '1', 'Python'], ['linea', '2', 'Python'], ['linea', '3', 'Spark']]

In [21]:
sc.parallelize(["linea 1 Python","linea 2 Python y Python","linea 3 Spark"]).map(lambda x:x.split(' ')). \
filter(lambda x: "Python" in x).count()

2

In [22]:
sc.parallelize(["linea 1 Python","linea 2 Python y Python","linea 3 Spark"]).flatMap(lambda x:x.split(' ')). \
filter(lambda x: "Python" in x).count()

3

In [23]:
rdd_lines.flatMap(lambda x: x.split(' ')).collect()

['linea', '1', 'Python', 'linea', '2', 'Python', 'linea', '3', 'Spark']

In [24]:
rdd_lines.map(lambda x:x).collect()

['linea 1 Python', 'linea 2 Python', 'linea 3 Spark']

In [25]:
rdd_lines.flatMap(lambda x:x).collect()

['l',
 'i',
 'n',
 'e',
 'a',
 ' ',
 '1',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'l',
 'i',
 'n',
 'e',
 'a',
 ' ',
 '2',
 ' ',
 'P',
 'y',
 't',
 'h',
 'o',
 'n',
 'l',
 'i',
 'n',
 'e',
 'a',
 ' ',
 '3',
 ' ',
 'S',
 'p',
 'a',
 'r',
 'k']

## filter transformation

In [26]:
for item in ["linea 1 Python","linea 2 Python","linea 3 Spark"]:
    print("Python" in item)

True
True
False


In [27]:
list(filter(lambda x: "Python" in x ,["linea 1 Python","linea 2 Python","linea 3 Spark"])) 
# Esto es Python, no Spark!

['linea 1 Python', 'linea 2 Python']

In [28]:
rdd_lines.filter(lambda x: "Python" in x).take(10)

['linea 1 Python', 'linea 2 Python']

# Juntando flatMap y filter

In [29]:
sc.parallelize(["linea 1 Python","linea 2 Python y Python y Python","linea 3 Spark"]). \
flatMap(lambda x:x.split(' ')). \
filter(lambda x: "Python" in x).take(10)

['Python', 'Python', 'Python', 'Python']

In [30]:
rdd_lines.flatMap(lambda x: x.split(' ')).filter(lambda x: "Python" in x).collect()

['Python', 'Python']

In [31]:
rdd_lines.filter(lambda x: "Python" in x).flatMap(lambda x: x.split(' ')).collect()

['linea', '1', 'Python', 'linea', '2', 'Python']

In [32]:
#sc.p.map.filter.flatMap.map.......

### Cache
* Los caches en Spark se hacen en los workers (es en la memoria del PID) 
* Los caches son volatiles (cerrando Spark todo se pierden)
* Si pides un cache, es una proposión, si Spark no lo puede hacer, no lo hará!
* Si los caches no se utilizan en algun momento los borra
* Si los caches molestan a nuevos procesamiento los manda a disco duro

In [33]:
rdd_lines.cache()

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:274

## distinct transformation

In [34]:
rdd_numbers = sc.parallelize([3,3,2,1] )
rdd_numbers.take(10)

[3, 3, 2, 1]

In [35]:
rdd_numbers.distinct().collect()

[1, 2, 3]

In [36]:
rdd_numbers. \
map(lambda x:x**2). \
filter(lambda x:x>3). \
distinct(). \
count()#.take(10)#

2

## sample without replacement

In [37]:
rdd_numbers.sample(False,0.5).collect()

[3]

## sample with replacement

In [38]:
rdd_numbers.sample(True,3).collect()

[3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 1]

---

# Transformation SET OPERATIONS

In [39]:
rdd_numbers.collect()

[3, 3, 2, 1]

In [40]:
rdd_more_numbers = sc.parallelize([3,4,2,5])
rdd_more_numbers.collect()

[3, 4, 2, 5]

## union - como un append sin mas lista1.append(lista2)

In [41]:
rdd_numbers.union(rdd_more_numbers).collect()

[3, 3, 2, 1, 3, 4, 2, 5]

## intersection - INNER JOIN

In [42]:
rdd_numbers.intersection(rdd_more_numbers).collect()

[2, 3]

## subtraction - Left Outer join

In [43]:
rdd_numbers.subtract(rdd_more_numbers).collect()

[1]

## cartesian product

In [44]:
rdd_more_numbers.take(10)

[3, 4, 2, 5]

In [45]:
rdd_numbers.cartesian(rdd_more_numbers).collect()

[(3, 3),
 (3, 4),
 (3, 2),
 (3, 5),
 (3, 3),
 (3, 4),
 (3, 2),
 (3, 5),
 (2, 3),
 (2, 4),
 (2, 2),
 (2, 5),
 (1, 3),
 (1, 4),
 (1, 2),
 (1, 5)]

In [46]:
rdd_numbers.cartesian(rdd_more_numbers).map(lambda x: x[0]/x[1]).take(100)



[1.0,
 0.75,
 1.5,
 0.6,
 1.0,
 0.75,
 1.5,
 0.6,
 0.6666666666666666,
 0.5,
 1.0,
 0.4,
 0.3333333333333333,
 0.25,
 0.5,
 0.2]

In [47]:
rdd_numbers.cartesian(rdd_more_numbers).map(lambda x:x[0]+x[1]).collect()

                                                                                

[6, 7, 5, 8, 6, 7, 5, 8, 5, 6, 4, 7, 4, 5, 3, 6]

## Exercise 1) sum rdd1 + rdd2

Expected result:
['(1+3)=4',
 '(1+4)=5',
 '(2+3)=5',
 '(2+4)=6',
 '(1+2)=3',
 '(1+5)=6',
 '(2+2)=4',
 '(2+5)=7',
 '(3+3)=6',
 '(3+4)=7',
 '(3+3)=6',
 '(3+4)=7',
 '(3+2)=5',
 '(3+5)=8',
 '(3+2)=5',
 '(3+5)=8']


In [48]:
rdd1 = sc.parallelize([1,2,3,3])

In [49]:
rdd2 = sc.parallelize([3,4,2,5])

In [50]:
rdd1.cartesian(rdd2).map(lambda x:'({}+{})={}'.format(x[0],x[1],x[0]+x[1])).collect()

                                                                                

['(1+3)=4',
 '(1+4)=5',
 '(1+2)=3',
 '(1+5)=6',
 '(2+3)=5',
 '(2+4)=6',
 '(2+2)=4',
 '(2+5)=7',
 '(3+3)=6',
 '(3+4)=7',
 '(3+2)=5',
 '(3+5)=8',
 '(3+3)=6',
 '(3+4)=7',
 '(3+2)=5',
 '(3+5)=8']

In [51]:
rdd1.cartesian(rdd2).map(lambda x: f"({x[0]}+{x[1]})={x[0]+x[1]}").collect()

                                                                                

['(1+3)=4',
 '(1+4)=5',
 '(1+2)=3',
 '(1+5)=6',
 '(2+3)=5',
 '(2+4)=6',
 '(2+2)=4',
 '(2+5)=7',
 '(3+3)=6',
 '(3+4)=7',
 '(3+2)=5',
 '(3+5)=8',
 '(3+3)=6',
 '(3+4)=7',
 '(3+2)=5',
 '(3+5)=8']

## Exercise 2) Explain what is the code doing, input (values and technology/type), technologies involved and output (values and technology/type)

rdd = sc.parallelize([1,2,3,4]) - __transformation__ / action? Returns: Python/__Spark__?

map(lambda x: x * 2) - __transformation__ / action? Returns: Python/__Spark__?

collect() - transformation / __action__? Returns: __Python__/Spark?

rdd.map(lambda x: x * 2).collect()

## Exercise 3) What is wrong with the following code and how to fix it?

len(sc.parallelize([1,2,3,4]).map(lambda x: x * 2).collect())

In [52]:
sc.parallelize([1,2,3,4]).map(lambda x: x * 2).count()

4

In [53]:
len(sc.parallelize([1,2,3,4]).map(lambda x: x * 2).collect())

4

---

# Spark - RDD - Basic Actions

## collect

In [54]:
rdd_numbers = sc.parallelize([1,2,3,3])

In [55]:
rdd_numbers.collect()

[1, 2, 3, 3]

## count

In [56]:
rdd_numbers.count()

4

## countByValue - same as value_counts() in DataFrame in Pandas

In [57]:
rdd_more_numbers = sc.parallelize([4,5,2,7])

In [58]:
rdd_many_numbers = rdd_numbers.union(rdd_more_numbers)

In [59]:
rdd_many_numbers.collect()

[1, 2, 3, 3, 4, 5, 2, 7]

In [60]:
rdd_many_numbers.countByValue()

defaultdict(int, {1: 1, 2: 2, 3: 2, 4: 1, 5: 1, 7: 1})

In [61]:
#df.value_counts()
a= rdd_lines.flatMap(lambda x:x.split(" ")).countByValue()

In [62]:
a

defaultdict(int, {'linea': 3, '1': 1, 'Python': 2, '2': 1, '3': 1, 'Spark': 1})

In [63]:
a['Python']

2

In [64]:
rdd_lines.flatMap(lambda x:x.split(" ")).countByValue()#take(10)

defaultdict(int, {'linea': 3, '1': 1, 'Python': 2, '2': 1, '3': 1, 'Spark': 1})

## Ejercicio 4) calcular el numero de ocurrencia de cada palabras en rdd_lines

In [65]:
rdd_lines = sc.parallelize(["linea 1 Python","linea 2 Python","linea 3 Spark"] )

In [66]:
rdd_lines.flatMap(lambda x: x.split(' ')).countByValue()#.take(10) 

defaultdict(int, {'linea': 3, '1': 1, 'Python': 2, '2': 1, '3': 1, 'Spark': 1})

### Transformaciones 0 o más? Porque el resultado de una transformación de un objeto de Spark siempre es el mismo objeto, en el caso actual RDD en otro RDD.
### Acciones - 0 o 1? Porque una acción devulve la respuesta a Python en un objeto de Python, ya no hay nade de Spark funcionando!

## take - same as head() in DataFrame in Pandas

In [67]:
rdd_many_numbers.take(2)

[1, 2]

## top - return the highest values

In [68]:
rdd_more_numbers = sc.parallelize([3,4,5,2,5])

In [69]:
rdd_more_numbers.top(3)

[5, 5, 4]

In [70]:
rdd_more_numbers.take(3)

[3, 4, 5]

### Ejercicio 5) coger los 3 valores unicos máximos.

In [71]:
rdd_more_numbers.distinct().top(3)

[5, 4, 3]

In [72]:
rdd_lines.distinct().take(3)

['linea 1 Python', 'linea 3 Spark', 'linea 2 Python']

In [73]:
rdd_lines.distinct().top(2)

['linea 3 Spark', 'linea 2 Python']

## takeOrdered

In [74]:
rdd_more_numbers.collect()

[3, 4, 5, 2, 5]

In [75]:
rdd_more_numbers.take(3)

[3, 4, 5]

In [76]:
rdd_more_numbers.takeOrdered(3,lambda x: -x) # Descending

[5, 5, 4]

In [77]:
rdd_more_numbers.takeOrdered(3,lambda x: x) # Ascending

[2, 3, 4]

In [78]:
rdd_more_numbers.takeOrdered(3) # Ascending

[2, 3, 4]

### Ejercicio 6) coger 3 valores por orden, los pares primero en orden descreciente y los impares en orden cresciente.

In [79]:
sc.parallelize([5, 2, 5, 4, 3]).collect()

[5, 2, 5, 4, 3]

In [80]:
sc.parallelize([5, 2, 5, 4, 3]).takeOrdered(10,lambda x:x%2)

[2, 4, 5, 5, 3]

### Ejercicio 7) coger 3 valores por orden, los pares primero en orden descreciente y los impares a continuación en orden tambien decresciente.

In [81]:
contador =  rdd_more_numbers.count()

In [82]:
sc.parallelize(rdd_more_numbers.takeOrdered(contador,lambda x:-x)).takeOrdered(3,lambda x: x%2)

[4, 2, 5]

---

# Persist

In [83]:
print(dir())

['In', 'Out', '_', '_10', '_11', '_12', '_13', '_14', '_15', '_16', '_17', '_18', '_19', '_2', '_20', '_21', '_22', '_23', '_24', '_25', '_27', '_28', '_29', '_30', '_31', '_33', '_34', '_35', '_36', '_37', '_38', '_39', '_40', '_41', '_42', '_43', '_44', '_45', '_46', '_47', '_5', '_50', '_51', '_52', '_53', '_55', '_56', '_59', '_6', '_60', '_62', '_63', '_64', '_66', '_67', '_69', '_7', '_70', '_71', '_72', '_73', '_74', '_75', '_76', '_77', '_78', '_79', '_8', '_80', '_82', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i10', '_i11', '_i12', '_i13', '_i14', '_i15', '_i16', '_i17', '_i18', '_i19', '_i2', '_i20', '_i21', '_i22', '_i23', '_i24', '_i25', '_i26', '_i27', '_i28', '_i29', '_i3', '_i30', '_i31', '_i32', '_i33', '_i34', '_i35', '_i36', '_i37', '_i38', '_i39', '_i4', '_i40', '_i41', '_i42', '_i43', '_i44', '_i45', '_i46', '_i47', '_i48', '_i49', '_i5', '_i50', '_i51', '_i52', '_i53', '_i54', '

In [84]:
a = [1,2,3,3] # En casi todas las lenguajes de programación las variables son persistentes dentro de su ambito!

In [85]:
print(dir())

['In', 'Out', '_', '_10', '_11', '_12', '_13', '_14', '_15', '_16', '_17', '_18', '_19', '_2', '_20', '_21', '_22', '_23', '_24', '_25', '_27', '_28', '_29', '_30', '_31', '_33', '_34', '_35', '_36', '_37', '_38', '_39', '_40', '_41', '_42', '_43', '_44', '_45', '_46', '_47', '_5', '_50', '_51', '_52', '_53', '_55', '_56', '_59', '_6', '_60', '_62', '_63', '_64', '_66', '_67', '_69', '_7', '_70', '_71', '_72', '_73', '_74', '_75', '_76', '_77', '_78', '_79', '_8', '_80', '_82', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i10', '_i11', '_i12', '_i13', '_i14', '_i15', '_i16', '_i17', '_i18', '_i19', '_i2', '_i20', '_i21', '_i22', '_i23', '_i24', '_i25', '_i26', '_i27', '_i28', '_i29', '_i3', '_i30', '_i31', '_i32', '_i33', '_i34', '_i35', '_i36', '_i37', '_i38', '_i39', '_i4', '_i40', '_i41', '_i42', '_i43', '_i44', '_i45', '_i46', '_i47', '_i48', '_i49', '_i5', '_i50', '_i51', '_i52', '_i53', '_i54', '

In [86]:
rdd_numbers = sc.parallelize([1,2,3,3]) 
# Un RDD de Spark es volatil!!!! Solo existe cuando hay una acción y luego desaparece

In [87]:
rdd_more_numbers.persist

<bound method RDD.persist of ParallelCollectionRDD[105] at readRDDFromFile at PythonRDD.scala:274>

In [88]:
rdd_more_numbers.takeOrdered(rdd_more_numbers.count()) # Ascending

[2, 3, 4, 5, 5]

In [89]:
rdd_more_numbers.unpersist

<bound method RDD.unpersist of ParallelCollectionRDD[105] at readRDDFromFile at PythonRDD.scala:274>

---

In [90]:
rdd_more_numbers.takeOrdered(3,lambda x: -x**4 if x % 2 == 0 else -x) # Ascending

[4, 2, 5]

## takeSample

In [91]:
rdd_many_numbers.collect()

[1, 2, 3, 3, 4, 5, 2, 7]

In [92]:
rdd_many_numbers.count()

8

In [93]:
rdd_many_numbers.take(10)

[1, 2, 3, 3, 4, 5, 2, 7]

In [94]:
rdd_many_numbers.takeSample(False,20,seed=321) #Without replacement

[7, 1, 5, 2, 3, 2, 3, 4]

In [95]:
rdd_many_numbers.takeSample(True,20,seed=321) #With replacement

[2, 3, 2, 2, 2, 7, 3, 2, 4, 3, 3, 2, 4, 2, 2, 1, 5, 7, 3, 1]

In [96]:
for semilla in range(20):
    print(rdd_many_numbers.takeSample(True,10,seed=semilla)) #With replacement

[1, 4, 5, 1, 3, 5, 2, 1, 1, 7]
[7, 7, 1, 2, 5, 1, 4, 5, 2, 3]
[5, 1, 7, 5, 2, 3, 5, 2, 2, 2]
[3, 2, 1, 4, 2, 2, 2, 3, 1, 5]
[5, 4, 4, 7, 4, 7, 3, 3, 1, 3]
[2, 5, 4, 2, 7, 4, 2, 3, 7, 7]
[7, 2, 2, 7, 2, 5, 7, 4, 1, 3]
[3, 5, 2, 2, 2, 7, 2, 2, 7, 4]
[4, 2, 3, 3, 3, 3, 1, 4, 3, 1]
[2, 5, 3, 3, 5, 3, 3, 3, 3, 3]
[7, 5, 7, 3, 2, 3, 3, 3, 5, 3]
[3, 5, 3, 2, 3, 2, 5, 3, 2, 3]
[4, 1, 2, 2, 3, 7, 1, 2, 4, 3]
[4, 3, 2, 7, 5, 4, 3, 3, 2, 2]
[7, 3, 7, 2, 5, 4, 1, 4, 3, 7]
[7, 2, 7, 2, 4, 2, 3, 4, 4, 4]
[7, 5, 2, 3, 5, 1, 3, 2, 1, 3]
[3, 5, 5, 3, 1, 3, 5, 7, 2, 1]
[2, 1, 1, 1, 5, 1, 2, 3, 5, 5]
[2, 4, 3, 4, 1, 3, 5, 3, 7, 3]


### Ejercicio 8) crear 10 listas aleatorias con 10 elementos de rdd_many_numbers usando semillas diferentes y unirlas todas en un unico rdd. ojo: se puede hacer persistent alguna de ellas?

---

# Spark - RDD - Reduce Actions - Reducing the whole list to a single value

## reduce

In [97]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))


SELECT COUNT(*) FROM tabla1 - la acción COUNT - es un agregador en SQL, en Python/Spark llamamos de reductor o reduce

In [98]:
rdd = sc.parallelize([1, 2, 3, 4]) 
rdd.reduce(lambda a, b: a + b)

10

In [99]:
(1*2)*(3*4)

24

In [100]:
rdd_many_numbers.collect()

[1, 2, 3, 3, 4, 5, 2, 7]

In [101]:
rdd_many_numbers.reduce(lambda a, b: a * b)

5040

## fold - the same as reduce, but you can provide a starting value

In [102]:
sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+b)

40

In [103]:
1+25+8+4+2

40

In [104]:
sc.parallelize([1,25,8,4,2]).fold(1,lambda a,b:a+b)

53

In [105]:
min([1,2,3,4,5])

1

In [106]:
max([1,2,3,4,5])

5

## aggregate

In [107]:
sc.parallelize([1,2,3,4,5]).reduce(lambda a,b:(a+b)/2)

4.0625

In [108]:
# sc.parallelize([1,2,3,4,5]).aggregate(INICIALIZADORES,REDUCE_DENTRO_WORKER, REDUCE_ENTRE_WORKERS)

# sc.parallelize([1,2,3,4,5]).aggregate((0),lambda acumlulador,valor_de_linea:x, lambda x,y:x)
t = sc.parallelize([1,2,3,4,5]).aggregate((0,0,1), \
                                      lambda x,y:(x[0]+y,x[1]+1,x[2]*y), \
                                      lambda x,y:(x[0]+y[0],x[1]+y[1],x[2]*y[2]))
# x es el acumulador
# y son los valores


In [109]:
t[0]/t[1]

3.0

In [110]:
t[2]**(1/t[1])

2.605171084697352

In [111]:
t = sc.parallelize([1,2,3,4,5]).aggregate(
  (0, 0), # INICIALIZADOR DE LOS DOS CONTADORES 
  (lambda acc, value: (acc[0] + value, acc[1] + 1)), # REDUCE dentro del WORKER
  (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) #REDUCE entre WORKERS

In [112]:
t

(15, 5)

In [113]:
t[0]/t[1]

3.0

## reduce by Key

In [114]:
[(1,2), (3,4), (3,6)]

[(1, 2), (3, 4), (3, 6)]

In [115]:
d = {1:2, 3:4,3:6}
d

{1: 2, 3: 6}

In [116]:
rdd = sc.parallelize({1:2, 3:4,3:6})
rdd.collect()

[1, 3]

In [117]:
rdd.reduceByKey(lambda a, b: a + b).collect()

23/03/02 09:44:06 ERROR Executor: Exception in task 11.0 in stage 161.0 (TID 2868)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 676, in process
    out_iter = func(split_index, iterator)
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 3472, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 540, in func
    return f(iterator)
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 2554, in combineLocally
    merger.mergeValues(iterator)
  File "/home/javiortig/.local/lib

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 161.0 failed 1 times, most recent failure: Lost task 5.0 in stage 161.0 (TID 2862) (fedora executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 676, in process
    out_iter = func(split_index, iterator)
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 3472, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 540, in func
    return f(iterator)
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 2554, in combineLocally
    merger.mergeValues(iterator)
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/shuffle.py", line 253, in mergeValues
    for k, v in iterator:
TypeError: cannot unpack non-iterable int object

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1211)
	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1589)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:1589)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 686, in main
    process()
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 676, in process
    out_iter = func(split_index, iterator)
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 3472, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 540, in func
    return f(iterator)
  File "/home/javiortig/.conda/envs/procesamiento_datos/lib/python3.9/site-packages/pyspark/rdd.py", line 2554, in combineLocally
    merger.mergeValues(iterator)
  File "/home/javiortig/.local/lib/python3.10/site-packages/pyspark/python/lib/pyspark.zip/pyspark/shuffle.py", line 253, in mergeValues
    for k, v in iterator:
TypeError: cannot unpack non-iterable int object

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:559)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:765)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:747)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:512)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1211)
	at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	... 1 more


In [None]:
[("Python",20),("Python",50),("Python",35),("Spark",23)]

In [None]:
sc.parallelize([(None,30),(4+4j,50),(4+4j,35),(4+4j,23)]).\
reduceByKey(lambda x,y: x+y).\
collect()

In [None]:
sc.parallelize([("Python",20),("Python",50),("Python",35),("Spark",23)]).\
reduceByKey(lambda x,y: x+y).\
collect()

In [None]:
[("Python",(20,3)),("Python",(50,4)),("Python",(35,2)),("Spark",(20,3))]

In [None]:
sc.parallelize([("Python",(20,3)),("Python",(50,4)),("Python",(35,2)),("Spark",(20,3))]).\
reduceByKey(lambda x,y: (x[0]+y[0],x[1]+y[1])).\
collect()

## Persistent (Catching)

In [None]:
rdd.persist

In [None]:
rdd.count()

## Cache

In [None]:
rdd_cached_lines = rdd_lines.cache()

In [None]:
rdd_cached_lines.collect()

In [None]:
rdd_cached_lines.count()

---

# Example 1

In [None]:
lines = sc.parallelize(["linea 500 Python","linea 404 Python","linea 404 Spark"] )

In [None]:
lines.map(lambda x: x.split(' ')).filter(lambda x : "404" in x).map(lambda word : (word, 1)).collect()

In [None]:
lines.flatMap(lambda x: x.split(' ')).collect()

In [None]:
lines.map(lambda x: x.split(' ')).collect()

In [None]:
lines.flatMap(lambda x: x.split(' ')).filter(lambda x : "404" in x).collect()

In [None]:
lines.flatMap(lambda x: x.split(' ')).\
filter(lambda x : "404" in x).\
map(lambda word : (word, 1)).collect()

In [None]:
from operator import add
#lambda x,y: x+y

In [None]:
lines.flatMap(lambda x: x.split(' ')).\
filter(lambda x : "404" in x).\
map(lambda word : (word, 1)).\
reduceByKey(add).collect()

In [None]:
lines.flatMap(lambda x: x.split(' ')).filter(lambda x : "404" in x or "500" in x).count()

In [None]:
from operator import add

In [None]:
rdd = lines.flatMap(lambda x: x.split(' '))
rdd.map(lambda x:(x,1)).reduceByKey(add).collect()

In [None]:
lines.flatMap(lambda x: x.split(' '))\
.filter(lambda x : "404" in x or "500" in x).map(lambda word : (word, 1)) \
.reduceByKey(add).collect()

In [None]:
lines.flatMap(lambda x: x.split(' ')).filter(lambda x : "404" in x or "500" in x).map(lambda word : (word, 1)) \
.reduceByKey(lambda x,y: x+y).collect()

# Ejercicio - Usando %3, sumar los número entre 1 y 1000000 divisible por %3==0, %3==1 o %3==2 (como si fuera por grupo). Hasta las 11:00
respuesta esperada: (166666833333, 166667166667, 166666500000)

In [121]:
sc.parallelize(range(1, 1000000)).map(lambda x: (x%3, x)).reduceByKey(lambda x, y: x+y).map(lambda x: x[1]).collect()

[166666833333, 166666166667, 166666500000]

# Ejercicio 2) Hacer el anterior juntando las claves 0 y 1
respuesta esperada: (333334000000, 166666500000)

In [124]:
sc.parallelize(range(1, 1000000)).map(lambda x: ('0or1'if x%3!=2 else '2', x)).reduceByKey(lambda x, y: x+y).map(lambda x: x[1]).collect()

[333333000000, 166666500000]