# Basic informations II

## Reshaping RDD's

In [1]:
from pyspark import SparkContext

# create a spark context
sc = SparkContext("local", "Basic informations II")

In [2]:
rdd = sc.parallelize([("a", 3),("b", 6),("a", 6),("c", 9),("a", 12),("c", 15)])
rdd1 = sc.parallelize([4, 5, 7, 8, 9, 35, 4, 23, 5, 6, 7, 8])

```.reduceByKey()``` - merge the values from each key from RDD

In [3]:
rdd.reduceByKey(lambda x, y: x*y).collect()

[('a', 216), ('b', 6), ('c', 135)]

```.reduce()``` - merge RDD values and return one value

In [4]:
rdd1.reduce(lambda x,y: x+y)

121

## grouping

```.groupByKey()``` - group the values in each equivalent key

In [5]:
rdd.groupByKey().mapValues(list).collect()

[('a', [3, 6, 12]), ('b', [6]), ('c', [9, 15])]

```.groupBy()``` - return a RDD grouped

In [6]:
rdd1.groupBy(lambda x: x % 2).mapValues(list).collect()

[(0, [4, 8, 4, 6, 8]), (1, [5, 7, 9, 35, 23, 5, 7])]

```.aggregate()``` aggregates the values of all partitions

arguments:
- ```zeroValue``` Initial value tu be used
- ```seqOp``` accumulate the results of each partition
- ```combOp``` combine the results of all partitions

In [7]:
zeroValue = (0,0)
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))

rdd1.aggregate(zeroValue, seqOp, combOp)

(121, 12)

```.aggregateByKey()``` aggregates the values of each RDD key


In [8]:
rdd.aggregateByKey(zeroValue, seqOp, combOp).collect()

[('a', (21, 3)), ('b', (6, 1)), ('c', (24, 2))]

```.fold()``` aggregates the elements of each partition and answer based in the function passed


In [9]:
rdd1.fold(0, lambda x,y: x + y)

121

In [10]:
rdd1.fold(1, lambda x,y: x * y)

54528768000

```.foldByKey()``` - merge the values of each key

In [11]:
rdd.foldByKey(0, lambda x,y: x + y).collect()

[('a', 21), ('b', 6), ('c', 24)]

```.keyBy()```- create tuples of RDD's

In [12]:
sc.parallelize(range(0,10),2).keyBy(lambda x: x).collect()

[(0, 0),
 (1, 1),
 (2, 2),
 (3, 3),
 (4, 4),
 (5, 5),
 (6, 6),
 (7, 7),
 (8, 8),
 (9, 9)]

In [13]:
sc.parallelize(range(0,10),2).keyBy(lambda x: x+x).collect()

[(0, 0),
 (2, 1),
 (4, 2),
 (6, 3),
 (8, 4),
 (10, 5),
 (12, 6),
 (14, 7),
 (16, 8),
 (18, 9)]

## some maths operations

```.subtract(p)``` - return each value in **self** that is not contained in **p**

In [14]:
rdd2 = sc.parallelize([13, 5, 18])
rdd3 = sc.parallelize([20, 11, 18])
print("rdd2: ", rdd2.collect())
print("rdd3: ", rdd3.collect())

rdd2:  [13, 5, 18]
rdd3:  [20, 11, 18]


In [15]:
rdd2.subtract(rdd3).collect()

[13, 5]

In [16]:
rdd3.subtract(rdd2).collect()

[20, 11]

```.subtractByKey(p)``` return each key in **self** that is not contained in **p**

In [17]:
rdd4 = sc.parallelize([("g", 3),("b", 6),("e", 6),("f", 9),("a", 12),("d", 18)])
print("rdd: ", rdd.collect())
print("rdd4: ", rdd4.collect())

rdd:  [('a', 3), ('b', 6), ('a', 6), ('c', 9), ('a', 12), ('c', 15)]
rdd4:  [('g', 3), ('b', 6), ('e', 6), ('f', 9), ('a', 12), ('d', 18)]


In [18]:
rdd.subtractByKey(rdd4).collect()

[('c', 9), ('c', 15)]

In [19]:
rdd4.subtractByKey(rdd).collect()

[('g', 3), ('d', 18), ('e', 6), ('f', 9)]

```.cartesian()``` - return Cartesian product of one RDD and another one

In [20]:
rdd2.cartesian(rdd3).collect()

[(13, 20),
 (13, 11),
 (13, 18),
 (5, 20),
 (5, 11),
 (5, 18),
 (18, 20),
 (18, 11),
 (18, 18)]

## Sorting

```.sortBy(f)``` - sort the RDD by the given function ```f(x)```

In [21]:
rdd.sortBy(lambda x: x[0]).collect()

[('a', 3), ('a', 6), ('a', 12), ('b', 6), ('c', 9), ('c', 15)]

In [22]:
rdd.sortBy(lambda x: x[1]).collect()

[('a', 3), ('b', 6), ('a', 6), ('c', 9), ('a', 12), ('c', 15)]

```.sortByKey()``` - sort the RDD by keys

In [23]:
rdd4.sortByKey().collect()

[('a', 12), ('b', 6), ('d', 18), ('e', 6), ('f', 9), ('g', 3)]

## Increasing and decreasing the number of partitions

If you want to increase or decrease the number of partitions in your RDD, consider:

  - ```.repartition(n)``` - A new RDD with n partitions. This function uses a shuffle to redistribute data.
  - ```.coalesce(n)``` - A new reduced RDD with n partitions. this function uses a shuffle to reduce number of partitions data performing better than ```.repartition()```.

In [24]:
rdd5 = sc.parallelize([1,3,12,42,55,5,6,8,34,54,7,9,93,5], 1)
print("Before repartition:", rdd5.getNumPartitions())
rdd5 = rdd5.repartition(2)
print("After repartition:", rdd5.getNumPartitions())

Before repartition: 1
After repartition: 2


In [25]:
rdd6 = sc.parallelize([1,3,12,42,55,5,6,8,34,54,7,9,93,5], 4)
print("Before coalesce:", rdd6.getNumPartitions())
rdd6 = rdd6.coalesce(1)
print("After coalesce:", rdd6.getNumPartitions())

Before coalesce: 4
After coalesce: 1


## How to save my RDD

Use ```.saveAsTextFile()``` to save your RDD.

PS: This function didn't work in jupyter notebook, I tried in jupyter lab and ipython and it works.

In [27]:
rdd.saveAsTextFile("sample_data/202-rdd.txt")

In [28]:
rdd_temp = sc.textFile("sample_data/202-rdd.txt")
rdd_temp.collect()

["('a', 3)", "('b', 6)", "('a', 6)", "('c', 9)", "('a', 12)", "('c', 15)"]

It is possible to save as a hadoop file with the follow function:

```.saveAsHadoopFile("path", "org.apache.hadoop.mapred.TextOutputFormat")```

Finishing this notebook with ```sc.stop()``` to shut down the SparkContext

In [29]:
sc.stop()