# Spark RDD transformations

1. **map(func)** 

Returns a new distributed dataset formed by passing each element of the source through a function func

In [1]:
rdd = sc.parallelize(["b", "a", "c"])

In [2]:
sorted(rdd.map(lambda x: (x, 1)).collect())

[('a', 1), ('b', 1), ('c', 1)]

2. **filter(func)**

Return a new dataset formed by selecting those elements of the source on which func returns true

In [3]:
rdd = sc.parallelize([1, 4, 7, 8, 9])

# Filter all of the numbers divisible by 2
rdd.filter(lambda x: x % 2 == 0).collect()

[4, 8]

3. **flatMap(func)**

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)

In [4]:
rdd.flatMap(lambda x: range(1, x)).collect()

[1, 2, 3, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8]

4. **union(rdd)**

Build the union of a list of RDDs.


In [5]:
rdd = sc.parallelize([1, 2, 3])

# Union the above RDD with itself. 
rdd.union(rdd).collect()

[1, 2, 3, 1, 2, 3]

5. **distinct([numPartitions])**

Returns a new RDD containing the distinct elements in this RDD.

In [6]:
sc.parallelize([1, 1, 1, 2, 3]).distinct().collect()

[1, 2, 3]

6. **cartesian(other)**


Return the Cartesian product of this RDD and another one, 
that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

In [7]:
rdd.cartesian(rdd).collect()

[(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), (3, 3)]