## RDD (Resilient Distributed Datasets)

1. It is a distributed data structure in Spark used for parallel data processing.
2. It is fault-tolerant and effienctly process large datasets across a cluster.

#### Characterstics
1. Immutable: Each transformation creates new RDD.
2. Distributed: Data is partitioned and processed in parallel.
3. Resilient: Track each transformation for fault tolerance.
4. Lazy evaluation: Execution plan is optimized and transformation are evaluated when necessary.
5. Fault-tolerant operations: map, filter, reduce, count etc.

#### Transformations
1. Creates new RDD by applying computation/manipulation
2. Lazy evaluation
3. Examples: map, filter, reduceByKet, sortBy, join etc.

#### Actions
1. Return results or perform actions on RDD
2. Early evaluation
3. Examples: collect, count, first, foreach, save.

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MySparkApp-RDD") \
    .getOrCreate()

sc = spark.sparkContext

### Creating RDD using Iterable

In [3]:
myList = [2, 4, 1, 5, 6, 7]

rdd = sc.parallelize(myList)

In [4]:
rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289

### Sample RDD Transformations and Actions

#### Actions

In [5]:
rdd.collect()

[2, 4, 1, 5, 6, 7]

In [6]:
myList = [(1, "Paul", 32), (2, "Tina", 45), (3, "John", 28)]
rdd = sc.parallelize(myList)

rdd.collect()

[(1, 'Paul', 32), (2, 'Tina', 45), (3, 'John', 28)]

In [7]:
rdd.count()

3

In [8]:
rdd.first()

(1, 'Paul', 32)

In [9]:
myList = ["mobile", "pc", "laptop", "monitor", "mouse"]
rdd = sc.parallelize(myList)

rdd.collect()

['mobile', 'pc', 'laptop', 'monitor', 'mouse']

#### Transformations

In [10]:
rdd.map(lambda x: x.upper()).collect()

['MOBILE', 'PC', 'LAPTOP', 'MONITOR', 'MOUSE']

In [11]:
rdd.filter(lambda x: x[0] == "m").collect()

['mobile', 'monitor', 'mouse']

In [12]:
myList = [2, 1, 3, 5, 6, 4]
rdd = sc.parallelize(myList)

rdd.collect()

[2, 1, 3, 5, 6, 4]

In [13]:
rdd.filter(lambda x: x%2 == 0).collect()

[2, 6, 4]

In [14]:
rdd.reduce(lambda x,y: x+y)

21

In [15]:
rdd.collect()

[2, 1, 3, 5, 6, 4]

In [16]:
rdd.sortBy(lambda x:x).collect()

[1, 2, 3, 4, 5, 6]

In [17]:
myList = [(1, "Paul", 32), (2, "Tina", 45), (3, "John", 28)]
rdd = sc.parallelize(myList)

rdd.collect()

[(1, 'Paul', 32), (2, 'Tina', 45), (3, 'John', 28)]

In [18]:
rdd.sortBy(keyfunc=lambda x:x[2]).collect()

[(3, 'John', 28), (1, 'Paul', 32), (2, 'Tina', 45)]

In [19]:
rdd.sortBy(keyfunc=lambda x:x[2], ascending=False).collect()

[(2, 'Tina', 45), (1, 'Paul', 32), (3, 'John', 28)]

In [20]:
sc.stop()

In [21]:
spark.stop()