## RDD (Resilient Distributed Datasets)

1. It is a distributed data structure in Spark used for parallel data processing.
2. It is fault-tolerant and effienctly process large datasets across a cluster.

#### Characterstics
1. Immutable: Each transformation creates new RDD.
2. Distributed: Data is partitioned and processed in parallel.
3. Resilient: Track each transformation for fault tolerance.
4. Lazy evaluation: Execution plan is optimized and transformation are evaluated when necessary.
5. Fault-tolerant operations: map, filter, reduce, count etc.

#### Transformations
1. Creates new RDD by applying computation/manipulation
2. Lazy evaluation
3. Examples: map, filter, reduceByKet, sortBy, join etc.

#### Actions
1. Return results or perform actions on RDD
2. Early evaluation
3. Examples: collect, count, first, foreach, save.

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MySparkApp-RDD") \
    .getOrCreate()

sc = spark.sparkContext

### Creating RDD using Iterable

In [3]:
myList = [2, 4, 1, 5, 6, 7]

rdd = sc.parallelize(myList)

In [4]:
rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289

### Actions

In [5]:
rdd.collect()

[2, 4, 1, 5, 6, 7]

In [6]:
myList = [(1, "Paul", 32), (2, "Tina", 45), (3, "John", 28)]
rdd = sc.parallelize(myList)

rdd.collect()

[(1, 'Paul', 32), (2, 'Tina', 45), (3, 'John', 28)]

In [7]:
rdd.count()

3

In [8]:
rdd.first()

(1, 'Paul', 32)

In [9]:
myList = ["mobile", "pc", "laptop", "monitor", "mouse"]
rdd = sc.parallelize(myList)

rdd.collect()

['mobile', 'pc', 'laptop', 'monitor', 'mouse']

### Transformations

### map()

In [29]:
data = [(1, "Paul", 32000, "HR"), (2, "Tina", 45000, "HR"), (3, "John", 28000, "IT"), (4, "Mike", 36000, "IT"), (5, "David", 34000, "Sales")]

rdd = sc.parallelize(data)
rdd.collect()

[(1, 'Paul', 32000, 'HR'),
 (2, 'Tina', 45000, 'HR'),
 (3, 'John', 28000, 'IT'),
 (4, 'Mike', 36000, 'IT'),
 (5, 'David', 34000, 'Sales')]

In [23]:
rdd.map(lambda emp: (emp[2] + emp[2] * 0.4)).collect()

[44800.0, 63000.0, 39200.0, 50400.0, 47600.0]

In [24]:
rdd.map(lambda emp: {
    "id": emp[0],
    "name": emp[1],
    "salary": emp[2],
    "dept": emp[3],
    "increment": emp[2] * 0.4
}).collect()

[{'id': 1,
  'name': 'Paul',
  'salary': 32000,
  'dept': 'HR',
  'increment': 12800.0},
 {'id': 2,
  'name': 'Tina',
  'salary': 45000,
  'dept': 'HR',
  'increment': 18000.0},
 {'id': 3,
  'name': 'John',
  'salary': 28000,
  'dept': 'IT',
  'increment': 11200.0},
 {'id': 4,
  'name': 'Mike',
  'salary': 36000,
  'dept': 'IT',
  'increment': 14400.0},
 {'id': 5,
  'name': 'David',
  'salary': 34000,
  'dept': 'Sales',
  'increment': 13600.0}]

### filter()

In [30]:
rdd.filter(lambda emp: emp[2] > 35000).collect()

[(2, 'Tina', 45000, 'HR'), (4, 'Mike', 36000, 'IT')]

In [31]:
myList = [2, 1, 3, 5, 6, 4]
rdd_sample = sc.parallelize(myList)

rdd_sample.collect()

[2, 1, 3, 5, 6, 4]

In [32]:
rdd_sample.filter(lambda x: x%2 == 0).collect()

[2, 6, 4]

### reduce()

In [None]:
rdd_sample.reduce(lambda x,y: x+y)

21

### sortBy

In [34]:
rdd_sample.sortBy(lambda x:x).collect()

[1, 2, 3, 4, 5, 6]

In [39]:
myList = [(1, "Paul", 32), (2, "Tina", 45), (3, "John", 28), (4, "Mike", 25), (5, "David", 36)]
rdd = sc.parallelize(myList)

rdd.collect()

[(1, 'Paul', 32),
 (2, 'Tina', 45),
 (3, 'John', 28),
 (4, 'Mike', 25),
 (5, 'David', 36)]

In [40]:
rdd.sortBy(keyfunc=lambda x:x[2]).collect()

[(4, 'Mike', 25),
 (3, 'John', 28),
 (1, 'Paul', 32),
 (5, 'David', 36),
 (2, 'Tina', 45)]

In [41]:
rdd.sortBy(keyfunc=lambda x:x[2], ascending=False).collect()

[(2, 'Tina', 45),
 (5, 'David', 36),
 (1, 'Paul', 32),
 (3, 'John', 28),
 (4, 'Mike', 25)]

### Convert RDD to DataFrame

In [42]:
data = [(1, "Paul", 32000, "HR"), (2, "Tina", 45000, "HR"), (3, "John", 28000, "IT"), (4, "Mike", 36000, "IT"), (5, "David", 34000, "Sales")]

rdd = sc.parallelize(data)
rdd.collect()

[(1, 'Paul', 32000, 'HR'),
 (2, 'Tina', 45000, 'HR'),
 (3, 'John', 28000, 'IT'),
 (4, 'Mike', 36000, 'IT'),
 (5, 'David', 34000, 'Sales')]

In [43]:
df = rdd.toDF(schema=["id", "name", "salary", "department"])
df.show()

+---+-----+------+----------+
| id| name|salary|department|
+---+-----+------+----------+
|  1| Paul| 32000|        HR|
|  2| Tina| 45000|        HR|
|  3| John| 28000|        IT|
|  4| Mike| 36000|        IT|
|  5|David| 34000|     Sales|
+---+-----+------+----------+



#### DataFrame to RDD

In [44]:
rdd = df.rdd.map(lambda emp: tuple(emp))
rdd.collect()

[(1, 'Paul', 32000, 'HR'),
 (2, 'Tina', 45000, 'HR'),
 (3, 'John', 28000, 'IT'),
 (4, 'Mike', 36000, 'IT'),
 (5, 'David', 34000, 'Sales')]

### How to use map on DataFrame?
1. Convert DataFrame to RDD.
2. Use map (even apply transformation if needed) and get the new mapped RDD.
3. Convert mapped RDD to new DataFrame using toDF() function.

In [49]:
df.show()

+---+-----+------+----------+
| id| name|salary|department|
+---+-----+------+----------+
|  1| Paul| 32000|        HR|
|  2| Tina| 45000|        HR|
|  3| John| 28000|        IT|
|  4| Mike| 36000|        IT|
|  5|David| 34000|     Sales|
+---+-----+------+----------+



In [51]:
mapped_rdd = df.rdd.map(lambda emp: (
        "EMP00" + str(emp[0]), 
        emp[1].upper(),
        emp[2],
        emp[3],
        emp[2] * 0.4
    )
)

mapped_rdd.collect()

[('EMP001', 'PAUL', 32000, 'HR', 12800.0),
 ('EMP002', 'TINA', 45000, 'HR', 18000.0),
 ('EMP003', 'JOHN', 28000, 'IT', 11200.0),
 ('EMP004', 'MIKE', 36000, 'IT', 14400.0),
 ('EMP005', 'DAVID', 34000, 'Sales', 13600.0)]

In [52]:
mapped_df = mapped_rdd.toDF(schema=["id", "name", "salary", "department", "increment"])
mapped_df.show()

+------+-----+------+----------+---------+
|    id| name|salary|department|increment|
+------+-----+------+----------+---------+
|EMP001| PAUL| 32000|        HR|  12800.0|
|EMP002| TINA| 45000|        HR|  18000.0|
|EMP003| JOHN| 28000|        IT|  11200.0|
|EMP004| MIKE| 36000|        IT|  14400.0|
|EMP005|DAVID| 34000|     Sales|  13600.0|
+------+-----+------+----------+---------+



In [53]:
sc.stop()

In [54]:
spark.stop()