In [None]:
What is RDD (Resilient Distributed Dataset)?
RDD, or Resilient Distributed Dataset, serves as a core component within PySpark, offering a fault-tolerant, 
distributed collection of objects. This foundational element boasts immutability, ensuring that once an RDD is created, 
it remains unchanged. Furthermore, RDDs are partitioned logically, facilitating parallel computation across various nodes 
within the cluster.

RDDs are collections of objects similar to a list in Python; the difference is that RDD is computed on several processes scattered 
across multiple physical servers, also called nodes in a cluster, while a Python collection lives and processes in just one process.

In [None]:
PySpark RDD Benefits
PySpark is widely adopted in the Machine learning and Data science community due to its advantages over traditional Python programming.

In-Memory Processing
PySpark loads the data from disk and processes it in memory, and keeps the data in memory; this is the main difference between PySpark and MapReduce (I/O intensive). In between the transformations, we can also cache/persists the RDD in memory to reuse the previous computations.

Immutability
PySpark RDDs are immutable in nature meaning, once RDDs are created you cannot modify them. When we apply transformations on RDD, PySpark creates a new RDD and maintains the RDD Lineage.

Fault Tolerance
PySpark operates on fault-tolerant data stores on HDFS, S3 e.t.c. Hence, if any RDD operation fails, it automatically reloads the data from other partitions. Also, when PySpark applications running on a cluster, PySpark task failures are automatically recovered for a certain number of times (as per the configuration) and finish the application seamlessly.

Lazy Evolution
PySpark does not evaluate the RDD transformations as they appear/encountered by Driver instead it keeps the all transformations as it encounters(DAG) and evaluates the all transformation when it sees the first RDD action.

Partitioning
When you create RDD from a data, It by default partitions the elements in a RDD. By default it partitions to the number of cores available.

In [None]:
RDD Creation
You can create RDD by parallelizing the existing collection and reading data from a disk.

1.parallelizing an existing collection and
2.referencing a dataset in an external storage system 

Using sparkContext.parallelize()
By using parallelize() function of SparkContext (sparkContext.parallelize() ) you can create an RDD. This function loads the existing collection from your driver program into parallelizing RDD. This method of creating an RDD is used when you already have data in memory that is either loaded from a file or from a database.
and all data must be present in the driver program prior to creating RDD.

In [18]:
# Imports
from pyspark.sql import SparkSession
from pyspark import SparkContext

# Spark session & context
spark = (SparkSession
         .builder
         .master("local")
         .appName("SparkbyExamples.com")
         .getOrCreate())

# Create RDD from parallelize    
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = spark.sparkContext.parallelize(data)

In [None]:
Using sparkContext.textFile()
Use the textFile() method to read a .txt file into RDD.

In [9]:
# Create RDD from external Data source
rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

In [None]:
Using sparkContext.wholeTextFiles()
wholeTextFiles() function returns a PairRDD with the key being the file path and the value being file content.
Besides using text files, we can also create RDD from CSV file, JSON, and more formats.

In [10]:
# Read entire file into a RDD as single record.
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")

In [None]:
Create empty RDD using sparkContext.emptyRDD
Using emptyRDD() method on sparkContext we can create an RDD with no data. This method creates an empty RDD with no partition.

In [11]:
# Create an empty RDD with no partition    
rdd = spark.sparkContext.emptyRDD 

In [13]:
#Creating empty RDD with partition
# Create empty RDD with partition
rdd2 = spark.sparkContext.parallelize([],10) #This creates 10 partitions

In [16]:
#getNumPartitions() – This is an RDD function that returns a number of partitions your dataset split into.

# Get partition count
print("Initial partition count:"+str(rdd.getNumPartitions()))

Initial partition count:1


In [21]:
# Set partitions manually
spark.sparkContext.parallelize([1,2,3,4,56,7,8,9,12,3], 10)

ParallelCollectionRDD[9] at readRDDFromFile at PythonRDD.scala:274

In [22]:
#6. Repartition and Coalesce
#Sometimes, we may need to repartition the RDD, PySpark provides two ways to repartition; first using repartition() method, which shuffles data from all nodes also called full shuffle and second coalesce() method which shuffles data from minimum nodes, for examples if you have data in 4 partitions and doing coalesce(2) moves data from just 2 nodes.  
#Both of these functions take the number of partitions to repartition RDD as shown below.  Note that repartition() method is a very expensive operation as it shuffles data from all nodes in a cluster. 

# Repartition the RDD
reparRdd = rdd.repartition(4)
print("re-partition count:"+str(reparRdd.getNumPartitions()))

re-partition count:4
