### Introduction to RDD and ways to create RDD

#### What is RDD ?

Resilient Distributed Dataset is the fundamental data structure of pyspark
* Resilient - Immutable and Fault tolerant
* Distributed Dataset - Collection of logical partitions that may be processed in different nodes

Initialy data will be present in the driver, when we create RDD using the parallelise() method data will be divided into logical partitions and distributed accross the cluster ( executor ). Driver will take note where each partitions is present


In [1]:
# creating spark session 
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("intro-to-rdd").getOrCreate()
spark.getActiveSession()

24/10/31 15:49:24 WARN Utils: Your hostname, padmanabhan-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
24/10/31 15:49:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/31 15:49:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# creating spark context - from pyspark 2.0 we need to create Spark Session first then creact Spark Context

sc = spark.sparkContext

In [5]:
data = [1,2,3,4,5,6,7,8,9,10]
# create logical partitions and distribute the partitions
rdd = sc.parallelize(data)       
print(rdd)

# note: Spark wont perform Transformation until it encounter action like collect(),saveAsTextFile(),etc

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:289


In [8]:
# Getting details about the RDD

print(rdd.getNumPartitions())
print("Action : Get first element from RDD = ",rdd.first())

4


[Stage 1:>                                                          (0 + 1) / 1]

Action : Get first element from RDD =  1


                                                                                

In [9]:
# another example of creating rdd using the list of tuples
data2 = [('padhu',2000),('karthik',5000),('selva',1000)]
rdd2 = sc.parallelize(data2)
print(rdd2.getNumPartitions())

In [26]:
rdd2.take(1) 

# o/p : [] 

[('padhu', 2000)]

In [16]:
rdd2.collect()

# default - 4 partitions for rdd2 
# To get data from particular rdd partition use glom()

partitions = rdd2.glom().collect()

print(partitions)

[[], [('padhu', 2000)], [('karthik', 5000)], [('selva', 1000)]]


                                                                                

In [30]:
# overriding default partition count

rdd3 = sc.parallelize(data2,numSlices = 2)
partition3 = rdd3.glom().collect()
print(partition3)
print(rdd3.getNumPartitions())
print("check whether rdd is empty : ",rdd3.isEmpty())

[[('padhu', 2000)], [('karthik', 5000), ('selva', 1000)]]
2
check whether rdd is empty :  False


In [29]:
# take() - input: number of element to return starting from the RDD
rdd3.take(0)

# o/p : 0 element

[]

Converting RDD into DF

Methods:
* toDF(optional=schema)
* createDataFrame(rdd,schema)

In [33]:
df = rdd3.toDF()
df.show()

# note: column name by default _1, _2, _3 ...etc

+-------+----+
|     _1|  _2|
+-------+----+
|  padhu|2000|
|karthik|5000|
|  selva|1000|
+-------+----+



In [34]:
df.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: long (nullable = true)



In [35]:
# defining column names while converting rdd into df'

column_name = ["name","amount"]
df = rdd3.toDF(column_name)
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- amount: long (nullable = true)



In [36]:
# Convert RDD into DF using CreateDataFrame()
df = spark.createDataFrame(rdd3,schema=column_name)
df.show(truncate = False)

+-------+------+
|name   |amount|
+-------+------+
|padhu  |2000  |
|karthik|5000  |
|selva  |1000  |
+-------+------+



Why to use StructType and StructField ?

By default while converting the RDD into DF, datatype are interperted from the data and Nullable are True. To override these we can use StructType and StructField ()

In [38]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name",StringType(),False),
    StructField("amount",IntegerType(),True)])

df4 = spark.createDataFrame(rdd3,schema= schema)
print(df4.printSchema())
print(df4.show(truncate= True))

root
 |-- name: string (nullable = false)
 |-- amount: integer (nullable = true)

None


                                                                                

+-------+------+
|   name|amount|
+-------+------+
|  padhu|  2000|
|karthik|  5000|
|  selva|  1000|
+-------+------+

None


                                                                                