RDD (Resilient Distributed Dataset) is a core building block of PySpark. It is a fault-tolerant, immutable, distributed collection of objects. We cannot change the RDD once we create it. The data with in the rdd is partitions which can be executed on distributed clusters of the computers

In [2]:
from pyspark.sql import SparkSession

# create spark session
spark = SparkSession.builder.master("local[1]").appName("SparkRDD").getOrCreate()

In [5]:
'''
There are the two ways to create the RDD
1. Using sparkContext.parallelize()
2. Using sparkContext.textFile()
3. Using sparkContext.wholeTextFiles(): this reads entire file into a single record with filename as key
'''

numbers=[1,2,3,1,4,5,8,9]
rdd=spark.sparkContext.parallelize(numbers)
print(rdd)
print(type(rdd))
rdd.collect()

ParallelCollectionRDD[1] at readRDDFromFile at PythonRDD.scala:289
<class 'pyspark.rdd.RDD'>


[1, 2, 3, 1, 4, 5, 8, 9]

In [8]:
rdd2 = spark.sparkContext.textFile("book.txt")

In [10]:
rdd3 = spark.sparkContext.wholeTextFiles("book.txt")

In [15]:
# creating empthy RDD, without any partitions
rdd4 = spark.sparkContext.emptyRDD

In [18]:
# creating empty RDD, with partitions
rdd5=spark.sparkContext.parallelize([],10)

In [21]:
'''When we create the partitions using any one of the above mentioned method, then sparkcontext automatically divides the data into partitions.
By default, when we use our system or lapton, there it partitions the data into the numbers of cores in the system.'''

'When we create the partitions using any one of the above mentioned method, then sparkcontext automatically divides the data into partitions.\nBy default, when we use our system or lapton, there it partitions the data into the numbers of cores in the system.'

In [22]:
# get number of partitions
print("Initial partition count:"+str(rdd.getNumPartitions()))

Initial partition count:1


In [23]:
# setting up our own number for the number of partitions
spark.sparkContext.parallelize([1,2,3,4,56,7,8,9,12,3], 10)

ParallelCollectionRDD[13] at readRDDFromFile at PythonRDD.scala:289

In [24]:
print(rdd.getNumPartitions())

1


In [26]:
# Repartition and Coalesce
repartioned_rdd=rdd.repartition(5)
print(repartioned_rdd.getNumPartitions())

5


In [27]:
coalesced_rdd=rdd.coalesce(5)
print(coalesced_rdd.getNumPartitions())

1


# PySpark RDD Operations- The core transformations and actions performed in rdd
* RDD transformations: Transformations are lazy operations; instead of updating an RDD, these operations return another RDD.RDD transformations are flatMap(), map(), reduceByKey(), filter(), sortByKey()
* RDD actions : operations that trigger computation and return RDD values.

In [30]:
# flatMap
rdd_new = spark.sparkContext.textFile("book.txt")

In [41]:
rdd1=rdd_new.flatMap(lambda x:x.split(" "))

In [42]:
rdd2 = rdd1.map(lambda x: (x,1))

[('Self-Employment:', 1),
 ('Building', 1),
 ('an', 1),
 ('Internet', 1),
 ('Business', 1),
 ('of', 1),
 ('One', 1),
 ('Achieving', 1),
 ('Financial', 1),
 ('and', 1),
 ('Personal', 1),
 ('Freedom', 1),
 ('through', 1),
 ('a', 1),
 ('Lifestyle', 1),
 ('Technology', 1),
 ('Business', 1),
 ('By', 1),
 ('Frank', 1),
 ('Kane', 1),
 ('', 1),
 ('', 1),
 ('', 1),
 ('Copyright', 1),
 ('�', 1),
 ('2015', 1),
 ('Frank', 1),
 ('Kane.', 1),
 ('', 1),
 ('All', 1),
 ('rights', 1),
 ('reserved', 1),
 ('worldwide.', 1),
 ('', 1),
 ('', 1),
 ('CONTENTS', 1),
 ('Disclaimer', 1),
 ('Preface', 1),
 ('Part', 1),
 ('I:', 1),
 ('Making', 1),
 ('the', 1),
 ('Big', 1),
 ('Decision', 1),
 ('Overcoming', 1),
 ('Inertia', 1),
 ('Fear', 1),
 ('of', 1),
 ('Failure', 1),
 ('Career', 1),
 ('Indoctrination', 1),
 ('The', 1),
 ('Carrot', 1),
 ('on', 1),
 ('a', 1),
 ('Stick', 1),
 ('Ego', 1),
 ('Protection', 1),
 ('Your', 1),
 ('Employer', 1),
 ('as', 1),
 ('a', 1),
 ('Security', 1),
 ('Blanket', 1),
 ('Why', 1),
 ('it�

In [44]:
# count occurance of same word
reduced_rdd=rdd2.reduceByKey(lambda a,b:a+b)


In [47]:
# filter only those occrance of the words which has the count greater than 5
reduced_rdd.filter(lambda x:x[1]>5).collect()

[('an', 172),
 ('Internet', 13),
 ('Business', 19),
 ('of', 941),
 ('One', 12),
 ('and', 901),
 ('Freedom', 7),
 ('through', 55),
 ('a', 1148),
 ('By', 9),
 ('Frank', 10),
 ('Kane', 7),
 ('', 199),
 ('�', 174),
 ('All', 13),
 ('the', 1176),
 ('The', 88),
 ('on', 399),
 ('Your', 62),
 ('as', 297),
 ('it�s', 28),
 ('it', 311),
 ('in', 552),
 ('Not', 7),
 ('No', 14),
 ('to', 1789),
 ('You', 144),
 ('When', 31),
 ('How', 29),
 ('Is', 17),
 ('for', 500),
 ('I', 322),
 ('Even', 35),
 ('Sundog', 20),
 ('Software', 12),
 ('your', 1339),
 ("Don't", 22),
 ('Have', 9),
 ('Go', 7),
 ('It', 42),
 ('Online', 8),
 ('Search', 6),
 ('New', 6),
 ('What', 38),
 ('you', 1267),
 ('do', 156),
 ('with', 292),
 ('career', 18),
 ('is', 531),
 ('own', 114),
 ('This', 57),
 ('book', 31),
 ('how', 127),
 ('my', 199),
 ('life', 21),
 ('by', 109),
 ('quitting', 8),
 ('job,', 24),
 ('creating', 14),
 ('growing', 9),
 ('income', 17),
 ('self-employment.', 10),
 ('learned', 10),
 ('along', 10),
 ('way', 42),
 ('could'

In [48]:
reduced_rdd.sortByKey().collect()

[('', 199),
 ('"CRM"', 1),
 ('"Display', 1),
 ('"Flexibility', 1),
 ('"Good', 1),
 ('"How', 1),
 ('"I', 1),
 ('"Lean', 1),
 ('"Measure', 1),
 ('"Office', 1),
 ('"Oh,', 1),
 ('"Only', 1),
 ('"Plan', 1),
 ('"Shark', 2),
 ('"Silver', 1),
 ('"SilverLining".', 1),
 ('"The', 4),
 ('"URL', 1),
 ('"Why', 1),
 ('"WordPress', 1),
 ('"Y', 1),
 ('"account', 1),
 ('"acqui-hires."', 1),
 ('"action"', 2),
 ('"ad', 1),
 ('"adaptive"', 1),
 ('"advertorial"', 1),
 ('"ageism', 1),
 ('"audience"', 1),
 ('"brand', 1),
 ('"broad', 2),
 ('"call', 1),
 ('"call-outs"', 1),
 ('"campaign', 1),
 ('"click', 1),
 ('"come', 1),
 ('"cost', 1),
 ('"designed"', 1),
 ('"dimensions"', 1),
 ('"goals".', 1),
 ('"growth', 3),
 ('"hobby"', 1),
 ('"how', 3),
 ('"landing', 1),
 ('"like"', 2),
 ('"liking"', 1),
 ('"link', 2),
 ('"market', 1),
 ('"moderately', 1),
 ('"modern"', 1),
 ('"moonlighting"', 2),
 ('"negative', 1),
 ('"organic', 1),
 ('"remarketing', 1),
 ('"remnant', 1),
 ('"responsive"?', 1),
 ('"retirement."', 1),
 (

# RDD Actions

In [53]:
print(f'Toatl numbers of words in the file is: {rdd1.count()}')

Toatl numbers of words in the file is: 46448


In [62]:
# Action - first
firstRec = rdd2.first()
print(f'the number of occurance of {firstRec[0]} is {firstRec[1]}')

the number of occurance of Self-Employment: is 1


In [59]:
datMax = rdd1.max()
datMax

'�would'

In [66]:
totalWordCount = rdd2.reduce(lambda a,b: (a[1]+b[1],a[1]))
print("dataReduce Record : "+str(totalWordCount[0]))

dataReduce Record : 2


In [67]:
rdd2.saveAsTextFile("/wordCount")