---


<center><h1> Persistence in RDDs </h1></center>

---

* Spark RDD persistence is an optimization technique in which saves the result of RDD evaluation.
* There are some advantages of RDD persistence in spark. It makes the whole system
  * Time efficient
  * Cost efficient

----

#### `Create the Spark Context`

----

In [1]:
# importing the libraries
from pyspark.context import SparkContext

In [2]:
# create the spark context
sc #= SparkContext(appName="persistance_example")

---

In this Notebook, we are going to use 2 different text files

* **`student_data.txt`**
     * **Roll No**
     * **Gender**
     * **City**
    
---    
    
* **`marks.txt`**
     * **Roll No**
     * **Subject Code**
     * **Marks**
    
---

#### `Read the Student Dataset`

--- 

In [3]:
# create the rdd of the student dataset
students_data = sc.textFile('dataset/student_data.txt')

In [4]:
# top 5 rows
students_data.take(5)

['IND2021100000 Others Delhi',
 'IND2021100001 Female Gurgaon',
 'IND2021100002 Male Mumbai',
 'IND2021100003 Male Pune',
 'IND2021100004 Male Mumbai']

---

Now, in the text file, all the columns are separated by space. So, first we have to use the map function to split it.

---

#### `Split each row by space`

---

In [5]:
# split each row by space
students_data = students_data.map(lambda x: x.split(' '))

In [6]:
# view the top 5 rows
students_data.take(5)

[['IND2021100000', 'Others', 'Delhi'],
 ['IND2021100001', 'Female', 'Gurgaon'],
 ['IND2021100002', 'Male', 'Mumbai'],
 ['IND2021100003', 'Male', 'Pune'],
 ['IND2021100004', 'Male', 'Mumbai']]

---

Now, we will create the pair RDD of it, making the key as `student roll no` and (`gender`, `city`) as the values.

---

#### `Creating the Pair RDDs`

---

In [7]:
# creating the Pair RDDs
students_data = students_data.map(lambda x: (x[0], (x[1], x[2])))

In [8]:
# view the top 5 rows
students_data.take(5)

[('IND2021100000', ('Others', 'Delhi')),
 ('IND2021100001', ('Female', 'Gurgaon')),
 ('IND2021100002', ('Male', 'Mumbai')),
 ('IND2021100003', ('Male', 'Pune')),
 ('IND2021100004', ('Male', 'Mumbai'))]

---

#### `Read the marks data`


---

In [9]:
# create the rdd of the marks data
marks = sc.textFile("dataset/marks.txt")

In [10]:
# view the top 5 rows
marks.take(5)

['IND2021100000 PC001 56',
 'IND2021100000 PC005 81',
 'IND2021100000 PC033 83',
 'IND2021100001 PC001 58',
 'IND2021100001 PC005 81']

---

Now, in the text file, all the columns are separated by space. So, first we have to use the map function to split it.

---

#### `Split each row by space`

----

In [11]:
# split each row by space
marks = marks.map(lambda x: x.split(' '))

In [12]:
# view the top 5 rows
marks.take(5)

[['IND2021100000', 'PC001', '56'],
 ['IND2021100000', 'PC005', '81'],
 ['IND2021100000', 'PC033', '83'],
 ['IND2021100001', 'PC001', '58'],
 ['IND2021100001', 'PC005', '81']]

---

We can see that the marks column in the above RDD is in string format. So, in the next step, we will type cast it into integer and also make it the pair RDD.

---

#### `Type cast marks into integer and make pair RDD`

---

In [13]:
# type cast marks into integer 
marks_pair = marks.map(lambda x: (x[0], (x[1], int(x[2]))))

In [14]:
# view the top 5 views
marks_pair.take(5)

[('IND2021100000', ('PC001', 56)),
 ('IND2021100000', ('PC005', 81)),
 ('IND2021100000', ('PC033', 83)),
 ('IND2021100001', ('PC001', 58)),
 ('IND2021100001', ('PC005', 81))]

---

Now, for each roll no, we will find out the total marks and store it in front of each roll no. To do that we will use the `reduceByKey`

---

In [15]:
# total marks using the reduceByKey
total_marks = marks_pair.reduceByKey(lambda x, y: (x[0]+ " " + y[0] , x[1]+y[1]))

In [16]:
# view the top 5 rows
total_marks.take(5)

[('IND2021100001', ('PC001 PC005 PC033', 221)),
 ('IND2021100002', ('PC001 PC005 PC033', 226)),
 ('IND2021100005', ('PC001 PC005 PC033', 191)),
 ('IND2021100006', ('PC001 PC005 PC033', 192)),
 ('IND2021100007', ('PC001 PC005 PC033', 222))]

---

#### `Join the two Pair RDDS`

---

Next, we will join the 2 pair RDDs, `students_data` & `total_marks`.


---

In [17]:
# join the pair RDDs
students_with_marks = students_data.join(total_marks)

In [18]:
# view the top 5
students_with_marks.take(5)

[('IND2021100002', (('Male', 'Mumbai'), ('PC001 PC005 PC033', 226))),
 ('IND2021100005', (('Others', 'Mumbai'), ('PC001 PC005 PC033', 191))),
 ('IND2021100015', (('Others', 'Pune'), ('PC001 PC005 PC033', 212))),
 ('IND2021100022', (('Male', 'Gurgaon'), ('PC001 PC005 PC033', 240))),
 ('IND2021100025', (('Male', 'Pune'), ('PC001 PC005 PC033', 179)))]

---

So, far we have done the following transformations on 2 different datasets.

---

![](images/level-1.png)

---

---

### Now, we will do 2 more transformations on the joined data.

   - We will find out the data of the `female` students.
   - Next, we will find out the data of `female` students who are from `Pune`

---

---

![](images/level-2.png)

---

---

#### `Data of female students`

---

In [19]:
# data of female students
students_with_data_female = students_with_marks.filter(lambda x : x[1][0][0] == "Female")

In [20]:
# collect the data
students_with_data_female.collect()

[('IND2021100031', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 198))),
 ('IND2021100038', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 225))),
 ('IND2021100047', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 195))),
 ('IND2021100052', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 209))),
 ('IND2021100063', (('Female', 'Delhi'), ('PC001 PC005 PC033', 207))),
 ('IND2021100074', (('Female', 'Delhi'), ('PC001 PC005 PC033', 190))),
 ('IND2021100077', (('Female', 'Bengaluru'), ('PC001 PC005 PC033', 202))),
 ('IND2021100082', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 222))),
 ('IND2021100097', (('Female', 'Delhi'), ('PC001 PC005 PC033', 190))),
 ('IND2021100099', (('Female', 'Delhi'), ('PC001 PC005 PC033', 174))),
 ('IND2021100103', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 210))),
 ('IND2021100116', (('Female', 'Pune'), ('PC001 PC005 PC033', 229))),
 ('IND2021100130', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 202))),
 ('IND2021100150', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 

---

![](images/female_data_1.png)

---

---



---

#### `Data of female students from Pune`

---

In [21]:
# data of female students from pune
students_with_data_pune = students_with_data_female.filter(lambda x: x[1][0][1] == "Pune")

In [22]:
# collect the data
students_with_data_pune.collect()

[('IND2021100116', (('Female', 'Pune'), ('PC001 PC005 PC033', 229))),
 ('IND2021100169', (('Female', 'Pune'), ('PC001 PC005 PC033', 204))),
 ('IND2021100289', (('Female', 'Pune'), ('PC001 PC005 PC033', 213))),
 ('IND2021100290', (('Female', 'Pune'), ('PC001 PC005 PC033', 199))),
 ('IND2021100321', (('Female', 'Pune'), ('PC001 PC005 PC033', 232))),
 ('IND2021100734', (('Female', 'Pune'), ('PC001 PC005 PC033', 202))),
 ('IND2021100752', (('Female', 'Pune'), ('PC001 PC005 PC033', 200))),
 ('IND2021100763', (('Female', 'Pune'), ('PC001 PC005 PC033', 193))),
 ('IND2021100860', (('Female', 'Pune'), ('PC001 PC005 PC033', 222))),
 ('IND2021101104', (('Female', 'Pune'), ('PC001 PC005 PC033', 213))),
 ('IND2021101147', (('Female', 'Pune'), ('PC001 PC005 PC033', 204))),
 ('IND2021101262', (('Female', 'Pune'), ('PC001 PC005 PC033', 203))),
 ('IND2021101394', (('Female', 'Pune'), ('PC001 PC005 PC033', 204))),
 ('IND2021101490', (('Female', 'Pune'), ('PC001 PC005 PC033', 215))),
 ('IND2021101541', (

---

![](images/female_from_pune_1.png)

---

----

Now, we can do a lot of things on the joined pair RDD like - 

* Find out the number of students from Mumbai?
* Find out the number of Male students from Gurugram?
* Find out the average marks of male & female students?
* Find out the top 10 students in Pune?

And a lot more. So, let's try to persist the joined Pair RDD so that it will take less time to compute the same operations.


----


#### `Persist the Joined Pair RDD`



----

---

![](images/level-3.png)

---

In [24]:
students_with_marks.persist()

PythonRDD[25] at RDD at PythonRDD.scala:53

---

**Now, we will do the same transformations on the persisted RDD**

 - We will find out the data of the `female` students.
 - Next, we will find out the data of `female` students who are from `Pune`


---

#### `Data of female students from persisted RDD`

---

In [25]:
# data of female students 
students_with_data_female_persist = students_with_marks.filter(lambda x : x[1][0][0] == "Female")

In [26]:
# collect the data
students_with_data_female_persist.collect()

[('IND2021100031', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 198))),
 ('IND2021100038', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 225))),
 ('IND2021100047', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 195))),
 ('IND2021100052', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 209))),
 ('IND2021100063', (('Female', 'Delhi'), ('PC001 PC005 PC033', 207))),
 ('IND2021100074', (('Female', 'Delhi'), ('PC001 PC005 PC033', 190))),
 ('IND2021100077', (('Female', 'Bengaluru'), ('PC001 PC005 PC033', 202))),
 ('IND2021100082', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 222))),
 ('IND2021100097', (('Female', 'Delhi'), ('PC001 PC005 PC033', 190))),
 ('IND2021100099', (('Female', 'Delhi'), ('PC001 PC005 PC033', 174))),
 ('IND2021100103', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 210))),
 ('IND2021100116', (('Female', 'Pune'), ('PC001 PC005 PC033', 229))),
 ('IND2021100130', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 202))),
 ('IND2021100150', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 

---

![](images/data_of_female_persist_1.png)

---

In [28]:
# check again
students_with_data_female_persist.collect()

[('IND2021100031', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 198))),
 ('IND2021100038', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 225))),
 ('IND2021100047', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 195))),
 ('IND2021100052', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 209))),
 ('IND2021100063', (('Female', 'Delhi'), ('PC001 PC005 PC033', 207))),
 ('IND2021100074', (('Female', 'Delhi'), ('PC001 PC005 PC033', 190))),
 ('IND2021100077', (('Female', 'Bengaluru'), ('PC001 PC005 PC033', 202))),
 ('IND2021100082', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 222))),
 ('IND2021100097', (('Female', 'Delhi'), ('PC001 PC005 PC033', 190))),
 ('IND2021100099', (('Female', 'Delhi'), ('PC001 PC005 PC033', 174))),
 ('IND2021100103', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 210))),
 ('IND2021100116', (('Female', 'Pune'), ('PC001 PC005 PC033', 229))),
 ('IND2021100130', (('Female', 'Mumbai'), ('PC001 PC005 PC033', 202))),
 ('IND2021100150', (('Female', 'Gurgaon'), ('PC001 PC005 PC033', 

---

![](images/data_with_female_persist_again.png)

----

----

#### `Data of female students from Pune from persisted RDD`

---

In [29]:
# data of female students from pune
students_with_data_pune_persist = students_with_data_female_persist.filter(lambda x: x[1][0][1] == "Pune")

In [30]:
# collect the data
students_with_data_pune_persist.collect()

[('IND2021100116', (('Female', 'Pune'), ('PC001 PC005 PC033', 229))),
 ('IND2021100169', (('Female', 'Pune'), ('PC001 PC005 PC033', 204))),
 ('IND2021100289', (('Female', 'Pune'), ('PC001 PC005 PC033', 213))),
 ('IND2021100290', (('Female', 'Pune'), ('PC001 PC005 PC033', 199))),
 ('IND2021100321', (('Female', 'Pune'), ('PC001 PC005 PC033', 232))),
 ('IND2021100734', (('Female', 'Pune'), ('PC001 PC005 PC033', 202))),
 ('IND2021100752', (('Female', 'Pune'), ('PC001 PC005 PC033', 200))),
 ('IND2021100763', (('Female', 'Pune'), ('PC001 PC005 PC033', 193))),
 ('IND2021100860', (('Female', 'Pune'), ('PC001 PC005 PC033', 222))),
 ('IND2021101104', (('Female', 'Pune'), ('PC001 PC005 PC033', 213))),
 ('IND2021101147', (('Female', 'Pune'), ('PC001 PC005 PC033', 204))),
 ('IND2021101262', (('Female', 'Pune'), ('PC001 PC005 PC033', 203))),
 ('IND2021101394', (('Female', 'Pune'), ('PC001 PC005 PC033', 204))),
 ('IND2021101490', (('Female', 'Pune'), ('PC001 PC005 PC033', 215))),
 ('IND2021101541', (

---

![](images/data_of_pune_persist_1.png)


---