![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 04.03 - Optimization: RDD Persistence

## Background

In this exercise you will be taking a look at how to best make use of the `cache` action to persist frequently used RDDs. 

Data will be from the Cassandra `videos` table with the following definition:

Note: There are some columns which may have a null value: `avg_rating` and `description`. Be sure to use the `Option[]` data type for those columns.

***

## Directions

#### 1. Write the following queries to:

In [2]:
case class Video(title : String, description : Option[String])

println(sc.cassandraTable[Video]("killr_video", "videos_by_tag").where("tag = 'christmas'").filter(vid => vid.description.getOrElse("").contains("Santa")).count)

println(sc.cassandraTable[Video]("killr_video", "videos_by_tag").where("tag = 'christmas'").filter(vid => vid.description.getOrElse("").contains("gift")).count)

7
6


#### 2. Open the "Storage" tab of the application UI (http://localhost:4040/storage). 

Note how Spark did not persist any results. 

#### 3. Now, optimize your solution by caching intermediate steps in your query. 

In [3]:
val cassandraVideos = sc.cassandraTable[Video]("killr_video", "videos_by_tag").where("tag = 'christmas'").cache;

println(cassandraVideos.filter(vid => vid.description.getOrElse("").contains("Santa")).count)

println(cassandraVideos.filter(vid => vid.description.getOrElse("").contains("gift")).count)

7
6


#### 4. Refresh the "Storage" tab and observe how Spark persisted your cached results.