![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 02.02 - Essentials: Three Ways to Create an RDD

## Background

In this exercise you will practice creating an RDD from a few different data sources: 

1. Parallelizing a collection
2. Load data from storage
3. Transforming an RDD

One of the data sources you'll be working with is a Cassandra table, `killr_video.videos`, with the table definition:

You'll also be working with a CSV file, `/root/data/video-years.csv`, which contains columns in the following order:

***

## Directions

#### 1. Create a List object with the following movie titles, and set to a `val`:

In [1]:
val list = List("Winnie the Pooh", "The Tigger Movie", "Pirates of the Caribbean", "Apollo 13", "Mall Cop", "Mockingjay - Part 1", "The Good Dinosaur", "Lava", "The Peanuts Movie")

#### 2. From the List object, create an RDD using the `parallelize` method. 

In [2]:
sc.parallelize(list)

org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:20

The output should show that you now have a org.apache.spark.rdd.RDD object!

#### 3. Create the RDD again, but use the `take` action to retrieve the first 3 elements of the RDD.

In [3]:
sc.parallelize(list).take(3)

Array[String] = Array(Winnie the Pooh, The Tigger Movie, Pirates of the Caribbean)

The `take` action returns an Array with the elements from the RDD, which in this case are Strings.

#### 4. Now let's create an RDD from the Cassandra `killr_video.videos` table and set that to a `val`. Retrieve the first element in the RDD.

In [4]:
val tableRDD = sc.cassandraTable("killr_video", "videos")
tableRDD.take(1)

The retrieved Array should contain a CassandraRow with columns and values, the same as the table definition above.

#### 5. Finally, let's create an RDD using a transformation. Run the next code box, and then filter the resulting `csvRDD` to create an RDD with only the movies released in 2002 or later. Print out the contents of the RDD.

In [5]:
val csvArray = sc.textFile("file:///root/data/video-years.csv").map(line => line.split(",") )
val csvRDD = csvArray collect { case Array(video_id: String, title: String, year: String) => (video_id,title,year.toInt) }

In [6]:
csvRDD.filter{ case (video_id, title, year) => year >= 2002 }.collect().foreach(println)

(b6d734d1-9aea-11e5-a6ca-8b496c707234,"Stuart Little 2",2002)
(b6c5cfb5-9aea-11e5-a6ca-8b496c707234,"Piglet's Big Movie",2003)
(b6c31093-9aea-11e5-a6ca-8b496c707234,"The Country Bears",2002)
(b6d89469-9aea-11e5-a6ca-8b496c707234,"The Jungle Book 2",2003)
(b6c31095-9aea-11e5-a6ca-8b496c707234,"Tarzan & Jane",2002)
(b6d9cce1-9aea-11e5-a6ca-8b496c707234,"Finding Nemo",2003)
