![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 02.02 - Essentials: Three Ways to Create an RDD

## Background

In this exercise you will practice creating an RDD from a few different data sources: 

1. Parallelizing a collection
2. Load data from storage
3. Transforming an RDD

One of the data sources you'll be working with is a Cassandra table, `killr_video.videos`, with the table definition:

You'll also be working with a CSV file, `/root/data/video-years.csv`, which contains columns in the following order:

***

## Directions

#### 1. Create a List object with the following movie titles, and set to a `val`:

In [2]:
li = ["Winnie the Pooh", "The Tigger Movie", "Pirates of the Caribbean", "Apollo 13", "Mallcop", "Mockingjay - Part 1", "The Good Dinosaur", "Lava", "The Peanuts Movie"]

#### 2. From the List object, create an RDD using the `parallelize` method. 

In [3]:
sc.parallelize(li)

ParallelCollectionRDD[1] at parallelize at PythonRDD.scala:396

The output should show that you now have a org.apache.spark.rdd.RDD object!

#### 3. Create the RDD again, but use the `take` action to retrieve the first 3 elements of the RDD.

In [4]:
sc.parallelize(li).take(3)

['Winnie the Pooh', 'The Tigger Movie', 'Pirates of the Caribbean']

The `take` action returns an Array with the elements from the RDD, which in this case are Strings.

#### 4. Now let's create an RDD from the Cassandra `killr_video.videos` table and set that to a `val`. Retrieve the first element in the RDD.

In [5]:
tableRDD = sc.cassandraTable("killr_video", "videos")
tableRDD.take(1)

[Row(avg_rating=7.0, description=u"After being wrongfully expelled from Harvard University, American Matt Buckner flees to his sister's home in England. Once there, he is befriended by her charming and dangerous brother-in-law, Pete Dunham, and introduced to the underworld of British football hooliganism. Matt learns to stand his ground through a friendship that develops against the backdrop of this secret and often violent world. 'Green Street Hooligans' is a story of loyalty, trust and the sometimes brutal consequences of living close to the edge.", genres=set([u'Crime']), mpaa_rating=u'R', release_date=datetime.datetime(2005, 9, 9, 0, 0), release_year=2005, title=u'Green Street Hooligans', user_id=u'6b234a61-faa6-4b4e-bba1-7c283fb8cae5', video_id=u'ece8de8f-a5e2-11e5-ba92-a45e60eb67c5')]

The retrieved Array should contain a CassandraRow with columns and values, the same as the table definition above.

#### 5. Finally, let's create an RDD using a transformation. Run the next code box, and then filter the resulting `csvRDD` to create an RDD with only the movies released in 2002 or later. Print out the contents of the RDD.

In [6]:
csvArray = sc.textFile("file:///root/data/video-years.csv") \
    .map(lambda line: line.split(",")) \
    .map(lambda movie: (movie[0], movie)) \
    .filter(lambda pair: int(pair[1][2]) >= 2002) \
    .collect()

In [7]:
for _, v in csvArray:
    print v

[u'b6d734d1-9aea-11e5-a6ca-8b496c707234', u'"Stuart Little 2"', u'2002']
[u'b6c5cfb5-9aea-11e5-a6ca-8b496c707234', u'"Piglet\'s Big Movie"', u'2003']
[u'b6c31093-9aea-11e5-a6ca-8b496c707234', u'"The Country Bears"', u'2002']
[u'b6d89469-9aea-11e5-a6ca-8b496c707234', u'"The Jungle Book 2"', u'2003']
[u'b6c31095-9aea-11e5-a6ca-8b496c707234', u'"Tarzan & Jane"', u'2002']
[u'b6d9cce1-9aea-11e5-a6ca-8b496c707234', u'"Finding Nemo"', u'2003']
