![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 07.03 - Spark-Cassandra Connector Optimizations: Joining Tables

## Background

With this exercise, you will be optimizing working Spark code by using the `joinWithCassandraTable` method.

You'll be working with the `videos_by_actor` table and the `actor` table:

***

## Directions

Begin by creating a locally initialized list of two actors and parallelize it, or make it an RDD.

In [2]:
case class ActorYear(actor_name: String, release_year: Int)

val actors2014 = sc.parallelize(List(ActorYear("Johnny Depp",2014), 
                                    ActorYear("Bruce Willis",2014)))

Next join to the `videos by actor` table using the new method `joinWithCassandraTable`. Using this method will automatically join on the partition key by default.

In [3]:
actors2014.joinWithCassandraTable("killr_video","videos_by_actor").takeSample(false,10).foreach(println)

(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 2003, character_name: William Rose Bailey, title: Tears of the Sun, video_id: ece961ee-a5e2-11e5-af15-a45e60eb67c5})
(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 1991, character_name: James Urbanski, title: Mortal Thoughts, video_id: ecf18d87-a5e2-11e5-b5e6-a45e60eb67c5})
(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 2010, character_name: Mr. Church, title: The Expendables, video_id: ecf38335-a5e2-11e5-a8d5-a45e60eb67c5})
(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 1992, character_name: Dr. Ernest Menville, title: Death Becomes Her, video_id: ece92b66-a5e2-11e5-90e6-a45e60eb67c5})
(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 2013, character_name: John McClane, title: A Good Day to Die Hard, video_id: ecf58f1e-a5e2-11e5-a494-a45e60eb67c5})
(ActorYear(Bruce

Now lets change the where condition. We can use the `on` condition, provided we are limiting the result set by a column that is part of the clustering column.

In [7]:
actors2014.joinWithCassandraTable("killr_video","videos_by_actor")
    .on(SomeColumns("actor_name", "release_year"))
    .takeSample(false,10)
    .foreach(println)

(ActorYear(Bruce Willis,2014),CassandraRow{actor_name: Bruce Willis, release_year: 2014, character_name: Omar, title: The Prince, video_id: ed01818c-a5e2-11e5-8efd-a45e60eb67c5})
(ActorYear(Johnny Depp,2014),CassandraRow{actor_name: Johnny Depp, release_year: 2014, character_name: Guy Lapointe, title: Tusk, video_id: ed01abe6-a5e2-11e5-89d1-a45e60eb67c5})


Join two cassandra tables using the `joinWithCassandra` method. You will want to make a point to start with the table with a higher cardinality. In this case there are more videos than there are actors, so we will want to start with the actors table.

In [11]:
sc.cassandraTable("killr_video", "actor")
    .joinWithCassandraTable("killr_video","videos_by_actor")
    .on(SomeColumns("actor_name"))
    .takeSample(false,10)
    .foreach(println)

(CassandraRow{actor_name: Tom Hardy},CassandraRow{actor_name: Tom Hardy, release_year: 2009, character_name: Michaels, title: Thick as Thieves, video_id: eced3f30-a5e2-11e5-abf1-a45e60eb67c5})
(CassandraRow{actor_name: Kim Darby},CassandraRow{actor_name: Kim Darby, release_year: 1969, character_name: Mattie Ross, title: True Grit, video_id: ecee2cd7-a5e2-11e5-a5a1-a45e60eb67c5})
(CassandraRow{actor_name: Mackenzie Crook},CassandraRow{actor_name: Mackenzie Crook, release_year: 2003, character_name: Ragetti, title: Pirates of the Caribbean: The Curse of the Black Pearl, video_id: ece5c0de-a5e2-11e5-9271-a45e60eb67c5})
(CassandraRow{actor_name: John Goodman},CassandraRow{actor_name: John Goodman, release_year: 2009, character_name: Julie 'Baby Feet' Balboni, title: The Princess and the Frog, video_id: ecef014c-a5e2-11e5-a7de-a45e60eb67c5})
(CassandraRow{actor_name: Burt Reynolds},CassandraRow{actor_name: Burt Reynolds, release_year: 1976, character_name: Gator McKlusky, title: Gator, vide