![DataStax Academy](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/datastax-academy.svg "DataStax Academy")

# Exercise 20 - Tuning Partitioning: Controlling Partitioning

## Background

In this exercise, you'll be executing long-running queries and seeing how the code can be optimized. The execution results will also be viewed using the Spark UI.

You'll be working with the `videos_by_two_actors` table:

***

## Directions

#### 1. Open the Spark web UI tool in a browser at http://localhost:4040. You should see a view similar to this one:

![Diagram 1](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-01.jpg "Diagram 1")

#### 2.  Run the below code box and refresh the Spark UI browser a few times while the query executes.

In [None]:
val primaryActor = sc.cassandraTable("killr_video", "videos_by_two_actors").select("actor1")
primaryActor.map(r => (r.getString(0).substring(0,1), 1)).reduceByKey((x,y) => x + y).count

The query pulls the first character from this column and counts the number of times each value appears, before returning the count of the number of records. When evaluated, the query takes a while to execute (20 to 60 seconds depending on your hardware).

In your browser, you should see a view similar to this as the query executes:

![Diagram 2](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-02.jpg "Diagram 2")

When the query completes, you should see a view similar to this:

![Diagram 3](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-03.jpg "Diagram 3")

#### 3. Click on the job description to view more details about the query's execution. Which of the stages took longer to do?

![Diagram 4](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-04.jpg "Diagram 4")

Take note that `map` took significantly longer than `count`.

#### 4. Now execute the following block of code.

In [None]:
primaryActor.map(r => (r.getString(0).substring(0,1), 1)).reduceByKey((x,y) => x + y).take(3).foreach(println)

This code prints the first three records instead of counting how many records exist. Notice it takes a long time. Monitor the job in the browser as the query executes.

#### 5. After the query terminates, review the details in the browser. For this job, which of the stages took longer to do?

![Diagram 5](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-05.jpg "Diagram 5")

Again, `map` took the bulk of the time.

#### 6. Re-run the query, but this time we will force Spark to create several partitions instead of just a few. Wait for it to complete.

In [None]:
val shuffled = primaryActor.map(r => (r.getString(0).substring(0,1), 1)).reduceByKey((x,y) => x + y, 10000).count

#### 7. Open the details of this latest job in the browser. Which of the stages took longer to do?

Notice the count took much longer this time due to a significant increase in the number of partitions.

![Diagram 6](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-06.jpg "Diagram 6")

#### 8. Click on the `count` stage link in the description (this page may take a while to load). Then click on the "Show Additional Metrics option. Which one of the metrics took longer: Duration or Scheduler Delay?

![Diagram 8](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-08.jpg "Diagram 8")

![Diagram 9](https://s3.amazonaws.com/datastaxtraining/vq8Jr36Gk48v/20-09.jpg "Diagram 9")

Notice the amount of time to complete each task (Duration) is smaller than the amount of time to schedule each task (Scheduler Delay).