<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Spark Lab

---



### Some useful Spark resources

- [Spark CSV](https://github.com/databricks/spark-csv)
- [Pyspark programming guide](https://spark.apache.org/docs/0.9.0/python-programming-guide.html)
- [Download and run Spark](https://github.com/mahmoudparsian/pyspark-tutorial/blob/master/howto/download_install_run_spark.md)

---

**In this lab, we will use Spark to dig into the Bay Area Bike Share data.**

Our goal is to calculate the average number of trips per hour, using the Caltrain Station as starting point.

In [1]:
import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warning
from pyspark.sql import SQLContext

In [2]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[4]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

Just created a SparkContext


### 1. Load the Bay Area Bike Share trip data


> **Note:** This dataset stems from: http://www.bayareabikeshare.com/open-data

In [3]:
# A:
trips = sc.textFile('./data/201508_trip_data.csv')

In [4]:
# trips.first()

### 2. What kind of object is the data loaded as?

In [5]:
# A:

### 3. Split csv lines

In spark, we can build complex pipelines that only get executed when we ask to collect them.

In a python pipeline the calculation is immediately executed, but with spark the pipeline definition and execution are separate steps.

In other words, we can define the pipeline with all its steps, and only when we call `collect` will the data flow through it. In order to get familiar with this new workflow, we will start with small steps to build our pipeline.

**Apply a map to trips that splits each line at commas and save that to a an RDD.**

> **Hint:** if you want to check that you're doing things right, you can collect the result and display the first few lines.

In [6]:
# A:
# trips = trips.map(...
#  don't forget to collect!

### 4. Filter for Caltrain station

In Spark we can also create filters using the `filter` method.

**Select station number 70 by filtering on the 5th column.** 

We will do all the following analysis just on this station, which corresponds to the most popular starting point. Save this to a variable called `station_70`.

In [7]:
# A:
# station_70 = trips.filter(...

### 5. Trips by day - hour (mapper)

Let's analyse the trips by the hour. We can do this by performing a map reduce job in Spark. First we will need to emit tuples with a count of 1 for each (date, hour) key, and then we will sum the counts by key.

**Emit tuple of ((date, hour), 1), applying a map to `station_70` that extracts the relevant data from each line.**

In [8]:
# A:
# trips_by_day_hour = station_70.map(...

### 6. Trips by day - hour (reducer)

Use the `reduceByKey` method to obtain the number of trips per (day, hour).

In [9]:
# A:
# trips_by_day_hour = trips_by_day_hour.reduceByKey(...

### 7. Trips by hour (mapper)

Let's further group the trips by hour. We'll do this with a second Map Reduce job.

First we will discard the day and emit tuples of (hour, count). You can achieve this with a map.

In [10]:
# A:

### 8. Trips by hour (reducer)

Now calculate the average number of trips by hour using the `combineByKey` method.

> You can find a suggestion on how to do it [here](http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/).

In [11]:
# A:

### 9. `collect()` the results.
We can finally collect our result and sort them.

In [12]:
# A:

### 10. [Bonus] Using the Spark `sqlContext`

Besides the SparkContext, Spark also exposes a sqlContext that allows us to perform SQL queries on an RDD object.

A SQLContext is also already created for you. Do not create another or unspecified behavior may occur. As you can see below, the sqlContext provided is a HiveContext.

**Run a query using the sqlContext to obtain the average duration of a trip originating from the Caltrain station.**

*Note: you might have to rename the columns*

In [13]:
# A:
tripsSql = sqlContext.read.format('com.databricks.spark.csv').options(header='true',
                inferschema='true').load('./data/201508_trip_data.csv')