This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week14` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

## Problem 14.1. Spark

In this problem, we will perform basic data processing tasks within Spark using the concept of Resilient Distributed Datasets (RDDs).

In [None]:
import pyspark
from pyspark import SparkConf, SparkContext

from nose.tools import assert_equal, assert_is_instance

We run Spark in [local mode](http://spark.apache.org/docs/latest/programming-guide.html#local-vs-cluster-modes) from within our Docker container.

In [None]:
sc = SparkContext('local[*]')

We create a new RDD by reading in the data as a text file. We use the ratings data from [MovieLens](http://grouplens.org/datasets/movielens/latest/). See [Week 6 Lesson 1](https://github.com/UI-DataScience/info490-sp16/blob/master/Week6/notebooks/intro2rs.ipynb) for more information on this data set.

In [None]:
text_file = sc.textFile('/home/data_scientist/data/ml-latest-small/ratings.csv')

assert_is_instance(text_file, pyspark.rdd.RDD)

- Write a function that creates a new RDD by transforming `text_file` into an RDD with columns of appropriate data types.
- The function accepts a `pyspark.rdd.RDD` instance (e.g., `text_file` in the above code cell) and returns another RDD instance, `pyspark.rdd.PipelinedRDD`.
- `ratings.csv` contains a header row. Use the `head` command or otherwise to inspect the file.

In [None]:
def read_ratings_csv(rdd):
    '''
    Creates an RDD by transforming `ratings.csv`
    into columns with appropriate data types.
    
    Parameters
    ----------
    rdd: A pyspark.rdd.RDD instance.
    
    Returns
    -------
    A pyspark.rdd.PipelinedRDD instance.
    '''
    
    # YOUR CODE HERE
    
    return rdd

In [None]:
ratings = read_ratings_csv(text_file)
print(ratings.take(3))

In [None]:
assert_is_instance(ratings, pyspark.rdd.PipelinedRDD)
assert_equal(ratings.count(), 105339)
assert_equal(len(ratings.first()), 4)
assert_equal(
    ratings.take(5),
    [(1, 16, 4.0, 1217897793),
     (1, 24, 1.5, 1217895807),
     (1, 32, 4.0, 1217896246),
     (1, 47, 4.0, 1217896556),
     (1, 50, 4.0, 1217896523)]
    )

For simplicity, we might want to restrict our analysis to only favorable ratings, which, since the movies are rated on a five-star system, we take to mean ratings greater than three. So

- Write a function that selects rows whose rating is greater than 3.

In [None]:
def filter_favorable_ratings(rdd):
    '''
    Selects rows whose rating is greater than 3.
    
    Parameters
    ----------
    rdd: A pyspark.rdd.RDD instance.
    
    Returns
    -------
    A pyspark.rdd.PipelinedRDD instance.
    '''
    
    # YOUR CODE HERE
    
    return rdd

In [None]:
favorable = filter_favorable_ratings(ratings)

In [None]:
assert_is_instance(favorable, pyspark.rdd.PipelinedRDD)
assert_equal(favorable.count(), 64160)

We might also want to select only those movies that have been reviewed by multiple people.

- Write a function that returns the number of reviews for a given movie.

In [None]:
def find_n_reviews(rdd, movie_id):
    '''
    Finds the number of reviews for a movie.
    
    Parameters
    ----------
    rdd: A pyspark.rdd.RDD instance.
    movie_id: An int.
    
    Returns
    -------
    A pyspark.rdd.PipelinedRDD instance.
    '''
    
    # YOUR CODE HERE
    
    return n_reviews

In [None]:
n_toy_story = find_n_reviews(favorable, 1)
print(n_toy_story)

In [None]:
assert_is_instance(n_toy_story, int)

test = [find_n_reviews(favorable, n) for n in range(5)]
assert_equal(test, [0, 172, 44, 18, 3])

## Cleanup

We must stop the SparkContext in order to release the spark resources before existing this Notebook.

In [None]:
sc.stop()