# 303 Spark - Movielens

The goal of this lab is to run some analysis on a different dataset, [MovieLens](https://grouplens.org/datasets/movielens/).

- [Spark programming guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
- [RDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/RDD.html)
- [PairRDD APIs](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/rdd/PairRDDFunctions.html)

This lab's notebook is in the ```material``` folder; the solutions will be released in the same folder.

The cluster configuration should be the same from 301 and 302.

Download the dataset [here](https://big.csr.unibo.it/downloads/bigdata/ml-dataset.zip), unzip it and upload the files to S3.

- ml_movies.csv (<u>movieId</u>:Long, title:String, genres:String) 
    - genres are separated by pipelines  (e.g., "comedy|drama|action")
    - each movie is associated with many ratings

- ml_ratings.csv (<u>userId</u>:Long, <u>movieId</u>:Long, rating:Double, year:Int)
    - each rating is associated with many tags
- ml_tags.csv (<u>userId</u>:Long, <u>movieId</u>:Long, <u>tag</u>:String, year:Int) 

In [None]:
%%configure -f
{"executorMemory":"8G", "numExecutors":2, "executorCores":3, "conf": {"spark.dynamicAllocation.enabled": "false"}}

In [None]:
val bucketname = "unibo-bd2122-egallinucci"

val path_ml_movies = "s3a://"+bucketname+"/first-datasets/ml-movies.csv"
val path_ml_ratings = "s3a://"+bucketname+"/first-datasets/ml-ratings.csv"
val path_ml_tags = "s3a://"+bucketname+"/first-datasets/ml-tags.csv"

sc.applicationId

"SPARK UI: Enable forwarding of port 20888 and connect to http://localhost:20888/proxy/" + sc.applicationId + "/"

In [None]:
import java.util.Calendar
import org.apache.spark.sql.SaveMode
import org.apache.spark.HashPartitioner

object MovieLensParser {

  val noGenresListed = "(no genres listed)"
  val commaRegex = ",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
  val pipeRegex = "\\|(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"
  val quotes = "\""
  
  /** Convert from timestamp (String) to year (Int) */
  def yearFromTimestamp(timestamp: String): Int = {
    val cal = Calendar.getInstance()
    cal.setTimeInMillis(timestamp.trim.toLong * 1000L)
    cal.get(Calendar.YEAR)
  }

  /** Function to parse movie records
   *
   *  @param line line that has to be parsed
   *  @return tuple containing movieId, title and genres, none in case of input errors
   */
  def parseMovieLine(line: String): Option[(Long, String, String)] = {
    try {
      val input = line.split(commaRegex)
      var title = input(1).trim
      title = if(title.startsWith(quotes)) title.substring(1) else title
      title = if(title.endsWith(quotes)) title.substring(0, title.length - 1) else title
      Some(input(0).trim.toLong, title, input(2).trim)
    } catch {
      case _: Exception => None
    }
  }

  /** Function to parse rating records
   *
   *  @param line line that has to be parsed
   *  @return tuple containing userId, movieId, rating, and year none in case of input errors
   */
  def parseRatingLine(line: String): Option[(Long, Long, Double, Int)] = {
    try {
      val input = line.split(commaRegex)
      Some(input(0).trim.toLong, input(1).trim.toLong, input(2).trim.toDouble, yearFromTimestamp(input(3)))
    } catch {
      case _: Exception => None
    }
  }

  /** Function to parse tag records
   *
   *  @param line line that has to be parsed
   *  @return tuple containing userId, movieId, tag, and year, none in case of input errors
   */
  def parseTagLine(line: String) : Option[(Long, Long, String, Int)] = {
    try {
      val input = line.split(commaRegex)
      Some(input(0).trim.toLong, input(1).trim.toLong, input(2), yearFromTimestamp(input(3)))
    } catch {
      case _: Exception => None
    }
  }

}

In [None]:
val rddMovies = sc.textFile(path_ml_movies).flatMap(MovieLensParser.parseMovieLine)
val rddRatings = sc.textFile(path_ml_ratings).flatMap(MovieLensParser.parseRatingLine)
val rddTags = sc.textFile(path_ml_tags).flatMap(MovieLensParser.parseTagLine)

## 303-1 Datasets exploration

Cache the dataset and answer the following questions:

- How many (distinct) users, movies, ratings, and tags?
- How many (distinct) genres?
- On average, how many ratings per user?
- On average, how many ratings per movie?
- On average, how many genres per movie?
- What is the range of ratings?
- Which years? (print an ordered list)
- On average, how many ratings per year?

## 303-2 Compute the average rating for each movie

- Export the result to S3
- Do not start from cached RDDs
- Evaluate:
  - Join-and-Aggregate vs Aggregate-and-Join
  - Best join vs broadcast
- Use Tableau to check the results
  - Download the file from S3 instead of connecting to S3

In [None]:
val path_output_avgRatPerMovie = "s3a://"+bucketname+"/spark/avgRatPerMovie"
// rdd.coalesce(1).toDF().write.format("csv").mode(SaveMode.Overwrite).save(path_output_avgRatPerMovie)

sc.getPersistentRDDs.foreach(_._2.unpersist())

## 303-3 Genres

Make a chart of best-ranked genres, export the result to S3, then use Tableau to check it.

Use cached RDDs.

Two possible workflows:

1. Pre-aggregation (3 shuffles)

  - Aggregate ratings by movieId
  - Join with movies and map to genres
  - Aggregate by genres
  
2. Join & aggregate (2 shuffles)

  - Join with movies and map to genres
  - Aggregate by genres



In [None]:
val path_output_avgRatPerGenre = "s3a://"+bucketname+"/spark/avgRatPerGenre"

for ((k,v) <- sc.getPersistentRDDs) {
  v.unpersist()
}

## 303-4 Tags

What can you find out about tags?