# <center> Introduction to Spark In-memmory Computing via Python PySpark </center>

In [None]:
!module list

In [None]:
import sys
import os

sys.path.insert(0, '/usr/hdp/2.6.0.3-8/spark2/python')
sys.path.insert(0, '/usr/hdp/2.6.0.3-8/spark2/python/lib/py4j-0.10.4-src.zip')

os.environ['SPARK_HOME'] = '/usr/hdp/2.6.0.3-8/spark2/'
os.environ['SPARK_CONF_DIR'] = '/etc/hadoop/synced_conf/spark2/'
os.environ['PYSPARK_PYTHON'] = '/software/anaconda3/4.2.0/bin/python'

import pyspark
conf = pyspark.SparkConf()
conf.setMaster("yarn")
conf.set("spark.driver.memory","4g")
conf.set("spark.executor.memory","60g")
conf.set("spark.num.executors","3")
conf.set("spark.executor.cores","12")

sc = pyspark.SparkContext(conf=conf)

In [None]:
sc

### Movie Ratings

An independent movie company is looking to invest in a new movie project. With limited finance, the company wants to 
analyze the reaction of audiences, particularly toward various movie genres, in order to identify beneficial 
movie project to focus on. The company relies on data collected from a publicly available recommendation service 
by [MovieLens](http://dl.acm.org/citation.cfm?id=2827872). This 
[dataset](http://files.grouplens.org/datasets/movielens/ml-10m-README.html) contains **24404096** ratings and **668953**
 tag applications across **40110** movies. These data were created by **247753** users between January 09, 1995 and January 29, 2016. This dataset was generated on October 17, 2016. 

From this dataset, several analyses are possible, include the followings:
1.   Find movies which have the highest average ratings over the years and identify the corresponding genre.
2.   Find genres which have the highest average ratings over the years.
3.   Find users who rate movies most frequently in order to contact them for in-depth marketing analysis.

These types of analyses, which are somewhat ambiguous, demand the ability to quickly process large amount of data in 
elatively short amount of time for decision support purposes. In these situations, the sizes of the data typically 
make analysis done on a single machine impossible and analysis done using a remote storage system impractical. For 
remainder of the lessons, we will learn how HDFS provides the basis to store massive amount of data and to enable 
the programming approach to analyze these data.

In [None]:
!hdfs dfs -ls /repository/movielens

In [None]:
!hdfs dfs -cat  /repository/movielens/README.txt

In [None]:
!hdfs dfs -cat  /repository/movielens/links.csv \
    2>/dev/null | head -n 5

In [None]:
!hdfs dfs -cat  /repository/movielens/movies.csv \
    2>/dev/null | head -n 5

In [None]:
!hdfs dfs -cat  /repository/movielens/ratings.csv \
    2>/dev/null | head -n 5

In [None]:
!hdfs dfs -cat  /repository/movielens/tags.csv \
    2>/dev/null | head -n 5

In [None]:
ratings = sc.textFile("/repository/movielens/ratings.csv")

In [None]:
ratings.cache()

In [None]:
%%time
ratings.count()

In [None]:
%%time
ratings.count()

In [None]:
%%time
ratings.count()

### 4.1 Find movies which have the highest average ratings over the years and identify the corresponding genre

- Find the average ratings of all movies over the years
- Identify the corresponding genres for each movie

In [None]:
ratings.take(5)

In [None]:
ratingHeader = ratings.first() #extract header
print(ratingHeader)

In [None]:
ratingsOnly = ratings.filter(lambda x:x != ratingHeader)

In [None]:
ratingsOnly.take(5)

In [None]:
movieRatings = ratingsOnly.map(lambda line: (line.split(",")[1], float(line.split(",")[2])))

In [None]:
movieRatings.take(5)

**Possible approaches in aggregating data:** 
- groupByKey and mapValues
- reduceByKey and countByKey

**groupByKey and mapValues**

In [None]:
groupByKeyRatings = movieRatings.groupByKey()

groupByKeyRatings.take(5)

In [None]:
mapValuesToListRatings = groupByKeyRatings.mapValues(list)
mapValuesToListRatings.take(5)

In [None]:
avgRatings01 = mapValuesToListRatings.mapValues(lambda V: sum(V) / float(len(V)))

avgRatings01.take(5)

Is this correct?

In [None]:
(3.5 + 3.5 + 2.5 + 3.5 + 2.0 + 3.5 + 2.5 + 3.0) / 8

**reduceByKey and countByKey**

In [None]:
countsByKey = movieRatings.countByKey()

countsByKey

In [None]:
def sumValues(x,y):
    return (x + y)

sumRatings = movieRatings.reduceByKey(sumValues)

sumRatings.take(5)

In [None]:
import operator

sumRatings = movieRatings.reduceByKey(operator.add)
sumRatings.take(5)

In [None]:
avgRatings02 = sumRatings.map(lambda x: (x[0], x[1] / countsByKey.get(x[0])))

avgRatings02.take(5)

How do we augment movie ratings data with title informations?

In [None]:
movies = sc.textFile("movielens/movies.csv")

In [None]:
movieHeader = movies.first() #extract header
print(movieHeader)

In [None]:
movies = movies.filter(lambda x:x != movieHeader)

movies.take(5)

In [None]:
movieInfo = movies.map(lambda line: (line.split(",")[0], (line.split(",")[1], line.split(",")[2])))

movieInfo.take(5)

In [None]:
augmentedRatings = avgRatings01.join(movieInfo)

augmentedRatings.take(5)

*Movie with highest average rating:*

In [None]:
augmentedRatings.takeOrdered(10, key = lambda x : -x[1][0])

*Movie with lowest average rating:*

In [None]:
augmentedRatings.takeOrdered(10, key = lambda x : x[1][0])

### Challenge

- Augment the mapping process of WordCount with a function to filter out punctuations and capitalization from the unique words

### Challenge:

1. Make appropriate changes so that only movies with averaged ratings higher than 3.75 are collected
2. Further enhance your modification so that only movies with averaged ratings higher than 3.75 and number of ratings of at least 1000 times are collected.

### 4.2 Find genres which have the highest average ratings over the years

- Identify the genres associated with a movie and its rating
- Each movie can have multiple genres. How to flip the Key/Value pair?

In [None]:
movieRatings.take(5)

In [None]:
movieInfo.take(5)

In [None]:
augmentedInfo = movieRatings.join(movieInfo)

In [None]:
augmentedInfo.take(5)

In [None]:
def extractGenreRating (t):
    final_tuples = []
    genreList = t[1][1][1].split("|")
    for genre in genreList:
        final_tuples.append((genre,t[1][0]))
    return final_tuples

print(extractGenreRating((u'1', (3.0, (u'Toy Story (1995)', u'Adventure|Animation|Children|Comedy|Fantasy')))))

In [None]:
genreRatings = augmentedInfo.flatMap(extractGenreRating)

In [None]:
genreRatings.take(5)

### Challenge:

Complete the remaining portion of task 2.2: Calculating the average rating of each genre over the years

### 4.3 Find users who rate movies most frequently in order to contact them for in-depth marketing analysis

- How do you define "frequently"?
    - At least once per week?

In [None]:
userRatings = ratingsOnly.map(lambda line: (line.split(",")[0], float(line.split(",")[3])))

In [None]:
ratingGroupByUsers = userRatings.groupByKey().mapValues(list)
ratingGroupByUsers.take(5)

In [None]:
avgRatingFreq = ratingGroupByUsers.mapValues(lambda V: (max(V) - min(V)) / float(len(V)))
avgRatingFreq.take(5)

In [None]:
x = [1346139060.0,
   1346139098.0,
   1346139113.0,
   1346139053.0,
   1346139234.0,
   1346139006.0,
   1346139209.0,
   1346139147.0,
   1346138998.0,
   1346139206.0,
   1346139224.0,
   1346139174.0,
   1346139152.0,
   1346139230.0,
   1346139181.0,
   1346139159.0,
   1346139314.0]
(max(x) - min(x)) / float(len(x))

In [None]:
topUsers = avgRatingFreq.top(10, key=lambda x: x[1])

In [None]:
topUsers

In [None]:
sc.stop()