# Project 1: Movie Recommendation

This system recommends movies based on a given user, their average rating, and the id of the movie they want to base our recommendations off of.

If a user gave a movie with genres (Action, Adventure) and an average rating of 3.0, we would only recommend movies that were Action, Adventure, or a mix of both with an average rating of 3.0 and above. 

We just import some modules here and initialize a SparkContext.

In [2]:
import os

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local[*]').setAppName('bds_project1')
sc = SparkContext.getOrCreate(conf=conf)

Initializing variables for the user, the id of the movie they rated, and the rating they gave that movie.

In [3]:
user_id = 1
mid = 31
avg_rating = 4

We import the movies CSV to grab data about the movie, mainly its ID and genre.

We also remove the header of the CSV file using the filter line by allowing anything that isn't the header line.

The resulting movies_csv is then put into this structure: (movieid, (name, set(genres))

The reason why we use a set for genres is to be able to use set.intersection() to see if there are any overlapping genres with the movie that the user rated. We use that overlap to decide whether to recommend this movie.

In [4]:
mcsv_path = os.path.join(os.getcwd(), 'datasets', 'movies.csv')
movies_csv = sc.textFile(mcsv_path)
movies_header = movies_csv.first()
movies_csv = movies_csv.filter(lambda line: line != movies_header)
movies_csv = movies_csv.map(lambda line: line.split(',')).map(
    lambda x: (int(x[0]), (x[1], set(x[2].split('|')))))

We import the ratings CSV to grab ratings of each movie by user.

We remove the header of the CSV file like last time then map the ratings to this structure:
    (movieid, (rating, num_ratings_added))
    
red_ratings then sums each rating and the number of ratings added.

Then averaged_ratings shows the average ratings of each movie by dividing each rating by the number of ratings to get:
    (movieid, average_rating)
   
Lastly, good_movies filters out all of the movies that don't make the cutoff.

In [5]:
rcsv_path = os.path.join(os.getcwd(), 'datasets', 'ratings.csv')
ratings_csv = sc.textFile(rcsv_path)
ratings_header = ratings_csv.first()
ratings_csv = ratings_csv.filter(lambda line: line != ratings_header)
restruct_ratings = ratings_csv.map(lambda line: line.split(',')).map(
    lambda x: (int(x[1]), (float(x[2]), 1)))
red_ratings = restruct_ratings.reduceByKey(
    lambda x, y: (x[0] + y[0], x[1] + y[1]))
averaged_ratings = red_ratings.mapValues(lambda x: round(x[0]/x[1], 1))
good_movies = averaged_ratings.filter(lambda rating: rating[1] >= avg_rating)

We grab the rated movie's genres in the movies_csv.

In [6]:
src_movie = movies_csv.filter(lambda movie: movie[0] == mid).first()

We define rec_similar to filter out only movies that are similar to the given movie. We say a movie is similar if it shares one or more genres.

In [10]:
def rec_similar(mid, mcsv):
    """Recommend something similar to this movie using genres

    Args:
        mid (String): movie_id to use as the source for recommendations
        mcsv (PythonRDD): movies_csv file to source genres for the given mid
    Returns:
        PythonRDD: reference to dependency graph to filter RDD to just
            recommended movies
    """
    return mcsv.filter(
        lambda movie: movie[1][1].intersection(src_movie[1][1])).map(
            lambda x: (x[0], x[1][0]))

We run rec_similar to get movies we want to recommend.

In [11]:
res = rec_similar(mid, movies_csv)

We want to merge our list of movies with the same genres with the list of movies above their rating cutoff, but we also want to show the average rating of each movie so we need to merge our movies RDD with the ratings RDD.

Afterwards, we take a random sample of 5 movies to give to the user.

Lastly, we don't want to show the movie ids so we return a list of the second item in each tuple which contains a tuple of the movie name and the average rating.

In [12]:
recs = res.join(good_movies).takeSample(False, 5)
print([movie[1] for movie in recs])

[('Drunken Angel (Yoidore tenshi) (1948)', 4.5), ('Limbo (1999)', 4.4), ('Love Is the Devil (1998)', 4.2), ("Sharky's Machine (1981)", 4.0), ("Satan's Brew (Satansbraten) (1976)", 4.5)]
