# Data 606 - Capstone Project
###Movie Recommendation System

Goal of this project: To explore recommendation system on the movie dataset with both content-based method and collaborative method using Machine Learning Algorithms. 

Outline for this project:
1. EDA
2. Data prep. 
3. Content-based Classfication
4. Collaborative filtering

#### **About the Dataset:**
This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

#### **User Ids**
MovieLens users were selected at random for inclusion. Their ids have been anonymized. User ids are consistent between ratings.csv and tags.csv (i.e., the same id refers to the same user across the two files).

#### **Movie Ids**
Only movies with at least one rating or tag are included in the dataset. These movie ids are consistent with those used on the MovieLens web site (e.g., id 1 corresponds to the URL https://movielens.org/movies/1). Movie ids are consistent between ratings.csv, tags.csv, movies.csv, and links.csv (i.e., the same id refers to the same movie across these four data files).

#### **Ratings Data**
Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

userId,movieId,rating,timestamp

Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

#### **Tags Data**

Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

userId,movieId,tag,timestamp

Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user.

Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.


#### **Movies Data**
Each line of this file after the header row represents one movie, and has the following format:

movieId,title,genres
Genres are a pipe-separated list, and are selected from the following:

Action
Adventure
Animation
Children's
Comedy
Crime
Documentary
Drama
Fantasy
Film-Noir
Horror
Musical
Mystery
Romance
Sci-Fi
Thriller
War
Western
(no genres listed)

#### **Links Data**
Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following format:

movieId,imdbId,tmdbId

movieId is an identifier for movies used by https://movielens.org. E.g., the movie Toy Story has the link https://movielens.org/movies/1.

imdbId is an identifier for movies used by http://www.imdb.com. E.g., the movie Toy Story has the link http://www.imdb.com/title/tt0114709/.

tmdbId is an identifier for movies used by https://www.themoviedb.org. E.g., the movie Toy Story has the link https://www.themoviedb.org/movie/862.

#### **References:**
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

In [1]:
#Import Required Libraries
import pandas as pd
import os
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.functions import isnan, when, count, col
from pyspark.sql.functions import to_timestamp
from pyspark.sql import functions as f
from pyspark.sql import types as t


In [3]:
spark = SparkSession \
    .builder \
    .appName("Data_606_Project") \
    .config(conf=SparkConf()) \
    .getOrCreate()

In [8]:
path = 'C:\\Users\\KIM\\Documents\\GitHub\\movieRecommendationSystem\\MRS_repo\\data\\raw\\Data3_movielens\\'

In [9]:
df_links = spark.read.csv(path+'links.csv', header=True,inferSchema='true')
df_movies = spark.read.csv(path+'movies.csv',header=True,inferSchema='true')
df_ratings = spark.read.csv(path+'ratings.csv', header=True,inferSchema='true')
df_tags = spark.read.csv(path+'tags.csv', header=True,inferSchema='true')

In [10]:
#Create a Dictionary for running functions conveniently.
df_dict = {'df_links' : df_links, 'df_movies':df_movies, 'df_ratings':df_ratings, 'df_tags':df_tags}

In [11]:
#Counting total no. of records of each dataframes
for name, df in df_dict.items():
  
  print(name)
  df.show()
  print('Total Records in',name,':',df.count())
  print('')

df_links
+-------+------+------+
|movieId|imdbId|tmdbId|
+-------+------+------+
|      1|114709|   862|
|      2|113497|  8844|
|      3|113228| 15602|
|      4|114885| 31357|
|      5|113041| 11862|
|      6|113277|   949|
|      7|114319| 11860|
|      8|112302| 45325|
|      9|114576|  9091|
|     10|113189|   710|
|     11|112346|  9087|
|     12|112896| 12110|
|     13|112453| 21032|
|     14|113987| 10858|
|     15|112760|  1408|
|     16|112641|   524|
|     17|114388|  4584|
|     18|113101|     5|
|     19|112281|  9273|
|     20|113845| 11517|
+-------+------+------+
only showing top 20 rows

Total Records in df_links : 9742

df_movies
+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Com

In [12]:
#PrintSchema of all the dataframe
for name,df in df_dict.items():
  print(name, 'Schema:')
  df.printSchema()

df_links Schema:
root
 |-- movieId: integer (nullable = true)
 |-- imdbId: integer (nullable = true)
 |-- tmdbId: integer (nullable = true)

df_movies Schema:
root
 |-- movieId: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: string (nullable = true)

df_ratings Schema:
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)

df_tags Schema:
root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- tag: string (nullable = true)
 |-- timestamp: integer (nullable = true)



In [13]:
df_ratings

DataFrame[userId: int, movieId: int, rating: double, timestamp: int]

In [14]:
#We noticed the datatype of the timestamp columns on both df_ratings and df_tags are integer
#We need to change the datatype to timestamp
df_ratings = df_ratings.withColumn('timestamp', f.date_format(df_ratings.timestamp.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))
df_ratings = df_ratings.withColumn('timestamp', f.to_date(df_ratings.timestamp.cast(dataType=t.TimestampType())))

df_tags = df_tags.withColumn('timestamp', f.date_format(df_tags.timestamp.cast(dataType=t.TimestampType()), "yyyy-MM-dd"))
df_tags = df_tags.withColumn('timestamp', f.to_date(df_tags.timestamp.cast(dataType=t.TimestampType())))


In [15]:
df_ratings.printSchema()
df_tags.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: date (nullable = true)

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- tag: string (nullable = true)
 |-- timestamp: date (nullable = true)



In [16]:
#Function that check the Nan & Null value in the whole dataframe 
def checkdfnan (df):
  df = df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
  return df 

In [17]:
for key,value in df_dict.items():
  print(key)
  checkdfnan(value)

df_links
+-------+------+------+
|movieId|imdbId|tmdbId|
+-------+------+------+
|      0|     0|     8|
+-------+------+------+

df_movies
+-------+-----+------+
|movieId|title|genres|
+-------+-----+------+
|      0|    0|     0|
+-------+-----+------+

df_ratings
+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     0|      0|     0|        0|
+------+-------+------+---------+

df_tags
+------+-------+---+---------+
|userId|movieId|tag|timestamp|
+------+-------+---+---------+
|     0|      0|  0|        0|
+------+-------+---+---------+



In [19]:
df_links.show()

+-------+------+------+
|movieId|imdbId|tmdbId|
+-------+------+------+
|      1|114709|   862|
|      2|113497|  8844|
|      3|113228| 15602|
|      4|114885| 31357|
|      5|113041| 11862|
|      6|113277|   949|
|      7|114319| 11860|
|      8|112302| 45325|
|      9|114576|  9091|
|     10|113189|   710|
|     11|112346|  9087|
|     12|112896| 12110|
|     13|112453| 21032|
|     14|113987| 10858|
|     15|112760|  1408|
|     16|112641|   524|
|     17|114388|  4584|
|     18|113101|     5|
|     19|112281|  9273|
|     20|113845| 11517|
+-------+------+------+
only showing top 20 rows



In [18]:
df_movies.show()

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
|     11|American Presiden...|Comedy|Drama|Romance|
|     12|Dracula: Dead and...|       Comedy|Horror|
|     13|        Balto (1995)|Adventure|Animati...|
|     14|        Nixon (1995)|               Drama|
|     15|Cutthroat Island ...|Action|Adventure|...|
|     16|       Casino (1995)|         Crime|Drama|
|     17|Sen

In [None]:
df_movies_ratings = df_movies.join(df_ratings, df_movies.movieId == df_ratings.movieId, 'inner').drop(df_movies.movieId)
df_movies_ratings.show()

+--------------------+--------------------+------+-------+------+----------+
|               title|              genres|userId|movieId|rating| timestamp|
+--------------------+--------------------+------+-------+------+----------+
|    Toy Story (1995)|Adventure|Animati...|     1|      1|   4.0|2000-07-30|
|Grumpier Old Men ...|      Comedy|Romance|     1|      3|   4.0|2000-07-30|
|         Heat (1995)|Action|Crime|Thri...|     1|      6|   4.0|2000-07-30|
|Seven (a.k.a. Se7...|    Mystery|Thriller|     1|     47|   5.0|2000-07-30|
|Usual Suspects, T...|Crime|Mystery|Thr...|     1|     50|   5.0|2000-07-30|
|From Dusk Till Da...|Action|Comedy|Hor...|     1|     70|   3.0|2000-07-30|
|Bottle Rocket (1996)|Adventure|Comedy|...|     1|    101|   5.0|2000-07-30|
|   Braveheart (1995)|    Action|Drama|War|     1|    110|   4.0|2000-07-30|
|      Rob Roy (1995)|Action|Drama|Roma...|     1|    151|   5.0|2000-07-30|
|Canadian Bacon (1...|          Comedy|War|     1|    157|   5.0|2000-07-30|

In [None]:
df_movies_ratings.count()

100836

In [20]:
df_tags.show()

+------+-------+-----------------+----------+
|userId|movieId|              tag| timestamp|
+------+-------+-----------------+----------+
|     2|  60756|            funny|2015-10-24|
|     2|  60756|  Highly quotable|2015-10-24|
|     2|  60756|     will ferrell|2015-10-24|
|     2|  89774|     Boxing story|2015-10-24|
|     2|  89774|              MMA|2015-10-24|
|     2|  89774|        Tom Hardy|2015-10-24|
|     2| 106782|            drugs|2015-10-24|
|     2| 106782|Leonardo DiCaprio|2015-10-24|
|     2| 106782|  Martin Scorsese|2015-10-24|
|     7|  48516|     way too long|2007-01-24|
|    18|    431|        Al Pacino|2016-05-01|
|    18|    431|         gangster|2016-05-01|
|    18|    431|            mafia|2016-05-01|
|    18|   1221|        Al Pacino|2016-04-26|
|    18|   1221|            Mafia|2016-04-26|
|    18|   5995|        holocaust|2016-02-17|
|    18|   5995|       true story|2016-02-17|
|    18|  44665|     twist ending|2016-03-02|
|    18|  52604|  Anthony Hopkins|