<a href="https://colab.research.google.com/github/philippengani/movie_recommendation/blob/master/Airliquide_google_collab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's setup Spark on your Colab environment. Run the cell below!


In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
!pip install keras
!pip install scikit-surprise
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
!pip install findspark

openjdk-8-jdk-headless is already the newest version (8u292-b10-0ubuntu1~18.04).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [2]:
import pandas as pd
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
import findspark
findspark.add_packages('mysql:mysql-connector-java:8.0.11')
import keras
import os.path
from os import path
from zipfile import ZipFile


# **1. Download the movie lens dataset and extract** 

In [3]:
# Download the actual data from http://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
dataset_path = "/content/dataset"

if not path.exists(dataset_path):
  !mkdir /content/dataset
  #!wget -P /content/dataset https://files.grouplens.org/datasets/movielens/ml-25m.zip
  !wget -P /content/dataset https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
else:
  print("dataset already exist. No need to download ")


dataset already exist. No need to download 


In [4]:
# Only extract the data the first time the script is run.
movielens_dir = dataset_path + "/ml-latest-small"
movielens_zipped_file = dataset_path + "/ml-latest-small.zip"

if not path.exists(movielens_dir):
    with ZipFile(movielens_zipped_file, "r") as z:
        # Extract files
        print("Extracting all the files now...")
        z.extractall(path=dataset_path)
        print("Done!")
else:
   print("dataset already exist. No need to extract ")


dataset already exist. No need to extract 


# **2. Setup the big data environment with pyspark**

In [5]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")
# create the context
sc = SparkContext(conf=conf)

spark = SparkSession \
    .builder \
    .appName("Movie recommendation") \
    .getOrCreate()


In [6]:
spark

# **3. Data exploration and cleaning**


**a .Reading the downloaded movie dataset to pyspark dataframe**



In [7]:
ratings_file = movielens_dir + "/ratings.csv"
movies_file = movielens_dir + "/movies.csv"


# Define the dataset schema 

from pyspark.sql.types import *

ratings_df_schema = StructType(
  [StructField('userId', IntegerType()),
   StructField('movieId', IntegerType()),
   StructField('rating', DoubleType())]
)
movies_df_schema = StructType(
  [StructField('movieId', IntegerType()),
   StructField('title', StringType()),
   StructField('genres', StringType())]
)

# creating the pyspark dataframes and cache in memory

ratings_df = spark.read\
                  .options(header =True, inferSchema=False)\
                  .schema(ratings_df_schema)\
                  .csv(ratings_file)
movies_df = spark.read\
                .options(header =True, inferSchema=False)\
                .schema(movies_df_schema)\
                .csv(movies_file)

ratings_df.cache()
movies_df.cache()

DataFrame[movieId: int, title: string, genres: string]

In [8]:
ratings_df.show(10, truncate=False)
movies_df.show(10, truncate=False)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|1     |1      |4.0   |
|1     |3      |4.0   |
|1     |6      |4.0   |
|1     |47     |5.0   |
|1     |50     |5.0   |
|1     |70     |3.0   |
|1     |101    |5.0   |
|1     |110    |4.0   |
|1     |151    |5.0   |
|1     |157    |5.0   |
+------+-------+------+
only showing top 10 rows

+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure|Animation|Children|Comedy|Fantasy|
|2      |Jumanji (1995)                    |Adventure|Children|Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy|Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy|Drama|Romance                       |
|5      |Father

In [9]:
ratings_df


DataFrame[userId: int, movieId: int, rating: double]

In [10]:
movies_df

DataFrame[movieId: int, title: string, genres: string]

# **4. Building the user-based collaborative filtering**



> For the this project, we will use the SurPRISE (Simple Python RecommendatIon System Engine) library


> This is because it is faster and has an integrated SVD algorithms, a Matrix factorization algoritms





In [11]:
from surprise import Reader, Dataset, SVD, SVDpp
from surprise import accuracy

In [12]:
# Surprise is only compatible with pandas. So we will convert the pyspark dataframes to pandas dataframes
ratings_df_pd =ratings_df.toPandas()

In [13]:
ratings_pd = pd.read_csv(ratings_file)
movies_pd = pd.read_csv(movies_file)

In [14]:
ratings_pd

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [15]:
movies_pd

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [16]:
ratings_pd.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [17]:
reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(ratings_pd[['userId', 'movieId', 'rating']], reader=reader)

svd = SVD(n_factors=50)
svd_plusplus = SVDpp(n_factors=50)

In [18]:
# Build the the training set and fit the model

trainset = dataset.build_full_trainset()

svd.fit(trainset)  # old version use svd.train

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f44fc326390>

In [20]:
id_2_names = dict()

for idx, names in zip(movies_pd['movieId'], movies_pd['title']):
    id_2_names[idx] = names

In [28]:
def create_test_set(user_id):
    
    fill = trainset.global_mean
    anti_testset = list()
    u = trainset.to_inner_uid(user_id)
    
    # ur == users ratings
    user_items = set([item_inner_id for (item_inner_id, rating) in trainset.ur[u]])
    
    anti_testset += [(trainset.to_raw_uid(u), trainset.to_raw_iid(i), fill) for
                            i in trainset.all_items() if i not in user_items]
    
    return anti_testset

Now let us create a fuction to implement a top 10 movies rated by a user

In [29]:
def top_recommendations(user_id, num_recommender=10):
    
    testSet = create_test_set(user_id)
    predict = svd.test(testSet)  # we can change to SVD++ later
    
    recommendation = list()
    
    for userID, movieID, actualRating, estimatedRating, _ in predict:
        intMovieID = int(movieID)
        recommendation.append((intMovieID, estimatedRating))
        
    recommendation.sort(key=lambda x: x[1], reverse=True)
    
    movie_names = []
    movie_ratings = []
    
    for name, ratings in recommendation[:20]:
        movie_names.append(id_2_names[name])
        movie_ratings.append(ratings)
        
    movie_dataframe =  pd.DataFrame({'movie_names': movie_names,
                                     'rating': movie_ratings})
    
    
    return movie_dataframe.sort_values('rating', ascending=False)[['movie_names', 'rating']].head(num_recommender)
    
   

In [35]:
# Now let us simulate the top 10 movies recommended movies for user 600

top_recommendations(600, num_recommender=10)

Unnamed: 0,movie_names,rating
0,Jaws (1975),4.11713
1,Glory (1989),4.116117
2,12 Angry Men (1957),4.103434
3,Seven Samurai (Shichinin no samurai) (1954),4.089558
4,"Outlaw Josey Wales, The (1976)",4.060724
5,"Streetcar Named Desire, A (1951)",4.03
6,"Boot, Das (Boat, The) (1981)",3.989496
7,Three Colors: Red (Trois couleurs: Rouge) (1994),3.968871
8,Man Bites Dog (C'est arrivé près de chez vous)...,3.957942
9,One Flew Over the Cuckoo's Nest (1975),3.920642


# Model evaluation

In [36]:
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()

predictions_svd = svd.test(testset)

In [38]:
print('SVD - RMSE:', accuracy.rmse(predictions_svd, verbose=False))
print('SVD - MAE:', accuracy.mae(predictions_svd, verbose=False))

SVD - RMSE: 0.48835513445255857
SVD - MAE: 0.37746508523653777


From the above evaluation metrics, the model obtained an RMSE of 0.48835513445255857 during the testing phase which is pretty good

## Testing the model on a user
Now let's use SVD to predict the rating that User with ID 500 will give to a random movie (let's say with Movie ID 100).

In [40]:
svd.predict(600, 1)

Prediction(uid=600, iid=1, r_ui=None, est=3.3225038943247345, details={'was_impossible': False})

For movie with ID 100, I get an estimated prediction of 2.6362. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

# Conclusion
In this notebook, I attempted to build a model-based Collaborative Filtering movie recommendation sytem based on latent features from matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes.