# Movie Lens Recommender - Big Data Framework Project
Date : 08/01/2023

Hadi Jamal Ahmad - Julie Ngan - M2 BDML-APP

### EFREI Paris


The goal is to build a model (ANNOY) to recommend movies to a watcher. 

There are two types of recommendations : 
- Content-based filtering : recommend to a watcher based on what he already watches
- Collaborative filtering : recommend a watcher based on what others might like whore are near to him in preferences.

For this project we use PySpark. 

Our IMBD dataset contains :

- Movies Metadata
- Credits
- Links
- Ratings
- Keywords

For this project, drag and drop the csv into a data folder

1. Make a good analysis of the dataset.
2. Build regression models to predict movie revenue and vote
averages.
3. Use collaborative filtering to build a movie recommendation
system with two functions:
3.1 Suggest top N movies similar to a given movie title.
3.2 Predict user rating for the movies they have not rated for. You
may use a test set to test your prediction accuracy, in which
the test ratings can be regarded as not rated during training .

In [None]:
!pip install koalas


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting koalas
  Downloading koalas-1.8.2-py3-none-any.whl (390 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.8/390.8 KB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: koalas
Successfully installed koalas-1.8.2


### 1. Analysis

In [1]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
os.environ["SPARK_VERSION"] = "3.0.1"                          #your SPARK_VERSION

In [2]:
!wget -q https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

In [3]:
!tar xf spark-3.0.1-bin-hadoop3.2.tgz

In [4]:
!pip install -q findspark

In [5]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [8]:
import pandas as pd

def read(filename):
  return (spark.read.format("csv").options(header="true")
    .load(filename))


Preprocessing

In [6]:
# dropna
def missing_values(sparkdf):
  sparkdf.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in sparkdf.columns if c != 'Date']).show()

import numpy as np
from pyspark.sql import functions as F
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler

#https://medium.com/@connectwithghosh/basic-data-preparation-in-pyspark-capping-normalizing-and-scaling-252ee7acba7d
#TODO : finish
# normalize if not same magnitude
def normalize(df_new):
  assembler = VectorAssembler().setInputCols\
              (df_new.columns).setOutputCol("features")
  transformed = assembler.transform(df_new)
  scaler = MinMaxScaler(inputCol="features",\
          outputCol="scaledFeatures")
  scalerModel =  scaler.fit(transformed.select("features"))
  scaledData = scalerModel.transform(transformed)
  return scaledData

# correlation between imbd and tmbd
# avg ratings

#### Ratings

Choses à faire dans l'analyse :
- Regarder la distribution des valeurs de notes
- Le prendre en compte pour l'échantillon de recommendations
Par exemple, 3/5 sera la médiane.

In [9]:
ratings = read("ratings_small.csv")

In [10]:
ratings.show()

+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|     31|   2.5|1260759144|
|     1|   1029|   3.0|1260759179|
|     1|   1061|   3.0|1260759182|
|     1|   1129|   2.0|1260759185|
|     1|   1172|   4.0|1260759205|
|     1|   1263|   2.0|1260759151|
|     1|   1287|   2.0|1260759187|
|     1|   1293|   2.0|1260759148|
|     1|   1339|   3.5|1260759125|
|     1|   1343|   2.0|1260759131|
|     1|   1371|   2.5|1260759135|
|     1|   1405|   1.0|1260759203|
|     1|   1953|   4.0|1260759191|
|     1|   2105|   4.0|1260759139|
|     1|   2150|   3.0|1260759194|
|     1|   2193|   2.0|1260759198|
|     1|   2294|   2.0|1260759108|
|     1|   2455|   2.5|1260759113|
|     1|   2968|   1.0|1260759200|
|     1|   3671|   3.0|1260759117|
+------+-------+------+----------+
only showing top 20 rows



In [11]:
ratings.count()

100004

In [12]:
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import desc, stddev, mean, min, max
from pyspark.sql.functions import col,isnan, when, count
from scipy.stats import pearsonr
from pyspark.mllib.stat import Statistics

missing_values(ratings)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     0|      0|     0|        0|
+------+-------+------+---------+



### Distribution des valeurs des notes

Ratings vary from 0.5 which means the watcher hated the movie, to 5

There are 100 004 movies and watchers

In [13]:
ratings.summary().show()

+-------+------------------+------------------+------------------+--------------------+
|summary|            userId|           movieId|            rating|           timestamp|
+-------+------------------+------------------+------------------+--------------------+
|  count|            100004|            100004|            100004|              100004|
|   mean| 347.0113095476181|12548.664363425463| 3.543608255669773|1.1296390869392424E9|
| stddev|195.16383797819535|26369.198968815268|1.0580641091070326|1.9168582602710962E8|
|    min|                 1|                 1|               0.5|          1000003813|
|    25%|             182.0|            1028.0|               3.0|        9.65846894E8|
|    50%|             367.0|            2406.0|               4.0|       1.110421811E9|
|    75%|             520.0|            5418.0|               4.0|       1.296192325E9|
|    max|                99|             99992|               5.0|           999892199|
+-------+------------------+----

Keywords

In [14]:
keywords = read("keywords.csv")

Keywords is a list of Key/value tuples

In [15]:
keywords.show()

+-----+--------------------+
|   id|            keywords|
+-----+--------------------+
|  862|[{'id': 931, 'nam...|
| 8844|"[{'id': 10090, '...|
|15602|[{'id': 1495, 'na...|
|31357|[{'id': 818, 'nam...|
|11862|[{'id': 1009, 'na...|
|  949|[{'id': 642, 'nam...|
|11860|[{'id': 90, 'name...|
|45325|                  []|
| 9091|[{'id': 949, 'nam...|
|  710|[{'id': 701, 'nam...|
| 9087|[{'id': 833, 'nam...|
|12110|[{'id': 3633, 'na...|
|21032|[{'id': 1994, 'na...|
|10858|[{'id': 840, 'nam...|
| 1408|[{'id': 911, 'nam...|
|  524|[{'id': 383, 'nam...|
| 4584|[{'id': 420, 'nam...|
|    5|"[{'id': 612, 'na...|
| 9273|[{'id': 409, 'nam...|
|11517|[{'id': 380, 'nam...|
+-----+--------------------+
only showing top 20 rows



In [16]:
missing_values(keywords)

+---+--------+
| id|keywords|
+---+--------+
|  0|       0|
+---+--------+



There are 46 419 keywords used to judge a movie

In [17]:
keywords.summary().show()

+-------+------------------+--------------------+
|summary|                id|            keywords|
+-------+------------------+--------------------+
|  count|             46419|               46419|
|   mean|109769.95187315538|                null|
| stddev|113045.78025568495|                null|
|    min|               100|"[{'id': 1005, 'n...|
|    25%|           26809.0|                null|
|    50%|           61178.0|                null|
|    75%|          159897.0|                null|
|    max|              9999|[{'id': 9991, 'na...|
+-------+------------------+--------------------+



## Links between IMDB and TMBD movies

In [30]:
links = read("links_small.csv")

In [31]:
links.show()

+-------+-------+------+
|movieId| imdbId|tmdbId|
+-------+-------+------+
|      1|0114709|   862|
|      2|0113497|  8844|
|      3|0113228| 15602|
|      4|0114885| 31357|
|      5|0113041| 11862|
|      6|0113277|   949|
|      7|0114319| 11860|
|      8|0112302| 45325|
|      9|0114576|  9091|
|     10|0113189|   710|
|     11|0112346|  9087|
|     12|0112896| 12110|
|     13|0112453| 21032|
|     14|0113987| 10858|
|     15|0112760|  1408|
|     16|0112641|   524|
|     17|0114388|  4584|
|     18|0113101|     5|
|     19|0112281|  9273|
|     20|0113845| 11517|
+-------+-------+------+
only showing top 20 rows



In [32]:
links.summary().show()

+-------+------------------+-----------------+-----------------+
|summary|           movieId|           imdbId|           tmdbId|
+-------+------------------+-----------------+-----------------+
|  count|              9125|             9125|             9112|
|   mean|31123.291835616437|479824.3923287671|39104.54554433714|
| stddev| 40782.63360397416|743177.3608435562|62814.51980132846|
|    min|                 1|          0000417|              100|
|    25%|            2850.0|          88846.0|           9451.0|
|    50%|            6290.0|         119778.0|          15850.0|
|    75%|          164979.0|        5794766.0|         416437.0|
|    max|             99992|          5794766|             9995|
+-------+------------------+-----------------+-----------------+



In [33]:
missing_values(links)

+-------+------+------+
|movieId|imdbId|tmdbId|
+-------+------+------+
|      0|     0|    13|
+-------+------+------+



In [35]:
# Delete nan values
links = links.filter(~links.tmdbId.isNull())
links.summary().show()

+-------+------------------+------------------+-----------------+
|summary|           movieId|            imdbId|           tmdbId|
+-------+------------------+------------------+-----------------+
|  count|              9112|              9112|             9112|
|   mean|31104.671861281826|479826.15046093066|39104.54554433714|
| stddev| 40775.25279748233| 743376.3730049542|62814.51980132846|
|    min|                 1|           0000417|              100|
|    25%|            2850.0|           88814.0|           9451.0|
|    50%|            6287.0|          119783.0|          15850.0|
|    75%|          164979.0|         5794766.0|         416437.0|
|    max|             99992|           5794766|             9995|
+-------+------------------+------------------+-----------------+



Metadata 


*   Belongs to Collection
*   Budget
*   Genres
*   Homepage
*   Original Language
*   Adult of Not
*   Original Title
*   Overview
*   Popularity
*   Poster Path
*   Production Companies
*   Production Countries
*   Release Date
*   Revenue
*   Runtime
*   Spoken Languages
*   Status (released)
*   Tagline
*   Title
*   Video
*   Vote Average
*   Vote Count



In [54]:
movies_metadata = read("movies_metadata.csv")
movies_metadata.show()

+-----+---------------------+--------+--------------------+--------------------+-----+---------+-----------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+--------------------+--------------------+--------+--------------------+-----------------+
|adult|belongs_to_collection|  budget|              genres|            homepage|   id|  imdb_id|original_language|      original_title|            overview|popularity|         poster_path|production_companies|production_countries|        release_date|             revenue|             runtime|    spoken_languages|  status|             tagline|               title|   video|        vote_average|       vote_count|
+-----+---------------------+--------+--------------------+--------------------+-----+---------+-----------------+--------------------+--------------------+----------+-----

In [55]:
movies_metadata.summary().show()

+-------+-------------------+---------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+-----------------------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+---------------+------------------+------------------+--------------------+
|summary|              adult|belongs_to_collection|              budget|              genres|            homepage|                  id|             imdb_id| original_language|                     original_title|            overview|          popularity|         poster_path|production_companies|production_countries|        release_date|             revenue|             runtime|    spoken_languages|              status|             tagline|          title|             video|     

In [56]:
missing_values(movies_metadata)

TypeError: ignored

In [59]:
# delete missing values
movies_metadata = movies_metadata.na.drop(thresh=24, how='all')

In [60]:
movies_metadata.show()

+-----+---------------------+--------+--------------------+--------------------+-----+---------+-----------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+------------+---------+-------+--------------------+--------+--------------------+--------------------+-----+------------+----------+
|adult|belongs_to_collection|  budget|              genres|            homepage|   id|  imdb_id|original_language|      original_title|            overview|popularity|         poster_path|production_companies|production_countries|release_date|  revenue|runtime|    spoken_languages|  status|             tagline|               title|video|vote_average|vote_count|
+-----+---------------------+--------+--------------------+--------------------+-----+---------+-----------------+--------------------+--------------------+----------+--------------------+--------------------+--------------------+------------+---------+-------+-----------

2. Predict Movie Revenues and Vote averages

In [41]:
from pyspark.ml.regression import LinearRegression
model = LinearRegression()


In [48]:
# analysis of revenues
import matplotlib.pyplot as plt
import seaborn as sb


#movies_metadata.select('revenue') = movies_metadata.select('revenue').str[1:]
  
for col in ['revenue', 'release_date']:
    movies_metadata[col] = movies_metadata[col].str.replace(',', '')
  
    # Selecting rows with no null values
    # in the columns on which we are iterating.
    temp = (~movies_metadata[col].isnull())
    movies_metadata[temp][col] = movies_metadata[temp][col].convert_dtypes(float)
  
    movies_metadata[col] = pd.to_numeric(movies_metadata[col], errors='coerce')

features = ['revenue', 'release_date']
movies = movies_metadata.toPandas()
print(movies)
plt.subplots(figsize=(15, 5))
for i, col in enumerate(features):
    plt.subplot(1, 3, i+1)
    sb.distplot(movies[col])
plt.tight_layout()
plt.show()

TypeError: ignored

In [52]:
movies_metadata.show()

AttributeError: ignored

In [51]:
movies_metadata.select(mean('revenue'))

AttributeError: ignored

In [None]:
X_train, y_train = 
# y : movie revenue, vote average

# gridsearchcv

Recommender : annoy
Build regression models to predict movie revenue and vote
averages.


Use collaborative filtering to build a movie recommendation
system with two functions