

# Building a Recommendation System based on a Netflix Rating Dataset.

# Collaborative Filtering

This task aims to build a collaborative filtering recommendation system on the [Netflix data](https://www.kaggle.com/netflix-inc/netflix-prize-data?select=probe.txt).

### 1. Download Data From Kaggle
Please follow the steps below to download the [Netflix Data](https://www.kaggle.com/netflix-inc/netflix-prize-data?select=combined_data_2.txt) to this Google Colab environment:

1. Go to your Kaggle account, Scroll to API section and Click **Expire API Token** to remove previous tokens.

2. Click on **Create New API Token** - It will download `kaggle.json` file on your machine.

In [None]:
# 3. Install the kaggle API
! pip install kaggle



In [None]:
# 4. Upload the kaggle.json file
from google.colab import files

files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"marianamaroto","key":"a0c2f1120fb6f4fc61d31e5d71865ee9"}'}

In [None]:
# 5. Make a directory named kaggle and copy kaggle.json file there
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/

# Change the permission of the file
! chmod 600 ~/.kaggle/kaggle.json

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
# 6. Download the Netflix data
!kaggle datasets download -d netflix-inc/netflix-prize-data

Downloading netflix-prize-data.zip to /content
 99% 673M/683M [00:08<00:00, 55.0MB/s]
100% 683M/683M [00:08<00:00, 82.1MB/s]


In [None]:
# 7. Unzip the downloaded file
!unzip netflix-prize-data.zip

Archive:  netflix-prize-data.zip
  inflating: README                  
  inflating: combined_data_1.txt     
  inflating: combined_data_2.txt     
  inflating: combined_data_3.txt     
  inflating: combined_data_4.txt     
  inflating: movie_titles.csv        
  inflating: probe.txt               
  inflating: qualifying.txt          


## Load the Data

The code below load the data from `combined_data_1.txt` as a data frame. Expand the code so to load from all data files.

In [None]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/45/b0/9d6860891ab14a39d4bddf80ba26ce51c2f9dc4805e5c6978ac0472c120a/pyspark-3.1.1.tar.gz (212.3MB)
[K     |████████████████████████████████| 212.3MB 71kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 49.7MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.1-py2.py3-none-any.whl size=212767604 sha256=bb5b80413ccc687dddc0f3f53009d09cdd700cab5762df86ec4e16785f969865
  Stored in directory: /root/.cache/pip/wheels/0b/90/c0/01de724414ef122bd05f056541fb6a0ecf47c7ca655f8b3c0f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.1


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
if not os.path.isfile('data.csv'):
    data = open('data.csv', mode='w')

file = "combined_data_1.txt"
with open(file) as f:
    for line in f:
        line = line.strip()
        if line.endswith(':'):
            movie_id = line.replace(':', '')
        else:
            data.write(movie_id + ',' + line)
            data.write('\n')

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Collaborative Filtering').getOrCreate()

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType, DateType
schema = StructType([
    StructField("movie_id", IntegerType(), True),
    StructField("user_id", IntegerType(), True),
    StructField("ratings", IntegerType(), True),
    StructField("data", DateType(), True)
])
data_customer = spark.read.csv('data.csv', header=False, schema=schema)
data_customer.printSchema()

root
 |-- movie_id: integer (nullable = true)
 |-- user_id: integer (nullable = true)
 |-- ratings: integer (nullable = true)
 |-- data: date (nullable = true)



## 2. Train-Test Split
Split the dataset into 80% training data and 20% test data.

In [None]:
from pyspark.ml.tuning import TrainValidationSplit

# Split data into train and test set
train, test = data_customer.randomSplit([0.8, 0.2], seed=42)

## 3. Train the Alternating Least-Squares Model
Apply the [alternating least squares method](https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html) to build a recommendation system.

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=5, regParam=0.01, userCol="user_id", itemCol="movie_id", ratingCol="ratings",
          coldStartStrategy="drop")
model = als.fit(train)

# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="ratings",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)

Root-mean-square error = 0.9560534244263446


In [None]:
userRecs.head(5)

[Row(user_id=471, recommendations=[Row(movie_id=614, rating=7.379634857177734), Row(movie_id=4205, rating=7.203495979309082), Row(movie_id=2761, rating=7.112646102905273), Row(movie_id=3869, rating=6.783904075622559), Row(movie_id=2207, rating=6.342108249664307), Row(movie_id=4090, rating=6.155520439147949), Row(movie_id=51, rating=6.1271820068359375), Row(movie_id=2280, rating=6.025995254516602), Row(movie_id=296, rating=5.9571428298950195), Row(movie_id=4178, rating=5.8679704666137695)]),
 Row(user_id=1088, recommendations=[Row(movie_id=972, rating=6.44008731842041), Row(movie_id=203, rating=6.2677717208862305), Row(movie_id=3237, rating=5.8581366539001465), Row(movie_id=2187, rating=5.857769012451172), Row(movie_id=2110, rating=5.718303680419922), Row(movie_id=1376, rating=5.711306571960449), Row(movie_id=343, rating=5.620838165283203), Row(movie_id=3361, rating=5.602970600128174), Row(movie_id=219, rating=5.596135139465332), Row(movie_id=451, rating=5.569637775421143)]),
 Row(user_

In [None]:
movieRecs.head(5)

[Row(movie_id=1580, recommendations=[Row(user_id=764123, rating=7.450042724609375), Row(user_id=858224, rating=7.156418323516846), Row(user_id=1997375, rating=7.127230167388916), Row(user_id=215456, rating=6.986635208129883), Row(user_id=2130657, rating=6.90012264251709), Row(user_id=1880586, rating=6.887922286987305), Row(user_id=1224184, rating=6.870545387268066), Row(user_id=661511, rating=6.84704065322876), Row(user_id=244979, rating=6.819540977478027), Row(user_id=225291, rating=6.814430236816406)]),
 Row(movie_id=471, recommendations=[Row(user_id=1142460, rating=7.536010265350342), Row(user_id=1751732, rating=7.111346244812012), Row(user_id=2531996, rating=7.043582439422607), Row(user_id=383247, rating=6.697164535522461), Row(user_id=2629109, rating=6.644122123718262), Row(user_id=874653, rating=6.576974391937256), Row(user_id=1417639, rating=6.544713020324707), Row(user_id=677779, rating=6.535009384155273), Row(user_id=1810051, rating=6.529563903808594), Row(user_id=858224, rati