# What is Model-Based Collaborative Filtering: Matrix Factor ?

- Model-Based Collaborative Filtering: Matrix Factorization is a technique used in recommender systems to predict user preferences or ratings for items based on the past behavior of users and the items they have interacted with. It belongs to the broader field of collaborative filtering, which is a popular approach for generating personalized recommendations.

- In model-based collaborative filtering, the idea is to create a mathematical model or representation of users and items, typically in the form of a matrix. This matrix is known as the utility matrix or rating matrix, where rows represent users, columns represent items, and each cell represents the user's rating or preference for a particular item. However, this matrix is usually sparse because most users have not rated or interacted with all items.

- Matrix factorization aims to decompose the utility matrix into two lower-rank matrices, typically referred to as the user matrix and the item matrix. Each user is represented by a vector of latent factors (features) and each item is represented by another vector of latent factors. The idea behind matrix factorization is that the latent factors capture the underlying characteristics or features of users and items that influence their preferences.

- By factorizing the utility matrix, the model learns to estimate the missing ratings or predict the ratings for new items by multiplying the corresponding user and item vectors. The model is trained using existing known ratings, and the optimization process seeks to minimize the prediction errors between the actual ratings and the predicted ratings.

- Once the model is trained, it can make recommendations by identifying items that have high predicted ratings for a particular user but have not been interacted with by the user before. This allows the system to suggest relevant and personalized recommendations based on similar users' preferences.

- Model-based collaborative filtering using matrix factorization has been successful in addressing the sparsity problem and providing accurate recommendations in various domains, such as movies, books, music, and e-commerce. It has been widely used in real-world applications and is considered one of the fundamental techniques in recommender systems.

![](https://cdn-images-1.medium.com/fit/t/1600/480/1*2i-GJO7JX0Yz6498jUvhEg.png)

# Dataset Story

- It contains films and the ratings given to these films.
- The data set contains about 2000000 ratings for about 27000 films.

**The dataset consists of two csv files**

- **1st csv file: film.csv file**
- movield: Unique film number
- title: Film name

- **2nd csv file : rating.csv file**

- userid = Unique user number.
- movield = Unique film number
- rating = Rating given to the film by the user
- timestamp = Evaluation date

# Road Map

- 1- Preparation of Data Set
- 2- Modelling
- 3- Model Tuning
- 4- Final Model and Forecast

# 1. Preparation of Data Set

In [1]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Installing collected packages: surprise
Successfully installed surprise-0.1
[0m

In [2]:
# import Required Libraries

import pandas as pd
import numpy as np

from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate

In [3]:
# Adjusting Row Column Settings

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)

In [4]:
# Loading the Data Set

movie = pd.read_csv('/kaggle/input/movielens-20m-dataset/movie.csv')
rating = pd.read_csv('/kaggle/input/movielens-20m-dataset/rating.csv')

In [5]:
# Merging movie and rating data sets

df = movie.merge(rating, how="left", on="movieId")

In [6]:
# Preliminary examination of the data set

def check_df(dataframe, head=5):
    print('##################### Shape #####################')
    print(dataframe.shape)
    print('##################### Types #####################')
    print(dataframe.dtypes)
    print('##################### Head #####################')
    print(dataframe.head(head))
    print('##################### Tail #####################')
    print(dataframe.tail(head))
    print('##################### NA #####################')
    print(dataframe.isnull().sum())
    print('##################### Quantiles #####################')
    print(dataframe.describe([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

##################### Shape #####################
(20000797, 6)
##################### Types #####################
movieId        int64
title         object
genres        object
userId       float64
rating       float64
timestamp     object
dtype: object
##################### Head #####################
   movieId             title                                       genres  userId  rating            timestamp
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy     3.0     4.0  1999-12-11 13:36:47
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy     6.0     5.0  1997-03-13 17:50:52
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy     8.0     4.0  1996-06-05 13:37:51
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy    10.0     4.0  1999-11-25 02:44:47
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy    11.0     4.5  2009-01-02 01:13:41
##################### Tail ####

In [7]:
# Creating Sample Data Set

In [8]:
movie_ids = [130219, 356, 4422, 541]

In [9]:
movies = ["The Dark Knight (2011)",
          "Cries and Whispers (Viskningar och rop) (1972)",
          "Forrest Gump (1994)",
          "Blade Runner (1982)"]

In [10]:
# Creating Subset of Movies

In [11]:
sample_df = df[df.movieId.isin(movie_ids)]

In [12]:
sample_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
2457839,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.0,4.0,1996-08-24 09:28:42
2457840,356,Forrest Gump (1994),Comedy|Drama|Romance|War,7.0,4.0,2002-01-16 19:02:55
2457841,356,Forrest Gump (1994),Comedy|Drama|Romance|War,8.0,5.0,1996-06-05 13:44:19
2457842,356,Forrest Gump (1994),Comedy|Drama|Romance|War,9.0,4.0,2001-07-01 20:26:38
2457843,356,Forrest Gump (1994),Comedy|Drama|Romance|War,10.0,3.0,1999-11-25 02:32:02


In [13]:
sample_df.shape

(97343, 6)

In [14]:
# Creating User-Movie Matrix

In [15]:
user_movie_df = sample_df.pivot_table(index=["userId"],
                                      columns=["title"],
                                      values="rating")

In [16]:
user_movie_df.head()

title,Blade Runner (1982),Cries and Whispers (Viskningar och rop) (1972),Forrest Gump (1994),The Dark Knight (2011)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1.0,4.0,,,
2.0,5.0,,,
3.0,5.0,,,
4.0,,,4.0,
7.0,,,4.0,


In [17]:
user_movie_df.shape

(76918, 4)

In [18]:
# Creating Reader Object

In [19]:
reader = Reader(rating_scale=(1, 5))

In [20]:
# Converting Data Set to Surprise Library Format

In [21]:
data = Dataset.load_from_df(sample_df[['userId',
                                       'movieId',
                                       'rating']], reader)

# 2. Modelling

In [22]:
# Splitting the Data Set into Training and Test Sets

In [23]:
trainset, testset = train_test_split(data, test_size=.25)

In [24]:
# Creating SVD Model Object

In [25]:
svd_model = SVD()

In [26]:
# Training the Model

In [27]:
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7bf257594520>

In [28]:
# Making Predictions and Calculating Error

In [29]:
predictions = svd_model.test(testset)

In [30]:
accuracy.rmse(predictions)

RMSE: 0.9407


0.940716005481333

In [31]:
# Example Prediction

In [32]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 4.09   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.091159841450607, details={'was_impossible': False})

In [33]:
svd_model.predict(uid=1.0, iid=356, verbose=True)

user: 1.0        item: 356        r_ui = None   est = 4.17   {'was_impossible': False}


Prediction(uid=1.0, iid=356, r_ui=None, est=4.173918186252513, details={'was_impossible': False})

In [34]:
sample_df[sample_df["userId"] == 1]

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
3612352,541,Blade Runner (1982),Action|Sci-Fi|Thriller,1.0,4.0,2005-04-02 23:30:03


# 3. Model Tuning

In [35]:
# Model Tuning and Finding Best Parameters

In [36]:
param_grid = {'n_epochs': [5, 10, 20],
              'lr_all': [0.002, 0.005, 0.007]}

In [37]:
gs = GridSearchCV(SVD,
                  param_grid,
                  measures=['rmse', 'mae'],
                  cv=3,
                  n_jobs=-1,
                  joblib_verbose=True)

In [38]:
gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:   17.8s finished


In [39]:
gs.best_score['rmse']

0.9298710044364663

In [40]:
gs.best_params['rmse']

{'n_epochs': 5, 'lr_all': 0.002}

# 4. Final Model and Forecast

In [41]:
dir(svd_model)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slotnames__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'bi',
 'biased',
 'bsl_options',
 'bu',
 'compute_baselines',
 'compute_similarities',
 'default_prediction',
 'estimate',
 'fit',
 'get_neighbors',
 'init_mean',
 'init_std_dev',
 'lr_bi',
 'lr_bu',
 'lr_pu',
 'lr_qi',
 'n_epochs',
 'n_factors',
 'predict',
 'pu',
 'qi',
 'random_state',
 'reg_bi',
 'reg_bu',
 'reg_pu',
 'reg_qi',
 'sgd',
 'sim_options',
 'test',
 'trainset',
 'verbose']

In [42]:
svd_model.n_epochs

20

In [43]:
svd_model = SVD(**gs.best_params['rmse'])

In [44]:
svd_model

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7bf257867d00>

In [45]:
data = data.build_full_trainset()

In [46]:
svd_model.fit(data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7bf257867d00>

In [47]:
# Making Final Predictions

In [48]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 4.23   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.229107535884852, details={'was_impossible': False})