It is assumed that there are some latent features on our likes. For instance, for a movie case, the genre, the actors, the hue, even the hour watched may be the reason for one's like. Moreover, these features correspond with the movie as well. Both, the user features and the movie features have weights in different scalars. 

For the matrices composing of user features and movie features respectively, it is assumed that the dot products of the features form up the user rating matrice. 

![resim.png](attachment:98adcf50-dd35-40cb-bb65-d4b603d67f5f.png)

This is the loss function. Let rui be the user rate, qi item and pu user feature weights. Since the dot product of qi and pu give the user rate, the difference between their product and the real rate value, i.e the error should be minimum. The aim is to find these p and q values that will give the minimum value of sum of squared errors. These values are calculated by Gradient Descent. 

After finding p and q, the missing rate values are calculated. 

In this project, it is aimed to find missing anime ratings of users by matrix factorization. The Anime Recommendations Database will be used.


In [None]:
import numpy as np 
import pandas as pd 

In [None]:
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate
pd.set_option('display.max_columns', None)

In [None]:
df1 = pd.read_csv("/kaggle/input/anime-recommendations-database/anime.csv")
df2 = pd.read_csv("/kaggle/input/anime-recommendations-database/rating.csv")

In [None]:
df1.head()

In [None]:
df2.head()

In [None]:
# Two datasets are merged, so the animes, the users and their ratings are gathered in one dataframe. Here rating_x is overall
# rating whereas rating_y is the rating of the user. 
df = df1.merge(df2, how= "left", on="anime_id")
df.tail()
#rating_x: overall rating, rating_y user's rating

In [None]:
df.shape

In [None]:
# Only a few animes will be chosen, so the dataframe is ordered to get the most voted animes.
df1.sort_values("members",ascending=False).head()

In [None]:
#The animes with the most ratings are chosen.
anime_ids = [1535,16498, 11757, 5114]

In [None]:
# A subset dataframe including only chosen anime ids.
sample_df = df[df.anime_id.isin(anime_ids)]

In [None]:
sample_df.head()

In [None]:
# A user - anime pivot table is constructed to apply matrix factorization. 
user_anime_df = sample_df.pivot_table(index=["user_id"], columns=["name"], values="rating_y")
user_anime_df.head()

In [None]:
sample_df["rating_y"].describe()

In [None]:
reader = Reader(rating_scale=(-1, 10))

In [None]:
data = Dataset.load_from_df(sample_df[['user_id', 'anime_id', 'rating_y']], reader)

In [None]:
trainset, testset = train_test_split(data, test_size=.25)

svd_model = SVD()
svd_model.fit(trainset)

predictions = svd_model.test(testset)

In [None]:
len(predictions)

In [None]:
predictions[0:10]

In [None]:
accuracy.rmse(predictions)

In [None]:
cross_validate(svd_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)

In [None]:
svd_model.predict(uid=1.0, iid=1535, verbose=True)

In [None]:
svd_model.predict(uid=1.0, iid=5114, verbose=True)

In [None]:
# Cross validation is applied to get lower error values. Then the parameters of this best result is used for fitting the model. 

param_grid = {'n_epochs': [50, 100], 'lr_all': [0.005, 0.009]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=10, n_jobs=-1, joblib_verbose=True)

gs.fit(data)

gs.best_score['rmse']




In [None]:
gs.best_params['rmse']

In [None]:
gs.best_estimator["rmse"]

In [None]:
svd_model = gs.best_estimator["rmse"]
svd_model.fit(data.build_full_trainset())

In [None]:
# The rates of the users with id 1 and 3 will be predicted. Their rates are as following.
user_anime_df.head(2)

In [None]:
df1.loc[df1["anime_id"] == 1535]["name"]

In [None]:
# The user 1.0 did not vote for Death Note, the rating is predicted.
svd_model.predict(uid=1.0, iid=1535, verbose=True)

In [None]:
df1.loc[df1["anime_id"] == 5114]["name"]

In [None]:
# The user 1.0 did not vote for Fullmetal Alchemist either, the rating is predicted.
svd_model.predict(uid=1.0, iid=5114, verbose=True)

In [None]:
df1.loc[df1["anime_id"] == 16498]["name"]

In [None]:
# The actual rating of the user 1.0 for Shingeki no Kyojin is -1, so is the predicted rating.
svd_model.predict(uid=1.0, iid=16498, verbose=True)

In [None]:
df1.loc[df1["anime_id"] == 11757]["name"]

In [None]:
# The actual rating of the user 1.0 for Sword Art Online is 10.0, the predicted rating is 9.0.
svd_model.predict(uid=1.0, iid= 11757, verbose=True)

In [None]:
# The actual rating of the user 3.0 for Death Note is 10.0, the predicted rating is 9.97.
svd_model.predict(uid=3.0, iid=1535, verbose=True)

In [None]:
# The actual rating of the user 3.0 for Fullmetal Alchemist is 10.0, the predicted rating is 9.97.
svd_model.predict(uid=3.0, iid=5114, verbose=True)

In [None]:
# The actual rating of the user 3.0 for Shingeki no Kyojin is 10.0, the predicted rating is 9.96.
svd_model.predict(uid=3.0, iid=16498, verbose=True)

In [None]:
# The actual rating of the user 3.0 for Sword Art Online is 9.0, the predicted rating is 9.00 as well.
svd_model.predict(uid=1.0, iid= 11757, verbose=True)