## **Model-Based Collaborative Filtering: Matrix Factorization**

### **Veri Seti Hikayesi**

* Veri seti MovieLens tarafından sağlanmıştır.

* İçerisinde filmler ve bu filmlere verilen puanları barındırmaktadır.

* Veri Seti yaklaşık 27000 film için yaklaşık 2.000.000 derecelendirme içermektedir. 

* Veri seti: https://grouplens.org/datasets/movielens/

In [1]:
!pip install surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 6.5 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633955 sha256=9404659e66cfc0c518939bd4ac3a853958c8305c66e9512901de82e7fc3290d0
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


In [2]:
import pandas as pd
from surprise import Reader, SVD, Dataset, accuracy
from surprise.model_selection import GridSearchCV, train_test_split, cross_validate
pd.set_option('display.max_columns', None)

#### **Adım 1: Veri Setinin Hazırlanması (Data Preparation)**

In [3]:
movie = pd.read_csv('/content/drive/MyDrive/DSMLBC10/week_7 (10.11.22-16.11.22)/datasets/movie_lens_dataset/movie.csv')
rating = pd.read_csv('/content/drive/MyDrive/DSMLBC10/week_7 (10.11.22-16.11.22)/datasets/movie_lens_dataset/rating.csv')
df = movie.merge(rating, how="left", on="movieId")
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


In [4]:
movie_ids = [130219, 356, 4422, 541]
movies = ["The Dark Knight (2011)",
          "Cries and Whispers (Viskningar och rop) (1972)",
          "Forrest Gump (1994)",
          "Blade Runner (1982)"]
sample_df = df[df.movieId.isin(movie_ids)]
sample_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
2457839,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.0,4.0,1996-08-24 09:28:42
2457840,356,Forrest Gump (1994),Comedy|Drama|Romance|War,7.0,4.0,2002-01-16 19:02:55
2457841,356,Forrest Gump (1994),Comedy|Drama|Romance|War,8.0,5.0,1996-06-05 13:44:19
2457842,356,Forrest Gump (1994),Comedy|Drama|Romance|War,9.0,4.0,2001-07-01 20:26:38
2457843,356,Forrest Gump (1994),Comedy|Drama|Romance|War,10.0,3.0,1999-11-25 02:32:02


In [5]:
sample_df.shape

(97343, 6)

In [6]:
user_movie_df = sample_df.pivot_table(index=["userId"],
                                      columns=["title"],
                                      values="rating")
user_movie_df.shape

(76918, 4)

In [7]:
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(sample_df[['userId',
                                       'movieId',
                                       'rating']], reader)
data

<surprise.dataset.DatasetAutoFolds at 0x7f2308285e50>

#### **Adım 2: Modelleme**

In [8]:
trainset, testset = train_test_split(data, test_size=.25)
svd_model = SVD()
svd_model.fit(trainset)
predictions = svd_model.test(testset)

In [9]:
accuracy.rmse(predictions)

RMSE: 0.9354


0.9353961798282115

In [10]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 3.87   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=3.86777281590454, details={'was_impossible': False})

In [11]:
svd_model.predict(uid=1.0, iid=356, verbose=True)

user: 1.0        item: 356        r_ui = None   est = 4.23   {'was_impossible': False}


Prediction(uid=1.0, iid=356, r_ui=None, est=4.226473402015522, details={'was_impossible': False})

In [12]:
sample_df[sample_df["userId"] == 1]

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
3612352,541,Blade Runner (1982),Action|Sci-Fi|Thriller,1.0,4.0,2005-04-02 23:30:03


#### **Adım 3: Model Tuning**

In [13]:
param_grid = {'n_epochs': [5, 10, 20],
              'lr_all': [0.002, 0.005, 0.007]}

In [14]:
gs = GridSearchCV(SVD,
                  param_grid,
                  measures=['rmse', 'mae'],
                  cv=3,
                  n_jobs=-1,
                  joblib_verbose=True)
gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:  2.1min finished


In [15]:
gs.best_score['rmse']

0.9307239424125795

In [16]:
gs.best_params['rmse']

{'n_epochs': 5, 'lr_all': 0.002}

#### **Adım 4: Final Model ve Tahmin**

In [17]:
dir(svd_model)
svd_model.n_epochs

20

In [18]:
svd_model = SVD(**gs.best_params['rmse'])

In [19]:
data = data.build_full_trainset()
svd_model.fit(data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f22a67e0c10>

In [20]:
svd_model.predict(uid=1.0, iid=541, verbose=True)

user: 1.0        item: 541        r_ui = None   est = 4.21   {'was_impossible': False}


Prediction(uid=1.0, iid=541, r_ui=None, est=4.212630303654173, details={'was_impossible': False})