### Описание задания:

Применение пакета SURPRISE для построения рекомендаций.

- использовать данные [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/),
- можно использовать любые модели из пакета,
- получить RMSE на тестовом сете 0,87 и ниже.

In [1]:
import wget
import zipfile

import pandas as pd
import numpy as np

from surprise import (BaselineOnly, CoClustering, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, 
                      NormalPredictor, NMF, SlopeOne, SVD, SVDpp)
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import train_test_split
from surprise import accuracy
from surprise.model_selection import GridSearchCV

In [2]:
dataset = 'ml-1m'
url = f'https://files.grouplens.org/datasets/movielens/{dataset}.zip'
wget.download(url, 'MovieLens.zip')

100% [..........................................................................] 5917549 / 5917549

'MovieLens.zip'

In [3]:
with zipfile.ZipFile("MovieLens.zip","r") as zip_ref:
    zip_ref.extractall()

In [4]:
users = pd.read_csv(f'./{dataset}/users.dat', sep='::',
                    names = ['userID', 'gender', 'age', 'occupation', 'zip-code'], engine='python')
users.name = 'users'
movies = pd.read_csv(f'./{dataset}/movies.dat', sep='::',
                     names = ['movieId', 'title', 'genres'], encoding='latin-1', engine='python')
movies.name = 'movies'
ratings = pd.read_csv(f'./{dataset}/ratings.dat', sep='::',
                      names = ['userId', 'movieId', 'rating', 'timestamp'], engine='python')
ratings.name = 'ratings'

In [5]:
def get_analises(dataset) -> None:
    print(dataset.name)
    dataset.info()
    print(f'Дублирующих записей: {dataset.duplicated().sum()}')
    print('------------')

In [6]:
get_analises(users)
get_analises(movies)
get_analises(ratings)

users
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   userID      6040 non-null   int64 
 1   gender      6040 non-null   object
 2   age         6040 non-null   int64 
 3   occupation  6040 non-null   int64 
 4   zip-code    6040 non-null   object
dtypes: int64(3), object(2)
memory usage: 236.1+ KB
Дублирующих записей: 0
------------
movies
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  3883 non-null   int64 
 1   title    3883 non-null   object
 2   genres   3883 non-null   object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB
Дублирующих записей: 0
------------
ratings
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column    

In [7]:
movies_with_ratings = movies.merge(ratings, on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Animation|Children's|Comedy,1,5,978824268
1,1,Toy Story (1995),Animation|Children's|Comedy,6,4,978237008
2,1,Toy Story (1995),Animation|Children's|Comedy,8,4,978233496
3,1,Toy Story (1995),Animation|Children's|Comedy,9,5,978225952
4,1,Toy Story (1995),Animation|Children's|Comedy,10,5,978226474


In [8]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [9]:
dataset.dropna(inplace=True)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 3 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   uid     1000209 non-null  int64 
 1   iid     1000209 non-null  object
 2   rating  1000209 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 22.9+ MB


In [10]:
reader = Reader(rating_scale=(ratings.rating.min(), ratings.rating.max()))
data = Dataset.load_from_df(dataset, reader)

In [11]:
trainset, testset = train_test_split(data, test_size=.2, random_state = 21)

In [25]:
models = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
models_name = ['SVD', 'SVDpp', 'SlopeOne', 'NMF', 'NormalPredictor', 'KNNBaseline', 'KNNBasic', 'KNNWithMeans', 'KNNWithZScore', 'BaselineOnly', 'CoClustering']

In [26]:
for idx, model in enumerate(models):
    model.fit(trainset)
    print(f'Модель {models_name[idx]}: rmse = {accuracy.rmse(model.test(testset), verbose=False)}')

Модель SVD: rmse = 0.8713557957245915
Модель SVDpp: rmse = 0.858187148764517
Модель SlopeOne: rmse = 0.9034563191669724
Модель NMF: rmse = 0.9141843997913802
Модель NormalPredictor: rmse = 1.505418036133454
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Модель KNNBaseline: rmse = 0.8921297881542896
Computing the msd similarity matrix...
Done computing similarity matrix.
Модель KNNBasic: rmse = 0.9192505209922023
Computing the msd similarity matrix...
Done computing similarity matrix.
Модель KNNWithMeans: rmse = 0.9265590995828431
Computing the msd similarity matrix...
Done computing similarity matrix.
Модель KNNWithZScore: rmse = 0.9278355686797725
Estimating biases using als...
Модель BaselineOnly: rmse = 0.9056271143463166
Модель CoClustering: rmse = 0.9150876784937805


Применение алгоритма SVDpp позволило получить необходимое значение метрики rmse ниже 0,87

Попробуем получить лучшее значение rmse с помощью перебора дополнительных значений параметров для алгоритма SVDpp

In [29]:
params = {
    'n_epochs': [10, 30, 50],
    'lr_all': [0.002, 0.005, 0.01],
    'reg_all': [0.005, 0.4, 0.6],
}

**Внимание:** время работы GridSearchCV составит несколько часов

In [None]:
grid_search = GridSearchCV(SVDpp, params, measures=['rmse'], cv=5, refit=True)
grid_search.fit(data)

algo = grid_search.best_estimator['rmse']
algo.fit(trainset)

predictions = algo.test(testset)
accur_score = accuracy.rmse(predictions, verbose=True)

In [None]:
print(f'Наилучшие параметры: {grid_search.best_params_}')
print(f'Наилучшее качество модели при подборе параметров: RMSE = {accur_score}')