### Задача

Написать рекомендательную систему, совмещающую несколько подходов. Использовать датасет [ml-latest](https://grouplens.org/datasets/movielens/latest/).

In [44]:
import pandas as pd
import numpy as np

from surprise import KNNBasic, KNNWithMeans, SVD, NMF
from surprise import Dataset
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split, GridSearchCV, cross_validate

from scipy.sparse import csr_matrix

from sklearn.neighbors import NearestNeighbors

In [3]:
links = pd.read_csv('Data/links.csv')
movies = pd.read_csv('Data/movies.csv')
ratings = pd.read_csv('Data/ratings.csv')
tags = pd.read_csv('Data/tags.csv')

In [4]:
# объединим таблицу с названиями фильмов и таблицу с рейтингами 
movies_with_ratings = movies.join(ratings.set_index('movieId'), on='movieId').reset_index(drop=True)
movies_with_ratings.dropna(inplace=True)

In [5]:
movies_with_ratings.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,964982700.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,847435000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,1106636000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,1510578000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,1305696000.0


In [6]:
dataset = pd.DataFrame({
    'uid': movies_with_ratings.userId,
    'iid': movies_with_ratings.title,
    'rating': movies_with_ratings.rating
})

In [7]:
dataset.head()

Unnamed: 0,uid,iid,rating
0,1.0,Toy Story (1995),4.0
1,5.0,Toy Story (1995),4.0
2,7.0,Toy Story (1995),4.5
3,15.0,Toy Story (1995),2.5
4,17.0,Toy Story (1995),4.5


In [8]:
dataset[dataset.uid==1.0] 

Unnamed: 0,uid,iid,rating
0,1.0,Toy Story (1995),4.0
325,1.0,Grumpier Old Men (1995),4.0
433,1.0,Heat (1995),4.0
2107,1.0,Seven (a.k.a. Se7en) (1995),5.0
2379,1.0,"Usual Suspects, The (1995)",5.0
...,...,...,...
56820,1.0,Shaft (2000),4.0
57280,1.0,X-Men (2000),5.0
57461,1.0,What About Bob? (1991),4.0
59174,1.0,Transformers: The Movie (1986),4.0


Идея гибридной системы в следующем:
- составляем топ фильмов для просматриваемого пользователя
- находим фильмы, похожие на найденный топ (алгоритм к-neibourghs)
- оцениваем подборку фильмов (задача регрессии)
- в выдачу попадают n фильмов с наиболее высокой оценкой

In [9]:
movie_pivot = dataset.pivot_table( index='iid', columns='uid', values='rating', fill_value=0)
movie_pivot.head()

uid,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,...,601.0,602.0,603.0,604.0,605.0,606.0,607.0,608.0,609.0,610.0
iid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,4.0
'Hellboy': The Seeds of Creation (2004),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
'Round Midnight (1986),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
'Salem's Lot (2004),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0
'Til There Was You (1997),0,0.0,0.0,0,0,0,0.0,0,0,0.0,...,0.0,0,0,0,0.0,0.0,0,0.0,0,0.0


In [12]:
movie_sparse=csr_matrix(movie_pivot)

#### 1-я модель, определяющая схожесть фильмов:

In [13]:
model=NearestNeighbors(n_neighbors=20,algorithm='brute',metric='cosine')

In [14]:
model.fit(movie_sparse)

NearestNeighbors(algorithm='brute', metric='cosine', n_neighbors=20)

Например, для фильма Toy Story (1995) похожими будут выбраны следующие:

In [15]:
film_title = 'Toy Story (1995)'
distances,suggestions=model.kneighbors(movie_pivot.loc[film_title,:].values.reshape(1,-1))

In [16]:
for i in range(len(suggestions[0])):
    if movie_pivot.index[suggestions[0][i]] == film_title:
        continue
    print(movie_pivot.index[suggestions[0][i]])

Toy Story 2 (1999)
Jurassic Park (1993)
Independence Day (a.k.a. ID4) (1996)
Star Wars: Episode IV - A New Hope (1977)
Forrest Gump (1994)
Lion King, The (1994)
Star Wars: Episode VI - Return of the Jedi (1983)
Mission: Impossible (1996)
Groundhog Day (1993)
Back to the Future (1985)
Shrek (2001)
Aladdin (1992)
Apollo 13 (1995)
Pulp Fiction (1994)
Star Wars: Episode V - The Empire Strikes Back (1980)
Willy Wonka & the Chocolate Factory (1971)
Men in Black (a.k.a. MIB) (1997)
Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
Shawshank Redemption, The (1994)


#### 2-я модель, выставляющая фильму оценку

Рассмотрим несколько возможных алгоритмов из пакета surprise



In [17]:
reader = Reader(rating_scale=(dataset.rating.min(), dataset.rating.max()))
data = Dataset.load_from_df(dataset, reader)

In [18]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=146)

In [20]:
algo_dict = {'KNN basic':KNNBasic(), 'KNN with Means':KNNWithMeans(), 'SVD':SVD(), 'NMF': NMF()}
for name_i, algo_i in algo_dict.items():
    algo_i.fit(trainset)
    print(f'{name_i}: RMSE = {accuracy.rmse(algo_i.test(testset), verbose=False)}')

Computing the msd similarity matrix...
Done computing similarity matrix.
KNN basic: RMSE = 0.947176837616969
Computing the msd similarity matrix...
Done computing similarity matrix.
KNN with Means: RMSE = 0.8997806257254258
SVD: RMSE = 0.8790708591030749
NMF: RMSE = 0.9263741116600157


Среди алгоритмов с параметрами по умолчанию SVD дает наименьшую ошибку. Попробуем подобрать гиперпараметры для SVD, чтобы еще уменьшить RMSE

In [28]:
param_grid_svd = {'n_factors': [50, 80, 100, 120],
                  'n_epochs': [5, 10, 20, 40], 
                  'lr_all': [0.002, 0.005, 0.01, 0.1],
                  'reg_all': [0.02, 0.05, 0.1]
                 }

In [31]:
gs_svd = GridSearchCV(SVD, param_grid_svd, measures=['RMSE'], cv=5)

In [32]:
gs_svd.fit(data)

In [33]:
print(gs_svd.best_score['rmse'])
print(gs_svd.best_params['rmse'])

0.8509806487620647
{'n_factors': 120, 'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.1}


In [45]:
algo = gs_svd.best_estimator['rmse']

In [46]:
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8581  0.8520  0.8475  0.8561  0.8452  0.8518  0.0049  
Fit time          15.31   15.91   16.20   15.95   15.85   15.85   0.29    
Test time         0.23    0.61    0.24    0.25    0.26    0.32    0.14    


{'test_rmse': array([0.85808225, 0.85203287, 0.84749996, 0.85607172, 0.84515184]),
 'fit_time': (15.311824321746826,
  15.91179871559143,
  16.204999446868896,
  15.948999881744385,
  15.848542213439941),
 'test_time': (0.2343752384185791,
  0.6069996356964111,
  0.2390003204345703,
  0.2479996681213379,
  0.26399970054626465)}

Подбор гиперпараметров действительно улучшил значение метрики RMSE

In [47]:
algo.predict(uid=146.0, iid='Toy Story (1995)').est

3.777625118778029

#### Гибридная система

Теперь у меня есть 2 обученных алгоритма. Пробую собрать общую систему 

In [52]:
def recommend_for_user(userid):
#   отбираем фильмы, которым пользоватль поставил масимальные оценки
#   если пользователь оценил много фильмов, оставляем 5 наиболее высоко оцененных
    top_full = dataset[dataset['uid']==userid].sort_values(by='rating', ascending=False)['iid'].values
    top_short=[]
    if len(top_full) > 5:
        top_short = top_full[:5]
    else:
        top_short = top_full
    
    similar_films=set()
#   используем 1ю модель  
    for film in top_short:
 
        distances,suggestions = model.kneighbors(movie_pivot.loc[film,:].values.reshape(1,-1))

        for i in range(len(suggestions[0])):
#           если фильм уже просмотрен - пропускаем  
            if movie_pivot.index[suggestions[0][i]] in top_full:
                continue
            similar_films.add(movie_pivot.index[suggestions[0][i]])
    
    similar_films = list(similar_films)
    film_score={}
    
#   используем 2ю модель     
    for sim_film in similar_films:
        est = algo.predict(uid=userid, iid=sim_film).est 
        film_score[sim_film] = est
        
    sorted_films = sorted(film_score.items(), key=lambda x: x[1], reverse=True)
#   возвращаем 10 первых значений (если всего получилось меньше 10, возвращаем сколько есть) 
    if len(sorted_films) > 10:
        sorted_films = sorted_films[:10]
        
    return sorted_films
  

Получим рекоммендации

In [53]:
recommend_for_user(1)

[('Hope and Glory (1987)', 4.834473016324756),
 ('Sweet Hereafter, The (1997)', 4.830951655785683),
 ('Killing Fields, The (1984)', 4.803965718393876),
 ('You Only Live Twice (1967)', 4.729469236962127),
 ('Player, The (1992)', 4.71595007516749),
 ('Terminator 2: Judgment Day (1991)', 4.709324507477813),
 ('Manchurian Candidate, The (1962)', 4.684323838303428),
 ('Deliverance (1972)', 4.680399724464322),
 ('Killing Zoe (1994)', 4.677243194691662),
 ('Chariots of Fire (1981)', 4.66981608257467)]

In [56]:
# рекомендации для пользователя
pd.DataFrame(recommend_for_user(42), columns=['iid', 'rating'])

Unnamed: 0,iid,rating
0,"Green Mile, The (1999)",4.386603
1,Raiders of the Lost Ark (Indiana Jones and the...,4.269404
2,To Catch a Thief (1955),4.219821
3,Being There (1979),4.176704
4,All the President's Men (1976),4.12388
5,"Sixth Sense, The (1999)",4.108503
6,Pink Floyd: The Wall (1982),4.091039
7,American Beauty (1999),4.090321
8,Indiana Jones and the Last Crusade (1989),4.075591
9,Deconstructing Harry (1997),4.070354


In [55]:
# фильмы, которые этот пользователь смотрел в действительности
dataset[dataset['uid']==42].sort_values(by='rating', ascending=False)[:10]

Unnamed: 0,uid,iid,rating
38596,42.0,Saving Private Ryan (1998),5.0
19461,42.0,"Time to Kill, A (1996)",5.0
23823,42.0,On Golden Pond (1981),5.0
23663,42.0,Top Gun (1986),5.0
23412,42.0,"Doors, The (1991)",5.0
23282,42.0,Platoon (1986),5.0
23157,42.0,Reservoir Dogs (1992),5.0
22621,42.0,Swingers (1996),5.0
46862,42.0,American Pie (1999),5.0
47061,42.0,Eyes Wide Shut (1999),5.0


In [57]:
pd.DataFrame(recommend_for_user(55), columns=['iid', 'rating'])

Unnamed: 0,iid,rating
0,Kiss Kiss Bang Bang (2005),3.734794
1,Fight Club (1999),3.690361
2,"Matrix, The (1999)",3.590073
3,"Godfather, The (1972)",3.583304
4,American History X (1998),3.55288
5,Inglourious Basterds (2009),3.54203
6,Inception (2010),3.525467
7,"Usual Suspects, The (1995)",3.523621
8,Goodfellas (1990),3.487587
9,Monty Python and the Holy Grail (1975),3.452661


In [242]:
dataset[dataset['uid']==55].sort_values(by='rating', ascending=False)[:10]

Unnamed: 0,uid,iid,rating
44837,55.0,"Lock, Stock & Two Smoking Barrels (1998)",5.0
83534,55.0,"Departed, The (2006)",5.0
78678,55.0,Layer Cake (2004),5.0
59230,55.0,Snatch (2000),5.0
53235,55.0,"Boondock Saints, The (2000)",5.0
82146,55.0,Lucky Number Slevin (2006),4.5
79711,55.0,Crash (2004),4.5
85546,55.0,"Bourne Ultimatum, The (2007)",4.0
28857,55.0,Highlander (1986),4.0
29486,55.0,Gandhi (1982),4.0
