В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 7 Октября (Итоговая проверка)

In [1]:
%pip install implicit lightfm

Processing /home/jupyter/.cache/pip/wheels/f0/cd/a5/b07914aa223c05ed61880d4c59f64a7febf117dbd2c2cbcf49/lightfm-1.15-cp37-cp37m-linux_x86_64.whl
Collecting numpy
  Using cached numpy-1.19.1-cp37-cp37m-manylinux2010_x86_64.whl (14.5 MB)
Collecting requests
  Using cached requests-2.22.0-py2.py3-none-any.whl (57 kB)
Collecting scipy>=0.17.0
  Using cached scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
Collecting chardet<3.1.0,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Collecting idna<2.9,>=2.5
  Using cached idna-2.6-py2.py3-none-any.whl (56 kB)
Processing /home/jupyter/.cache/pip/wheels/44/7e/7d/a17324ea207cfbe76aca878b5b8ca0aa932cf55d163329be37/implicit-0.4.4-cp37-cp37m-linux_x86_64.whl
Collecting tqdm>=4.27
  Using cached tqdm-4.45.0-py2.py3-none-any.wh

In [6]:
import os
import shutil
import zipfile
import urllib.request

filename = 'ml-1m.zip'
urllib.request.urlretrieve('http://files.grouplens.org/datasets/movielens/ml-1m.zip', 'ml-1m.zip')

('ml-1m.zip', <http.client.HTTPMessage at 0x7ff841ffed90>)

In [7]:
import zipfile
from tqdm import tqdm

fname = './ml-1m.zip'
path = './'

with zipfile.ZipFile(fname, 'r') as zf:
    for entry in tqdm(zf.infolist(), desc='Extracting '):
        try:
            zf.extract(entry, path)
        except zipfile.error as e:
            pass

Extracting : 100%|██████████| 5/5 [00:04<00:00,  1.23it/s]


In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [2]:
ratings = pd.read_csv('ml-1m/ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')

In [3]:
movie_info = pd.read_csv('ml-1m/movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

поправим индексацию:

In [4]:
min(ratings['user_id']), min(ratings['movie_id']), min(movie_info['movie_id'])

(1, 1, 1)

In [5]:
ratings['user_id'] -= 1
ratings['movie_id'] -= 1
movie_info['movie_id'] -= 1

In [6]:
min(ratings['user_id']), min(ratings['movie_id']), min(movie_info['movie_id'])

(0, 0, 0)

Explicit данные

In [7]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,0,1192,5,978300760
1,0,660,3,978302109
2,0,913,3,978301968
3,0,3407,4,978300275
4,0,2354,5,978824291
5,0,1196,3,978302268
6,0,1286,5,978302039
7,0,2803,5,978300719
8,0,593,4,978302268
9,0,918,4,978301368


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [8]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [9]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,0,1192,5,978300760
3,0,3407,4,978300275
4,0,2354,5,978824291
6,0,1286,5,978302039
7,0,2803,5,978300719
8,0,593,4,978302268
9,0,918,4,978301368
10,0,594,5,978824268
11,0,937,4,978301752
12,0,2397,4,978302281


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [10]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [11]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [14]:
#!L

model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 0 movie_id = Истории игрушек

In [15]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,0,Toy Story (1995),Animation|Children's|Comedy
1,1,Jumanji (1995),Adventure|Children's|Fantasy
2,2,Grumpier Old Men (1995),Comedy|Romance
3,3,Waiting to Exhale (1995),Comedy|Drama
4,4,Father of the Bride Part II (1995),Comedy


In [16]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [17]:
get_similars(0, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1526    Hercules (1997)',
 '1838    Mulan (1998)',
 '2252    Pleasantville (1998)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [18]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [19]:
get_user_history(3, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [186]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [21]:
get_recommendations(3, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '2502    Matrix, The (1999)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1182    Aliens (1986)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '3402    Close Encounters of the Third Kind (1977)',
 '847    Godfather, The (1972)',
 '1884    French Connection, The (1971)',
 '2460    Planet of the Apes (1968)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

In [172]:
movie_info.head()

Unnamed: 0,movie_id,name,category
0,0,Toy Story (1995),Animation|Children's|Comedy
1,1,Jumanji (1995),Adventure|Children's|Fantasy
2,2,Grumpier Old Men (1995),Comedy|Romance
3,3,Waiting to Exhale (1995),Comedy|Drama
4,4,Father of the Bride Part II (1995),Comedy


In [187]:
get_recommendations_cat = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["category"].to_string() 
                                                   for x in model.recommend(user_id, user_item_csr)]

In [22]:
N_USERS  = max(ratings['user_id']) + 1
N_MOVIES = max(ratings['movie_id']) + 1

N_USERS, N_MOVIES

(6040, 3952)

In [203]:
class Recommender:
    def __init__(self, user_vecs, item_vecs, user_bias=None, item_bias=None, n_rec=10):
        self.user_vecs = user_vecs
        self.item_vecs = item_vecs
        
        self.user_bias = user_bias if (user_bias is not None) else np.zeros(len(user_vecs))
        self.item_bias = item_bias if (item_bias is not None) else np.zeros(len(item_vecs))
        
        self.n_rec = n_rec
    
    def recommend(self, user_id, user_item_csr):
        seen = (user_item_csr[user_id] != 0).toarray()[0]
        similarity = np.dot(self.item_vecs, self.user_vecs[user_id]) + self.item_bias
        similarity[seen] = 0
        recommended = similarity.argsort()[::-1][:self.n_rec]
        return recommended[:,np.newaxis]
    
    def similar_items(self, item_id):
        similarity = np.dot(self.item_vecs, self.item_vecs[item_id]) / np.linalg.norm(self.item_vecs)
        similars = similarity.argsort()[::-1][:self.n_rec]
        return similars[:,np.newaxis]

In [141]:
import numpy as np
import numpy.random as npr

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [331]:
#!M
import time

class SvdModel:
    def __init__(self, n_users, n_items, dim=64, n_rec=10):
        super().__init__()
        scale = 1. / np.sqrt(dim)
        self.user_vecs = npr.uniform(0, scale, size=[n_users, dim])
        self.user_bias = npr.uniform(0, scale, size=[n_users])
        
        self.item_vecs = npr.uniform(0, scale, size=[n_items, dim])
        self.item_bias = npr.uniform(0, scale, size=[n_items])
        
        self.n_rec = n_rec
        self.bias = 0
    
    
    def fit(self, ratings_df, lr=1e-2, alpha=0.01, n_epoch=20):
        n = len(ratings_df)
        batch_size = 64
        
        users   = np.asarray(ratings_df['user_id'])
        movies  = np.asarray(ratings_df['movie_id']) 
        ratings = np.asarray(ratings_df['rating'], dtype=np.float32)
        
        for epoch in range(n_epoch):
            epoch_idx = npr.permutation(n)
            mse = []
            
            
            for batch_start in range(0, n - batch_size, batch_size):
                idx = epoch_idx[batch_start: batch_start + batch_size]
                
                error = np.sum(self.user_vecs[users[idx]] * self.item_vecs[movies[idx]], axis=1) + \
                        self.user_bias[users[idx]] + self.item_bias[movies[idx]] + \
                        self.bias - \
                        ratings[idx]
                mse.append(np.average(error ** 2))
                
                cur_user_vecs = self.user_vecs[users[idx]]
                cur_item_vecs = self.item_vecs[movies[idx]]
                
                self.user_vecs[users[idx]]  -= lr * (error[:, np.newaxis] * cur_item_vecs + alpha * cur_user_vecs)
                self.user_bias[users[idx]]  -= lr * (error + alpha * self.user_bias[users[idx]])
                
                self.item_vecs[movies[idx]] -= lr * (error[:, np.newaxis] * cur_user_vecs + alpha * cur_item_vecs)
                self.item_bias[movies[idx]] -= lr * (error + alpha * self.item_bias[movies[idx]])
                
                self.bias -= (lr * np.average(error) + alpha * self.bias)
                
            mse = np.average(mse)
            if (epoch & (epoch - 1)) == 0:
                print(f"epoch {epoch} mse {mse}")
        
    
    def to_recommender(self):
        return Recommender(self.user_vecs, self.item_vecs, 
                           self.user_bias, self.item_bias)

In [332]:
#!M

model = SvdModel(N_USERS, N_MOVIES, dim=128)
model.fit(ratings, n_epoch=100)

epoch 0 mse 1.2582111817763608
epoch 1 mse 0.8666532502894305
epoch 2 mse 0.8338975349737427
epoch 4 mse 0.7781776025058177
epoch 8 mse 0.6141882440915957
epoch 16 mse 0.32303921441264316
epoch 32 mse 0.17497652565279356
epoch 64 mse 0.12945716659964115


In [333]:
#!M

get_recommendations(3, model.to_recommender())

['3366    Double Indemnity (1944)',
 '3352    Animal House (1978)',
 "941    It's a Wonderful Life (1946)",
 '898    Some Like It Hot (1959)',
 '1215    Sting, The (1973)',
 '1263    High Noon (1952)',
 '1186    Lawrence of Arabia (1962)',
 '2868    Palm Beach Story, The (1942)',
 '893    It Happened One Night (1934)',
 '847    Godfather, The (1972)']

In [334]:
#!M

get_recommendations_cat(3, model.to_recommender())

['3366    Crime|Film-Noir',
 '3352    Comedy',
 '941    Drama',
 '898    Comedy|Crime',
 '1215    Comedy|Crime',
 '1263    Western',
 '1186    Adventure|War',
 '2868    Comedy',
 '893    Comedy',
 '847    Action|Crime|Drama']

In [335]:
#!M

get_similars(0, model.to_recommender())

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '1838    Mulan (1998)',
 '2167    Simon Birch (1998)',
 '3400    Inherit the Wind (1960)',
 '3327    Muppet Movie, The (1979)',
 '935    My Man Godfrey (1936)',
 '584    Aladdin (1992)',
 '1261    Great Dictator, The (1940)']

### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [224]:
#!L
import torch
device = torch.device('cuda:0')

class AlsModel:
    def __init__(self, n_users, n_items, dim=64):
        super().__init__()
        scale = 1. / np.sqrt(dim)
        self.dim = dim
        self.n_users = n_users
        self.n_items = n_items
        self.user_vecs = scale * torch.rand(n_users, dim).to(device)
        self.item_vecs = scale * torch.rand(n_items, dim).to(device)
    
    def fit(self, user_item_csr, alpha=20., reg_a=1e-2, n_epoch=10):
        reg_eye = torch.eye(self.dim).to(device) * reg_a 
        
        p = torch.tensor(user_item_csr.toarray()).float().to(device)
        C = p * alpha + 1.
        
        for epoch in range(n_epoch):
            U = self.user_vecs
            I = self.item_vecs
            
            if epoch % 2 == 0:
                for u in range(self.n_users):
                    inv_term = torch.inverse(torch.matmul(I.T * C[u], I) + reg_eye) 
                    self.user_vecs[u] = torch.mv(torch.matmul(inv_term, I.T * C[u]), p[u])
            else:
                for i in range(self.n_items):
                    inv_term = torch.inverse(torch.matmul(U.T * C[:, i], U) + reg_eye)
                    self.item_vecs[i] = torch.mv(torch.matmul(inv_term, U.T * C[:, i]), p[:, i])
                    
            mse = ((torch.matmul(self.user_vecs, self.item_vecs.T) - p) ** 2).mean().cpu().detach()
            print(f"epoch {epoch} mse {mse}")
            
        
    def to_recommender(self):
        return Recommender(self.user_vecs.cpu().numpy(), 
                           self.item_vecs.cpu().numpy())
    
model = AlsModel(N_USERS, N_MOVIES)
model.fit(user_item_csr)

epoch 0 mse 0.11402388662099838
epoch 1 mse 0.058488879352808
epoch 2 mse 0.052558403462171555
epoch 3 mse 0.05063788965344429
epoch 4 mse 0.04885832965373993
epoch 5 mse 0.04835929721593857
epoch 6 mse 0.04769299179315567
epoch 7 mse 0.047509025782346725
epoch 8 mse 0.047176916152238846
epoch 9 mse 0.04709593206644058


In [225]:
#!L

get_recommendations(3, model.to_recommender())

['1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1182    Aliens (1986)',
 '2502    Matrix, The (1999)',
 '585    Terminator 2: Judgment Day (1991)',
 '453    Fugitive, The (1993)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1192    Star Wars: Episode VI - Return of the Jedi (1983)',
 '1203    Godfather: Part II, The (1974)',
 '847    Godfather, The (1972)',
 '1568    Hunt for Red October, The (1990)']

In [226]:
#!L

get_recommendations_cat(3, model.to_recommender())

['1178    Action|Adventure|Drama|Sci-Fi|War',
 '1182    Action|Sci-Fi|Thriller|War',
 '2502    Action|Sci-Fi|Thriller',
 '585    Action|Sci-Fi|Thriller',
 '453    Action|Thriller',
 '1271    Action|Adventure',
 '1192    Action|Adventure|Romance|Sci-Fi|War',
 '1203    Action|Crime|Drama',
 '847    Action|Crime|Drama',
 '1568    Action|Thriller']

In [227]:
#!L

get_similars(0, model.to_recommender())

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '584    Aladdin (1992)',
 '33    Babe (1995)',
 '1245    Groundhog Day (1993)',
 '1726    As Good As It Gets (1997)',
 '2252    Pleasantville (1998)',
 '360    Lion King, The (1994)',
 '591    Beauty and the Beast (1991)']

### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [314]:
#!M

def sigmoid(x):                                        
    return np.exp(x) / (1 + np.exp(x))

class BprModel:
    def __init__(self, n_users, n_items, dim=128):
        scale = 1. / np.sqrt(dim)
        self.dim = dim
        self.n_users = n_users
        self.n_items = n_items
        self.user_vecs = npr.uniform(0, scale, size=[n_users, dim])
        self.item_vecs = npr.uniform(0, scale, size=[n_items, dim])
    
    def fit(self, ratings_df, n_epoch=40, lr=5e-2, alpha=1e-5):
        ratings_df = ratings[ratings_df['rating'] >= 4].reset_index(drop=True)
        n = len(ratings_df)
        
        users   = np.asarray(ratings_df['user_id'])
        movies  = np.asarray(ratings_df['movie_id']) 
        batch_size = 1
        
        for epoch in range(n_epoch):
            epoch_idx = npr.permutation(n)
            negative_movies = npr.randint(0, self.n_items, size=[n])
            loss = np.zeros(n)
            
            for batch_start in range(0, n - batch_size, batch_size):
                idx = epoch_idx[batch_start: batch_start + batch_size]
                neg_id   = negative_movies[idx]
                movie_id = movies[idx]
                user_id  = users[idx]
                
                d_item   = self.item_vecs[movie_id] - self.item_vecs[neg_id]
                user_vec = self.user_vecs[user_id]
                r        = np.sum(user_vec * d_item, axis=1)
                dloss_dr = sigmoid(-r)
                
                self.user_vecs[user_id]  -= lr * (- dloss_dr[:,np.newaxis] * d_item   + alpha * self.user_vecs[user_id])
                self.item_vecs[movie_id] -= lr * (- dloss_dr[:,np.newaxis] * user_vec + alpha * self.item_vecs[movie_id])
                self.item_vecs[neg_id]   -= lr * (  dloss_dr[:,np.newaxis] * user_vec + alpha * self.item_vecs[neg_id])
                loss[idx] = np.log(sigmoid(-r))
            print(f"epoch {epoch} loss {np.average(loss)}")
        
    def to_recommender(self):
        return Recommender(self.user_vecs, self.item_vecs)
        
model = BprModel(N_USERS, N_MOVIES)        
model.fit(ratings)

epoch 0 loss -1.6919997810033656
epoch 1 loss -2.471665962889375
epoch 2 loss -2.7108009882091064
epoch 3 loss -2.9047088157728744
epoch 4 loss -3.122593215287271
epoch 5 loss -3.3319668817124235
epoch 6 loss -3.529872429485596
epoch 7 loss -3.6957221790058816
epoch 8 loss -3.8404653672106988
epoch 9 loss -3.98745042911269
epoch 10 loss -4.116169293815311
epoch 11 loss -4.235388019592235
epoch 12 loss -4.350852974398026
epoch 13 loss -4.459619662519827
epoch 14 loss -4.573262016385992
epoch 15 loss -4.689779577912499
epoch 16 loss -4.793650160872265
epoch 17 loss -4.895275566936719
epoch 18 loss -4.984846593195887
epoch 19 loss -5.0848430130173226
epoch 20 loss -5.190242626214744
epoch 21 loss -5.283504511698155
epoch 22 loss -5.368297806383104
epoch 23 loss -5.455083004098275
epoch 24 loss -5.533794563614727
epoch 25 loss -5.638346433577409
epoch 26 loss -5.718547286425198
epoch 27 loss -5.796899849900889
epoch 28 loss -5.866390817558798
epoch 29 loss -5.924083197354956
epoch 30 loss 

In [315]:
#!M
get_recommendations(3, model.to_recommender())

['847    Godfather, The (1972)',
 '108    Braveheart (1995)',
 '2502    Matrix, The (1999)',
 '585    Terminator 2: Judgment Day (1991)',
 '1192    Star Wars: Episode VI - Return of the Jedi (1983)',
 '589    Silence of the Lambs, The (1991)',
 '1182    Aliens (1986)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1203    Godfather: Part II, The (1974)',
 '1179    Princess Bride, The (1987)']

In [316]:
#!M

get_recommendations_cat(3, model.to_recommender())

['847    Action|Crime|Drama',
 '108    Action|Drama|War',
 '2502    Action|Sci-Fi|Thriller',
 '585    Action|Sci-Fi|Thriller',
 '1192    Action|Adventure|Romance|Sci-Fi|War',
 '589    Drama|Thriller',
 '1182    Action|Sci-Fi|Thriller|War',
 '1178    Action|Adventure|Drama|Sci-Fi|War',
 '1203    Action|Crime|Drama',
 '1179    Action|Adventure|Comedy|Romance']

In [317]:
#!M

get_similars(0, model.to_recommender())

['0    Toy Story (1995)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '2502    Matrix, The (1999)',
 '3045    Toy Story 2 (1999)',
 '1179    Princess Bride, The (1987)',
 '2789    American Beauty (1999)',
 "523    Schindler's List (1993)",
 '847    Godfather, The (1972)',
 '33    Babe (1995)',
 '589    Silence of the Lambs, The (1991)']

In [318]:
#!M
bpr_model = model

### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [358]:
#!M

class WarpModel:
    def __init__(self, n_users, n_items, dim=128):
        scale = 1. / np.sqrt(dim)
        self.dim = dim
        self.n_users = n_users
        self.n_items = n_items
        self.user_vecs = npr.uniform(0, scale, size=[n_users, dim])
        self.item_vecs = npr.uniform(0, scale, size=[n_items, dim])
        self.n_samples = 20
    
    def fit(self, ratings_df, lr=1e-2, n_epoch=10):
        ratings_df = ratings[ratings_df['rating'] >= 4].reset_index(drop=True)
        n = len(ratings_df)
        
        users  = np.asarray(ratings_df['user_id'])
        movies = np.asarray(ratings_df['movie_id'])
        
        user_item = np.zeros([self.n_users, self.n_items], dtype=np.long)
        user_item[users, movies] = 1
        
        unseen_items = [np.where(user_item[user_id] == 0)[0] for user_id in range(self.n_users)]
        
        for epoch in range(n_epoch):
            epoch_idx = npr.permutation(n)
            
            cur_users  = users[ epoch_idx]
            cur_movies = movies[epoch_idx]
            
            losses = []
            for user_id, pos_id in zip(cur_users, cur_movies):
                user_vec = self.user_vecs[user_id] 
                
                neg_idx = npr.permutation(unseen_items[user_id])[:self.n_samples]
                pos_score  = np.dot(self.item_vecs[pos_id],  user_vec)
                neg_scores = np.dot(self.item_vecs[neg_idx], user_vec)
                diffs = 1 + neg_scores - pos_score
                neg_poses = np.where(diffs > 0)[0]
                if len(neg_poses) == 0:
                    losses.append(0)
                    continue
                p = neg_poses[0]
                neg_id = neg_idx[p]
                c = np.log(len(neg_idx) / (p + 1))
                
                self.user_vecs[user_id] -= lr * c * (self.item_vecs[neg_id] - self.item_vecs[pos_id])
                self.item_vecs[pos_id]  -= lr * c * (-user_vec)
                self.item_vecs[neg_id]  -= lr * c * user_vec
                
                losses.append(c * diffs[p])
                
            print(f"epoch {epoch} loss {np.average(losses)}")
            
    def to_recommender(self):
        return Recommender(self.user_vecs, self.item_vecs)
    
warp_model = WarpModel(N_USERS, N_MOVIES)        
warp_model.fit(ratings)

epoch 0 loss 1.5790660721970202
epoch 1 loss 1.468198112299358
epoch 2 loss 1.2044436070671822
epoch 3 loss 1.0327376186413235
epoch 4 loss 0.9041340271692613
epoch 5 loss 0.7945979141565905
epoch 6 loss 0.714255196984392
epoch 7 loss 0.6558161482586752
epoch 8 loss 0.6109913783859467
epoch 9 loss 0.5789974896828386


In [359]:
#!M

get_recommendations(3, warp_model.to_recommender())

['585    Terminator 2: Judgment Day (1991)',
 '847    Godfather, The (1972)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '2789    American Beauty (1999)',
 '108    Braveheart (1995)',
 '1203    Godfather: Part II, The (1974)',
 '2502    Matrix, The (1999)',
 '1250    Back to the Future (1985)',
 '604    Fargo (1996)',
 '1284    Butch Cassidy and the Sundance Kid (1969)']

In [360]:
#!M

get_recommendations_cat(3, warp_model.to_recommender())

['585    Action|Sci-Fi|Thriller',
 '847    Action|Crime|Drama',
 '1178    Action|Adventure|Drama|Sci-Fi|War',
 '2789    Comedy|Drama',
 '108    Action|Drama|War',
 '1203    Action|Crime|Drama',
 '2502    Action|Sci-Fi|Thriller',
 '1250    Comedy|Sci-Fi',
 '604    Crime|Drama|Thriller',
 '1284    Action|Comedy|Western']

In [361]:
#!M
get_similars(0, warp_model.to_recommender())

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '584    Aladdin (1992)',
 '591    Beauty and the Beast (1991)',
 '33    Babe (1995)',
 '352    Forrest Gump (1994)',
 '1245    Groundhog Day (1993)',
 '1959    Saving Private Ryan (1998)',
 '360    Lion King, The (1994)']

In [None]:
#!M
