### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [36]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [37]:
ratings = pd.read_csv(
  'ml-1m/ratings.dat', delimiter='::', header=None, 
  names=['user_id', 'movie_id', 'rating', 'timestamp'], 
  usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [38]:
movie_info = pd.read_csv(
  'ml-1m/movies.dat', delimiter='::', header=None, 
  names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [39]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [40]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [41]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
3,1,3408,4
4,1,2355,5
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4
10,1,595,5
11,1,938,4
12,1,2398,4


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [42]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [43]:
FACTOR_RANK = 64
implicit_model = (
  implicit.als.AlternatingLeastSquares(factors=FACTOR_RANK, iterations=100, calculate_training_loss=True))

В качестве loss здесь всеми любимый RMSE

In [44]:
implicit_model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [45]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [46]:
get_similars = (
  lambda item_id, model : [
    movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
    for x in model.similar_items(item_id)])

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [47]:
get_similars(1, implicit_model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '1838    Mulan (1998)',
 '360    Lion King, The (1994)',
 '2618    Tarzan (1999)',
 '1526    Hercules (1997)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [48]:
get_user_history = (
  lambda user_id, implicit_ratings : [
    movie_info[movie_info["movie_id"] == x]["name"].to_string() 
    for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]])

In [49]:
get_user_history(4, implicit_ratings)

['3399    Hustler, The (1961)',
 '2882    Fistful of Dollars, A (1964)',
 '1196    Alien (1979)',
 '1023    Die Hard (1988)',
 '257    Star Wars: Episode IV - A New Hope (1977)',
 '1959    Saving Private Ryan (1998)',
 '476    Jurassic Park (1993)',
 '1180    Raiders of the Lost Ark (1981)',
 '1885    Rocky (1976)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '3349    Thelma & Louise (1991)',
 '3633    Mad Max (1979)',
 '2297    King Kong (1933)',
 '1366    Jaws (1975)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '1220    Terminator, The (1984)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [50]:
get_recommendations = (
  lambda user_id, model : [
    movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
    for x in model.recommend(user_id, user_item_csr)])

In [51]:
get_recommendations(4, implicit_model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '2502    Matrix, The (1999)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1182    Aliens (1986)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '3402    Close Encounters of the Third Kind (1977)',
 '1179    Princess Bride, The (1987)',
 '2460    Planet of the Apes (1968)',
 '847    Godfather, The (1972)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

In [52]:
def test_model(model):
  movie_id = 1
  print(f'Similar to {movie_info[movie_info["movie_id"] == movie_id]["name"].to_string()} are:')
  movie_similars = get_similars(movie_id, model)
  print('\n'.join(movie_similars))
  user_id = 4
  print(f'Recommended movies for the user {user_id} are:')
  recommended_movies = get_recommendations(user_id, model)
  print('\n'.join(recommended_movies))

In [53]:
test_model(implicit_model)

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
33    Babe (1995)
584    Aladdin (1992)
2315    Babe: Pig in the City (1998)
1838    Mulan (1998)
360    Lion King, The (1994)
2618    Tarzan (1999)
1526    Hercules (1997)
Recommended movies for the user 4 are:
585    Terminator 2: Judgment Day (1991)
1271    Indiana Jones and the Last Crusade (1989)
2502    Matrix, The (1999)
1284    Butch Cassidy and the Sundance Kid (1969)
1182    Aliens (1986)
1178    Star Wars: Episode V - The Empire Strikes Back...
3402    Close Encounters of the Third Kind (1977)
1179    Princess Bride, The (1987)
2460    Planet of the Apes (1968)
847    Godfather, The (1972)


### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [54]:
RANDOM_SEED = 17
EPOCHS = 10

In [55]:
from sklearn.neighbors import KDTree 

class RecommenderEngine:
  def __init__(self):
    self.movies = None
    self.movie_embeddings = None
    self.user_embeddings = None
    self.user_biases = None
    self.movie_biases = None
    self.gen_bias = None
    self.items_kdtree = None
    
  def fit(self, *args, **kwargs):
    assert self.movies is not None
    assert self.movie_embeddings is not None
    assert self.user_embeddings is not None
    self.items_kdtree = KDTree(self.movie_embeddings)
  
  def similar_items(self, movie_id, n=10):
    distances, nearest_items = self.items_kdtree.query([self.movie_embeddings[movie_id]], n)
    similarities = -distances
    return list(zip(nearest_items.flatten(), similarities.flatten()))
  
  def recommend(self, user_id, user_item_csr, n=10):
    predicted_ratings = []
    liked_set = [movie_id for movie_id in user_item_csr[user_id].indices]
    for movie_id in self.movies:
      if movie_id in liked_set:
        continue
      rating = np.dot(self.user_embeddings[user_id], self.movie_embeddings[movie_id])
      if self.user_biases is not None:
        rating += self.user_biases[user_id]
      if self.movie_biases is not None:
        rating += self.movie_biases[movie_id]
      if self.gen_bias is not None:
        rating += self.gen_bias
      predicted_ratings.append((movie_id, rating))
    predicted_ratings.sort(key=lambda movie_rate: movie_rate[1], reverse=True)
    return predicted_ratings[:n]

In [61]:
from collections import defaultdict

import tqdm

class SVDThroughSGD(RecommenderEngine):
  def __init__(self, rank=64, lr=1e-3, max_epochs=100, eps=1e-5, lambd=1e-5):
    super().__init__()
    self.rank = rank
    self.lr = lr
    self.max_epochs = max_epochs
    self.eps = eps
    self.lambd = lambd
  
  def fit(self, ratings, seed=RANDOM_SEED):
    np.random.seed(seed)
    users = ratings['user_id'].unique() 
    movies = ratings['movie_id'].unique()
    print('Users:', len(users))
    print('Movies:', len(movies))
    print('Ratings:', len(ratings), flush=True)
    user_ids_size = np.max(users) + 1
    movie_ids_size = np.max(movies) + 1
    init_limit = 1 / np.sqrt(self.rank)
    user_embeddings = np.random.uniform(0, init_limit, (user_ids_size, self.rank))
    movie_embeddings = np.random.uniform(0, init_limit, (movie_ids_size, self.rank))
    user_biases = np.zeros(user_ids_size)
    movie_biases = np.zeros(movie_ids_size)
    gen_bias = np.mean(ratings['rating'])
    ratings_index = np.array(ratings.index)
    for epoch in range(self.max_epochs):
      # Shuffle indexes to simulate uniform sampling for SGD.
      np.random.shuffle(ratings_index)
      mse_sum = 0
      mse_count = 0
      for rating_index in tqdm.tqdm(ratings_index):
        rating_row = ratings.iloc[rating_index]
        user_id = rating_row['user_id']
        movie_id = rating_row['movie_id']
        rating = rating_row['rating']
        user_embedding = user_embeddings[user_id]
        movie_embedding = movie_embeddings[movie_id]
        user_bias = user_biases[user_id]
        movie_bias = movie_biases[movie_id]
        error = (np.dot(user_embedding, movie_embedding) + user_bias + movie_bias + gen_bias) - rating
        user_embedding_delta = self.lr * (error * movie_embedding + self.lambd * user_embedding)
        movie_embedding_delta = self.lr * (error * user_embedding + self.lambd * movie_embedding)
        user_bias_delta = self.lr * (error + self.lambd * user_bias)
        movie_bias_delta = self.lr * (error + self.lambd * movie_bias)
        user_embeddings[user_id] -= user_embedding_delta
        movie_embeddings[movie_id] -= movie_embedding_delta
        user_biases[user_id] -= user_bias_delta
        movie_biases[movie_id] -= movie_bias_delta
        mse_sum += error ** 2
        mse_count += 1
      mse = mse_sum / mse_count
      print(f'Epoch: {epoch + 1}, train MSE={mse}', flush=True)
      if mse < self.eps:
        break
    self.movies = movies
    self.user_embeddings = user_embeddings
    self.movie_embeddings = movie_embeddings
    self.movie_biases = movie_biases
    self.user_biases = user_biases
    self.gen_bias = gen_bias
    super().fit()

In [62]:
svd_model = SVDThroughSGD(rank=FACTOR_RANK, lr=1e-2, lambd=1e-2, eps=1e-3, max_epochs=EPOCHS)
svd_model.fit(ratings)

Users: 6040
Movies: 3706
Ratings: 1000209


100%|██████████| 1000209/1000209 [03:03<00:00, 5459.22it/s]

Epoch: 1, train MSE=0.9070532182019198



100%|██████████| 1000209/1000209 [03:02<00:00, 5485.31it/s]

Epoch: 2, train MSE=0.8260301885808143



100%|██████████| 1000209/1000209 [03:02<00:00, 5481.32it/s]

Epoch: 3, train MSE=0.8063661627217299



100%|██████████| 1000209/1000209 [03:00<00:00, 5534.59it/s]

Epoch: 4, train MSE=0.7855569607478481



100%|██████████| 1000209/1000209 [03:03<00:00, 5440.53it/s]

Epoch: 5, train MSE=0.7593830454952499



100%|██████████| 1000209/1000209 [03:01<00:00, 5502.03it/s]

Epoch: 6, train MSE=0.7278272380264692



100%|██████████| 1000209/1000209 [03:01<00:00, 5506.91it/s]

Epoch: 7, train MSE=0.6935764914451396



100%|██████████| 1000209/1000209 [03:01<00:00, 5515.26it/s]

Epoch: 8, train MSE=0.6578326241122451



100%|██████████| 1000209/1000209 [03:01<00:00, 5520.75it/s]

Epoch: 9, train MSE=0.6217505845350675



100%|██████████| 1000209/1000209 [03:01<00:00, 5521.41it/s]

Epoch: 10, train MSE=0.5863738460966565





In [63]:
test_model(svd_model)

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
2618    Tarzan (1999)
2021    Rescuers, The (1977)
1459    Cats Don't Dance (1997)
1551    Air Bud (1997)
565    Little Big League (1994)
1988    Incredible Journey, The (1963)
Recommended movies for the user 4 are:
1189    To Kill a Mockingbird (1962)
2836    Sanjuro (1962)
847    Godfather, The (1972)
900    Casablanca (1942)
3020    Bicycle Thief, The (Ladri di biciclette) (1948)
1950    Seven Samurai (The Magnificent Seven) (Shichin...
1186    Lawrence of Arabia (1962)
3269    For All Mankind (1989)
892    Rear Window (1954)
3238    City Lights (1931)


### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [64]:
class ALS(RecommenderEngine):
  def __init__(self, rank=64, max_iterations=100, eps=1e-5, lambd=1e-5, confidence=10):
    super().__init__()
    self.rank = rank
    self.max_iterations = max_iterations
    self.eps = eps
    self.lambd = lambd
    self.confidence = confidence

  def fit(self, user_item_csr, seed=RANDOM_SEED):
    np.random.seed(seed)
    cx = sp.coo_matrix(user_item_csr)
    users = np.unique(cx.row)
    movies = np.unique(cx.col)
    print('Users:', len(users))
    print('Movies:', len(movies))
    print('Ratings:', user_item_csr.getnnz(), flush=True)
    
    init_limit = 1 / np.sqrt(self.rank)
    user_ids_size = np.max(users) + 1
    movie_ids_size = np.max(movies) + 1
    init_limit = 1 / np.sqrt(self.rank)
    user_embeddings = np.random.uniform(0, init_limit, (user_ids_size, self.rank))
    movie_embeddings = np.random.uniform(0, init_limit, (movie_ids_size, self.rank))

    for iteration in range(self.max_iterations):
      if iteration % 2 == 0:
        # Fix movie embeddings, optimize user embeddings.
        movie_dot_movie = (movie_embeddings * movie_embeddings).sum(axis=1, keepdims=True)
        for user_id in tqdm.tqdm(users):
          x_i = np.array(user_item_csr[user_id, :].todense()).reshape((-1, 1))
          c = 1 + self.confidence * x_i
          left_side = (c * movie_dot_movie + self.lambd).sum(axis=0)
          right_side = (c * x_i * movie_embeddings).sum(axis=0)
          user_embeddings[user_id] = right_side / left_side
      else:
        # Fix user embeddings, optimize movie embeddings.
        user_dot_user = (user_embeddings * user_embeddings).sum(axis=1, keepdims=True)
        for movie_id in tqdm.tqdm(movies):
          x_j = np.array(user_item_csr[:, movie_id].todense()).reshape((-1, 1))
          c = 1 + self.confidence * x_j
          left_side = (c * user_dot_user + self.lambd).sum(axis=0)
          right_side = (c * x_j * user_embeddings).sum(axis=0)
          movie_embeddings[movie_id] = right_side / left_side
      mse_sum = 0
      mse_count = 0
      for user_id, movie_id, rating in zip(cx.row, cx.col, cx.data):
        error = 1 - np.dot(user_embeddings[user_id], movie_embeddings[movie_id])
        mse_sum += error**2
        mse_count += 1
      mse = mse_sum / mse_count
      print(f'Iteration: {iteration + 1}, end MSE={mse}', flush=True)
      if mse < self.eps:
        break
    self.movies = movies
    self.user_embeddings = user_embeddings
    self.movie_embeddings = movie_embeddings
    super().fit()

In [65]:
als_model = ALS(rank=FACTOR_RANK, confidence=1e4, lambd=1e-2, eps=1e-3, max_iterations=EPOCHS)
als_model.fit(user_item_csr)

Users: 6038
Movies: 3533
Ratings: 575281


100%|██████████| 6038/6038 [00:03<00:00, 1644.45it/s]


Iteration: 1, end MSE=0.06459164032772788


100%|██████████| 3533/3533 [00:06<00:00, 538.02it/s]


Iteration: 2, end MSE=0.0002011533726035453


In [66]:
test_model(als_model)

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
1245    Groundhog Day (1993)
584    Aladdin (1992)
3045    Toy Story 2 (1999)
38    Clueless (1995)
33    Babe (1995)
3184    Wayne's World (1992)
2252    Pleasantville (1998)
2225    Antz (1998)
360    Lion King, The (1994)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
1353    Star Trek: The Wrath of Khan (1982)
2460    Planet of the Apes (1968)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
108    Braveheart (1995)
585    Terminator 2: Judgment Day (1991)
3630    Starman (1984)
2502    Matrix, The (1999)
1271    Indiana Jones and the Last Crusade (1989)
1355    Star Trek IV: The Voyage Home (1986)


### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [67]:
from collections import namedtuple

PositiveSample = namedtuple('PositiveSample', ('user_id', 'movie_id'))

class BPR(RecommenderEngine):
  def __init__(self, rank=64, lr=1e-3, max_epochs=100, eps=1e-5, lambd=1e-5):
    super().__init__()
    self.rank = rank
    self.lr = lr
    self.max_epochs = max_epochs
    self.eps = eps
    self.lambd = lambd
    
  @staticmethod
  def _create_positives_dataset(user_item_csr, user_ids):
    D = []
    for user_id in user_ids:
      for movie_id in user_item_csr[user_id, :].indices:
        D.append(PositiveSample(user_id, movie_id))
    return D

  def fit(self, user_item_csr, seed=RANDOM_SEED):
    np.random.seed(seed)
    cx = sp.coo_matrix(user_item_csr)
    users = np.unique(cx.row)
    movies = np.unique(cx.col)
    print('Users:', len(users))
    print('Movies:', len(movies))
    print('Ratings:', user_item_csr.getnnz(), flush=True)
    
    init_limit = 1 / np.sqrt(self.rank)
    user_ids_size = np.max(users) + 1
    movie_ids_size = np.max(movies) + 1
    init_limit = 1 / np.sqrt(self.rank)
    user_embeddings = np.random.uniform(0, init_limit, (user_ids_size, self.rank))
    movie_embeddings = np.random.uniform(0, init_limit, (movie_ids_size, self.rank))
    
    D = BPR._create_positives_dataset(user_item_csr, users)
    print('Dataset len:', len(D), flush=True)
    prev_logloss = None
    for epoch in range(self.max_epochs):
      # Shuffle indexes to simulate uniform sampling for BPR.
      np.random.shuffle(D)
      logloss_sum = 0
      logloss_count = 0
      for user_id, prefers_id in tqdm.tqdm(D):
        while True:
          over_id = np.random.choice(movies, 1)[0]
          if user_item_csr[user_id, over_id] == 0:
            break
        x_ui = np.dot(user_embeddings[user_id], movie_embeddings[prefers_id])
        x_uj = np.dot(user_embeddings[user_id], movie_embeddings[over_id])
        x_uij = x_ui - x_uj
        user_grad = movie_embeddings[prefers_id] - movie_embeddings[over_id]
        prefers_grad = user_embeddings[user_id]
        over_grad = -user_embeddings[user_id]
        sigm = np.exp(-x_uij) / (1 + np.exp(-x_uij))
        user_delt = self.lr * (-sigm * user_grad - self.lambd * user_embeddings[user_id])
        prefers_delt = self.lr * (-sigm * prefers_grad - self.lambd * movie_embeddings[prefers_id])
        over_delt = self.lr * (-sigm * over_grad - self.lambd * movie_embeddings[over_id])
        user_embeddings[user_id] -= user_delt
        movie_embeddings[prefers_id] -= prefers_delt
        movie_embeddings[over_id] -= over_delt
        
        logloss_sum += np.log(sigm)
        logloss_count += 1
      logloss = logloss_sum / logloss_count
      print(f'Epoch: {epoch + 1}, train logloss={logloss}', flush=True)
      if prev_logloss is not None and np.abs(logloss - prev_logloss) / np.abs(prev_logloss) < self.eps:
        break
      else:
        prev_logloss = logloss
    self.movies = movies
    self.user_embeddings = user_embeddings
    self.movie_embeddings = movie_embeddings
    super().fit()

In [68]:
bpr_model = BPR(rank=FACTOR_RANK, lr=1e-1, lambd=1e-6, eps=1e-3, max_epochs=EPOCHS)
bpr_model.fit(user_item_csr)

Users: 6038
Movies: 3533
Ratings: 575281
Dataset len: 575281


100%|██████████| 575281/575281 [00:58<00:00, 9798.05it/s] 

Epoch: 1, train logloss=-2.231225870962534



100%|██████████| 575281/575281 [01:00<00:00, 9532.89it/s] 

Epoch: 2, train logloss=-2.9711421417812134



100%|██████████| 575281/575281 [00:58<00:00, 9859.13it/s] 

Epoch: 3, train logloss=-3.500623415479499



100%|██████████| 575281/575281 [00:58<00:00, 9854.34it/s] 

Epoch: 4, train logloss=-4.013380501955161



100%|██████████| 575281/575281 [00:59<00:00, 9712.95it/s] 

Epoch: 5, train logloss=-4.407699608855628



100%|██████████| 575281/575281 [00:59<00:00, 9694.08it/s] 

Epoch: 6, train logloss=-4.751563375050813



100%|██████████| 575281/575281 [00:59<00:00, 9701.97it/s] 

Epoch: 7, train logloss=-5.092849675276153



100%|██████████| 575281/575281 [00:57<00:00, 9920.78it/s] 

Epoch: 8, train logloss=-5.429901185057159



100%|██████████| 575281/575281 [00:57<00:00, 9940.45it/s] 

Epoch: 9, train logloss=-5.760098347991164



100%|██████████| 575281/575281 [00:57<00:00, 9975.51it/s] 

Epoch: 10, train logloss=-6.080524487294817





In [69]:
test_model(bpr_model)

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
2252    Pleasantville (1998)
584    Aladdin (1992)
33    Babe (1995)
363    Mask, The (1994)
1245    Groundhog Day (1993)
3045    Toy Story 2 (1999)
496    Mrs. Doubtfire (1993)
2849    Ferris Bueller's Day Off (1986)
2012    Little Mermaid, The (1989)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
847    Godfather, The (1972)
1284    Butch Cassidy and the Sundance Kid (1969)
604    Fargo (1996)
1884    French Connection, The (1971)
1203    Godfather: Part II, The (1974)
1575    L.A. Confidential (1997)
585    Terminator 2: Judgment Day (1991)
2789    American Beauty (1999)
740    Dr. Strangelove or: How I Learned to Stop Worr...


### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [58]:
class WARP(RecommenderEngine):
  def __init__(self, rank=64, lr=1e-3, max_epochs=100, eps=1e-5, lambd=1e-5, M=1, max_negatives=10):
    super().__init__()
    self.rank = rank
    self.lr = lr
    self.max_epochs = max_epochs
    self.eps = eps
    self.lambd = lambd
    self.M = M
    self.max_negatives = max_negatives
    
  @staticmethod
  def _create_positives_dataset(user_item_csr, user_ids):
    D = []
    for user_id in user_ids:
      for movie_id in user_item_csr[user_id, :].indices:
        D.append(PositiveSample(user_id, movie_id))
    return D

  def fit(self, user_item_csr, seed=RANDOM_SEED):
    np.random.seed(seed)
    cx = sp.coo_matrix(user_item_csr)
    users = np.unique(cx.row)
    movies = np.unique(cx.col)
    print('Users:', len(users))
    print('Movies:', len(movies))
    print('Ratings:', user_item_csr.getnnz(), flush=True)
    
    init_limit = 1 / np.sqrt(self.rank)
    user_ids_size = np.max(users) + 1
    movie_ids_size = np.max(movies) + 1
    init_limit = 1 / np.sqrt(self.rank)
    user_embeddings = np.random.uniform(0, init_limit, (user_ids_size, self.rank))
    movie_embeddings = np.random.uniform(0, init_limit, (movie_ids_size, self.rank))
    
    D = BPR._create_positives_dataset(user_item_csr, users)
    print('Dataset len:', len(D), flush=True)
    prev_loss = None
    for epoch in range(self.max_epochs):
      # Shuffle indexes to simulate uniform sampling for WARP.
      np.random.shuffle(D)
      loss_sum = 0
      loss_count = 0
      for user_id, prefers_id in tqdm.tqdm(D):
        score_prefers = np.dot(user_embeddings[user_id], movie_embeddings[prefers_id])
        for n in range(1, self.max_negatives + 1):
          while True:
            over_id = np.random.choice(movies, 1)[0]
            if user_item_csr[user_id, over_id] == 0:
              break
          score_over = np.dot(user_embeddings[user_id], movie_embeddings[over_id])
          if self.M + score_over - score_prefers > 0:
            # Update since we've found a negative which our model ranks higher (including margin) than the positive.
            rank_approx = np.floor(self.max_negatives / n)
            rank_loss = np.log(rank_approx)
            loss = rank_loss * (self.M + score_over - score_prefers)
            user_grad = movie_embeddings[over_id] - movie_embeddings[prefers_id]
            prefers_grad = -user_embeddings[user_id]
            over_grad = user_embeddings[user_id]
            user_delt = self.lr * (rank_loss * user_grad + self.lambd * user_embeddings[user_id])
            prefers_delt = self.lr * (rank_loss * prefers_grad + self.lambd * movie_embeddings[prefers_id])
            over_delt = self.lr * (rank_loss * over_grad + self.lambd * movie_embeddings[over_id])
            user_embeddings[user_id] -= user_delt
            movie_embeddings[prefers_id] -= prefers_delt
            movie_embeddings[over_id] -= over_delt

            loss_sum += loss
            loss_count += 1
            
            break
        else:
          loss_sum += 0.0
          loss_count += 1

      loss = loss_sum / loss_count if loss_count else 0.0
      print(f'Epoch: {epoch + 1}, train loss={loss}', flush=True)
      self.movies = movies
      self.user_embeddings = user_embeddings
      self.movie_embeddings = movie_embeddings
      super().fit()
      test_model(warp_model)
      if prev_loss is not None and np.abs(loss - prev_loss) / np.abs(prev_loss) < self.eps:
        break
      else:
        prev_loss = loss

In [59]:
warp_model = WARP(rank=FACTOR_RANK, lr=1e-2, lambd=1e-6, eps=1e-3, max_epochs=EPOCHS)
warp_model.fit(user_item_csr)

Users: 6038
Movies: 3533
Ratings: 575281
Dataset len: 575281


100%|██████████| 575281/575281 [02:36<00:00, 3679.98it/s]

Epoch: 1, train loss=1.1551241952360298



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
1195    GoodFellas (1990)
907    Wizard of Oz, The (1939)
1220    Terminator, The (1984)
1203    Godfather: Part II, The (1974)
900    Casablanca (1942)
293    Pulp Fiction (1994)
2847    Total Recall (1990)
740    Dr. Strangelove or: How I Learned to Stop Worr...
912    2001: A Space Odyssey (1968)
Recommended movies for the user 4 are:
2789    American Beauty (1999)
1178    Star Wars: Episode V - The Empire Strikes Back...
604    Fargo (1996)
847    Godfather, The (1972)
2502    Matrix, The (1999)
2928    Being John Malkovich (1999)
108    Braveheart (1995)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
589    Silence of the Lambs, The (1991)
523    Schindler's List (1993)


100%|██████████| 575281/575281 [03:08<00:00, 3054.29it/s]

Epoch: 2, train loss=1.0670129306581517



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
2502    Matrix, The (1999)
108    Braveheart (1995)
476    Jurassic Park (1993)
352    Forrest Gump (1994)
453    Fugitive, The (1993)
1081    E.T. the Extra-Terrestrial (1982)
1250    Back to the Future (1985)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1220    Terminator, The (1984)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
2789    American Beauty (1999)
589    Silence of the Lambs, The (1991)
585    Terminator 2: Judgment Day (1991)
523    Schindler's List (1993)
315    Shawshank Redemption, The (1994)
293    Pulp Fiction (1994)
1575    L.A. Confidential (1997)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
2693    Sixth Sense, The (1999)


100%|██████████| 575281/575281 [03:15<00:00, 2948.18it/s]

Epoch: 3, train loss=0.9921389227464892



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1179    Princess Bride, The (1987)
1959    Saving Private Ryan (1998)
1250    Back to the Future (1985)
2647    Ghostbusters (1984)
2502    Matrix, The (1999)
1180    Raiders of the Lost Ark (1981)
589    Silence of the Lambs, The (1991)
2286    Bug's Life, A (1998)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
2789    American Beauty (1999)
589    Silence of the Lambs, The (1991)
2693    Sixth Sense, The (1999)
1179    Princess Bride, The (1987)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
108    Braveheart (1995)
847    Godfather, The (1972)
585    Terminator 2: Judgment Day (1991)
0    Toy Story (1995)


100%|██████████| 575281/575281 [03:30<00:00, 2728.22it/s]

Epoch: 4, train loss=0.8468496391129058



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
1179    Princess Bride, The (1987)
2728    Big (1988)
33    Babe (1995)
1899    Breakfast Club, The (1985)
2693    Sixth Sense, The (1999)
1250    Back to the Future (1985)
1245    Groundhog Day (1993)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
2789    American Beauty (1999)
847    Godfather, The (1972)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
585    Terminator 2: Judgment Day (1991)
604    Fargo (1996)
315    Shawshank Redemption, The (1994)
1179    Princess Bride, The (1987)
2502    Matrix, The (1999)
589    Silence of the Lambs, The (1991)


100%|██████████| 575281/575281 [03:44<00:00, 2562.53it/s]

Epoch: 5, train loss=0.7439385024267261



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
1179    Princess Bride, The (1987)
1245    Groundhog Day (1993)
33    Babe (1995)
2918    Who Framed Roger Rabbit? (1988)
1250    Back to the Future (1985)
584    Aladdin (1992)
2728    Big (1988)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
2502    Matrix, The (1999)
847    Godfather, The (1972)
589    Silence of the Lambs, The (1991)
1179    Princess Bride, The (1987)
585    Terminator 2: Judgment Day (1991)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
453    Fugitive, The (1993)
2789    American Beauty (1999)
537    Blade Runner (1982)


100%|██████████| 575281/575281 [03:57<00:00, 2422.17it/s]

Epoch: 6, train loss=0.6758595071790702



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
1179    Princess Bride, The (1987)
584    Aladdin (1992)
33    Babe (1995)
1250    Back to the Future (1985)
2918    Who Framed Roger Rabbit? (1988)
1245    Groundhog Day (1993)
1058    Willy Wonka and the Chocolate Factory (1971)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
604    Fargo (1996)
2789    American Beauty (1999)
847    Godfather, The (1972)
589    Silence of the Lambs, The (1991)
293    Pulp Fiction (1994)
523    Schindler's List (1993)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
585    Terminator 2: Judgment Day (1991)
2693    Sixth Sense, The (1999)


100%|██████████| 575281/575281 [04:05<00:00, 2344.48it/s]

Epoch: 7, train loss=0.6258361717779746



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
33    Babe (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
1245    Groundhog Day (1993)
2918    Who Framed Roger Rabbit? (1988)
1179    Princess Bride, The (1987)
360    Lion King, The (1994)
1178    Star Wars: Episode V - The Empire Strikes Back...
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
847    Godfather, The (1972)
585    Terminator 2: Judgment Day (1991)
2502    Matrix, The (1999)
108    Braveheart (1995)
2789    American Beauty (1999)
1182    Aliens (1986)
523    Schindler's List (1993)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
453    Fugitive, The (1993)


100%|██████████| 575281/575281 [04:12<00:00, 2279.51it/s]

Epoch: 8, train loss=0.5816527068620632



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
33    Babe (1995)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
360    Lion King, The (1994)
1245    Groundhog Day (1993)
1179    Princess Bride, The (1987)
2252    Pleasantville (1998)
2918    Who Framed Roger Rabbit? (1988)
Recommended movies for the user 4 are:
585    Terminator 2: Judgment Day (1991)
1178    Star Wars: Episode V - The Empire Strikes Back...
1192    Star Wars: Episode VI - Return of the Jedi (1983)
847    Godfather, The (1972)
589    Silence of the Lambs, The (1991)
2502    Matrix, The (1999)
453    Fugitive, The (1993)
537    Blade Runner (1982)
108    Braveheart (1995)
2789    American Beauty (1999)


100%|██████████| 575281/575281 [04:21<00:00, 2196.13it/s]

Epoch: 9, train loss=0.5483960923453649



  0%|          | 0/575281 [00:00<?, ?it/s]

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
33    Babe (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
1245    Groundhog Day (1993)
2252    Pleasantville (1998)
352    Forrest Gump (1994)
1180    Raiders of the Lost Ark (1981)
1250    Back to the Future (1985)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
847    Godfather, The (1972)
585    Terminator 2: Judgment Day (1991)
2502    Matrix, The (1999)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1182    Aliens (1986)
589    Silence of the Lambs, The (1991)
1203    Godfather: Part II, The (1974)
1568    Hunt for Red October, The (1990)
453    Fugitive, The (1993)


100%|██████████| 575281/575281 [04:29<00:00, 2131.15it/s]

Epoch: 10, train loss=0.5154498543197303





Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
33    Babe (1995)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
1245    Groundhog Day (1993)
2252    Pleasantville (1998)
1250    Back to the Future (1985)
1180    Raiders of the Lost Ark (1981)
1081    E.T. the Extra-Terrestrial (1982)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
847    Godfather, The (1972)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
2502    Matrix, The (1999)
585    Terminator 2: Judgment Day (1991)
1182    Aliens (1986)
108    Braveheart (1995)
589    Silence of the Lambs, The (1991)
1568    Hunt for Red October, The (1990)
453    Fugitive, The (1993)


In [60]:
test_model(warp_model)

Similar to 0    Toy Story (1995) are:
0    Toy Story (1995)
3045    Toy Story 2 (1999)
33    Babe (1995)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
1245    Groundhog Day (1993)
2252    Pleasantville (1998)
1250    Back to the Future (1985)
1180    Raiders of the Lost Ark (1981)
1081    E.T. the Extra-Terrestrial (1982)
Recommended movies for the user 4 are:
1178    Star Wars: Episode V - The Empire Strikes Back...
847    Godfather, The (1972)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
2502    Matrix, The (1999)
585    Terminator 2: Judgment Day (1991)
1182    Aliens (1986)
108    Braveheart (1995)
589    Silence of the Lambs, The (1991)
1568    Hunt for Red October, The (1990)
453    Fugitive, The (1993)
