### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [2]:
ratings = pd.read_csv('ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [3]:
movie_info = pd.read_csv('movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [4]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


In [5]:
ratings = ratings.sort_values(by = ['user_id', 'movie_id']) 

In [6]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
40,1,1,5
25,1,48,5
39,1,150,5
44,1,260,4
23,1,527,5
49,1,531,4
33,1,588,4
8,1,594,4
10,1,595,5
51,1,608,4


In [7]:
print(np.max(ratings['user_id']))
print(np.max(ratings['movie_id']))

6040
3952


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [8]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [9]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
40,1,1,5
25,1,48,5
39,1,150,5
44,1,260,4
23,1,527,5
49,1,531,4
33,1,588,4
8,1,594,4
10,1,595,5
51,1,608,4


In [10]:
implicit_ratings['rating'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
40,1,1,1
25,1,48,1
39,1,150,1
44,1,260,1
23,1,527,1
49,1,531,1
33,1,588,1
8,1,594,1
10,1,595,1
51,1,608,1


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [12]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [13]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [14]:
model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [15]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [17]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '360    Lion King, The (1994)',
 '2315    Babe: Pig in the City (1998)',
 '1838    Mulan (1998)',
 '2618    Tarzan (1999)',
 '1526    Hercules (1997)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [18]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [19]:
get_user_history(4, implicit_ratings)

['257    Star Wars: Episode IV - A New Hope (1977)',
 '476    Jurassic Park (1993)',
 '1023    Die Hard (1988)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '1180    Raiders of the Lost Ark (1981)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '1196    Alien (1979)',
 '1220    Terminator, The (1984)',
 '1366    Jaws (1975)',
 '1885    Rocky (1976)',
 '1959    Saving Private Ryan (1998)',
 '2297    King Kong (1933)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '2882    Fistful of Dollars, A (1964)',
 '3349    Thelma & Louise (1991)',
 '3399    Hustler, The (1961)',
 '3633    Mad Max (1979)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [20]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [21]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '2502    Matrix, The (1999)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1182    Aliens (1986)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '3402    Close Encounters of the Third Kind (1977)',
 '847    Godfather, The (1972)',
 '2460    Planet of the Apes (1968)',
 '2880    Dr. No (1962)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [22]:
# class svd:
#     def __init__(self, latent=64, regn=0.01, lr=0.000001 / 5.0, bias_lr = 1e-1, max_iter=300000, eps=0.0):
#         self.latent = latent
#         self.regn = regn
#         self.lr = lr
#         self.max_iter=max_iter
#         self.eps = eps
#         self.bias_lr = bias_lr
        
#     def fit(self, df, movie_info):
#         #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
#         self.df = df
#         self.movie_info = movie_info
        
#         self.total_users = np.max(df['user_id'])
#         self.total_items = np.max(df['movie_id'])
#         total_len = len(df)
        
#         W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
#         H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
#         B_u = np.zeros((self.total_users, 1))
#         B_i = np.zeros((1, self.total_items))
#         mu = self.df['rating'].mean()
        
        
#         cur_iter = 0
        
#         while cur_iter <= self.max_iter:
#             if cur_iter % 20 == 0:
#                 pred = W@H + B_u + B_i + mu
#                 rmse = np.linalg.norm(pred[df['user_id']-1, df['movie_id']-1] - df['rating'])
#                 rmse /= np.sqrt(total_len)
#                 if rmse <= self.eps:
#                     print(f"Let's stop on iter {cur_iter}!")
#                     break
                    
#                 print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
                
#             error = W @ H + B_u + B_i + mu
#             error[df['user_id']-1, df['movie_id']-1] -= df['rating']
        
#             W = W - self.lr * (self.regn * W + error @ H.T)
#             H = H - self.lr * (self.regn * H + W.T @ error)
            
#             B_u = B_u - self.bias_lr * (np.mean(error, axis=1).reshape(-1, 1) + self.regn * B_u)
#             B_i = B_i - self.bias_lr * (np.mean(error, axis=0).reshape(1, -1) + self.regn * B_i)
            
#             cur_iter += 1
            
#             self.w, self.h, self.b_u, self.b_i, self.mu = W, H, B_u, B_i, mu
        
#     def get_history(self, cur_user_id):
#         history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
#         return history
        
#     def get_recommendations(self, cur_user_id, best_k=5):
#         all_predicted_ratings = (self.w @ self.h + self.b_u + self.b_i + self.mu)[cur_user_id - 1]
        
#         known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
#         need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
#         need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
#         res = np.array(need_to_predict[:best_k])
        
#         best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
#         return best_names
    
#     def get_similars(self, cur_movie_id, best_k=5):       
#         cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
#         all_films = [i for i in movie_info["movie_id"]]
        
#         all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
#         res = np.array(all_films[:best_k])
        
#         best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
#         return best_names

In [23]:
class svd:
    def __init__(self, latent=64, regn=0.05, lr=0.01, bias_lr=0.01, max_iter=300000, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        self.bias_lr = bias_lr
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        B_u = np.zeros((self.total_users, 1))
        B_i = np.zeros((1, self.total_items))
        mu = self.df['rating'].mean()
        
        
        cur_iter = 0
        
        while cur_iter <= self.max_iter + 1:
            if cur_iter % 100000 == 0:
                rmse = np.linalg.norm((W@H + B_u + B_i + mu)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
                            
            cur_id = np.random.randint(low=0, high=total_len)
            i = df.iloc[cur_id]['user_id'] - 1
            j = df.iloc[cur_id]['movie_id'] - 1
            value = df.iloc[cur_id]['rating']
            
            error = W[i, :] @ H[:, j] + B_u[i] + B_i[0, j] + mu - value
            
            W[i, :] = W[i, :] * (1 - self.lr * self.regn) - self.lr * error * H[:, j].T
            H[:, j] = H[:, j] * (1 - self.lr * self.regn) - self.lr * error * W[i, :].T
            B_u[i] = B_u[i] - self.bias_lr * (error + self.regn * B_u[i])
            B_i[0, j] = B_i[0, j] - self.bias_lr * (error + self.regn * B_i[0, j])
            
            cur_iter += 1
            
        self.w, self.h, self.b_u, self.b_i, self.mu = W, H, B_u, B_i, mu
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h + self.b_u + self.b_i + self.mu)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

# Changelog

Попробовал переписать sgd так, чтобы считать ошибку сразу по всему датасету (то есть, nan я считаю нулями). Поигрался с таким sgd, он очень быстро переучивается, из-за этого сложно настроить lr.
Рекоммендации всё равно плохие.

Откатился к старому sgd по одной строчке, немного покрутил параметры, оставил считаться на ночь, кажется, оно не очень помогло :(

In [24]:
%%time

# for all-in train
#model = svd(latent=128, regn=1e-3, lr=1e-6, bias_lr=1e-5, max_iter=100)

# for pointwise train
model = svd(max_iter=50000000)

model.fit(ratings, movie_info)

current: 0/50000000, rmse: 1.1453288253375367
current: 100000/50000000, rmse: 1.0060925446820992
current: 200000/50000000, rmse: 0.9677147430510444
current: 300000/50000000, rmse: 0.949669852905966
current: 400000/50000000, rmse: 0.9389639164903757
current: 500000/50000000, rmse: 0.9315487852721415
current: 600000/50000000, rmse: 0.9266569434156475
current: 700000/50000000, rmse: 0.9226740603633001
current: 800000/50000000, rmse: 0.9197601124651456
current: 900000/50000000, rmse: 0.9172803618575288
current: 1000000/50000000, rmse: 0.9152093513695219
current: 1100000/50000000, rmse: 0.9132787302883907
current: 1200000/50000000, rmse: 0.9117147519089416
current: 1300000/50000000, rmse: 0.9104492637483208
current: 1400000/50000000, rmse: 0.9089297670814532
current: 1500000/50000000, rmse: 0.9080977502253093
current: 1600000/50000000, rmse: 0.9072392666219039
current: 1700000/50000000, rmse: 0.9061715442886558
current: 1800000/50000000, rmse: 0.9052136308886759
current: 1900000/50000000, r

current: 15800000/50000000, rmse: 0.7861720524891522
current: 15900000/50000000, rmse: 0.7854133122116048
current: 16000000/50000000, rmse: 0.7846975225213291
current: 16100000/50000000, rmse: 0.783728007290901
current: 16200000/50000000, rmse: 0.7830628935126949
current: 16300000/50000000, rmse: 0.7824085956377149
current: 16400000/50000000, rmse: 0.7817352119980125
current: 16500000/50000000, rmse: 0.7809872451990252
current: 16600000/50000000, rmse: 0.7802635791817509
current: 16700000/50000000, rmse: 0.779262311812784
current: 16800000/50000000, rmse: 0.7787431213258748
current: 16900000/50000000, rmse: 0.7779388979918948
current: 17000000/50000000, rmse: 0.776896841368273
current: 17100000/50000000, rmse: 0.7759585624386066
current: 17200000/50000000, rmse: 0.7755024216241311
current: 17300000/50000000, rmse: 0.7746959193555155
current: 17400000/50000000, rmse: 0.7742100898552152
current: 17500000/50000000, rmse: 0.7732197785782083
current: 17600000/50000000, rmse: 0.7725245835229

current: 31300000/50000000, rmse: 0.696276868560214
current: 31400000/50000000, rmse: 0.6959514540741275
current: 31500000/50000000, rmse: 0.6956921000623757
current: 31600000/50000000, rmse: 0.6952807768580267
current: 31700000/50000000, rmse: 0.6947665457297052
current: 31800000/50000000, rmse: 0.6944289080866801
current: 31900000/50000000, rmse: 0.694152772454785
current: 32000000/50000000, rmse: 0.6938128084739582
current: 32100000/50000000, rmse: 0.6936569577812468
current: 32200000/50000000, rmse: 0.693045087909284
current: 32300000/50000000, rmse: 0.6928477494245995
current: 32400000/50000000, rmse: 0.6924110773419716
current: 32500000/50000000, rmse: 0.6920911341155277
current: 32600000/50000000, rmse: 0.6916024088051735
current: 32700000/50000000, rmse: 0.6915489209480536
current: 32800000/50000000, rmse: 0.6909810501126851
current: 32900000/50000000, rmse: 0.6906243082335436
current: 33000000/50000000, rmse: 0.690332697984782
current: 33100000/50000000, rmse: 0.69001949360362

current: 46800000/50000000, rmse: 0.6603549589380208
current: 46900000/50000000, rmse: 0.6600859725966448
current: 47000000/50000000, rmse: 0.6598998140075305
current: 47100000/50000000, rmse: 0.6599166572344606
current: 47200000/50000000, rmse: 0.6600893774012475
current: 47300000/50000000, rmse: 0.6599513257064589
current: 47400000/50000000, rmse: 0.6598367656105857
current: 47500000/50000000, rmse: 0.6596865514915873
current: 47600000/50000000, rmse: 0.6593343093379636
current: 47700000/50000000, rmse: 0.6592623916700194
current: 47800000/50000000, rmse: 0.6590558132543185
current: 47900000/50000000, rmse: 0.6590466190559496
current: 48000000/50000000, rmse: 0.6590825215832857
current: 48100000/50000000, rmse: 0.6589748609574819
current: 48200000/50000000, rmse: 0.6587852565633607
current: 48300000/50000000, rmse: 0.6588415064074743
current: 48400000/50000000, rmse: 0.6587339723426011
current: 48500000/50000000, rmse: 0.6584159079163778
current: 48600000/50000000, rmse: 0.6582088465

In [25]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1178    Star Wars: Episode V - The Empire Strikes Back...
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3458    Predator (1987)
3633    Mad Max (1979)

So we recommend him:
1185    12 Angry Men (1957)
1189    To Kill a Mockingbird (1962)
3026    Grapes of Wrath, The (1940)
1186    Lawrence of Arabia (1962)
1745    Hana-bi (1997)
2169    Seven Beauties (Pasqualino Settebellezze) (1976)
2961    Yojimbo (1961)
3366    D

In [26]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
3538    One Little Indian (1973)
526    Second Best (1994)
2610    Finding North (1999)
3223    Big Combo, The (1955)
697    Sunset Park (1996)
973    Small Wonders (1996)
584    Aladdin (1992)


### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [27]:
class als:
    def __init__(self, latent=64, regn=0.001, lr=0.001, max_iter=300, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        cur_iter = 0
        
        while cur_iter <= self.max_iter + 1:
            if cur_iter % 10 == 0:
                rmse = np.linalg.norm((W@H)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
                        
            
            V = W @ H
            V[df['user_id']-1, df['movie_id']-1] = V[df['user_id']-1, df['movie_id']-1] - self.df['rating']
            
            if cur_iter % 2 == 0:
                W = W - self.lr * (V @ H.T + self.regn * W)
            else:
                H = H - self.lr * (W.T @ V + self.regn * H)
            
            cur_iter += 1
            
        self.w, self.h = W, H
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

# Changelog

Ничего не менял :)

In [28]:
%%time

model = als(max_iter=100)

model.fit(implicit_ratings, movie_info)

current: 0/100, rmse: 0.7503608384221314
current: 10/100, rmse: 0.90597258132075
current: 20/100, rmse: 0.8607868042077562
current: 30/100, rmse: 0.8353893755121795
current: 40/100, rmse: 0.8250661544029566
current: 50/100, rmse: 0.8179918211742806
current: 60/100, rmse: 0.8074990433571867
current: 70/100, rmse: 0.7929483615984505
current: 80/100, rmse: 0.7779010477515267
current: 90/100, rmse: 0.7642025935470674
current: 100/100, rmse: 0.7525292119541259
CPU times: user 39.6 s, sys: 9.31 s, total: 48.9 s
Wall time: 11.8 s


In [29]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3633    Mad Max (1979)

So we recommend him:
1178    Star Wars: Episode V - The Empire Strikes Back...
1192    Star Wars: Episode VI - Return of the Jedi (1983)
585    Terminator 2: Judgment Day (1991)
1182    Aliens (1986)
2502    Matrix, The (1999)
108    Braveheart (1995)
847    Godfather, The (1972)
1271    Indiana Jones and the Last Crusade (1989)
1284    Butch Cassidy and the Sundance Kid (1969)
1203    Godfather: Part II,

In [30]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
1245    Groundhog Day (1993)
584    Aladdin (1992)
1250    Back to the Future (1985)
33    Babe (1995)
1179    Princess Bride, The (1987)
1287    When Harry Met Sally... (1989)
2327    Shakespeare in Love (1998)


### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [31]:
class bpr:
    def __init__(self, latent=100, regn=0.0001, lr=0.0001, max_iter=300, eps=0.0, sample_size=5):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter = max_iter
        self.eps = eps
        self.sample_size = sample_size
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        self.total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        cur_iter = 0
        
        all_movies = [i for i in movie_info["movie_id"]]
        
        seen_movies_by_user = {}
        not_seen_movies_by_user = {}
        for u in range(1, self.total_users + 1):
            seen_movies_by_user[u] = np.array(self.df[self.df['user_id'] == u]['movie_id'])
            not_seen_movies_by_user[u] = np.array([i for i in all_movies if i not in seen_movies_by_user[u]])
            
        print('Precalc is completed!')
        
        while cur_iter <= self.max_iter:
            if cur_iter % 1 == 0:
                rmse = np.linalg.norm((W@H)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(self.total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
            
            for u in range(1, self.total_users + 1):
                for i in seen_movies_by_user[u]:
                    neg_sample = np.random.choice(not_seen_movies_by_user[u], size=self.sample_size, replace=False)
                    for j in neg_sample:

                        x_uij = (W[u - 1, :] @ H[:, i - 1]) - (W[u - 1, :] @ H[:, j - 1])

                        exp_x = np.exp(x_uij)
                        sigmoid = 1 / (1 + exp_x)

                        W[u - 1, :] += self.lr * (sigmoid * (H[:, i - 1] - H[:, j - 1]) - self.regn * W[u - 1, :])

                        H[:, i - 1] += self.lr * (sigmoid * (W[u - 1, :]) - self.regn * H[:, i - 1])

                        H[:, j - 1] += self.lr * (sigmoid * (-W[u - 1, :]) - self.regn * H[:, j - 1])
            
            cur_iter += 1
            
            self.w, self.h = W, H
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

# Changelog

Переписал процесс обучения на эпохи. Одна эпоха -- проход по всем юзерам и всем их просмотрам, сэмплирую для каждой пары по пять не посмотренных фильмов. Стало учиться чуть-чуть бодрее, но всё равно, есть проблемы с переобучением. Я не смог подобрать норм lr, чтобы оно училось, но не переобучалось.

In [32]:
%%time

model = bpr(max_iter=10)

model.fit(implicit_ratings, movie_info)

Precalc is completed!
current: 0/10, rmse: 0.7500708345285706
current: 1/10, rmse: 0.7172349614431065
current: 2/10, rmse: 0.6855922688001498
current: 3/10, rmse: 0.654825636866169
current: 4/10, rmse: 0.6249289536875398
current: 5/10, rmse: 0.5961073230020715
current: 6/10, rmse: 0.5689334490471091
current: 7/10, rmse: 0.5441393658368533
current: 8/10, rmse: 0.522755020522763
current: 9/10, rmse: 0.5059630377248402
current: 10/10, rmse: 0.49495727623936014
CPU times: user 22min 17s, sys: 1.07 s, total: 22min 18s
Wall time: 22min 9s


In [33]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3633    Mad Max (1979)

So we recommend him:
2789    American Beauty (1999)
1178    Star Wars: Episode V - The Empire Strikes Back...
589    Silence of the Lambs, The (1991)
2502    Matrix, The (1999)
2693    Sixth Sense, The (1999)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
604    Fargo (1996)
523    Schindler's List (1993)
585    Terminator 2: Judgment Day (1991)
315    Shawshank Redemption, The (1994)


In [34]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
2928    Being John Malkovich (1999)
293    Pulp Fiction (1994)
1081    E.T. the Extra-Terrestrial (1982)
1245    Groundhog Day (1993)
352    Forrest Gump (1994)
49    Usual Suspects, The (1995)
1539    Men in Black (1997)
2647    Ghostbusters (1984)
476    Jurassic Park (1993)


### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [22]:
class warp:
    def __init__(self, latent=64, regn=1, lr=0.0001, max_iter=300, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        self.total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        cur_iter = 0
            
        def partial_BPR(x_uij, partial_x):
            exp_x = np.exp(-x_uij)
            return exp_x / (1 + exp_x) * partial_x
        
        all_movies = [i for i in movie_info["movie_id"]]
        
        seen_movies_by_user = {}
        not_seen_movies_by_user = {}
        for u in range(1, self.total_users + 1):
            seen_movies_by_user[u] = np.array(self.df[self.df['user_id'] == u]['movie_id'])
            not_seen_movies_by_user[u] = np.array([i for i in all_movies if i not in seen_movies_by_user[u]])
            
        print('Precalc is completed!')
        
        while cur_iter <= self.max_iter:
            if cur_iter % 1 == 0:
                rmse = np.linalg.norm((W@H)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(self.total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
            
            for u in range(1, self.total_users + 1):
                for i in seen_movies_by_user[u]:
                    #u_id = np.random.randint(self.total_len)
                    #u = self.df.iloc[u_id]['user_id']
                    #i = np.random.choice(seen_movies_by_user[u])

                    pred = W[u - 1] @ H[:, i - 1]
                    unseen_for_user = not_seen_movies_by_user[u]

                    q = 0
                    for j in np.random.permutation(unseen_for_user):
                        q += 1
                        if W[u - 1] @ H[:, j - 1] + 1 > pred:
                            log_proba = np.log(len(unseen_for_user) / q)

                            W[u - 1] = W[u - 1] - self.lr * log_proba * (H[:, j - 1] - H[:, i - 1]) + self.regn * self.lr * W[u - 1]
                            H[:, i - 1] = H[:, i - 1] + self.lr * log_proba * (W[u - 1]) + self.regn * self.lr * H[:, i - 1]
                            H[:, j - 1] = H[:, j - 1] - self.lr * log_proba * (W[u - 1]) + self.regn * self.lr * H[:, j - 1]
                            break
            
            cur_iter += 1
            
            self.w, self.h = W, H
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

# Changelog

Аналогично bpr поменял процесс обучения на эпохи. Тут ещё сильнее выражено переобучение, тоже не смог подобрать lr.

In [32]:
%%time

model = warp(max_iter=3, lr=0.00005)

model.fit(implicit_ratings, movie_info)

Precalc is completed!
current: 0/3, rmse: 0.7502451686229439
current: 1/3, rmse: 0.6803607792755957
current: 2/3, rmse: 0.6083980227203888
current: 3/3, rmse: 0.5479798085689465
CPU times: user 4min 19s, sys: 448 ms, total: 4min 19s
Wall time: 4min 16s


In [33]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3633    Mad Max (1979)

So we recommend him:
2789    American Beauty (1999)
1178    Star Wars: Episode V - The Empire Strikes Back...
589    Silence of the Lambs, The (1991)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
2693    Sixth Sense, The (1999)
2502    Matrix, The (1999)
604    Fargo (1996)
523    Schindler's List (1993)
585    Terminator 2: Judgment Day (1991)
315    Shawshank Redemption, The (1994)


In [34]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
1539    Men in Black (1997)
49    Usual Suspects, The (1995)
1245    Groundhog Day (1993)
1182    Aliens (1986)
2647    Ghostbusters (1984)
2928    Being John Malkovich (1999)
293    Pulp Fiction (1994)
476    Jurassic Park (1993)
453    Fugitive, The (1993)
