### Матричные факторизации

В данной работе вам предстоит познакомиться с практической стороной матричных разложений.
Работа поделена на 4 задания:
1. Вам необходимо реализовать SVD разложения используя SGD на explicit данных
2. Вам необходимо реализовать матричное разложения используя ALS на implicit данных
3. Вам необходимо реализовать матричное разложения используя BPR(pair-wise loss) на implicit данных
4. Вам необходимо реализовать матричное разложения используя WARP(list-wise loss) на implicit данных

Мягкий дедлайн 28 Сентября (пишутся замечания, выставляется оценка, есть возможность исправить до жесткого дедлайна)

Жесткий дедлайн 5 Октября (Итоговая проверка)

In [1]:
import implicit
import pandas as pd
import numpy as np
import scipy.sparse as sp

from lightfm.datasets import fetch_movielens

В данной работе мы будем работать с explicit датасетом movieLens, в котором представленны пары user_id movie_id и rating выставленный пользователем фильму

Скачать датасет можно по ссылке https://grouplens.org/datasets/movielens/1m/

In [2]:
ratings = pd.read_csv('ratings.dat', delimiter='::', header=None, 
        names=['user_id', 'movie_id', 'rating', 'timestamp'], 
        usecols=['user_id', 'movie_id', 'rating'], engine='python')

In [3]:
movie_info = pd.read_csv('movies.dat', delimiter='::', header=None, 
        names=['movie_id', 'name', 'category'], engine='python')

Explicit данные

In [4]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
0,1,1193,5
1,1,661,3
2,1,914,3
3,1,3408,4
4,1,2355,5
5,1,1197,3
6,1,1287,5
7,1,2804,5
8,1,594,4
9,1,919,4


In [5]:
ratings = ratings.sort_values(by = ['user_id', 'movie_id']) 

In [6]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
40,1,1,5
25,1,48,5
39,1,150,5
44,1,260,4
23,1,527,5
49,1,531,4
33,1,588,4
8,1,594,4
10,1,595,5
51,1,608,4


In [7]:
print(np.max(ratings['user_id']))
print(np.max(ratings['movie_id']))

6040
3952


Для того, чтобы преобразовать текущий датасет в Implicit, давайте считать что позитивная оценка это оценка >=4

In [8]:
implicit_ratings = ratings.loc[(ratings['rating'] >= 4)]

In [9]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
40,1,1,5
25,1,48,5
39,1,150,5
44,1,260,4
23,1,527,5
49,1,531,4
33,1,588,4
8,1,594,4
10,1,595,5
51,1,608,4


In [10]:
implicit_ratings['rating'] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [11]:
implicit_ratings.head(10)

Unnamed: 0,user_id,movie_id,rating
40,1,1,1
25,1,48,1
39,1,150,1
44,1,260,1
23,1,527,1
49,1,531,1
33,1,588,1
8,1,594,1
10,1,595,1
51,1,608,1


Удобнее работать с sparse матричками, давайте преобразуем DataFrame в CSR матрицы

In [12]:
users = implicit_ratings["user_id"]
movies = implicit_ratings["movie_id"]
user_item = sp.coo_matrix((np.ones_like(users), (users, movies)))
user_item_t_csr = user_item.T.tocsr()
user_item_csr = user_item.tocsr()

В качестве примера воспользуемся ALS разложением из библиотеки implicit

Зададим размерность латентного пространства равным 64, это же определяет размер user/item эмбедингов

In [13]:
model = implicit.als.AlternatingLeastSquares(factors=64, iterations=100, calculate_training_loss=True)



В качестве loss здесь всеми любимый RMSE

In [14]:
model.fit(user_item_t_csr)

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Построим похожие фильмы по 1 movie_id = Истории игрушек

In [15]:
movie_info.head(5)

Unnamed: 0,movie_id,name,category
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [16]:
get_similars = lambda item_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                        for x in model.similar_items(item_id)]

Как мы видим, симилары действительно оказались симиларами.

Качество симиларов часто является хорошим способом проверить качество алгоритмов.

P.S. Если хочется поглубже разобраться в том как разные алгоритмы формируют разные латентные пространства, рекомендую загружать полученные вектора в tensorBoard и смотреть на сформированное пространство

In [17]:
get_similars(1, model)

['0    Toy Story (1995)',
 '3045    Toy Story 2 (1999)',
 "2286    Bug's Life, A (1998)",
 '33    Babe (1995)',
 '584    Aladdin (1992)',
 '2315    Babe: Pig in the City (1998)',
 '360    Lion King, The (1994)',
 '1526    Hercules (1997)',
 '2692    Iron Giant, The (1999)',
 '2252    Pleasantville (1998)']

Давайте теперь построим рекомендации для юзеров

Как мы видим юзеру нравится фантастика, значит и в рекомендациях ожидаем увидеть фантастику

In [18]:
get_user_history = lambda user_id, implicit_ratings : [movie_info[movie_info["movie_id"] == x]["name"].to_string() 
                                            for x in implicit_ratings[implicit_ratings["user_id"] == user_id]["movie_id"]]

In [19]:
get_user_history(4, implicit_ratings)

['257    Star Wars: Episode IV - A New Hope (1977)',
 '476    Jurassic Park (1993)',
 '1023    Die Hard (1988)',
 '1081    E.T. the Extra-Terrestrial (1982)',
 '1180    Raiders of the Lost Ark (1981)',
 '1183    Good, The Bad and The Ugly, The (1966)',
 '1196    Alien (1979)',
 '1220    Terminator, The (1984)',
 '1366    Jaws (1975)',
 '1885    Rocky (1976)',
 '1959    Saving Private Ryan (1998)',
 '2297    King Kong (1933)',
 '2623    Run Lola Run (Lola rennt) (1998)',
 '2878    Goldfinger (1964)',
 '2882    Fistful of Dollars, A (1964)',
 '3349    Thelma & Louise (1991)',
 '3399    Hustler, The (1961)',
 '3633    Mad Max (1979)']

Получилось! 

Мы действительно порекомендовали пользователю фантастику и боевики, более того встречаются продолжения тех фильмов, которые он высоко оценил

In [20]:
get_recommendations = lambda user_id, model : [movie_info[movie_info["movie_id"] == x[0]]["name"].to_string() 
                                               for x in model.recommend(user_id, user_item_csr)]

In [21]:
get_recommendations(4, model)

['585    Terminator 2: Judgment Day (1991)',
 '1271    Indiana Jones and the Last Crusade (1989)',
 '1284    Butch Cassidy and the Sundance Kid (1969)',
 '1182    Aliens (1986)',
 '2502    Matrix, The (1999)',
 '1178    Star Wars: Episode V - The Empire Strikes Back...',
 '1892    Rain Man (1988)',
 '1884    French Connection, The (1971)',
 '1179    Princess Bride, The (1987)',
 '847    Godfather, The (1972)']

Теперь ваша очередь реализовать самые популярные алгоритмы матричных разложений

Что будет оцениваться:
1. Корректность алгоритма
2. Качество получившихся симиларов
3. Качество итоговых рекомендаций для юзера

### Задание 1. Не использую готовые решения, реализовать SVD разложение используя SGD на explicit данных

In [22]:
class svd:
    def __init__(self, latent=64, regn=0.01, lr=0.01, max_iter=300000, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        B_u = np.zeros((self.total_users, 1))
        B_i = np.zeros((1, self.total_items))
        mu = self.df['rating'].mean()
        
        
        cur_iter = 0
        
        while cur_iter <= self.max_iter + 1:
            if cur_iter % 100000 == 0:
                rmse = np.linalg.norm((W@H + B_u + B_i + mu)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
                            
            cur_id = np.random.randint(low=0, high=total_len)
            i = df.iloc[cur_id]['user_id'] - 1
            j = df.iloc[cur_id]['movie_id'] - 1
            value = df.iloc[cur_id]['rating']
            
            error = W[i, :] @ H[:, j] + B_u[i] + B_i[0, j] + mu - value
            
            W[i, :] = W[i, :] * (1 - self.lr * self.regn) - self.lr * error * H[:, j].T
            H[:, j] = H[:, j] * (1 - self.lr * self.regn) - self.lr * error * W[i, :].T
            B_u[i] = B_u[i] - self.lr * (error + self.regn * B_u[i])
            B_i[0, j] = B_i[0, j] - self.lr * (error + self.regn * B_i[0, j])
            
            cur_iter += 1
            
        self.w, self.h, self.b_u, self.b_i, self.mu = W, H, B_u, B_i, mu
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h + self.b_u + self.b_i + self.mu)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

In [23]:
%%time

model = svd(max_iter=int(5e6))

model.fit(ratings, movie_info)

current: 0/5000000, rmse: 1.1455727758877459
current: 100000/5000000, rmse: 1.0047668431154533
current: 200000/5000000, rmse: 0.96678444659504
current: 300000/5000000, rmse: 0.9483664043786628
current: 400000/5000000, rmse: 0.9374847442532682
current: 500000/5000000, rmse: 0.9306308830307863
current: 600000/5000000, rmse: 0.9252260995577941
current: 700000/5000000, rmse: 0.9212778588353278
current: 800000/5000000, rmse: 0.9180147792794007
current: 900000/5000000, rmse: 0.9152667769400795
current: 1000000/5000000, rmse: 0.9128157885714528
current: 1100000/5000000, rmse: 0.9108351233864429
current: 1200000/5000000, rmse: 0.909008550102539
current: 1300000/5000000, rmse: 0.9074733627001527
current: 1400000/5000000, rmse: 0.9061616502900375
current: 1500000/5000000, rmse: 0.904666253206701
current: 1600000/5000000, rmse: 0.9033278184047804
current: 1700000/5000000, rmse: 0.9021855598465327
current: 1800000/5000000, rmse: 0.9007944967859004
current: 1900000/5000000, rmse: 0.899685013826737


In [24]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1178    Star Wars: Episode V - The Empire Strikes Back...
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3458    Predator (1987)
3633    Mad Max (1979)

So we recommend him:
2836    Sanjuro (1962)
1950    Seven Samurai (The Magnificent Seven) (Shichin...
1194    Third Man, The (1949)
1132    Wrong Trousers, The (1993)
911    Citizen Kane (1941)
941    It's a Wonderful Life (1946)
1176    One Flew Over the Cuckoo's Nes

In [25]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
3045    Toy Story 2 (1999)
584    Aladdin (1992)
2012    Little Mermaid, The (1989)
2031    Splash (1984)
2286    Bug's Life, A (1998)
2728    Big (1988)
360    Lion King, The (1994)
1985    Honey, I Shrunk the Kids (1989)
2338    Cocoon (1985)


### Задание 2. Не использую готовые решения, реализовать матричное разложение используя ALS на implicit данных

In [26]:
class als:
    def __init__(self, latent=64, regn=0.001, lr=0.001, max_iter=300, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        cur_iter = 0
        
        while cur_iter <= self.max_iter + 1:
            if cur_iter % 10 == 0:
                rmse = np.linalg.norm((W@H)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
                        
            
            V = W @ H
            V[df['user_id']-1, df['movie_id']-1] = V[df['user_id']-1, df['movie_id']-1] - self.df['rating']
            
            if cur_iter % 2 == 0:
                W = W - self.lr * (V @ H.T + self.regn * W)
            else:
                H = H - self.lr * (W.T @ V + self.regn * H)
            
            cur_iter += 1
            
        self.w, self.h = W, H
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

In [27]:
%%time

model = als(max_iter=100)

model.fit(implicit_ratings, movie_info)

current: 0/100, rmse: 0.7506248239631461
current: 10/100, rmse: 0.9058937026374565
current: 20/100, rmse: 0.8609916828066873
current: 30/100, rmse: 0.8358536589955023
current: 40/100, rmse: 0.8256677691905722
current: 50/100, rmse: 0.8189308492088085
current: 60/100, rmse: 0.8086712269406495
current: 70/100, rmse: 0.7934376022770077
current: 80/100, rmse: 0.7777090590675776
current: 90/100, rmse: 0.7644041119936322
current: 100/100, rmse: 0.7531534957478487
CPU times: user 39.4 s, sys: 9.47 s, total: 48.9 s
Wall time: 11.8 s


In [28]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3633    Mad Max (1979)

So we recommend him:
1178    Star Wars: Episode V - The Empire Strikes Back...
585    Terminator 2: Judgment Day (1991)
847    Godfather, The (1972)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1568    Hunt for Red October, The (1990)
1182    Aliens (1986)
108    Braveheart (1995)
1271    Indiana Jones and the Last Crusade (1989)
1931    Lethal Weapon (1987)
1353    Star Trek: The Wrath of Kh

In [29]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
1245    Groundhog Day (1993)
3045    Toy Story 2 (1999)
2286    Bug's Life, A (1998)
584    Aladdin (1992)
2647    Ghostbusters (1984)
2918    Who Framed Roger Rabbit? (1988)
33    Babe (1995)
2849    Ferris Bueller's Day Off (1986)
1854    There's Something About Mary (1998)


### Задание 3. Не использую готовые решения, реализовать матричное разложение BPR на implicit данных

In [38]:
class bpr:
    def __init__(self, latent=64, regn=0.1, lr=0.01, max_iter=300, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        self.total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        cur_iter = 0
        
        all_movies = [i for i in movie_info["movie_id"]]
        
        seen_movies_by_user = {}
        not_seen_movies_by_user = {}
        for u in range(1, self.total_users + 1):
            seen_movies_by_user[u] = np.array(self.df[self.df['user_id'] == u]['movie_id'])
            not_seen_movies_by_user[u] = np.array([i for i in all_movies if i not in seen_movies_by_user[u]])
            
        print('Precalc is completed!')
        
        while cur_iter <= self.max_iter + 1:
            if cur_iter % 100000 == 0:
                rmse = np.linalg.norm((W@H)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(self.total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
            
            u_id = np.random.randint(self.total_len)
            u = self.df.iloc[u_id]['user_id']
            i = np.random.choice(seen_movies_by_user[u])
            j = np.random.choice(not_seen_movies_by_user[u])
            
            x_uij = (W[u - 1, :] @ H[:, i - 1]) - (W[u - 1, :] @ H[:, j - 1])
            
            exp_x = np.exp(-x_uij)
            sigmoid = exp_x / (1 + exp_x)
            
            W[u - 1, :] += self.lr * (sigmoid * (H[:, i - 1] - H[:, j - 1]) - self.regn * W[u - 1, :])
            
            H[:, i - 1] += self.lr * (sigmoid * (W[u - 1, :]) - self.regn * H[:, i - 1])
            
            H[:, j - 1] += self.lr * (sigmoid * (-W[u - 1, :]) - self.regn * H[:, j - 1])
            
            cur_iter += 1
            
        self.w, self.h = W, H
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

In [39]:
%%time

model = bpr(max_iter=int(1e6))

model.fit(implicit_ratings, movie_info)

Precalc is completed!
current: 0/1000000, rmse: 0.7509700817353809
current: 100000/1000000, rmse: 0.6919820173974665
current: 200000/1000000, rmse: 0.6458982368551557
current: 300000/1000000, rmse: 0.6042568268040082
current: 400000/1000000, rmse: 0.5656521007450733
current: 500000/1000000, rmse: 0.532614458735839
current: 600000/1000000, rmse: 0.5067481743739117
current: 700000/1000000, rmse: 0.49033871698428033
current: 800000/1000000, rmse: 0.4830683086965228
current: 900000/1000000, rmse: 0.48327563722441325
current: 1000000/1000000, rmse: 0.4877005041745
CPU times: user 3min 54s, sys: 955 ms, total: 3min 55s
Wall time: 3min 52s


In [40]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3633    Mad Max (1979)

So we recommend him:
1178    Star Wars: Episode V - The Empire Strikes Back...
589    Silence of the Lambs, The (1991)
1179    Princess Bride, The (1987)
2789    American Beauty (1999)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
1575    L.A. Confidential (1997)
604    Fargo (1996)
585    Terminator 2: Judgment Day (1991)
847    Godfather, The (1972)
2502    Matrix, The (1999)


In [41]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
1250    Back to the Future (1985)
2693    Sixth Sense, The (1999)
476    Jurassic Park (1993)
1245    Groundhog Day (1993)
1539    Men in Black (1997)
315    Shawshank Redemption, The (1994)
1220    Terminator, The (1984)
293    Pulp Fiction (1994)
847    Godfather, The (1972)


### Задание 4. Не использую готовые решения, реализовать матричное разложение WARP на implicit данных

In [34]:
class warp:
    def __init__(self, latent=64, regn=1, lr=0.0001, max_iter=300, eps=0.0):
        self.latent = latent
        self.regn = regn
        self.lr = lr
        self.max_iter=max_iter
        self.eps = eps
        
    def fit(self, df, movie_info):
        #V = sp.coo_matrix((df['rating'], (df['user_id'], df['movie_id']))).tocsr()
        
        self.df = df
        self.movie_info = movie_info
        
        self.total_users = np.max(df['user_id'])
        self.total_items = np.max(df['movie_id'])
        self.total_len = len(df)
        
        W = np.random.random((self.total_users, self.latent)) * (1 / np.sqrt(self.latent))
        H = np.random.random((self.latent, self.total_items)) * (1 / np.sqrt(self.latent))
        
        cur_iter = 0
            
        def partial_BPR(x_uij, partial_x):
            exp_x = np.exp(-x_uij)
            return exp_x / (1 + exp_x) * partial_x
        
        all_movies = [i for i in movie_info["movie_id"]]
        
        seen_movies_by_user = {}
        not_seen_movies_by_user = {}
        for u in range(1, self.total_users + 1):
            seen_movies_by_user[u] = np.array(self.df[self.df['user_id'] == u]['movie_id'])
            not_seen_movies_by_user[u] = np.array([i for i in all_movies if i not in seen_movies_by_user[u]])
            
        print('Precalc is completed!')
        
        while cur_iter <= self.max_iter + 1:
            if cur_iter % 100000 == 0:
                rmse = np.linalg.norm((W@H)[df['user_id']-1, df['movie_id']-1] - df['rating'])
                rmse /= np.sqrt(self.total_len)
                if rmse <= self.eps:
                    print(f"Let's stop on iter {cur_iter}!")
                    break
                    
                print(f'current: {cur_iter}/{self.max_iter}, rmse: {rmse}')
            
            u_id = np.random.randint(self.total_len)
            u = self.df.iloc[u_id]['user_id']
            i = np.random.choice(seen_movies_by_user[u])
            
            pred = W[u - 1] @ H[:, i - 1]
            unseen_for_user = not_seen_movies_by_user[u]
            
            q = 0
            for j in np.random.permutation(unseen_for_user):
                q += 1
                if W[u - 1] @ H[:, j - 1] + 1 > pred:
                    log_proba = np.log(len(unseen_for_user) / q)
                    
                    W[u - 1] = W[u - 1] - self.lr * log_proba * (H[:, j - 1] - H[:, i - 1]) + self.regn * self.lr * W[u - 1]
                    H[:, i - 1] = H[:, i - 1] + self.lr * log_proba * (W[u - 1]) + self.regn * self.lr * H[:, i - 1]
                    H[:, j - 1] = H[:, j - 1] - self.lr * log_proba * (W[u - 1]) + self.regn * self.lr * H[:, j - 1]
                    break
            
            cur_iter += 1
            
        self.w, self.h = W, H
        
    def get_history(self, cur_user_id):
        history = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in self.df[self.df["user_id"] == cur_user_id]["movie_id"]]
        return history
        
    def get_recommendations(self, cur_user_id, best_k=5):
        all_predicted_ratings = (self.w @ self.h)[cur_user_id - 1]
        
        known_ratings = np.array(self.df.loc[self.df['user_id'] == cur_user_id]['movie_id'])
        
        need_to_predict = [i for i in movie_info["movie_id"] if i not in known_ratings]
        
        need_to_predict.sort(key=lambda x: all_predicted_ratings[x - 1], reverse=True)
        
        res = np.array(need_to_predict[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names
    
    def get_similars(self, cur_movie_id, best_k=5):       
        cur_movie_embedding = self.h[:, cur_movie_id - 1]
        
        all_films = [i for i in movie_info["movie_id"]]
        
        all_films.sort(key=lambda x: np.linalg.norm(cur_movie_embedding - self.h[:, x - 1]))
        
        res = np.array(all_films[:best_k])
        
        best_names = [self.movie_info[self.movie_info["movie_id"] == x]["name"].to_string() for x in res]
        
        return best_names

In [35]:
%%time

model = warp(max_iter=1500000)

model.fit(implicit_ratings, movie_info)

Precalc is completed!
current: 0/1500000, rmse: 0.7505114678367546
current: 100000/1500000, rmse: 0.7265920463430235
current: 200000/1500000, rmse: 0.702190795535047
current: 300000/1500000, rmse: 0.6770941167005018
current: 400000/1500000, rmse: 0.6518765675305429
current: 500000/1500000, rmse: 0.6265250710724803
current: 600000/1500000, rmse: 0.6018156914036056
current: 700000/1500000, rmse: 0.5785972101317821
current: 800000/1500000, rmse: 0.558082441479489
current: 900000/1500000, rmse: 0.5419789308299597
current: 1000000/1500000, rmse: 0.531699782558143
current: 1100000/1500000, rmse: 0.5289940063387015
current: 1200000/1500000, rmse: 0.5345046268098664
current: 1300000/1500000, rmse: 0.5506559590585585
current: 1400000/1500000, rmse: 0.5776895661135728
current: 1500000/1500000, rmse: 0.6160193584882127
CPU times: user 6min 55s, sys: 1.78 s, total: 6min 57s
Wall time: 6min 52s


In [36]:
history = model.get_history(4)

print("This guy's history is:")
for token in history:
    print(token)
    
recommendations = model.get_recommendations(4, 10)

print("\nSo we recommend him:")
for token in recommendations:
    print(token)

This guy's history is:
257    Star Wars: Episode IV - A New Hope (1977)
476    Jurassic Park (1993)
1023    Die Hard (1988)
1081    E.T. the Extra-Terrestrial (1982)
1180    Raiders of the Lost Ark (1981)
1183    Good, The Bad and The Ugly, The (1966)
1196    Alien (1979)
1220    Terminator, The (1984)
1366    Jaws (1975)
1885    Rocky (1976)
1959    Saving Private Ryan (1998)
2297    King Kong (1933)
2623    Run Lola Run (Lola rennt) (1998)
2878    Goldfinger (1964)
2882    Fistful of Dollars, A (1964)
3349    Thelma & Louise (1991)
3399    Hustler, The (1961)
3633    Mad Max (1979)

So we recommend him:
2789    American Beauty (1999)
1178    Star Wars: Episode V - The Empire Strikes Back...
589    Silence of the Lambs, The (1991)
2502    Matrix, The (1999)
2693    Sixth Sense, The (1999)
604    Fargo (1996)
523    Schindler's List (1993)
1192    Star Wars: Episode VI - Return of the Jedi (1983)
585    Terminator 2: Judgment Day (1991)
847    Godfather, The (1972)


In [37]:
#best = model.get_similars(260, 10)
best = model.get_similars(1, 10)

for token in best:
    print(token)

0    Toy Story (1995)
476    Jurassic Park (1993)
2928    Being John Malkovich (1999)
1575    L.A. Confidential (1997)
2647    Ghostbusters (1984)
49    Usual Suspects, The (1995)
537    Blade Runner (1982)
1245    Groundhog Day (1993)
1220    Terminator, The (1984)
293    Pulp Fiction (1994)
