# Домашнее задание по рекомендательным системам

В данном домашнем задании вам предлагается реализовать User-based рекомендательную систему. Так же требуется реализовать несколько вспомогательных функций, шаблоны которых вы можете найти в `utils.py`.

Требования к выполнению задания:
- Реализация функции из `utils.py` засчитывается, только если пройдены все соответствующие тесты из `test.py`. Запуск тестов: <font color='red'>pytest test.py</font>. Для тестов вам потребуются библиотеки `numpy`, `scipy`, `pytest` и `hypothesis`.
- Плагиат запрещен. Если будет замечено, что часть задания списана, то 0 баллов ставится как списывающему, так и давшему списать.
- Если пользуетесь кодом из открытых источников, то указывайте ссылки, откуда взяли решение. Иначе такой код может быть воспринят как плагиат.
- При выполнении задания нельзя использовать библиотеку `scipy` и функцию `numpy.linalg.norm`

При запуске тестов могут появиться предупреждения: PearsonRConstantInputWarning и PearsonRNearConstantInputWarning. На них можно не обращать внимания.

Возможный максимум баллов за задание: 10 баллов <br>
Дедлайн: ??? <br>
Штраф: ??? - будет ли в курсе штраф? <br>
<br>
Для ускорения проверки, напишите здесь получившееся количество баллов: ...

## 1. Метрика сходства
<b>1.1. Реализация метрик (2 балла)</b>

Первое, с чем необходимо разобраться, при реализации User-based подхода, это с метрикой, с помощью которой будет решаться, насколько похожи пользователи. Вам предлагается реализовать 2 метрики: на основе евклидовой метрики и коэффициент корреляции Пирсона. Шаблоны для обоих функций можете найти в `utils.py`. Не забудьте проверить реализацию на тестах.

Евклидова метрика:
\begin{equation}
d(p,q)=\sqrt{(p_1-q_1)^2+(p_2-q_2)^2+\dots+(p_n-q_n)^2} = \sqrt{\sum_{k=1}^n (p_k-q_k)^2}
\end{equation}

В этом случае $d(p, q) \in [0, \infty)$, при этом если $d(p, q) \to 0$, то $sim(p, q) \to 1$. С учетом этого конечная формула будет выглядеть следующим образом:
\begin{equation}
sim(p, q) = \frac{1}{1 + d(p, q)}
\end{equation}
Так же в этой формуле не будет проблем с делением на 0.

Коэффициент корреляции Пирсона:
\begin{equation}
r_{xy} = \frac {\sum_{i=1}^{m} \left( x_i-\bar{x} \right)\left( y_i-\bar{y} \right)}{\sqrt{\sum_{i=1}^{m} \left( x_i-\bar{x} \right)^2 \sum_{i=1}^{m} \left( y_i-\bar{y} \right)^2}}
\end{equation}

<b>1.2. (1 балл)</b>

Рассмотрим пользователей $u$ и $v$. Им соотвествуют векторы $x_u$ и $x_v$, где $x_u[i] = r_{ui}$ и $x_v[i] = r_{vi}$. Из лекции известно, что похожесть между векторами $x_u$ и $x_v$ вычисляются только для тех индексов i, для которых существует и $r_{ui}$, и $r_{vi}$. То есть верно следуюющее:
\begin{equation}
sim(u, v) = sim(x_uI_{uv}, x_vI_{uv}),
\end{equation}
где $I_{uv} = [i | \exists r_{ui} \& \exists r_{vi}]$. При этом если $I_{uv} = \emptyset$, то $sim(u, v) \to -\infty$.

Реализуйте два новых метода, которые переиспользуют написанные вами `euclidean_distance` и `pearson_distance`, добавляющие условия на $x_u$ и $x_v$. Считается, что $x_u[i] = 0$, если $\nexists r_{ui}$. То же верно для $x_v$.

При реализации заданий можно как написать новые функции, так и использовать декораторы.

Так как в utils.py нету метода "pearson_distance", скорее всего в задании имелось в виду "euclidean_similarity", "pearson_similarity". Поэтому менять будем их

In [None]:
from google.colab import files
uploaded = files.upload()
from utils import euclidean_similarity, pearson_similarity

Saving utils.py to utils.py


In [None]:
from utils import euclidean_distance

In [None]:
import numpy as np

In [None]:
import numpy.ma as ma
x = np.array([1, 1, 0])
y = np.array([1, 0, 4])
masked_x = ma.masked_array(x, mask = x == 0, fill_value=0)
masked_y = ma.masked_array(y, mask = y == 0, fill_value=0)

In [None]:
(masked_x + masked_y).mask, ((masked_x + masked_y).mask == True).all()
# (masked_x + masked_y).mask

(array([False,  True,  True]), False)

Таким образом я проверяю, чтобы была оценка для фильма обоих пользователей, между которыми ищется расстояние

In [None]:
# your code (ﾉ>ω<)ﾉ :｡･:*:･ﾟ’★,｡･:*:･ﾟ’☆
def euclidean_similarity_1(x: np.array, y: np.array) -> float:
    if type(y) != np.ndarray:
        y_shape = y.shape[-1]
        y = y.todense()
    else:
        y_shape = len(y)
    if len(x) != y_shape:
        raise ValueError("x and y need to have the same length")
    masked_x = ma.masked_array(x, mask = x == 0, fill_value=0)
    masked_y = ma.masked_array(y, mask = y == 0, fill_value=0)
    if ((masked_x + masked_y).mask == True).all():  # для любой из пар x[i], y[i] нет как минимум одного
        return -float('inf')                        # из объектов, чтобы посчитать метрику
    return euclidean_similarity(masked_x,masked_y)


def pearson_similarity_1(x: np.array, y: np.array) -> float:
    if type(y) != np.ndarray:
        y_shape = y.shape[-1]
        y = y.todense()
    else:
        y_shape = len(y)
    if len(x) != y_shape:
        raise ValueError("x and y need to have the same length")
    masked_x = ma.masked_array(x, mask = x == 0, fill_value=0)
    masked_y = ma.masked_array(y, mask = y == 0, fill_value=0)
    if ((masked_x + masked_y).mask == True).all():
        return -float('inf')
    return pearson_similarity(masked_x, masked_y)


def euclidean_distance_1(x: np.array, y: np.array) -> float:
    return 1 - euclidean_similarity_1(x, y)

def pearson_distance_1(x: np.array, y: np.array) -> float:
    return 1 - pearson_similarity_1(x, y)

## 2. User-based method
<b>2.1. (3 балла)</b> 

Реализовать User-based подход, реализовав методы класса `UserBasedRecommendation`, основанного на использовании `NearestNeighbors`. В качестве метрики может для нахождения похожих пользователей может быть использована как евклидова метрика, так и коэффициент корреляции Пирсона.

Не забывайте, что `NearestNeighbors` ищет минимум расстояния между элементами, поэтому логично в качестве метрики при инициализации `NearestNeighbors` использовать обратную метрике схожести. То есть такую, что когда $sim(u, v) \to 1$, то $d(u, v) \to 0$. Например: $d(u, v) = 1 - sim(u, v)$

In [None]:
from sklearn.neighbors import NearestNeighbors
from typing import Optional
from scipy.sparse import csr_matrix

from collections import Counter


class UserBasedRecommendation:
    def __init__(self, metric: str = 'euclidean', n_recommendations: int = 5, alpha: float = 0.8):
        """
        Args:
            metric: name of metric: ['euclidean', 'pearson']
            n_recommendations: number of recommendations. Also can be specified self.make_recommendation
            alpha: similarity threshold: if sim(u, v) > alpha then u and v are similar
        """
        self.alpha = alpha
        if metric == 'euclidean':
            self.metric = euclidean_distance_1
        elif metric == 'pearson':
            self.metric = pearson_distance_1
        self.model_NN = NearestNeighbors(metric=self.metric, radius=self.alpha)
        self.n_recommendations = n_recommendations
        self.alpha = alpha

    def fit(self, X: np.array):
        """
        Args:
            X: matrix N x M where X[u, i] = r_{ui} if r_{ui} exists else X[u, i] = 0
        """
        self.X = X
        self.X_matrix = csr_matrix(self.X)
        self.model_NN.fit(self.X_matrix)

    def __find_closest_users(self, user_id: int, n_closest_users: int):
        self.n_closest_users = n_closest_users
        distances, indices = self.model_NN.kneighbors(self.X.iloc[user_id,:].values.reshape(1, -1), n_neighbors = self.n_closest_users)
        userid_to_distance = list(zip(self.X.index[indices.flatten()[1:]], distances.flatten()[1:]))
        self.closest_users = [x[0] for x in userid_to_distance]

    def make_recommendation(self, user_id: int, n_recommendations: Optional[int] = None):
        """
        Args:
            user_id: user id to whom you want to recommend
            n_recommendations: number of recommendations
        """
        if n_recommendations:
            self.n_recommendations = n_recommendations
        self.__find_closest_users(user_id=user_id, n_closest_users=30)
        movie_with_rating = []
        for movie_id in self.X.columns:
            if len(movie_with_rating) > self.n_recommendations-1:
                break
            else:  
                normalized_rating = X.loc[self.closest_users, movie_id].sum() / len(self.closest_users)
                movie_with_rating.append((movie_id, normalized_rating))

        if len(movie_with_rating) < self.n_recommendations:
            print('Для пользователья с id = ', user_id, 'не набралось', self.n_recommendations, 'рекомендаций\n', \
              'все, что удалось найти', len(result), 'фильмов')
        movie_with_rating.sort(key=lambda x: x[1], reverse=True)  
        result = [x[0] for x in movie_with_rating]      
        return result

<b>2.2. (1 балла)</b>

Приведите пример, для которого использование разных метрик будет давать разные рекомендации. Объясните свой пример.

In [None]:
user_1, user_2, user_3  = np.array([1, 2]), np.array([10, 9.5]), np.array([6, 5]) 
euclidean_distance_1(user_1, user_3), pearson_distance_1(user_1, user_3), euclidean_distance_1(user_2, user_3), pearson_distance_1(user_2, user_3)
print('Евклидово расстояние между 1 и 3 пользователем = ', euclidean_distance_1(user_1, user_3))
print('Евклидово расстояние между 2 и 3 пользователем = ', euclidean_distance_1(user_2, user_3))
print('Евклидово расстояние между 2 и 3 пользователем больше => алгоритм порекомедует\
 3 пользователю 1 пользователя\n')
print('Расстояние по Пирсону между 1 и 3 пользователем = ', pearson_distance_1(user_1, user_3))
print('Расстояние по Пирсону между 2 и 3 пользователем = ', pearson_distance_1(user_2, user_3))
print('Расстояние по Пирсону между 1 и 3 пользователем больше => алгоритм порекомедует\
 3 пользователю 2 пользователя')

Евклидово расстояние между 1 и 3 пользователем =  0.8536075183380212
Евклидово расстояние между 2 и 3 пользователем =  0.8575660343433718
Евклидово расстояние между 2 и 3 пользователем больше => алгоритм порекомедует 3 пользователю 1 пользователя

Расстояние по Пирсону между 1 и 3 пользователем =  2.0
Расстояние по Пирсону между 2 и 3 пользователем =  0.0
Расстояние по Пирсону между 1 и 3 пользователем больше => алгоритм порекомедует 3 пользователю 2 пользователя


<b>Объяснение:</b> Для ясности пронумеруем товары с 0. Видно, что третьему пользователю так же как и второму больше понравился товар 1 и чуть меньше понравился товар 2, тогда как первому пользователю оба эти товары не сильно понравились, причем товар 1 ему понравился меньше, чем товар 2, то есть наблюдается обратная завизимость. Но все эти факторы никак не учитываются при использовании евклидовой метрики, поэтому в первом случае алгоритм посчитал, что первый и третий пользователь похожи только по тому, что они ставят оценки примерно в одном и том отрезке, что является довольно странным предположением. А алгоритм, использующий коэффициент корреляции Пирсона учитывает эти факторы, поэтому находит третьему пользователю соседа со схожими интересами в виде второго пользователя. Отсюда получаем разные рекомендации.

## 3. Оценка качества
<b>3.1. (1 балл)</b>

Реализуйте Average Precision at k и Mean Average Precision at k. Шаблоны можете найти в `utils.py`.
\begin{align*}
AP@K = \frac{1}{m}\sum_{k=1}^K P(k)*rel(k), \\
MAP@K = \frac{1}{|U|}\sum_{u=1}^{|U|}(AP@K)_u
\end{align*}
где $P(k)$ - Precision at k, $rel(k) = 1$, если рекомендация релевантна, иначе $rel(k) = 0$.

---

## 4. Применение модели
<b>4.1. (2 балла)</b>

Выгрузите датасет `ratings_small.csv`: https://www.kaggle.com/rounakbanik/the-movies-dataset#ratings_small.csv

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
from google.colab import drive
drive.mount('content', force_remount=True)

Mounted at content


In [None]:
data = pd.read_csv('content/MyDrive/рекомендательныесистемы/рексис/ratings_small.csv', index_col=False)
data.shape

(100004, 4)

In [None]:
data.userId.min(), data.userId.max(), len(data.userId.unique())

(1, 671, 671)

In [None]:
data.movieId.min(), data.movieId.max(), len(data.movieId.unique())

(1, 163949, 9066)

In [None]:
data

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


Для простоты работы с данными, измените нумерацию пользователей и фильмов так, чтобы нумерация начиналась с 0 и шла непрерывно.

In [None]:
data.userId

0           1
1           1
2           1
3           1
4           1
         ... 
99999     671
100000    671
100001    671
100002    671
100003    671
Name: userId, Length: 100004, dtype: int64

In [None]:
# your code (ﾉ>ω<)ﾉ :｡･:*:･ﾟ’★,｡･:*:･ﾟ’☆
userid_to_idx = {userid : idx for idx, userid in enumerate(data.userId.unique())}
idx_to_userid = {idx : userid for userid, idx  in userid_to_idx.items()}
movieid_to_idx = {movieid : idx for idx, movieid in enumerate(data.movieId.unique())}
idx_to_movieid = {idx : movieid for movieid, idx  in movieid_to_idx.items()}

In [None]:
userid_to_idx[1], idx_to_userid[0], movieid_to_idx[31], idx_to_movieid[0]

(0, 1, 0, 31)

In [None]:
data.userId = data.userId.apply(lambda x: userid_to_idx[x])
data.movieId = data.movieId.apply(lambda x: movieid_to_idx[x])

In [None]:
data.userId.min(), data.userId.max(), len(data.userId.unique())

(0, 670, 671)

In [None]:
data.movieId.min(), data.movieId.max(), len(data.movieId.unique())

(0, 9065, 9066)

Удалим для наиболее активных пользователей 5 оценок

In [None]:
active_users = data.userId.value_counts()[:10].index
test_data = pd.DataFrame([], columns=data.columns)
for user_id in active_users:
    _, test = train_test_split(data[data.userId == user_id], test_size=5, random_state=42)
    test_data = test_data.append(test, ignore_index=True)
    data = data[~((data.userId == user_id) & (data.movieId.isin(test.movieId.values)))]
data.shape, test_data.shape

((99954, 4), (50, 4))

In [None]:
test_data

Unnamed: 0,userId,movieId,rating,timestamp
0,546,6495,4.0,1028129718
1,546,2241,4.0,1039880724
2,546,1874,4.5,1468681977
3,546,8021,4.0,1242992741
4,546,4401,3.5,1342849917
5,563,5124,5.0,974708761
6,563,2735,5.0,974844711
7,563,2578,3.0,974714208
8,563,2532,3.0,974843382
9,563,5008,4.0,974839307


Преобразуем данные в таблицу `X`, с которой может работать `UserBasedRecommendation`, где $X_{ui} = r_{ui}$, если пользователь $u$ поставил оценку фильму $i$, и $X_{ui} = 0$, если пользователь $u$ не проставил оценку фильму $i$.

Вам может пригодиться `csr_matrix`.

In [None]:
# your code (ﾉ>ω<)ﾉ :｡･:*:･ﾟ’★,｡･:*:･ﾟ’☆
from scipy.sparse import csr_matrix
X = data.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
X_matrix = csr_matrix(X.values)
X_matrix

<671x9061 sparse matrix of type '<class 'numpy.float64'>'
	with 99954 stored elements in Compressed Sparse Row format>

In [None]:
------------------

На самом деле перед тем как писать класс UserBasedRecommendation, я попробовал просто применить Nearest Neighbors, чтобы сперва понять как все будет работать

In [None]:
model_NN = NearestNeighbors(n_neighbors=5, metric=euclidean_distance_1, radius=float('inf'), p=-float('inf'))
model_NN.fit(X_matrix)

NearestNeighbors(metric=<function euclidean_distance_1 at 0x7f5da44e9e60>,
                 p=-inf, radius=inf)

Единственное, что у меня не получилось сделать: подобрать гиперпараметр alpha. Почитав документацию, я решил, что гиперпараметр p модели NearestNeighbors нам не нужен, он был бы необходим, если бы я передавал в гиперпараметр модели NearestNeighbors metric значение ’minkowski’. Поэтому я решил, что гиперпараметр alpha класса UserBasedRecommendation это гиперпараметр модели NearestNeighbors radius. Однако, когда я менял radius, на модель это не влияло (на примере выше я передал значение бесконечности). 

In [None]:
# query_index = np.random.choice(X.shape[0])
query_index = 621
print(query_index)
distances, indices = model_NN.kneighbors(X.iloc[query_index,:].values.reshape(1, -1), n_neighbors = 100)

621


Найду похожих пользователей для пользователя с id = 621

In [None]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print('Closest users for {0}:\n'.format(X.index[query_index]))
    else:
        print('{0}: {1}, with distance of {2}:'.format(i, X.index[indices.flatten()[i]], distances.flatten()[i]))

Closest users for 621:

1: 618, with distance of 0.0:
2: 54, with distance of 0.0:
3: 621, with distance of 0.0:
4: 52, with distance of 0.0:
5: 542, with distance of 0.0:
6: 535, with distance of 0.0:
7: 247, with distance of 0.0:
8: 84, with distance of 0.0:
9: 434, with distance of 0.0:
10: 45, with distance of 0.0:
11: 545, with distance of 0.0:
12: 568, with distance of 0.0:
13: 633, with distance of 0.0:
14: 368, with distance of 0.0:
15: 547, with distance of 0.0:
16: 241, with distance of 0.0:
17: 636, with distance of 0.0:
18: 540, with distance of 0.0:
19: 615, with distance of 0.0:
20: 452, with distance of 0.0:
21: 592, with distance of 0.0:
22: 400, with distance of 0.0:
23: 222, with distance of 0.0:
24: 458, with distance of 0.0:
25: 224, with distance of 0.0:
26: 408, with distance of 0.0:
27: 463, with distance of 0.0:
28: 537, with distance of 0.0:
29: 464, with distance of 0.0:
30: 120, with distance of 0.0:
31: 415, with distance of 0.0:
32: 81, with distance of 0.0

Отлично, для пользователя 621 я теперь знаю как получить список кортежей из id пользователей и расстояний до пользователя 621:

In [None]:
list(zip(X.index[indices.flatten()[1:]], distances.flatten()[1:]))

[(618, 0.0),
 (54, 0.0),
 (621, 0.0),
 (52, 0.0),
 (542, 0.0),
 (535, 0.0),
 (247, 0.0),
 (84, 0.0),
 (434, 0.0),
 (45, 0.0),
 (545, 0.0),
 (568, 0.0),
 (633, 0.0),
 (368, 0.0),
 (547, 0.0),
 (241, 0.0),
 (636, 0.0),
 (540, 0.0),
 (615, 0.0),
 (452, 0.0),
 (592, 0.0),
 (400, 0.0),
 (222, 0.0),
 (458, 0.0),
 (224, 0.0),
 (408, 0.0),
 (463, 0.0),
 (537, 0.0),
 (464, 0.0),
 (120, 0.0),
 (415, 0.0),
 (81, 0.0),
 (390, 0.0),
 (388, 0.0),
 (172, 0.0),
 (439, 0.0),
 (170, 0.0),
 (365, 0.0),
 (510, 0.0),
 (278, 0.0),
 (279, 0.0),
 (445, 0.0),
 (190, 0.0),
 (328, 0.0),
 (665, 0.0),
 (189, 0.0),
 (638, 0.0),
 (318, 0.0),
 (668, 0.0),
 (187, 0.0),
 (3, 0.0),
 (551, 0.0),
 (183, 0.0),
 (661, 0.0),
 (342, 0.0),
 (335, 0.0),
 (655, 0.0),
 (639, 0.0),
 (360, 0.0),
 (91, 0.0),
 (565, 0.0),
 (645, 0.0),
 (529, 0.0),
 (27, 0.0),
 (28, 0.0),
 (196, 0.0),
 (195, 0.0),
 (348, 0.0),
 (347, 0.0),
 (652, 0.0),
 (26, 0.0),
 (506, 0.0),
 (484, 0.33333333333333337),
 (609, 0.33333333333333337),
 (322, 0.33333333

А так я могу найти N ближайших:

In [None]:
[x[0] for x in list(zip(X.index[indices.flatten()[1:]], distances.flatten()[1:]))]

[618,
 54,
 621,
 52,
 542,
 535,
 247,
 84,
 434,
 45,
 545,
 568,
 633,
 368,
 547,
 241,
 636,
 540,
 615,
 452,
 592,
 400,
 222,
 458,
 224,
 408,
 463,
 537,
 464,
 120,
 415,
 81,
 390,
 388,
 172,
 439,
 170,
 365,
 510,
 278,
 279,
 445,
 190,
 328,
 665,
 189,
 638,
 318,
 668,
 187,
 3,
 551,
 183,
 661,
 342,
 335,
 655,
 639,
 360,
 91,
 565,
 645,
 529,
 27,
 28,
 196,
 195,
 348,
 347,
 652,
 26,
 506,
 484,
 609,
 322,
 642,
 139,
 363,
 624,
 90,
 216,
 444,
 409,
 332,
 266,
 234,
 313,
 442,
 223,
 171,
 495,
 466,
 507,
 178,
 512,
 181,
 157,
 498,
 490]

In [None]:
----------

In [None]:
X

movieId,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,9026,9027,9028,9029,9030,9031,9032,9033,9034,9035,9036,9037,9038,9039,9040,9041,9042,9043,9044,9045,9046,9047,9048,9049,9050,9051,9052,9053,9054,9055,9056,9057,9058,9059,9060,9061,9062,9063,9064,9065
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
0,2.5,3.0,3.0,2.0,4.0,2.0,2.0,2.0,3.5,2.0,2.5,1.0,4.0,4.0,3.0,2.0,2.0,2.5,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,5.0,5.0,4.0,4.0,3.0,3.0,4.0,3.0,5.0,4.0,3.0,3.0,3.0,3.0,3.0,3.0,5.0,1.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,5.0,4.0,0.0,3.0,0.0,0.0,5.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,4.0,5.0,4.0,0.0,4.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,5.0,0.0,0.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
X.shape

(671, 9061)

Теперь, найдя похожих пользователей, я могу найти средние оценки всем фильмам, отсортировать их и получить не только фильмы, которые смотрели ближайшие пользователи, но и наиболее понравившиеся им фильмы


In [None]:
closest_users = [i for i in range(10)]
result = []
for movie_id in X.columns:
    result.append((movie_id, X.loc[closest_users, movie_id].sum() / len(closest_users)))

Таким образом я получил список кортежей с id фильмов и их средними рейтингами, поставленными ближайшими пользователями

In [None]:
result.sort(key=lambda x: x[1], reverse=True)
result

[(57, 2.4),
 (99, 2.3),
 (105, 2.1),
 (106, 2.1),
 (88, 1.85),
 (180, 1.8),
 (49, 1.75),
 (179, 1.75),
 (27, 1.7),
 (79, 1.7),
 (89, 1.7),
 (202, 1.6),
 (402, 1.6),
 (119, 1.55),
 (59, 1.5),
 (92, 1.45),
 (24, 1.4),
 (60, 1.4),
 (91, 1.4),
 (189, 1.4),
 (225, 1.4),
 (143, 1.35),
 (298, 1.35),
 (72, 1.3),
 (141, 1.3),
 (191, 1.3),
 (64, 1.25),
 (157, 1.25),
 (158, 1.25),
 (195, 1.25),
 (23, 1.2),
 (90, 1.2),
 (113, 1.2),
 (187, 1.2),
 (197, 1.2),
 (341, 1.2),
 (459, 1.2),
 (75, 1.15),
 (111, 1.15),
 (118, 1.15),
 (129, 1.15),
 (132, 1.15),
 (391, 1.15),
 (20, 1.1),
 (58, 1.1),
 (120, 1.1),
 (128, 1.1),
 (130, 1.1),
 (170, 1.1),
 (204, 1.1),
 (127, 1.0),
 (213, 1.0),
 (10, 0.95),
 (264, 0.95),
 (287, 0.95),
 (334, 0.95),
 (335, 0.95),
 (377, 0.95),
 (12, 0.9),
 (21, 0.9),
 (22, 0.9),
 (29, 0.9),
 (55, 0.9),
 (83, 0.9),
 (95, 0.9),
 (101, 0.9),
 (122, 0.9),
 (139, 0.9),
 (140, 0.9),
 (153, 0.9),
 (154, 0.9),
 (155, 0.9),
 (161, 0.9),
 (164, 0.9),
 (172, 0.9),
 (176, 0.9),
 (177, 0.9),
 (1

In [None]:
------------------

Для пользователей, у которых были удалены фильмы, найдите топ 100 фильмов, который должен посмотреть каждый из этих пользователей, используя `UserBasedRecommendation`. Не забудьте подобрать параметр alpha.

Используя метрику `MAP@5`, `MAP@10` и `MAP@100`, определите, насколько эффективна user-based рекомендательная система для данной задачи.

In [None]:
from utils import mapk, apk

In [None]:
test_data

Unnamed: 0,userId,movieId,rating,timestamp
0,546,6495,4.0,1028129718
1,546,2241,4.0,1039880724
2,546,1874,4.5,1468681977
3,546,8021,4.0,1242992741
4,546,4401,3.5,1342849917
5,563,5124,5.0,974708761
6,563,2735,5.0,974844711
7,563,2578,3.0,974714208
8,563,2532,3.0,974843382
9,563,5008,4.0,974839307


In [None]:
list(test_data['movieId'][test_data.userId == 29])

[2698, 2700, 1045, 2646, 2329]

Для метрики Евклидового расстояния

In [None]:
actual = []
predicted = []
adviser = UserBasedRecommendation()
adviser.fit(X)
for user_id in test_data.userId.unique():
    watched_movies = list(test_data['movieId'][test_data.userId == user_id])
    actual.append(watched_movies)
    recommended_movies = adviser.make_recommendation(user_id=user_id, n_recommendations=100)
    predicted.append(recommended_movies)

In [None]:
mapk(actual, predicted, 5)

0.01

In [None]:
mapk(actual, predicted, 10)

0.01

In [None]:
mapk(actual, predicted, 100)

0.01023529411764706

In [None]:
for i in range(10):
    for j in range(100):
      if predicted[i][j] in actual[i]:
          print('Да, правильно!, правильно предсказан фильм', predicted[i][j],'для пользователя', i)

Да, правильно!, правильно предсказан фильм 72 для пользователя 3
Да, правильно!, правильно предсказан фильм 49 для пользователя 7


Из 500 предсказанных фильмов, модель правильно предсказала 2, не очень хорошая работа...

Для метрики Пирсона

In [None]:
actual = []
predicted = []
adviser = UserBasedRecommendation(metric='pearson')
adviser.fit(X)
for user_id in test_data.userId.unique():
    watched_movies = list(test_data['movieId'][test_data.userId == user_id])
    actual.append(watched_movies)
    recommended_movies = adviser.make_recommendation(user_id=user_id, n_recommendations=100)
    predicted.append(recommended_movies)

In [None]:
mapk(actual, predicted, 5)

0.02

In [None]:
mapk(actual, predicted, 10)

0.0225

In [None]:
mapk(actual, predicted, 100)

0.0225

In [None]:
for i in range(10):
    for j in range(100):
      if predicted[i][j] in actual[i]:
          print('Да, правильно!, правильно предсказан фильм', predicted[i][j],'для пользователя', i)

Да, правильно!, правильно предсказан фильм 72 для пользователя 3
Да, правильно!, правильно предсказан фильм 49 для пользователя 7


Как можно улучшить работу модели?

<b>Ответ:</b> Я поперебирал гиперпараметр n_closest_users, MAP@K после этого совсем немного, но возросло. Также тут можно воспользоваться стандартизацией данных (scaling) (центрированием, z-score и т.п.). Еще мы никак не использовали данные колонки timestamp, это могло бы улучшить нашу модель, т.к. какие-то фильмы люди обычно смотрят в определенное время года (новогодние фильмы перед Новым Годом, военные - перед 9 мая и т.п.). Возможно стоит попробовать другие метрики, такие как корреляция Спирмана, косинусное расстояние. Возможно стоит воспользоваться другой моделью (Item-based или baseline'ом из конспекта). Либо же использовать нейросети, глубокое обучение и гибридные модели.