# Практическое задание Урока 2

### Тема "Бейзлайны и детерминированные алгоритмы item-item."

### Содержание домашнего задания:
- [Подключение библиотек и скриптов](#python)
- [Загрузка данных](#load_data)
- [Оценивание](#metric)
- [Задание 0](#task_0)
- [Задание 1](#task_1)
- [Задание 2](#task_2)
- [Задание 3](#task_3)
- [Задание 4](#task_4)
- [Ссылки](#linls)

----
### Подключение библиотек и скриптов<a class="anchor" id="python"></a>

In [1]:
import json
import pickle
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from scipy import spatial
from scipy.stats import pearsonr

# Для работы с матрицами
from scipy.sparse import csr_matrix, coo_matrix, save_npz, load_npz

# Детерминированные алгоритмы
from implicit.nearest_neighbours import ItemItemRecommender, CosineRecommender, TFIDFRecommender, BM25Recommender

# Метрики
from implicit.evaluation import train_test_split
from implicit.evaluation import precision_at_k, mean_average_precision_at_k, AUC_at_k, ndcg_at_k

from IPython.display import display

In [2]:
def mse(x):
    mean = x.mean()
    return np.mean((x - mean)**2)

In [3]:
def convert_text_to_int_list(df, feature):
    if isinstance(df[feature].values[0], str):
        print(f"convert '{feature}' from str to list.")
        df[feature] = df[feature].map(lambda x: x[1:-1].split(', ')).apply(lambda x: list(map(int, x)))

In [4]:
def precision_at_k_(recommended_list, bought_list, k=5):
    
    bought_list = np.array(bought_list)
    #print(bought_list)
    recommended_list = np.array(recommended_list)
    
    bought_list = bought_list  # Тут нет [:k] !!
    recommended_list = recommended_list[:k]
    
    flags = np.isin(bought_list, recommended_list)
    
    precision = flags.sum() / len(recommended_list)
    
    
    return precision

def precision_at_k_score(df, actual=None, feature=None):
    r = df.apply(lambda x: precision_at_k_(x[feature], x[actual],  5), axis=1)
    return r.mean()

In [5]:
def data_to_sparse_matrix(data_train):
    user_item_matrix = pd.pivot_table(data_train, 
                                      index='user_id', columns='item_id', 
                                      values='quantity',
                                      aggfunc='count', 
                                      fill_value=0,)

    user_item_matrix[user_item_matrix > 0] = 1        # так как в итоге хотим предсказать 
    user_item_matrix = user_item_matrix.astype(float) # необходимый тип матрицы для implicit

    userids = user_item_matrix.index.values
    itemids = user_item_matrix.columns.values

    matrix_userids = np.arange(len(userids))
    matrix_itemids = np.arange(len(itemids))

    id_to_itemid = dict(zip(matrix_itemids, itemids))
    id_to_userid = dict(zip(matrix_userids, userids))

    itemid_to_id = dict(zip(itemids, matrix_itemids))
    userid_to_id = dict(zip(userids, matrix_userids))

    # переведем в формат saprse matrix
    sparse_user_item = csr_matrix(user_item_matrix)
    
    return sparse_user_item, id_to_itemid, userid_to_id

----

![cosine_similarity.png](images/cosine_similarity.png)

In [6]:
def cosine_similarity(a, b):
    return 1 - spatial.distance.cosine(a, b) # == (a*b).sum()/np.sqrt((a**2).sum())/np.sqrt((b**2).sum())

----

#### Pearson’s correlation looks like this [[*]](#Pearson_s_correlation):

> $$ \Large sim(u,v) = \frac{\sum(r_{ui}-\overline{r}_{u})(r_{vi}-\overline{r}_{v})}
{\sqrt{\sum{(r_{ui}-\overline{r}_{u})^2}} \sqrt{\sum{(r_{vi}-\overline{r}_{v})^2}}} $$ 

#### Computing Similarity [[*]](#Pearson_s_correlation)
> Similarity computation between users is the main task in collaborative filtering algorithms. The similarity between users ( also known as the distance between users) is a mathematical method to quantify how different or similar users are to each other. For a User-User CF algorithm, similarity, $sim_{xy}$ between the users $x$ and $y$ who have both rated the same items is calculated first. To calculate this similarity different metrics are used. We will be using correlation-based similarity metrics to compute the similarity between user $x$ and user $y$, $sim_{xy}$ using **Pearson correlation**:

> $$ \Large sim_{xy,person} = \frac{\sum_{i=1}^{n}(x_i-\overline{x})(y_{i}-\overline{y})}
{\sqrt{\sum_{i=1}^{n}(x_i-\overline{x})^2} \sqrt{\sum_{i=1}^{n}(y_{i}-\overline{y})^2}} $$ 

In [7]:
def pearson_similarity(x, y):
    return pearsonr(x, y)[0]

----
### Cosine and Pearson Similarities<a class="anchor" id="cosine_pearson"></a>

In [8]:
item_names = ['Молоко', 'Йогурт', 'Мясо']
user_names = ['Мария', 'Анна', 'Глеб', 'Никита']
users_items = np.array([[1,1,0],[1,1,0],[0,0,1],[1,0,0]])
df = pd.DataFrame(data=users_items, columns=item_names, index=user_names)
df

Unnamed: 0,Молоко,Йогурт,Мясо
Мария,1,1,0
Анна,1,1,0
Глеб,0,0,1
Никита,1,0,0


In [9]:
users_items = df.values
users_items

array([[1, 1, 0],
       [1, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

In [10]:
milk, yogurt = users_items[:,0], users_items[:,1]
milk, yogurt

(array([1, 1, 0, 1]), array([1, 1, 0, 0]))

In [11]:
cosine, pearson = cosine_similarity(milk, yogurt), pearson_similarity(milk, yogurt)
print(f"similarities:\n    cosine:  {cosine:.5f}\n    pearson: {pearson:.5f}")

similarities:
    cosine:  0.81650
    pearson: 0.57735


In [12]:
def similar(matrix, axis=None, id=None, similarity='cosine'):
    assert axis in [0, 1], "axis should be 0 or 1"
    assert similarity in ['cosine', 'pearson']
    if similarity == 'cosine':
        func = cosine_similarity
    else:
        func = pearson_similarity
    target = matrix[id] if axis == 0 else matrix[:,id]
    target_ = target
    return np.apply_along_axis(lambda x: func(target, x), 1 if axis == 0 else 0, matrix)

In [13]:
# похожесть с Марией
similar(users_items, axis=0, id=0, similarity='cosine')

array([1.        , 1.        , 0.        , 0.70710678])

In [14]:
# похожесть с Йогуртом
similar(users_items, axis=1, id=1, similarity='cosine')

array([0.81649658, 1.        , 0.        ])

----
##### Other Stuff

In [15]:
def sort_array(x, column=None, flip=False):
    x = x[np.argsort(x[:,column])]
    if flip:
        x = np.flip(x, axis=0)
    return x

def top_k_similar(matrix, axis=None, id=None, k=None, verbose=False, similarity='cosine'):
    assert axis in [0, 1], "axis should be 0 or 1"
    if axis == 1:
        matrix = matrix.T
    assert id is not None and id in range(matrix.shape[0])
    assert k is not None
    sims = similar(matrix, axis=0, id=id, similarity=similarity)
    ids_sims = np.array(list(zip(range(len(sims)), sims)))
    ids_sims = sort_array(ids_sims, column=1, flip=True)
    ids_sims = ids_sims[ids_sims[:,0] != id][:k]
    ids, sims = ids_sims[:,0].astype(np.int), ids_sims[:,1]
    if verbose:
        print(f"top_k_similar: axis {axis} id {id} k {k} matrix.shape {matrix.shape}")
        print(f"  ids {ids}")
        print(f"  sims {sims}")
    return ids, sims

In [16]:
df

Unnamed: 0,Молоко,Йогурт,Мясо
Мария,1,1,0
Анна,1,1,0
Глеб,0,0,1
Никита,1,0,0


In [17]:
ids, _ = top_k_similar(users_items, axis=1, id=1, k=1, similarity='cosine', verbose=True)
print(f"на {item_names[1]} похоже {item_names[int(ids[0])]}")

top_k_similar: axis 1 id 1 k 1 matrix.shape (3, 4)
  ids [0]
  sims [0.81649658]
на Йогурт похоже Молоко


In [18]:
ids, _ = top_k_similar(users_items, axis=1, id=1, k=1, similarity='pearson', verbose=True)
print(f"на {item_names[1]} похоже {item_names[int(ids[0])]}")

top_k_similar: axis 1 id 1 k 1 matrix.shape (3, 4)
  ids [0]
  sims [0.57735027]
на Йогурт похоже Молоко


In [19]:
ids, _ = top_k_similar(users_items, axis=0, id=0, k=2, similarity='cosine', verbose=True)
print(f"на {user_names[0]} похожи {user_names[int(ids[0])]} и {user_names[int(ids[1])]}")

top_k_similar: axis 0 id 0 k 2 matrix.shape (4, 3)
  ids [1 3]
  sims [1.         0.70710678]
на Мария похожи Анна и Никита


In [20]:
ids, _ = top_k_similar(users_items, axis=0, id=0, k=2, similarity='pearson', verbose=True)
print(f"на {user_names[0]} похожи {user_names[int(ids[0])]} и {user_names[int(ids[1])]}")

top_k_similar: axis 0 id 0 k 2 matrix.shape (4, 3)
  ids [1 3]
  sims [1.  0.5]
на Мария похожи Анна и Никита


In [21]:
def print_array(x):
    display(pd.DataFrame(x))

----
##### Mean Average Precision (MAP)
- https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py

In [22]:
def apk(actual, predicted, k=10):
    """
    Computes the average precision at k.

    This function computes the average prescision at k between two lists of
    items.

    Parameters
    ----------
    actual : list
             A list of elements that are to be predicted (order doesn't matter)
    predicted : list
                A list of predicted elements (order does matter)
    k : int, optional
        The maximum number of predicted elements

    Returns
    -------
    score : double
            The average precision at k over the input lists

    """
    if len(predicted)>k:
        predicted = predicted[:k]

    score = 0.0
    num_hits = 0.0

    for i,p in enumerate(predicted):
        if p in actual and p not in predicted[:i]:
            num_hits += 1.0
            score += num_hits / (i+1.0)

    if not actual:
        return 0.0

    return score / min(len(actual), k)

def mapk(actual, predicted, k=10):
    """
    Computes the mean average precision at k.

    This function computes the mean average prescision at k between two lists
    of lists of items.

    Parameters
    ----------
    actual : list
             A list of lists of elements that are to be predicted 
             (order doesn't matter in the lists)
    predicted : list
                A list of lists of predicted elements
                (order matters in the lists)
    k : int, optional
        The maximum number of predicted elements

    Returns
    -------
    score : double
            The mean average precision at k over the input lists

    """
    return np.mean([apk(a,p,k) for a,p in zip(actual, predicted)])

----
### Загрузка данных<a class="anchor" id="load_data"></a>

In [23]:
data = pd.read_csv('./data/retail_train.csv')
data.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [24]:
users, items, interactions = data.user_id.nunique(), data.item_id.nunique(), data.shape[0]

print('# users: ', users)
print('# items: ', items)
print('# interactions: ', interactions)
interactions / (users*items)

# users:  2499
# items:  89051
# interactions:  2396804


0.010770291654185115

##### 'user_id' и 'item_id'

In [25]:
userids = data['user_id'].unique()
itemids = data['item_id'].unique()

len(userids), len(itemids)

(2499, 89051)

##### Разделение на train и test

In [26]:
test_size_weeks = 3

data_train = data[data['week_no'] < data['week_no'].max() - test_size_weeks].copy()
data_train_copy = data_train.copy()
data_test = data[data['week_no'] >= data['week_no'].max() - test_size_weeks].copy()
data_train.shape[0], data_test.shape[0]

(2278490, 118314)

In [27]:
test_users = data_test['user_id'].unique().shape[0]
new_test_users = len(set(data_test['user_id']) - set(data_train['user_id']))
print(f"В тестовом дата сете {test_users} юзеров")
print(f"В тестовом дата сете {new_test_users} новых юзеров")

В тестовом дата сете 2042 юзеров
В тестовом дата сете 0 новых юзеров


In [28]:
test_items = data_test['item_id'].unique().shape[0]
new_test_items = len(set(data_test['item_id']) - set(data_train['item_id']))
print(f"В тестовом дата сете {test_items} продуктов")
print(f"В тестовом дата сете {new_test_items} новых продуктов")

В тестовом дата сете 24329 продуктов
В тестовом дата сете 2186 новых продуктов


##### popularity и top_5000

In [29]:
popularity = data_train.groupby('item_id')['quantity'].sum().reset_index()
popularity.rename(columns={'quantity': 'n_sold'}, inplace=True)

top_5000 = popularity.sort_values('n_sold', ascending=False).head(5000).item_id.tolist()

popularity.sort_values('n_sold', ascending=False).head(3)

Unnamed: 0,item_id,n_sold
55470,6534178,190227964
55430,6533889,15978434
55465,6534166,12439291


In [30]:
seminar = pd.read_csv('./data/preds.csv')# закгрузка predict с семианара
for feature in seminar.select_dtypes(include='object').columns:
    convert_text_to_int_list(seminar, feature)
seminar.head(2)

convert 'actual' from str to list.
convert 'random_recommendation' from str to list.
convert 'popular_recommendation' from str to list.
convert 'itemitem' from str to list.
convert 'cosine' from str to list.
convert 'tfidf' from str to list.
convert 'own_purchases' from str to list.


Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]"


----
### Оценивание<a class="anchor" id="metric"></a>
За выполнени каждого задания 1 балл

4 балла -> отл

3 балла -> хор

И тд

----
### Задание 0. Товар 999999<a class="anchor" id="task_0"></a>

##### Создадим датафрейм с покупками юзеров на тестовом датасете (последние 3 недели)

In [31]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result['actual'] = result['actual'].apply(lambda x: list(x))
result_copy = result.copy()
result.head(3)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


##### 999999

In [32]:
%%time

data_train = data_train_copy.copy()

# Заведем фиктивный item_id (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

CPU times: user 492 ms, sys: 2.49 s, total: 2.98 s
Wall time: 2.99 s


In [33]:
sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)

In [34]:
%%time

model = ItemItemRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 2.16 s, sys: 462 ms, total: 2.62 s
Wall time: 935 ms


In [35]:
%%time
result['itemitem999999'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])
result.head(3)

CPU times: user 101 ms, sys: 2.47 ms, total: 103 ms
Wall time: 102 ms


Unnamed: 0,user_id,actual,itemitem999999
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 981760, 1127831, 995242]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 981760, 1098066, 995242]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[999999, 1082185, 981760, 1127831, 995242]"


##### No 999999 (все 'item_id' как есть)

In [36]:
NO_999999_PATHFILE, ITEMID_PATHFILE, USERID_PATHFILE = './data/no-999999.npz', './data/no-999999-itemid.pkl', './data/no-999999-userid.pkl'

if not Path(NO_999999_PATHFILE).is_file():
    data_train = data_train_copy.copy()
    %time sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)
    save_npz(NO_999999_PATHFILE, sparse_user_item, compressed=True)
    with open(ITEMID_PATHFILE, "wb") as f:
        pickle.dump(id_to_itemid, f)
    with open(USERID_PATHFILE, "wb") as f:
        pickle.dump(userid_to_id, f)
else:
    sparse_user_item = load_npz(NO_999999_PATHFILE)
    with open(ITEMID_PATHFILE, "rb") as f:
        id_to_itemid = pickle.load(f)
    with open(USERID_PATHFILE, "rb") as f:
        userid_to_id = pickle.load(f)

In [37]:
%%time

model = ItemItemRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/86865 [00:00<?, ?it/s]

CPU times: user 26.4 s, sys: 2.9 s, total: 29.3 s
Wall time: 11 s


In [38]:
%%time
result['itemitemNo999999'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

CPU times: user 227 ms, sys: 4.03 ms, total: 231 ms
Wall time: 232 ms


In [39]:
result.head(3)

Unnamed: 0,user_id,actual,itemitem999999,itemitemNo999999
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 840361]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 981760, 1098066, 826249, 995242]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 1098066]"


##### Drop 999999 (только 'item_id' из top_5000)

In [40]:
%%time

data_train = data_train_copy.copy()

data_train = data_train[data_train['item_id'].isin(top_5000)]

sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)

CPU times: user 3.78 s, sys: 292 ms, total: 4.07 s
Wall time: 4.08 s


Список userid для которых невозможно сделать рекомендации.

In [41]:
userids = [userid for userid in result['user_id'] if not userid in userid_to_id]
userids

[650, 729, 954, 1987, 2364]

Добавим пользователей к top_5000

In [42]:
%%time

data_train = data_train_copy.copy()

cond = (data_train['item_id'].isin(top_5000)) | (data_train['user_id'].isin(userids))
data_train = data_train[cond]

sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)

CPU times: user 4.64 s, sys: 4.47 s, total: 9.11 s
Wall time: 9.11 s


In [43]:
%%time

model = ItemItemRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/7834 [00:00<?, ?it/s]

CPU times: user 1.88 s, sys: 181 ms, total: 2.06 s
Wall time: 905 ms


In [44]:
%%time
result['itemitemDrop999999'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

CPU times: user 93.3 ms, sys: 0 ns, total: 93.3 ms
Wall time: 91.6 ms


In [45]:
result.head(3)

Unnamed: 0,user_id,actual,itemitem999999,itemitemNo999999,itemitemDrop999999
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 840361]","[1082185, 981760, 1127831, 995242, 840361]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 981760, 1098066, 826249, 995242]","[1082185, 981760, 1098066, 826249, 995242]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 1098066]","[1082185, 981760, 1127831, 995242, 1098066]"


##### Ответы:

- На вебинаре мы использовали товар 999999 - что это за товар? Зачем он нужен?

> "Товар 999999" - это фиктивный item_id для непопулярных товаров (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар).

- Используя этот товар мы смещаем качество рекомендаций. В какую сторону?

> В худшую сторону, но улучшаем скорость обработки.

- Можно ли удалить этот товар?

> При удалении "Товара 999999" мы будем проводить анализ только пл топ-5000 товаров.

- Уберите этот товар и сравните с качеством на семинаре.

In [46]:
for feature in [x for x in result.columns if not x in ['user_id', 'actual']]:
    mean = precision_at_k_score(result, actual='actual', feature=feature)
    print(f"{feature:24}: Mean {mean:.5f}")

itemitem999999          : Mean 0.13692
itemitemNo999999        : Mean 0.15406
itemitemDrop999999      : Mean 0.15504


----
### Задание 1. Weighted Random Recommendation<a class="anchor" id="task_1"></a>

Напишите код для случайных рекоммендаций, в которых вероятность рекомендовать товар прямо пропорциональна логарифму продаж
- Можно сэмплировать товары случайно, но пропорционально какому-либо весу
- Например, прямопропорционально популярности. вес = log(sales_sum товара)
- Придумайте пример 3 весов, посчитайте weighted_random_recommendation для разных весов

In [47]:
def weights_log_sales_volume(df):
    df = df[['item_id', 'quantity', 'sales_value']].copy()
    df['sales_sum'] = df['quantity'] * df['sales_value']
    sales = df.groupby('item_id')['sales_sum'].sum().reset_index()
    sales.sort_values('sales_sum', ascending=False, inplace=True)
    #sales['weight'] = sales['sales_sum'].apply(np.log).replace(-np.inf, 0.0)
    sales['weight'] = np.where(sales['sales_sum']<0, 0.0, sales['sales_sum'])
    sales['weight'] = sales['weight'] / sales['weight'].sum()
    sales = sales[['item_id', 'weight']]
    return sales

In [48]:
def weighted_random_recommendation(items_weights, n=5):
    """Случайные рекоммендации
    
    Input
    -----
    items_weights: pd.DataFrame
        Датафрейм со столбцами item_id, weight. Сумма weight по всем товарам = 1
    """
    
    items, weights = items_weights['item_id'].values, items_weights['weight'].values
    
    recs = np.random.choice(items, size=n, replace=False, p=weights)
    
    return recs.tolist()

Сделайте предсказания

In [49]:
%%time

items_weights = weights_log_sales_volume(data_train)

result_ = seminar.copy()

result_['weighted_random_recommendation'] = result_['user_id'].apply(lambda x: weighted_random_recommendation(items_weights, n=5))
result_.head(2)

CPU times: user 842 ms, sys: 0 ns, total: 842 ms
Wall time: 838 ms


Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534178, 6544236, 6534166, 6533889, 1426702]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]","[6534178, 6533889, 1404121, 6544236, 1426702]"


----
### Задание 2. Расчет метрик<a class="anchor" id="task_2"></a>
- Рассчитайте Precision@5 для каждого алгоритма (с вебинара и weighted_random_recommendation) с помощью функции из вебинара 1.

In [50]:
result_.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534178, 6544236, 6534166, 6533889, 1426702]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]","[6534178, 6533889, 1404121, 6544236, 1426702]"


In [51]:
for feature in [x for x in result_.columns if not x in ['user_id', 'actual']]:
    mean = precision_at_k_score(result_, actual='actual', feature=feature)
    print(f"{feature:32}: Mean {mean:.5f}")

random_recommendation           : Mean 0.00108
popular_recommendation          : Mean 0.15524
itemitem                        : Mean 0.13692
cosine                          : Mean 0.13291
tfidf                           : Mean 0.13898
own_purchases                   : Mean 0.17969
weighted_random_recommendation  : Mean 0.04574


- Какой алгоритм показывает лучшее качество? Почему?

##### Ответ:
- own_purchases, tfidf и popular_recommendation дают лучшее качество.
- Почему? Не могу ответить. На лекции я задавал вопрос по поводу того, какие модели (алгоритмы) используются для рекомендательных систем. Пришлите пожалуста ссылки где можно почитать какие из них лучше или хуже.

----
### Задание 3. Улучшение бейзлайнов и ItemItem<a class="anchor" id="task_3"></a>

- Попробуйте улучшить бейзлайны, считая их на топ-5000 товаров

Подсчет на топ-5000 приведен в задании 0.

- Попробуйте улучшить разные варианты ItemItemRecommender, выбирая число соседей $K$.

In [52]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result['actual'] = result['actual'].apply(lambda x: list(x))
result_copy = result.copy()
result.head(3)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


In [53]:
%%time

data_train = data_train_copy.copy()

# Заведем фиктивный item_id (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

CPU times: user 590 ms, sys: 30.2 ms, total: 620 ms
Wall time: 617 ms


In [54]:
def neighbours_recommender(result, data_train, n_neighbours=None):
    model = ItemItemRecommender(K=n_neighbours, num_threads=4) # K - кол-во билжайших соседей
    model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
              show_progress=False)
    result[str(n_neighbours)+'_neighbours'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

In [55]:
%%time
sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)
for n_neighbours in range(2, 16): #, 2):
    print(f"n_neighbours {n_neighbours}")
    neighbours_recommender(result, data_train, n_neighbours=n_neighbours)

n_neighbours 2
n_neighbours 3
n_neighbours 4
n_neighbours 5
n_neighbours 6
n_neighbours 7
n_neighbours 8
n_neighbours 9
n_neighbours 10
n_neighbours 11
n_neighbours 12
n_neighbours 13
n_neighbours 14
n_neighbours 15
CPU times: user 32.8 s, sys: 604 ms, total: 33.5 s
Wall time: 12.7 s


In [56]:
result.head(2)

Unnamed: 0,user_id,actual,2_neighbours,3_neighbours,4_neighbours,5_neighbours,6_neighbours,7_neighbours,8_neighbours,9_neighbours,10_neighbours,11_neighbours,12_neighbours,13_neighbours,14_neighbours,15_neighbours
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 995242, 1029743, 840361]","[999999, 1082185, 981760, 995242, 1029743]","[999999, 1082185, 981760, 995242, 1127831]","[999999, 1082185, 981760, 1127831, 995242]","[999999, 1082185, 981760, 1127831, 995242]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 1098066, 6534178, 826249]","[999999, 1082185, 981760, 1098066, 6534178]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 995242]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]"


In [57]:
for feature in [x for x in result.columns if not x in ['user_id', 'actual']]:
    mean = precision_at_k_score(result, actual='actual', feature=feature)
    print(f"{feature:24}: Mean {mean:.5f}")

2_neighbours            : Mean 0.19201
3_neighbours            : Mean 0.18609
4_neighbours            : Mean 0.14496
5_neighbours            : Mean 0.13692
6_neighbours            : Mean 0.14202
7_neighbours            : Mean 0.14486
8_neighbours            : Mean 0.14721
9_neighbours            : Mean 0.14848
10_neighbours           : Mean 0.15093
11_neighbours           : Mean 0.15220
12_neighbours           : Mean 0.15338
13_neighbours           : Mean 0.15318
14_neighbours           : Mean 0.15348
15_neighbours           : Mean 0.15318


<hr>

- Попробуйте стратегии ансамблирования изученных алгоритмов

In [58]:
ansamble_features = ['itemitem', 'cosine', 'tfidf']

In [59]:
result_.head(1)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534178, 6544236, 6534166, 6533889, 1426702]"


In [60]:
def merge(df, features=None):
    return np.apply_along_axis(lambda x: np.array(x.tolist()).ravel(), 1, df[features])

def ansamble_func_at_k(x, drop_999999=True, k=None):
    if drop_999999:
        x = x[x!=999999]
    unique, counts = np.unique(x, return_counts=True)
    x = np.asarray((counts, unique)).T
    x = np.flip(np.sort(x, axis=0)[:,1])
    if not k is None:
        x = x[:k]
    return list(x)

In [61]:
x = merge(result_, features=ansamble_features)
r = []
#x = np.apply_along_axis(ansamble_func_at_k, 1, x, drop_999999=True)
for x in x:
    x = ansamble_func_at_k(x, drop_999999=True)
    r.append(x)
result_['ansamble_func_at_k'] = r #x.tolist()
result_.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation,ansamble_func_at_k
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534178, 6544236, 6534166, 6533889, 1426702]","[1127831, 1098066, 1082185, 995242, 981760]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]","[6534178, 6533889, 1404121, 6544236, 1426702]","[1098066, 1082185, 995242, 981760, 826249]"


In [62]:
for feature in [x for x in result_.columns if not x in ['user_id', 'actual']]:
    mean = precision_at_k_score(result_, actual='actual', feature=feature)
    print(f"{feature:30}: Mean {mean:.5f}")

random_recommendation         : Mean 0.00108
popular_recommendation        : Mean 0.15524
itemitem                      : Mean 0.13692
cosine                        : Mean 0.13291
tfidf                         : Mean 0.13898
own_purchases                 : Mean 0.17969
weighted_random_recommendation: Mean 0.04574
ansamble_func_at_k            : Mean 0.16383


----
### Задание 4. Улучшение детерминированных алгоритмов<a class="anchor" id="task_4"></a>

In [63]:
class Matrix:
    def __init__(self, df, index=None, columns=None, values=None, aggfunc='count'):
        self.df = df.copy()
        self.matrix = pd.pivot_table(df, index=index, columns=columns, values=values,
                                     aggfunc=aggfunc, fill_value=0,)
        self.matrix[self.matrix > 0] = 1        # так как в итоге хотим предсказать 
        self.matrix = self.matrix.astype(float) # необходимый тип матрицы для implicit
        self.sparce = csr_matrix(self.matrix)
        self.row_ids = self.matrix.index.values
        self.col_ids = self.matrix.columns.values
        self.row2ids = dict(zip(self.row_ids, np.arange(len(self.row_ids)), ))
        self.col2ids = dict(zip(self.col_ids, np.arange(len(self.col_ids))))
    def get(self):
        return self.sparce
    def id(self, tr=None, row=None, col=None):
        """ translates:
        m2r - `matrix id' to `real id'
        r2m - `real id' to `matrix id'
        """
        assert tr in ['m2r', 'r2m']
        assert not row is None or not col is None
        if tr == 'm2r':
            if not row is None:
                return self.row_ids[row]
            elif not col is None:
                return self.col_ids[col]
        else:
            if not row is None:
                return self.row2ids[row]
            elif not col is None:
                return self.col2ids[col]            
    def ids(self, tr=None, rows=None, cols=None):
        """ translates:
        m2r - `matrix IDs' to `real IDs'
        r2m - `real IDs' to `matrix IDs'
        """
        assert not rows is None or not cols is None
        if not rows is None:
            return np.array([self.id(tr=tr, row=x) for x in rows])
        elif not cols is None:
            return np.array([self.id(tr=tr, col=x) for x in cols])

In [64]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result['actual'] = result['actual'].apply(lambda x: list(x))
result_copy = result.copy()
result.head(2)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."


##### Алгоритм из вебинара (CosineSimilarity)

In [65]:
%%time

data_train = data_train_copy.copy()

# Заведем фиктивный item_id (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

CPU times: user 587 ms, sys: 40.1 ms, total: 627 ms
Wall time: 623 ms


In [66]:
matrix = Matrix(data_train, index='user_id', columns='item_id', values='quantity')

In [67]:
%%time

model = CosineRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(matrix.get().T.tocsr(), show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 2.06 s, sys: 212 ms, total: 2.27 s
Wall time: 888 ms


In [68]:
%%time

result['cosine_webinar'] = result['user_id'].\
    apply(lambda x: [matrix.id(tr='m2r', col=rec[0]) for rec in 
                    model.recommend(userid=matrix.id(tr='r2m', row=x), 
                                    user_items=matrix.get(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

CPU times: user 136 ms, sys: 667 µs, total: 136 ms
Wall time: 136 ms


In [69]:
result.head(2)

Unnamed: 0,user_id,actual,cosine_webinar
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[1082185, 999999, 981760, 1127831, 1098066]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[1082185, 1098066, 981760, 999999, 826249]"


##### На семинаре мы рассматривали:

Далее $U \equiv N_i(u) $

$$ \Large r_{u,i} =  \frac{\sum\limits_{v \in U}\operatorname{sim}(u,v)r_{v, i}}{\sum\limits_{v \in U} \operatorname{sim}(u,v)}$$
$$ \iff $$
$$ \Large r_{u,i} =  \frac{1}{S}\sum\limits_{v \in U}\operatorname{sim}(u,v)r_{v, i}$$
$$ \Large S = \sum\limits_{v \in U} \operatorname{sim}(u,v)$$

$N_i^k$ - пользователи, которые оценили товар i, k самых похожих на u пользователя

##### Алгоритм:
- *Шаг 1:* Ищем K ближайших юзеров к целевому юзеру  
- *Шаг 2*: predict "скора" товара = среднему "скору" этого товара у его соседей  
- *Шаг 3*: Сортируем товары по убыванию predict-ов и берем топ-k

In [70]:
userid, k_users = 100, 5
itemid, k_items = 840361, 5

In [71]:
print(f"userid {userid} matrix_id {matrix.id(tr='r2m', row=userid)}")
userids, sims = top_k_similar(matrix.get().toarray(), axis=0, id=matrix.id(tr='r2m', row=userid), k=k_users, verbose=True)
userids = matrix.ids(tr='m2r', rows=userids)
print(f"покупатели похожие на {userid}: {userids}\nпохожести: {sims}")

userid 100 matrix_id 99
top_k_similar: axis 0 id 99 k 5 matrix.shape (2499, 5001)
  ids [ 967 2038   18  587 1335]
  sims [0.1844662  0.1790975  0.1782881  0.17407766 0.17377009]
покупатели похожие на 100: [ 968 2039   19  588 1336]
похожести: [0.1844662  0.1790975  0.1782881  0.17407766 0.17377009]


In [72]:
print(f"itemid {itemid} matrix_id {matrix.id(tr='r2m', col=itemid)}")
itemids, sims = top_k_similar(matrix.get().toarray(), axis=1, id=matrix.id(tr='r2m', col=itemid), k=k_items, verbose=True)
itemids = matrix.ids(tr='m2r', cols=itemids)
print(f"товары похожие на {itemid}: {itemids}\nпохожести: {sims}")

itemid 840361 matrix_id 300
top_k_similar: axis 1 id 300 k 5 matrix.shape (5001, 2499)
  ids [3408 2381 2148 2307 3587]
  sims [0.72800918 0.71694575 0.6900548  0.63426835 0.6260751 ]
товары похожие на 840361: [1082185  999999  981760  995242 1098066]
похожести: [0.72800918 0.71694575 0.6900548  0.63426835 0.6260751 ]


In [73]:
def top_k_items_by_neighbours(matrix, userid, k_items=None, k_users=None, verbose=False):
    userids, sims = top_k_similar(matrix.get().toarray(), axis=0, id=matrix.id(tr='r2m', row=userid), k=k_users)
    if verbose:
        print(f"top {k_users} пользователей похожих на {matrix.ids(tr='m2r', rows=userids)}:")
        print(f"  пользователи: {userids}\n  похожести: {sims}")
    neighbours = userids
    # items_scores = matrix.get().toarray()[neighbours].sum(axis=0)
    items_scores = (sims[:,None]*matrix.get().toarray()[neighbours]).sum(axis=0)/sims.sum()
    ids_items_scores = np.array(list(zip(range(len(items_scores)), items_scores)))
    ids_items_scores = sort_array(ids_items_scores, column=1, flip=True)
    ids_items_scores = ids_items_scores[:k_items]
    items, scores = ids_items_scores[:,0].astype(np.int), ids_items_scores[:,1]
    items = matrix.ids(tr='m2r', cols=items)
    if verbose:
        print(f"top {k_items} товаров, рекомендованных для пользователя {userid}:")
        print(f"  товары: {items}\n  scores {scores}")
    return itemids

top_k_items_by_neighbours(matrix, userid, k_items=k_items, k_users=k_users, verbose=True);

top 5 пользователей похожих на [ 968 2039   19  588 1336]:
  пользователи: [ 967 2038   18  587 1335]
  похожести: [0.1844662  0.1790975  0.1782881  0.17407766 0.17377009]
top 5 товаров, рекомендованных для пользователя 100:
  товары: [ 849843 1098066 9527290 1068719 5569230]
  scores [1. 1. 1. 1. 1.]


In [75]:
%%time
print(f"k_items {k_items} k_users {k_users}")
result['my_recommender_1'] = result['user_id'].\
    apply(lambda userid: top_k_items_by_neighbours(matrix, userid, k_items=k_items, k_users=k_users))

k_items 5 k_users 5
CPU times: user 7min 49s, sys: 6min 8s, total: 13min 58s
Wall time: 13min 58s


##### Предлагается улучшить эту формулу и учесть средние предпочтения `всех пользователей`:

$$  \Large r_{u,i} = \mu + \bar{r_u} + \frac{1}{S}\sum\limits_{v \in U}\operatorname{sim}(u,v)(r_{v, i}-\bar{r_{v}} - \mu)$$

In [78]:
def top_k_items_by_neighbours_2(matrix, userid, k_items=None, k_users=None, verbose=False):
    userids, sims = top_k_similar(matrix.get().toarray(), axis=0, id=matrix.id(tr='r2m', row=userid), k=k_users)
    if verbose:
        print(f"top {k_users} пользователей похожих на {matrix.ids(tr='m2r', rows=userids)}:")
        print(f"  пользователи: {userids}\n  похожести: {sims}")
    neighbours = userids
    r_vi = matrix.get().toarray()[neighbours]
    r_v = matrix.get().toarray()[neighbours].mean(axis=0)
    mu = matrix.get().toarray().mean(axis=0)
    k = r_vi - r_v - mu
    items_scores = (sims[:,None]*k).sum(axis=0)
    ids_items_scores = np.array(list(zip(range(len(items_scores)), items_scores)))
    ids_items_scores = sort_array(ids_items_scores, column=1, flip=True)
    ids_items_scores = ids_items_scores[:k_items]
    items, scores = ids_items_scores[:,0].astype(np.int), ids_items_scores[:,1]
    items = matrix.ids(tr='m2r', cols=items)
    if verbose:
        print(f"top {k_items} товаров, рекомендованных для пользователя {userid}:")
        print(f"  товары: {items}\n  scores {scores}")
    return itemids

top_k_items_by_neighbours_2(matrix, userid, k_items=k_items, k_users=k_users, verbose=True);

top 5 пользователей похожих на [ 968 2039   19  588 1336]:
  пользователи: [ 967 2038   18  587 1335]
  похожести: [0.1844662  0.1790975  0.1782881  0.17407766 0.17377009]
top 5 товаров, рекомендованных для пользователя 100:
  товары: [1087362  984314 1049920 1388206 2690723]
  scores [ 0.00082993  0.00011789 -0.00023813 -0.00035602 -0.00035602]


In [79]:
%%time
print(f"k_items {k_items} k_users {k_users}")
result['my_recommender_2'] = result['user_id'].\
    apply(lambda userid: top_k_items_by_neighbours_2(matrix, userid, k_items=k_items, k_users=k_users))

k_items 5 k_users 5
CPU times: user 10min 16s, sys: 12min 15s, total: 22min 32s
Wall time: 22min 32s


##### Ответы:

- Какие смысл имееют $ \mu $ и $ \bar{r_u}$ ?
  - $ \mu $ - среднее по всем пользователям.
  - $ \bar{r_u}$ - среднее по K пользователям.

- Реализуйте алгоритм, прогнозирующий рейтинги на основе данной формулы, на numpy (векторизованно!)
  - Выше в тетрадке.

- В качестве схожести возьмите CosineSimilarity.

- Примените к user_item_matrix. В качестве рейтингов возьмите количество или стоимость купленного товара. 

- Данный алгоритм предсказывает рейтинги. Как на основании предсказанных рейтингов предсказать факт покупки Предложите вариант.

- Посчитайте accuracy@5 и сравните с алгоритмами, разобранными на вебинаре.

##### Примечания:
- Не удалось (пока не успел) разобраться почему `my_recommender_1` и `my_recommender_2` выдают одинаковые результаты, несмотря на то, что логика рассчетов разная.
- Не удалось найти в Интернете что значит Accuracy@K поэтому использовал MAP@K вместо Accuracy@K.

In [80]:
result.head(3)

Unnamed: 0,user_id,actual,cosine_webinar,my_recommender_1,my_recommender_2
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 999999, 981760, 995242, 1098066]","[1082185, 999999, 981760, 995242, 1098066]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 999999, 981760, 995242, 1098066]","[1082185, 999999, 981760, 995242, 1098066]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 999999, 981760, 995242, 1098066]","[1082185, 999999, 981760, 995242, 1098066]"


In [81]:
K = 5
actual = result['actual'].values
for feature in [x for x in result.columns if not x in ['user_id', 'actual']]:
    predicted = result[feature].values
    score_ = mapk(actual, predicted, k=K)
    mean = precision_at_k_score(result, actual='actual', feature=feature)
    print(f"{feature:24}: MAP@{K} {score_:.5f} Mean {mean:.5f}")

cosine_webinar          : MAP@5 0.10513 Mean 0.13291
my_recommender_1        : MAP@5 0.10328 Mean 0.14143
my_recommender_2        : MAP@5 0.10328 Mean 0.14143


### Ссылки:<a class="anchor" id="linls"></a>
- Коэффициент Жаккара https://ru.wikipedia.org/wiki/Коэффициент_Жаккара https://en.wikipedia.org/wiki/Jaccard_index
- Pearson correlation coefficient https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
<a class="anchor" id="Pearson_s_correlation"></a>
- USER-USER Collaborative filtering Recommender System in Python https://medium.com/@tomar.ankur287/user-user-collaborative-filtering-recommender-system-51f568489727
- User-User Collaborative Filtering For Jokes Recommendation https://towardsdatascience.com/user-user-collaborative-filtering-for-jokes-recommendation-b6b1e4ec8642
- Calculating Pearson Correlation Coefficient in Python with Numpy https://stackabuse.com/calculating-pearson-correlation-coefficient-in-python-with-numpy/
- Recommendation Systems: Collaborative Filtering just with numpy and pandas, A-Z https://medium.com/@sam.mail2me/recommendation-systems-collaborative-filtering-just-with-numpy-and-pandas-a-z-fa9868a95da2

- How TFIDF scoring in Content Based Recommender works https://medium.com/@shengyuchen/how-tfidf-scoring-in-content-based-recommender-works-5791e36ee8da
- Building a movie content based recommender using tf-idf https://towardsdatascience.com/content-based-recommender-systems-28a1dbd858f5
- Using TF.IDF for article tag recommender systems in Python https://medium.com/@shaswatlenka/using-tf-idf-for-article-tag-recommender-systems-in-python-d1cf74e28b6a
- Recommender Engine — Under The Hood https://towardsdatascience.com/recommender-engine-under-the-hood-7869d5eab072
- An In-Depth Introduction to Sparse Matrix https://medium.com/swlh/an-in-depth-introduction-to-sparse-matrix-a5972d7e8c86
- Recommending GitHub Repositories with Google BigQuery and the implicit library https://towardsdatascience.com/recommending-github-repositories-with-google-bigquery-and-the-implicit-library-e6cce666c77
- evaluation.ipynb recommendations.ipynb https://gist.github.com/jbochi/2e8ddcc5939e70e5368326aa034a144e
- How can I measure the accuracy of a recommender system? https://www.quora.com/How-can-I-measure-the-accuracy-of-a-recommender-system
- What you wanted to know about Mean Average Precision http://fastml.com/what-you-wanted-to-know-about-mean-average-precision/
- A Survey of Accuracy Evaluation Metrics of Recommendation Tasks https://www.jmlr.org/papers/volume10/gunawardana09a/gunawardana09a.pdf
- Evaluation Metrics for Recommender Systems https://github.com/statisticianinstilettos/recmetrics https://towardsdatascience.com/evaluation-metrics-for-recommender-systems-df56c6611093
- EVALUATION METRICS https://github.com/benhamner/Metrics