# Практическое задание Урока 2

### Тема "Бейзлайны и детерминированные алгоритмы item-item."

### Содержание домашнего задания:
- [Подключение библиотек и скриптов](#python)
- [Загрузка данных](#load_data)
- [Оценивание](#metric)
- [Задание 0](#task_0)
- [Задание 1](#task_1)
- [Задание 2](#task_2)
- [Задание 3](#task_3)
- [Задание 4](#task_4)
- [Ссылки](#linls)

### Подключение библиотек и скриптов<a class="anchor" id="python"></a>

In [1]:
import json
import pickle
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Для работы с матрицами
from scipy.sparse import csr_matrix, coo_matrix, save_npz, load_npz

# Детерминированные алгоритмы
from implicit.nearest_neighbours import ItemItemRecommender, CosineRecommender, TFIDFRecommender, BM25Recommender

# Метрики
from implicit.evaluation import train_test_split
from implicit.evaluation import precision_at_k, mean_average_precision_at_k, AUC_at_k, ndcg_at_k

from IPython.display import display

In [2]:
def mse(x):
    mean = x.mean()
    return np.mean((x - mean)**2)

In [3]:
def convert_text_to_int_list(df, feature):
    if isinstance(df[feature].values[0], str):
        print(f"convert '{feature}' from str to list.")
        df[feature] = df[feature].map(lambda x: x[1:-1].split(', ')).apply(lambda x: list(map(int, x)))

In [4]:
def precision_at_k_(recommended_list, bought_list, k=5):
    
    bought_list = np.array(bought_list)
    #print(bought_list)
    recommended_list = np.array(recommended_list)
    
    bought_list = bought_list  # Тут нет [:k] !!
    recommended_list = recommended_list[:k]
    
    flags = np.isin(bought_list, recommended_list)
    
    precision = flags.sum() / len(recommended_list)
    
    
    return precision

def score(df, actual=None, feature=None):
    r = df.apply(lambda x: precision_at_k_(x[feature], x[actual],  5), axis=1)
    return r.mean(), mse(r)

In [5]:
def data_to_sparse_matrix(data_train):
    user_item_matrix = pd.pivot_table(data_train, 
                                      index='user_id', columns='item_id', 
                                      values='quantity',
                                      aggfunc='count', 
                                      fill_value=0,)

    user_item_matrix[user_item_matrix > 0] = 1        # так как в итоге хотим предсказать 
    user_item_matrix = user_item_matrix.astype(float) # необходимый тип матрицы для implicit

    userids = user_item_matrix.index.values
    itemids = user_item_matrix.columns.values

    matrix_userids = np.arange(len(userids))
    matrix_itemids = np.arange(len(itemids))

    id_to_itemid = dict(zip(matrix_itemids, itemids))
    id_to_userid = dict(zip(matrix_userids, userids))

    itemid_to_id = dict(zip(itemids, matrix_itemids))
    userid_to_id = dict(zip(userids, matrix_userids))

    # переведем в формат saprse matrix
    sparse_user_item = csr_matrix(user_item_matrix)
    
    return sparse_user_item, id_to_itemid, userid_to_id

### Загрузка данных<a class="anchor" id="load_data"></a>

In [6]:
data = pd.read_csv('./data/retail_train.csv')
data.head(2)

Unnamed: 0,user_id,basket_id,day,item_id,quantity,sales_value,store_id,retail_disc,trans_time,week_no,coupon_disc,coupon_match_disc
0,2375,26984851472,1,1004906,1,1.39,364,-0.6,1631,1,0.0,0.0
1,2375,26984851472,1,1033142,1,0.82,364,0.0,1631,1,0.0,0.0


In [7]:
users, items, interactions = data.user_id.nunique(), data.item_id.nunique(), data.shape[0]

print('# users: ', users)
print('# items: ', items)
print('# interactions: ', interactions)
interactions / (users*items)

# users:  2499
# items:  89051
# interactions:  2396804


0.010770291654185115

### 'user_id' и 'item_id'

In [8]:
userids = data['user_id'].unique()
itemids = data['item_id'].unique()

len(userids), len(itemids)

(2499, 89051)

##### Разделение на train и test

In [9]:
test_size_weeks = 3

data_train = data[data['week_no'] < data['week_no'].max() - test_size_weeks].copy()
data_train_copy = data_train.copy()
data_test = data[data['week_no'] >= data['week_no'].max() - test_size_weeks].copy()
data_train.shape[0], data_test.shape[0]

(2278490, 118314)

In [10]:
test_users = data_test['user_id'].unique().shape[0]
new_test_users = len(set(data_test['user_id']) - set(data_train['user_id']))
print(f"В тестовом дата сете {test_users} юзеров")
print(f"В тестовом дата сете {new_test_users} новых юзеров")

В тестовом дата сете 2042 юзеров
В тестовом дата сете 0 новых юзеров


In [11]:
test_items = data_test['item_id'].unique().shape[0]
new_test_items = len(set(data_test['item_id']) - set(data_train['item_id']))
print(f"В тестовом дата сете {test_items} продуктов")
print(f"В тестовом дата сете {new_test_items} новых продуктов")

В тестовом дата сете 24329 продуктов
В тестовом дата сете 2186 новых продуктов


##### popularity и top_5000

In [12]:
popularity = data_train.groupby('item_id')['quantity'].sum().reset_index()
popularity.rename(columns={'quantity': 'n_sold'}, inplace=True)

top_5000 = popularity.sort_values('n_sold', ascending=False).head(5000).item_id.tolist()

popularity.sort_values('n_sold', ascending=False).head(3)

Unnamed: 0,item_id,n_sold
55470,6534178,190227964
55430,6533889,15978434
55465,6534166,12439291


In [13]:
seminar = pd.read_csv('./data/preds.csv')# закгрузка predict с семианара
for feature in seminar.select_dtypes(include='object').columns:
    convert_text_to_int_list(seminar, feature)
seminar.head(2)

convert 'actual' from str to list.
convert 'random_recommendation' from str to list.
convert 'popular_recommendation' from str to list.
convert 'itemitem' from str to list.
convert 'cosine' from str to list.
convert 'tfidf' from str to list.
convert 'own_purchases' from str to list.


Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]"


### Оценивание<a class="anchor" id="metric"></a>
За выполнени каждого задания 1 балл

4 балла -> отл

3 балла -> хор

И тд

### Задание 0. Товар 999999<a class="anchor" id="task_0"></a>

##### Создадим датафрейм с покупками юзеров на тестовом датасете (последние 3 недели)

In [14]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result['actual'] = result['actual'].apply(lambda x: list(x))
result_copy = result.copy()
result.head(3)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


##### 999999

In [15]:
%%time

data_train = data_train_copy.copy()

# Заведем фиктивный item_id (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

CPU times: user 460 ms, sys: 351 ms, total: 811 ms
Wall time: 808 ms


In [16]:
sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)

In [17]:
%%time

model = ItemItemRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/5001 [00:00<?, ?it/s]

CPU times: user 1.99 s, sys: 255 ms, total: 2.24 s
Wall time: 842 ms


In [18]:
%%time
result['itemitem999999'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])
result.head(3)

CPU times: user 83.5 ms, sys: 5.35 ms, total: 88.8 ms
Wall time: 87.2 ms


Unnamed: 0,user_id,actual,itemitem999999
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 981760, 1127831, 995242]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 981760, 1098066, 995242]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[999999, 1082185, 981760, 1127831, 995242]"


##### No 999999 (все 'item_id' как есть)

In [19]:
NO_999999_PATHFILE, ITEMID_PATHFILE, USERID_PATHFILE = './data/no-999999.npz', './data/no-999999-itemid.pkl', './data/no-999999-userid.pkl'

if not Path(NO_999999_PATHFILE).is_file():
    data_train = data_train_copy.copy()
    %time sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)
    save_npz(NO_999999_PATHFILE, sparse_user_item, compressed=True)
    with open(ITEMID_PATHFILE, "wb") as f:
        pickle.dump(id_to_itemid, f)
    with open(USERID_PATHFILE, "wb") as f:
        pickle.dump(userid_to_id, f)
else:
    sparse_user_item = load_npz(NO_999999_PATHFILE)
    with open(ITEMID_PATHFILE, "rb") as f:
        id_to_itemid = pickle.load(f)
    with open(USERID_PATHFILE, "rb") as f:
        userid_to_id = pickle.load(f)

In [20]:
%%time

model = ItemItemRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/86865 [00:00<?, ?it/s]

CPU times: user 25.6 s, sys: 1.86 s, total: 27.4 s
Wall time: 9.99 s


In [21]:
%%time
result['itemitemNo999999'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

CPU times: user 215 ms, sys: 5.67 ms, total: 221 ms
Wall time: 219 ms


In [22]:
result.head(3)

Unnamed: 0,user_id,actual,itemitem999999,itemitemNo999999
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 840361]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 981760, 1098066, 826249, 995242]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 1098066]"


##### Drop 999999 (только 'item_id' из top_5000)

In [23]:
%%time

data_train = data_train_copy.copy()

data_train = data_train[data_train['item_id'].isin(top_5000)]

sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)

CPU times: user 3.47 s, sys: 307 ms, total: 3.77 s
Wall time: 3.77 s


Список userid для которых невозможно сделать рекомендации.

In [24]:
userids = [userid for userid in result['user_id'] if not userid in userid_to_id]
userids

[650, 729, 954, 1987, 2364]

Добавим пользователей к top_5000

In [25]:
%%time

data_train = data_train_copy.copy()

cond = (data_train['item_id'].isin(top_5000)) | (data_train['user_id'].isin(userids))
data_train = data_train[cond]

sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)

CPU times: user 4.21 s, sys: 1.85 s, total: 6.06 s
Wall time: 6.06 s


In [26]:
%%time

model = ItemItemRecommender(K=5, num_threads=4) # K - кол-во билжайших соседей

model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
          show_progress=True)

  0%|          | 0/7834 [00:00<?, ?it/s]

CPU times: user 1.79 s, sys: 83.7 ms, total: 1.88 s
Wall time: 778 ms


In [27]:
%%time
result['itemitemDrop999999'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

CPU times: user 88.7 ms, sys: 3.33 ms, total: 92.1 ms
Wall time: 90.3 ms


In [28]:
result.head(3)

Unnamed: 0,user_id,actual,itemitem999999,itemitemNo999999,itemitemDrop999999
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 840361]","[1082185, 981760, 1127831, 995242, 840361]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 981760, 1098066, 826249, 995242]","[1082185, 981760, 1098066, 826249, 995242]"
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107...","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 981760, 1127831, 995242, 1098066]","[1082185, 981760, 1127831, 995242, 1098066]"


##### Ответы:

- На вебинаре мы использовали товар 999999 - что это за товар? Зачем он нужен?

> "Товар 999999" - это фиктивный item_id для непопулярных товаров (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар).

- Используя этот товар мы смещаем качество рекомендаций. В какую сторону?

> В худшую сторону, но улучшаем скорость обработки.

- Можно ли удалить этот товар?

> При удалении "Товара 999999" мы будем проводить анализ только пл топ-5000 товаров.

- Уберите этот товар и сравните с качеством на семинаре.

In [29]:
for feature in [x for x in result.columns if not x in ['user_id', 'actual']]:
    mean, mse_ = score(result, actual='actual', feature=feature)
    print(f"{feature:24}: Mean {mean:.5f}")

itemitem999999          : Mean 0.13692
itemitemNo999999        : Mean 0.15406
itemitemDrop999999      : Mean 0.15504


### Задание 1. Weighted Random Recommendation<a class="anchor" id="task_1"></a>

Напишите код для случайных рекоммендаций, в которых вероятность рекомендовать товар прямо пропорциональна логарифму продаж
- Можно сэмплировать товары случайно, но пропорционально какому-либо весу
- Например, прямопропорционально популярности. вес = log(sales_sum товара)
- Придумайте пример 3 весов, посчитайте weighted_random_recommendation для разных весов

In [30]:
def weights_log_sales_volume(df):
    df = df[['item_id', 'quantity', 'sales_value']].copy()
    df['sales_sum'] = df['quantity'] * df['sales_value']
    sales = df.groupby('item_id')['sales_sum'].sum().reset_index()
    sales.sort_values('sales_sum', ascending=False, inplace=True)
    #sales['weight'] = sales['sales_sum'].apply(np.log).replace(-np.inf, 0.0)
    sales['weight'] = np.where(sales['sales_sum']<0, 0.0, sales['sales_sum'])
    sales['weight'] = sales['weight'] / sales['weight'].sum()
    sales = sales[['item_id', 'weight']]
    return sales

In [31]:
def weighted_random_recommendation(items_weights, n=5):
    """Случайные рекоммендации
    
    Input
    -----
    items_weights: pd.DataFrame
        Датафрейм со столбцами item_id, weight. Сумма weight по всем товарам = 1
    """
    
    items, weights = items_weights['item_id'].values, items_weights['weight'].values
    
    recs = np.random.choice(items, size=n, replace=False, p=weights)
    
    return recs.tolist()

Сделайте предсказания

In [32]:
%%time

items_weights = weights_log_sales_volume(data_train)

result_ = seminar.copy()

result_['weighted_random_recommendation'] = result_['user_id'].apply(lambda x: weighted_random_recommendation(items_weights, n=5))
result_.head(2)

CPU times: user 750 ms, sys: 4.54 ms, total: 755 ms
Wall time: 748 ms


Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534166, 6534178, 6544236, 5703832, 6533889]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]","[6544236, 6534178, 6533889, 6534166, 1404121]"


### Задание 2. Расчет метрик<a class="anchor" id="task_2"></a>
- Рассчитайте Precision@5 для каждого алгоритма (с вебинара и weighted_random_recommendation) с помощью функции из вебинара 1.

In [33]:
result_.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534166, 6534178, 6544236, 5703832, 6533889]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]","[6544236, 6534178, 6533889, 6534166, 1404121]"


In [34]:
for feature in [x for x in result_.columns if not x in ['user_id', 'actual']]:
    mean, mse_ = score(result_, actual='actual', feature=feature)
    print(f"{feature:32}: Mean {mean:.5f}")

random_recommendation           : Mean 0.00108
popular_recommendation          : Mean 0.15524
itemitem                        : Mean 0.13692
cosine                          : Mean 0.13291
tfidf                           : Mean 0.13898
own_purchases                   : Mean 0.17969
weighted_random_recommendation  : Mean 0.04603


- Какой алгоритм показывает лучшее качество? Почему?

##### Ответ:
- own_purchases, tfidf и popular_recommendation дают лучшее качество.
- Почему? Не могу ответить. На лекции я задавал вопрос по поводу того, какие модели (алгоритмы) используются для рекомендательных систем. Пришлите пожалуста ссылки где можно почитать какие из них лучше или хуже.

### Задание 3. Улучшение бейзлайнов и ItemItem<a class="anchor" id="task_3"></a>

- Попробуйте улучшить бейзлайны, считая их на топ-5000 товаров

Подсчет на топ-5000 приведен в задании 0.

- Попробуйте улучшить разные варианты ItemItemRecommender, выбирая число соседей $K$.

In [35]:
result = data_test.groupby('user_id')['item_id'].unique().reset_index()
result.columns=['user_id', 'actual']
result['actual'] = result['actual'].apply(lambda x: list(x))
result_copy = result.copy()
result.head(3)

Unnamed: 0,user_id,actual
0,1,"[821867, 834484, 856942, 865456, 889248, 90795..."
1,3,"[835476, 851057, 872021, 878302, 879948, 90963..."
2,6,"[920308, 926804, 946489, 1006718, 1017061, 107..."


In [36]:
%%time

data_train = data_train_copy.copy()

# Заведем фиктивный item_id (если юзер НЕ покупал товары из топ-5000, то он "купил" такой товар)
data_train.loc[~data_train['item_id'].isin(top_5000), 'item_id'] = 999999

CPU times: user 529 ms, sys: 66 ms, total: 595 ms
Wall time: 584 ms


In [37]:
def neighbours_recommender(result, data_train, n_neighbours=None):
    model = ItemItemRecommender(K=n_neighbours, num_threads=4) # K - кол-во билжайших соседей
    model.fit(sparse_user_item.T.tocsr(),  # На вход item-user matrix
              show_progress=False)
    result[str(n_neighbours)+'_neighbours'] = result['user_id'].\
    apply(lambda x: [id_to_itemid[rec[0]] for rec in 
                    model.recommend(userid=userid_to_id[x], 
                                    user_items=sparse_user_item.tocsr(),   # на вход user-item matrix
                                    N=5, 
                                    filter_already_liked_items=False, 
                                    filter_items=None, 
                                    recalculate_user=True)])

In [38]:
%%time
sparse_user_item, id_to_itemid, userid_to_id = data_to_sparse_matrix(data_train)
for n_neighbours in range(2, 16): #, 2):
    print(f"n_neighbours {n_neighbours}")
    neighbours_recommender(result, data_train, n_neighbours=n_neighbours)

n_neighbours 2
n_neighbours 3
n_neighbours 4
n_neighbours 5
n_neighbours 6
n_neighbours 7
n_neighbours 8
n_neighbours 9
n_neighbours 10
n_neighbours 11
n_neighbours 12
n_neighbours 13
n_neighbours 14
n_neighbours 15
CPU times: user 30.7 s, sys: 567 ms, total: 31.3 s
Wall time: 11.7 s


In [39]:
result.head(2)

Unnamed: 0,user_id,actual,2_neighbours,3_neighbours,4_neighbours,5_neighbours,6_neighbours,7_neighbours,8_neighbours,9_neighbours,10_neighbours,11_neighbours,12_neighbours,13_neighbours,14_neighbours,15_neighbours
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[999999, 1082185, 995242, 1029743, 840361]","[999999, 1082185, 981760, 995242, 1029743]","[999999, 1082185, 981760, 995242, 1127831]","[999999, 1082185, 981760, 1127831, 995242]","[999999, 1082185, 981760, 1127831, 995242]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]","[999999, 1082185, 981760, 995242, 840361]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[999999, 1082185, 1098066, 6534178, 826249]","[999999, 1082185, 981760, 1098066, 6534178]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 995242]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]","[999999, 1082185, 981760, 1098066, 826249]"


In [40]:
for feature in [x for x in result.columns if not x in ['user_id', 'actual']]:
    mean, mse_ = score(result, actual='actual', feature=feature)
    print(f"{feature:24}: Mean {mean:.5f}")

2_neighbours            : Mean 0.19201
3_neighbours            : Mean 0.18609
4_neighbours            : Mean 0.14496
5_neighbours            : Mean 0.13692
6_neighbours            : Mean 0.14202
7_neighbours            : Mean 0.14486
8_neighbours            : Mean 0.14721
9_neighbours            : Mean 0.14848
10_neighbours           : Mean 0.15093
11_neighbours           : Mean 0.15220
12_neighbours           : Mean 0.15338
13_neighbours           : Mean 0.15318
14_neighbours           : Mean 0.15348
15_neighbours           : Mean 0.15318


<hr>

- Попробуйте стратегии ансамблирования изученных алгоритмов

In [41]:
ansamble_features = ['itemitem', 'cosine', 'tfidf']

In [42]:
result_.head(1)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534166, 6534178, 6544236, 5703832, 6533889]"


In [43]:
def merge(df, features=None):
    return np.apply_along_axis(lambda x: np.array(x.tolist()).ravel(), 1, df[features])

def ansamble_func_at_k(x, drop_999999=True, k=None):
    if drop_999999:
        x = x[x!=999999]
    unique, counts = np.unique(x, return_counts=True)
    x = np.asarray((counts, unique)).T
    x = np.flip(np.sort(x, axis=0)[:,1])
    if not k is None:
        x = x[:k]
    return list(x)

In [44]:
x = merge(result_, features=ansamble_features)
r = []
#x = np.apply_along_axis(ansamble_func_at_k, 1, x, drop_999999=True)
for x in x:
    x = ansamble_func_at_k(x, drop_999999=True)
    r.append(x)
result_['ansamble_func_at_k'] = r #x.tolist()
result_.head(2)

Unnamed: 0,user_id,actual,random_recommendation,popular_recommendation,itemitem,cosine,tfidf,own_purchases,weighted_random_recommendation,ansamble_func_at_k
0,1,"[821867, 834484, 856942, 865456, 889248, 90795...","[391418, 935908, 12734426, 46957, 1092219]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1127831, 995242]","[1082185, 999999, 981760, 1127831, 1098066]","[1082185, 981760, 1127831, 999999, 1098066]","[999999, 1082185, 1029743, 995785, 1004906]","[6534166, 6534178, 6544236, 5703832, 6533889]","[1127831, 1098066, 1082185, 995242, 981760]"
1,3,"[835476, 851057, 872021, 878302, 879948, 90963...","[5592993, 1031930, 7466687, 15660158, 9836679]","[6534178, 6533889, 1029743, 6534166, 1082185]","[999999, 1082185, 981760, 1098066, 995242]","[1082185, 1098066, 981760, 999999, 826249]","[1082185, 981760, 1098066, 826249, 999999]","[999999, 1082185, 1098066, 6534178, 1127831]","[6544236, 6534178, 6533889, 6534166, 1404121]","[1098066, 1082185, 995242, 981760, 826249]"


In [45]:
for feature in [x for x in result_.columns if not x in ['user_id', 'actual']]:
    mean, mse_ = score(result_, actual='actual', feature=feature)
    print(f"{feature:30}: Mean {mean:.5f}")

random_recommendation         : Mean 0.00108
popular_recommendation        : Mean 0.15524
itemitem                      : Mean 0.13692
cosine                        : Mean 0.13291
tfidf                         : Mean 0.13898
own_purchases                 : Mean 0.17969
weighted_random_recommendation: Mean 0.04603
ansamble_func_at_k            : Mean 0.16383


### Задание 4. Улучшение детерминированных алгоритмов<a class="anchor" id="task_4"></a>
> # НЕ СДЕЛАНО

На семинаре мы рассматривали 

Далее $U \equiv N_i(u) $

$$r_{u,i} =  \frac{1}{S}\sum\limits_{v \in U}\operatorname{sim}(u,v)r_{v, i}$$
$$ S = \sum\limits_{v \in U} \operatorname{sim}(u,v)$$

Предлагается улучшить эту формулу и учесть средние предпочтения всех пользователей

$$r_{u,i} = \mu + \bar{r_u} + \frac{1}{S}\sum\limits_{v \in U}\operatorname{sim}(u,v)(r_{v, i}-\bar{r_{v}} - \mu)$$

- Какие смысл имееют $ \mu $ и $ \bar{r_u}$ ?

- Реализуйте алгоритм, прогнозирующий рейтинги на основе данной формулы, на numpy (векторизованно!)

- В качестве схожести возьмите CosineSimilarity.

- Примените к user_item_matrix. В качестве рейтингов возьмите количество или стоимость купленного товара. 

- Данный алгоритм предсказывает рейтинги. Как на основании предсказанных рейтингов предсказать факт покупки Предложите вариант.

- Посчитайте accuracy@5 и сравните с алгоритмами, разобранными на вебинаре.

# Ссылки:<a class="anchor" id="linls"></a>
- Коэффициент Жаккара https://ru.wikipedia.org/wiki/Коэффициент_Жаккара https://en.wikipedia.org/wiki/Jaccard_index
- How TFIDF scoring in Content Based Recommender works https://medium.com/@shengyuchen/how-tfidf-scoring-in-content-based-recommender-works-5791e36ee8da
- Building a movie content based recommender using tf-idf https://towardsdatascience.com/content-based-recommender-systems-28a1dbd858f5
- Using TF.IDF for article tag recommender systems in Python https://medium.com/@shaswatlenka/using-tf-idf-for-article-tag-recommender-systems-in-python-d1cf74e28b6a
- Recommender Engine — Under The Hood https://towardsdatascience.com/recommender-engine-under-the-hood-7869d5eab072
- An In-Depth Introduction to Sparse Matrix https://medium.com/swlh/an-in-depth-introduction-to-sparse-matrix-a5972d7e8c86
- Recommending GitHub Repositories with Google BigQuery and the implicit library https://towardsdatascience.com/recommending-github-repositories-with-google-bigquery-and-the-implicit-library-e6cce666c77
- evaluation.ipynb recommendations.ipynb https://gist.github.com/jbochi/2e8ddcc5939e70e5368326aa034a144e