## Домашнее задание `максимум 25 баллов (бывшее 10)`

## Критерии оценки 
`❗️Оцениваться будет значение метрики + ревью кода + реализация сервиса.` 

Вы можете сделать НЕ ВСЕ пункты и все равно получить 25 баллов. Получение > 25 баллов будет расцениваться как 25.


### 1. Побейте метрику на лидерборде `map@10 = 0.063` для userKnn модели с семинара (`4 балла`)


### 2. Предоставьте ноутбук(и) с экспериментами (`16 баллов`)

Что можно сделать:
   - сделать кол-во рекомендаций не меньше N (`2 балла`)
   - наличие тюнинга гиперпараметров (например, векторного расстояния или типов kNN моделей (implicit/rectools/...)) (`4 балла`)
   - другие варианты ранжированивания айтемов похожих пользователей (`2 балла`)
   - эксперименты с оффлайн валидацией (`2 балла`)
   - в тесте вас ждут холодные пользователи. Сделайте рекомендации для них (обратите внимание на <a href="https://rectools.readthedocs.io/en/latest/api/rectools.models.popular.html"> rectools.models.popular</a>) (`2 балла`)
   - блендинг моделей (`4 балла`)


### 3. Оберните модель в сервис.
- **предпочтительный онлайн вариант**: обучаете модель в ноутбуке, сохраняете обученную модель (pickle, dill), при запуске сервиса ее поднимаете и запрашиваете рекомендации "на лету" (`9 баллов`)
- или оффлайн вариант: предварительно посчитайте рекомендации для всех пользователей, сохраните и запрашивайте их (`4 балла`)
   

### Хороший код ДЗ это:
- комментарии и объяснения. В ipynb пользуйтесь силой маркдауна. 
В скриптах пишите комментарии и докстринг. 
- легкая читаемость и воспроизводимость
- стандарт PEP8 
- обоснование схемы валидации
- анализ метрики качества 
  

In [1]:
!pip install rectools==0.2.0 implicit >> None

In [2]:
import pandas as pd
import numpy as np
import requests
from tqdm.auto import tqdm
import scipy as sp
from scipy.stats import mode
from scipy.sparse import csr_matrix
from itertools import islice, cycle
from pprint import pprint
from implicit.nearest_neighbours import CosineRecommender, TFIDFRecommender, ItemItemRecommender, BM25Recommender
from rectools.metrics import MAP, Precision, Recall, MeanInvUserFreq, Serendipity, calc_metrics
from rectools.model_selection import TimeRangeSplit
import warnings
warnings.filterwarnings("ignore")

from rectools import Columns
from rectools.dataset import Dataset

from userknn1 import UserKnn
import dill
import pickle

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)

In [3]:
np.random.seed(41)

#Get KION dataset

In [4]:
# загрузка датасета частями
url = "https://storage.yandexcloud.net/itmo-recsys-public-data/kion_train.zip"

req = requests.get(url, stream=True)

with open('kion_train.zip', "wb") as fd:
    total_size_in_bytes = int(req.headers.get('Content-Length', 0))
    progress_bar = tqdm(desc='kion dataset download', total=total_size_in_bytes, unit='iB', unit_scale=True)
    for chunk in req.iter_content(chunk_size=2 ** 20):
        progress_bar.update(len(chunk))
        fd.write(chunk)

kion dataset download:   0%|          | 0.00/78.8M [00:00<?, ?iB/s]

In [5]:
!unzip kion_train.zip

Archive:  kion_train.zip
   creating: kion_train/
  inflating: kion_train/interactions.csv  
  inflating: __MACOSX/kion_train/._interactions.csv  
  inflating: kion_train/users.csv    
  inflating: __MACOSX/kion_train/._users.csv  
  inflating: kion_train/items.csv    
  inflating: __MACOSX/kion_train/._items.csv  


In [6]:
interactions = pd.read_csv('kion_train/interactions.csv')
users = pd.read_csv('kion_train/users.csv')
items = pd.read_csv('kion_train/items.csv')

In [7]:
# rename columns, convert timestamp, watched_pct as weight
interactions.rename(columns={'last_watch_dt': Columns.Datetime,
                            'watched_pct': Columns.Weight}, 
                    inplace=True) 

interactions['datetime'] = pd.to_datetime(interactions['datetime'])

In [8]:
interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5476251 entries, 0 to 5476250
Data columns (total 5 columns):
 #   Column     Dtype         
---  ------     -----         
 0   user_id    int64         
 1   item_id    int64         
 2   datetime   datetime64[ns]
 3   total_dur  int64         
 4   weight     float64       
dtypes: datetime64[ns](1), float64(1), int64(3)
memory usage: 208.9 MB


#  userkNN model  CV ФОЛДЫ

сравнение моделей `CosineRecommender` и `TFIDFRecommender` на фолдах 



In [10]:
# setting for cv 
n_folds = 7
unit = "W"
n_units = 1

last_date = interactions[Columns.Datetime].max().normalize()
start_date = last_date - pd.Timedelta(n_folds * n_units + 1, unit=unit)  
print(f"Start date and last date of the test fold: {start_date, last_date}")

Start date and last date of the test fold: (Timestamp('2021-06-27 00:00:00'), Timestamp('2021-08-22 00:00:00'))


### Test fold borders

In [11]:
periods = n_folds + 1
freq = f"{n_units}{unit}"
print(
    f"start_date: {start_date}\n"
    f"last_date: {last_date}\n"
    f"periods: {periods}\n"
    f"freq: {freq}\n"
)
    
date_range = pd.date_range(start=start_date, periods=periods, freq=freq, tz=last_date.tz)
print(f"Test fold borders: {date_range.values.astype('datetime64[D]')}")

# generator of folds
cv = TimeRangeSplit(
    date_range=date_range,
    filter_already_seen=True,
    filter_cold_items=True,
    filter_cold_users=True,
)
print(f"Real number of folds: {cv.get_n_splits(interactions)}")

start_date: 2021-06-27 00:00:00
last_date: 2021-08-22 00:00:00
periods: 8
freq: 1W

Test fold borders: ['2021-06-27' '2021-07-04' '2021-07-11' '2021-07-18' '2021-07-25'
 '2021-08-01' '2021-08-08' '2021-08-15']
Real number of folds: 7


In [16]:
# посчитаем эти метрики
metrics = {
    "MAP@10": MAP(k=10),
    "prec@10": Precision(k=10),
    "recall@10": Recall(k=10),
    "novelty": MeanInvUserFreq(k=10),
    "serendipity": Serendipity(k=10),
}

# модели с урока и добавила BM25Recommender
models = {
    "cosine_itemknn": CosineRecommender(),
    #"tfidf_itemknn": TFIDFRecommender(),
    #"bm25": BM25Recommender()
}


# Model training by fold

In [14]:
results = []

fold_iterator = cv.split(interactions, collect_fold_stats=True)

for i_fold, (train_ids, test_ids, fold_info) in enumerate(fold_iterator):
    print(f"\n==================== Fold {i_fold}")
    print(fold_info)

    df_train = interactions.iloc[train_ids].copy()
    df_test = interactions.iloc[test_ids][Columns.UserItem].copy()

    catalog = df_train[Columns.Item].unique()
    
    for model_name, model in models.items():
        userknn_model = UserKnn(model=model, N_users=50)
        userknn_model.fit(df_train)
        recos = userknn_model.predict(df_test)
    
        metric_values = calc_metrics(
            metrics,
            reco=recos,
            interactions=df_test,
            prev_interactions=df_train,
            catalog=catalog,
        )
    
        fold = {"fold": i_fold, "model": model_name}
        fold.update(metric_values)
        results.append(fold)


{'Start date': Timestamp('2021-06-27 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-04 00:00:00', freq='W-SUN'), 'Train': 2533586, 'Train users': 536802, 'Train items': 14092, 'Test': 237414, 'Test users': 98930, 'Test items': 5947}


  0%|          | 0/536802 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-04 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-11 00:00:00', freq='W-SUN'), 'Train': 2886800, 'Train users': 595902, 'Train items': 14357, 'Test': 211146, 'Test users': 86167, 'Test items': 6209}


  0%|          | 0/595902 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-11 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-18 00:00:00', freq='W-SUN'), 'Train': 3192875, 'Train users': 640144, 'Train items': 14711, 'Test': 214489, 'Test users': 84234, 'Test items': 6313}


  0%|          | 0/640144 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-18 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-25 00:00:00', freq='W-SUN'), 'Train': 3506106, 'Train users': 687200, 'Train items': 14928, 'Test': 231207, 'Test users': 87632, 'Test items': 6491}


  0%|          | 0/687200 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-25 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-01 00:00:00', freq='W-SUN'), 'Train': 3838180, 'Train users': 734701, 'Train items': 15061, 'Test': 249396, 'Test users': 93092, 'Test items': 6611}


  0%|          | 0/734701 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-08-01 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'Train': 4203885, 'Train users': 788721, 'Train items': 15212, 'Test': 264039, 'Test users': 98161, 'Test items': 6609}


  0%|          | 0/788721 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-15 00:00:00', freq='W-SUN'), 'Train': 4587708, 'Train users': 842129, 'Train items': 15404, 'Test': 276699, 'Test users': 101983, 'Test items': 6715}


  0%|          | 0/842129 [00:00<?, ?it/s]

In [15]:
df_metrics = pd.DataFrame(results)
df_metrics

Unnamed: 0,fold,model,prec@10,recall@10,MAP@10,novelty,serendipity
0,0,cosine_itemknn,0.004879,0.027282,0.004696,7.783925,3.1e-05
1,1,cosine_itemknn,0.004807,0.028028,0.004765,7.8136,3.3e-05
2,2,cosine_itemknn,0.004103,0.023102,0.004082,7.95327,3.7e-05
3,3,cosine_itemknn,0.003865,0.020485,0.003623,8.063779,4.4e-05
4,4,cosine_itemknn,0.0037,0.019591,0.003579,8.118989,4.7e-05
5,5,cosine_itemknn,0.003669,0.01941,0.003403,8.126134,4.3e-05
6,6,cosine_itemknn,0.00333,0.017233,0.003142,8.185844,4.3e-05


In [19]:
model.similar_items(1, N=11)

[(1, 0.9999999999999987),
 (78101, 0.28067570844923023),
 (273835, 0.2706231506959187),
 (496026, 0.26518576139191),
 (359238, 0.25514595333753737),
 (4656, 0.2546084638985143),
 (159253, 0.2544200743322844),
 (66832, 0.25157730271331386),
 (198831, 0.25157730271331386),
 (152958, 0.24751933820372524),
 (73433, 0.24722693486691322)]

collab вылетает периодически, поэтому приходится частями обучать, по моделям


In [20]:
# модели с урока и добавила BM25Recommender
models = {
    #"cosine_itemknn": CosineRecommender(),
    "tfidf_itemknn": TFIDFRecommender(),
    #"bm25": BM25Recommender()
}

In [22]:
fold_iterator = cv.split(interactions, collect_fold_stats=True)

for i_fold, (train_ids, test_ids, fold_info) in enumerate(fold_iterator):
    print(f"\n==================== Fold {i_fold}")
    print(fold_info)

    df_train = interactions.iloc[train_ids].copy()
    df_test = interactions.iloc[test_ids][Columns.UserItem].copy()

    catalog = df_train[Columns.Item].unique()
    
    for model_name, model in models.items():
        userknn_model = UserKnn(model=model, N_users=50)
        userknn_model.fit(df_train)
        recos = userknn_model.predict(df_test)
    
        metric_values = calc_metrics(
            metrics,
            reco=recos,
            interactions=df_test,
            prev_interactions=df_train,
            catalog=catalog,
        )
    
        fold = {"fold": i_fold, "model": model_name}
        fold.update(metric_values)
        results.append(fold)


{'Start date': Timestamp('2021-06-27 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-04 00:00:00', freq='W-SUN'), 'Train': 2533586, 'Train users': 536802, 'Train items': 14092, 'Test': 237414, 'Test users': 98930, 'Test items': 5947}


  0%|          | 0/536802 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-04 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-11 00:00:00', freq='W-SUN'), 'Train': 2886800, 'Train users': 595902, 'Train items': 14357, 'Test': 211146, 'Test users': 86167, 'Test items': 6209}


  0%|          | 0/595902 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-11 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-18 00:00:00', freq='W-SUN'), 'Train': 3192875, 'Train users': 640144, 'Train items': 14711, 'Test': 214489, 'Test users': 84234, 'Test items': 6313}


  0%|          | 0/640144 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-18 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-07-25 00:00:00', freq='W-SUN'), 'Train': 3506106, 'Train users': 687200, 'Train items': 14928, 'Test': 231207, 'Test users': 87632, 'Test items': 6491}


  0%|          | 0/687200 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-07-25 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-01 00:00:00', freq='W-SUN'), 'Train': 3838180, 'Train users': 734701, 'Train items': 15061, 'Test': 249396, 'Test users': 93092, 'Test items': 6611}


  0%|          | 0/734701 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-08-01 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'Train': 4203885, 'Train users': 788721, 'Train items': 15212, 'Test': 264039, 'Test users': 98161, 'Test items': 6609}


  0%|          | 0/788721 [00:00<?, ?it/s]


{'Start date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-15 00:00:00', freq='W-SUN'), 'Train': 4587708, 'Train users': 842129, 'Train items': 15404, 'Test': 276699, 'Test users': 101983, 'Test items': 6715}


  0%|          | 0/842129 [00:00<?, ?it/s]

In [23]:
df_metrics = pd.DataFrame(results)
df_metrics

Unnamed: 0,fold,model,prec@10,recall@10,MAP@10,novelty,serendipity
0,0,cosine_itemknn,0.004879,0.027282,0.004696,7.783925,3.1e-05
1,1,cosine_itemknn,0.004807,0.028028,0.004765,7.8136,3.3e-05
2,2,cosine_itemknn,0.004103,0.023102,0.004082,7.95327,3.7e-05
3,3,cosine_itemknn,0.003865,0.020485,0.003623,8.063779,4.4e-05
4,4,cosine_itemknn,0.0037,0.019591,0.003579,8.118989,4.7e-05
5,5,cosine_itemknn,0.003669,0.01941,0.003403,8.126134,4.3e-05
6,6,cosine_itemknn,0.00333,0.017233,0.003142,8.185844,4.3e-05
7,0,tfidf_itemknn,0.008546,0.04834,0.008829,7.799313,3.5e-05
8,1,tfidf_itemknn,0.008504,0.05056,0.009345,7.827066,3.9e-05
9,2,tfidf_itemknn,0.00683,0.038295,0.007287,7.952497,4.2e-05


интересно, что метрики выше на более ранних данных. Лучше метрика у модели tfidf, обучим ее на всем датасете, сравним метрики.

##Модель TFIDF на одном фолде

In [9]:
# train test split 
# test = last 1 week 
from rectools.model_selection import TimeRangeSplit

n_folds = 1
unit = "W"
n_units = 1
periods = n_folds + 1
freq = f"{n_units}{unit}"

last_date = interactions[Columns.Datetime].max().normalize()
start_date = last_date - pd.Timedelta(n_folds * n_units + 1, unit=unit)  
print(f"Start date and last date of the test fold: {start_date, last_date}")
    
date_range = pd.date_range(start=start_date, periods=periods, freq=freq, tz=last_date.tz)
print(f"Test fold borders: {date_range.values.astype('datetime64[D]')}")

# generator of folds
cv = TimeRangeSplit(
    date_range=date_range,
    filter_already_seen=True,
    filter_cold_items=True,
    filter_cold_users=True,
)
print(f"Real number of folds: {cv.get_n_splits(interactions)}")

Start date and last date of the test fold: (Timestamp('2021-08-08 00:00:00'), Timestamp('2021-08-22 00:00:00'))
Test fold borders: ['2021-08-08' '2021-08-15']
Real number of folds: 1


In [10]:
# we have just 1 test fold - no need to iterate over fold
(train_ids, test_ids, fold_info) = cv.split(interactions, collect_fold_stats=True).__next__()

In [11]:
train = interactions.loc[train_ids]
test = interactions.loc[test_ids]

In [9]:
# посчитаем эти метрики
metrics = {
    "MAP@10": MAP(k=10),
    "prec@10": Precision(k=10),
    "recall@10": Recall(k=10),
    "novelty": MeanInvUserFreq(k=10),
    "serendipity": Serendipity(k=10),
}

# модели с урока и добавила BM25Recommender
models = {
    #"cosine_itemknn": CosineRecommender(),
    "tfidf_itemknn": TFIDFRecommender(),
    #"bm25": BM25Recommender()
}


In [None]:
results = []

print(f"one fold")
print(fold_info)

df_train = train.copy()
df_test = test.copy()

catalog = df_train[Columns.Item].unique()
    
for model_name, model in models.items():
    userknn_model = UserKnn(model=model, N_users=50)
    userknn_model.fit(df_train)
    recos = userknn_model.predict(df_test)

    metric_values = calc_metrics(
        metrics,
        reco=recos,
        interactions=df_test,
        prev_interactions=df_train,
        catalog=catalog,
    )

    fold = {"fold": one, "model": model_name}
    fold.update(metric_values)
    results.append(fold)

In [12]:
metric_values

{'prec@10': 0.004905719580714434,
 'recall@10': 0.024260674034016428,
 'MAP@10': 0.0055949355981462605,
 'novelty': 8.314307116228239,
 'serendipity': 6.460547203542146e-05}

здесь метрика map10 ниже. Посмотрим другую модель

#BM25

In [16]:
metrics = {
    "MAP@10": MAP(k=10),
    "prec@10": Precision(k=10),
    "recall@10": Recall(k=10),
    "novelty": MeanInvUserFreq(k=10),
    "serendipity": Serendipity(k=10),
}

# модели с урока и добавила BM25Recommender
models = {
    #"cosine_itemknn": CosineRecommender(),
    #"tfidf_itemknn": TFIDFRecommender(),
    "bm25": BM25Recommender()
}

In [17]:
print(f"one fold")
print(fold_info)

df_train = train.copy()
df_test = test.copy()

catalog = df_train[Columns.Item].unique()
    
for model_name, model in models.items():
    userknn_model = UserKnn(model=model, N_users=50)
    userknn_model.fit(df_train)
    recos = userknn_model.predict(df_test)

    metric_values = calc_metrics(
        metrics,
        reco=recos,
        interactions=df_test,
        prev_interactions=df_train,
        catalog=catalog,
    )

    print(model_name)
    print(metric_values)

one fold
{'Start date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-15 00:00:00', freq='W-SUN'), 'Train': 4587708, 'Train users': 842129, 'Train items': 15404, 'Test': 276699, 'Test users': 101983, 'Test items': 6715}


  0%|          | 0/842129 [00:00<?, ?it/s]

bm25
{'prec@10': 0.006820744633909573, 'recall@10': 0.03257176075886181, 'MAP@10': 0.01015513826742717, 'novelty': 7.890427732457493, 'serendipity': 5.1971240378355066e-05}


у модели bm25 лучший map10, сохраним модель

In [19]:
# save model
import dill

with open('/content/drive/MyDrive/data/knn_bm25_item.dill', 'wb') as f:
    dill.dump(model, f)

# Популярное на фолдах

проверим на фолдах на урезанном датасете, так как очень долго считается

In [38]:
class PopularReco():
    def __init__(self, max_K=10, days=7, item_column='item_id', dt_column=Columns.Datetime):
        self.max_K = max_K
        self.days = days
        self.item_column = item_column
        self.dt_column = dt_column
        self.recommendations = []
        
    def fit(self, df, ):
        min_date = df[self.dt_column].max().normalize() - pd.DateOffset(days=self.days)
        self.recommendations = df.loc[df[self.dt_column] > min_date, self.item_column].value_counts().head(self.max_K).index.values
    
    def recommend(self, users=None, N=10):
        recs = self.recommendations[:N]
        if users is None:
            return recs
        else:
            return list(islice(cycle([recs]), len(users)))

In [30]:
# setting for cv 
n_folds = 3
unit = "W"
n_units = 1

last_date = interactions[Columns.Datetime].max().normalize()
start_date = last_date - pd.Timedelta(n_folds * n_units, unit=unit)  
print(f"Start date and last date of the test fold: {start_date, last_date}")

Start date and last date of the test fold: (Timestamp('2021-08-01 00:00:00'), Timestamp('2021-08-22 00:00:00'))


In [25]:
last_date = interactions[Columns.Datetime].max().normalize()
folds = 3
start_date = last_date - pd.Timedelta(days=folds*7)
start_date, last_date

(Timestamp('2021-08-01 00:00:00'), Timestamp('2021-08-22 00:00:00'))

In [31]:
periods = n_folds + 1
freq = f"{n_units}{unit}"
print(
    f"start_date: {start_date}\n"
    f"last_date: {last_date}\n"
    f"periods: {periods}\n"
    f"freq: {freq}\n"
)
    
date_range = pd.date_range(start=start_date, periods=periods, freq=freq, tz=last_date.tz)
print(f"Test fold borders: {date_range.values.astype('datetime64[D]')}")

# generator of folds
cv = TimeRangeSplit(
    date_range=date_range,
    filter_already_seen=True,
    filter_cold_items=True,
    filter_cold_users=True,
)

start_date: 2021-08-01 00:00:00
last_date: 2021-08-22 00:00:00
periods: 4
freq: 1W

Test fold borders: ['2021-08-01' '2021-08-08' '2021-08-15' '2021-08-22']


In [34]:
folds_with_stats = list(cv.split( 
    interactions, 
    collect_fold_stats=True
))

folds_info_with_stats = pd.DataFrame([info for _, _, info in folds_with_stats])

In [35]:
folds_info_with_stats

Unnamed: 0,Start date,End date,Train,Train users,Train items,Test,Test users,Test items
0,2021-08-01,2021-08-08,4203885,788721,15212,264039,98161,6609
1,2021-08-08,2021-08-15,4587708,842129,15404,276699,101983,6715
2,2021-08-15,2021-08-22,4985269,896791,15565,297228,109382,6705


In [36]:
top_N = 10
last_n_days = 7

In [40]:
final_results = []
validation_results = pd.DataFrame()

for i_fold, (train_idx, test_idx, info) in enumerate(folds_with_stats):
    print(f"\n==================== Fold {i_fold}")
    print(fold_info)

    train = interactions.iloc[train_ids].copy()
    test = interactions.iloc[test_ids][Columns.UserItem].copy()

    catalog = train[Columns.Item].unique()
        
    pop_model = PopularReco(days=last_n_days, dt_column=Columns.Datetime)
    pop_model.fit(train)

    recs = pd.DataFrame({'user_id': test['user_id'].unique()})
    recs['item_id'] = pop_model.recommend(recs['user_id'], N=top_N)
    recs = recs.explode('item_id')
    recs['rank'] = recs.groupby('user_id').cumcount() + 1

    metric_values = calc_metrics(
        metrics,
        reco=recs,
        interactions=test,
        prev_interactions=train,
        catalog=catalog,
    )

    fold = {"fold": i_fold}
    fold.update(metric_values)
    results.append(fold)
    
    df_metrics = pd.DataFrame(results)


{'Start date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-15 00:00:00', freq='W-SUN'), 'Train': 4587708, 'Train users': 842129, 'Train items': 15404, 'Test': 276699, 'Test users': 101983, 'Test items': 6715}

{'Start date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-15 00:00:00', freq='W-SUN'), 'Train': 4587708, 'Train users': 842129, 'Train items': 15404, 'Test': 276699, 'Test users': 101983, 'Test items': 6715}

{'Start date': Timestamp('2021-08-08 00:00:00', freq='W-SUN'), 'End date': Timestamp('2021-08-15 00:00:00', freq='W-SUN'), 'Train': 4587708, 'Train users': 842129, 'Train items': 15404, 'Test': 276699, 'Test users': 101983, 'Test items': 6715}


In [41]:
df_metrics

Unnamed: 0,fold,prec@10,recall@10,MAP@10,novelty,serendipity
0,0,0.037411,0.196584,0.075371,4.262035,2.7e-05
1,1,0.037411,0.196584,0.075371,4.262035,2.7e-05
2,2,0.037411,0.196584,0.075371,4.262035,2.7e-05


популярные ожидаемо дают высокую метрику

#Для сервиса

сохраним модель популярных рекомендаций, но ее придется немного поменять для сервиса

In [42]:
class PopularRecoS():
    def __init__(self, max_K=10, days=7, item_column='item_id', dt_column=Columns.Datetime):
        self.max_K = max_K
        self.days = days
        self.item_column = item_column
        self.dt_column = dt_column
        self.recommendations = []
        
    def fit(self, df, ):
        min_date = df[self.dt_column].max().normalize() - pd.DateOffset(days=self.days)
        self.recommendations = df.loc[df[self.dt_column] > min_date, self.item_column].value_counts().head(self.max_K).index.values
    
    def recommend(self, N=10):
        recs = self.recommendations[:N]
        return recs

In [44]:
pop_model_7 = PopularRecoS(days=7)
pop_model_7.fit(interactions)

In [46]:
pop_model_7.recommend(10)

array([ 9728, 15297, 10440, 14488, 13865, 12192,   341,  4151,  3734,
         512])

In [47]:
with open('/content/drive/MyDrive/data/pop_model_7.dill', 'wb') as f:
    dill.dump(pop_model_7, f)

У нас две модели: bm25 и pop_7

In [18]:
with open('/content/drive/MyDrive/data/knn_bm25_item.dill', 'rb') as f:
    knn_bm25 = dill.load(f)

In [19]:
with open('/content/drive/MyDrive/data/pop_model_7.dill', 'rb') as f:
    pop_7 = dill.load(f)

создадим функцию, которая добивает популярным до 10 рекомендаций

In [55]:
def get_knn_pop(user_id, N=10):

    recs = knn_bm25.similar_items(user_id)
    recs = [x[0] for x in recs]
    
    pop = pop_7.recommend(N=10)
    
    if len(recs) < N:
        recs.extend(pop[:N])
        recs = recs[:N]
    
    return recs

In [59]:
get_knn_pop(1000)

[23489, 74931, 143573, 153891, 209245, 336535, 463373, 532033, 558371, 574157]

In [60]:
pop_7.recommend()

array([ 9728, 15297, 10440, 14488, 13865, 12192,   341,  4151,  3734,
         512])

In [None]:
# TODO не забыть в сервисе обработку холодных - им популярные предлагать, те сначала проверяем по условию есть ли юзер в уникальных
# перед загрузкой в сервиc переобучить bm25 модель на всех данных, убрать валидацию, так как более свежие данные лучше

In [61]:
def get_rec(user_id):
    users = interactions['user_id'].unique()
    if user_id in users:
        return get_knn_pop(user_id)
    else:
        return pop_7.recommend()

In [None]:
# поясните по заданию пробить метрику? какие там ограничения по модели? какую можно брать?
# можно ли, чтобы пробить метрику, взять только топ популярных? или поменять соотношение: брать knn три и 7 популярных, например?

Задание по другому варианту ранжирования айтемов похожих пользователей в другом ноутбуке: recsys_hw_3_knn_useritem.ipynb

сохранение списка юзеров для сервиса

In [12]:
users_list = train['user_id'].unique()

In [14]:
users_list.shape

(842129,)

In [15]:
import pickle

In [16]:
with open("users_list.pickle", "wb") as f:
    pickle.dump(users_list, f)

In [17]:
with open("users_list.pickle", "rb") as f:
    users_list = pickle.load(f)

In [20]:
# объединим все в одну функцию
def get_rec(user_id, model_knn, model_pop, users_list, N=10):
    """
    check if user is in users list
    if yes - return knn recs and add pop recs if knn recs < 10
    if no - return pop recs
    """
    #переобучить модель на всем interactions, пока на трейн!
    if user_id in users_list:
        recs = knn_bm25.similar_items(user_id)
        recs = [x[0] for x in recs]
    
        pop = pop_7.recommend(N)
    
        if len(recs) < N:
            recs.extend(pop[:N])
            recs = recs[:N]
    
        return recs
        
    else:
        return pop_7.recommend()

In [23]:
get_rec(10, knn_bm25, pop_7, users_list)

[728827,
 730172,
 730509,
 738566,
 740412,
 741784,
 746066,
 753254,
 756747,
 763235]