# Сопроводительная документация по задаче

## Анализ полученного датасета

1. Выявление ошибок. Список некорректных url просмотров из dataset_news_1.xlsx
```
mos.ru/news/item/89421073/ /
mos.ru/news/item/9468/
mos.ru/news/item/94670073/ /
mos.ru/news/item/94501073/душ/
mos.ru/news/item/89957073/ Их/
mos.ru/news/item/94852073/%5c/
mos.ru/news/item/94479073/ (https:/app.aif.ru/owa/redir.aspx/
mos.ru/news/item/94792073/ /
mos.ru/news/item/94897073/+/
mos.ru/news/item/94953073/ /
mos.ru/news/item/91919073/-/
```
2. Анализ полученных данных:
- 239 пользователей на 5812 новости для 26 446 просмотров

## Протестированные гипотезы и алгоритм работы решения для рекомендательной системы

In [1]:
import pandas as pd
import numpy as np
import json
import datetime
from scipy.sparse import csr_matrix
from implicit.nearest_neighbours import ItemItemRecommender
from implicit.als import AlternatingLeastSquares
from implicit.evaluation import precision_at_k, mean_average_precision_at_k

Загружаем данные

In [181]:
def get_news_id_from_url(url: str) -> int:
    """
    id из url
    """
    parts = url.split('/')
    try:
        return int(parts[-2])
    except Exception as err:
        for part in parts:
            if '073' in part:  # Опытным путем выявлено, что битые урлы
                # только для типа 073, поэтому просто решила вытащить такие
                return int(part)
        return 0

    
df_views = pd.read_excel('/app/data/dataset_news_1.xlsx')
df_news = pd.read_json('/app/data/news.json', encoding="utf_8_sig")
df_views['news_id'] = df_views['url_clean'].apply(get_news_id_from_url)
df_news['unique_views'] = df_news['id'].apply(lambda x: df_views[df_views.news_id == x].user_id.nunique())
merged = df_views.merge(df_news, left_on='news_id', right_on='id')

In [182]:
df_news[df_news['unique_views'] == 0]

Unnamed: 0,id,title,importance,published_at,created_at,updated_at,is_deferred_publication,status,ya_rss,active_from,...,territory_area_id,territory_district_id,preview_text,full_text,url,preview,text,promo,images,unique_views
10,91359073,Москва окажет финансовую поддержку анимационны...,,2021-05-28 09:01:00,2021-05-27 17:18:38,2021-05-28 12:21:05,0.0,public,1,,...,,,Город выделил 100 миллионов рублей на стимулир...,<p>Правительство Москвы учредило два новых гра...,/news/item/91359073/,,,,,0
11,95361073,Планируйте маршрут: на Савеловском и Белорусск...,,2021-08-31 12:01:00,2021-08-31 11:55:25,2021-08-31 12:00:44,0.0,public,0,,...,,,Изменения связаны с проведением путевых работ.,<p>В расписании пригородных поездов Савеловско...,/news/item/95361073/,,,,,0
32,79697073,В Южном Бутове построят детский сад,,2020-09-10 20:11:01,2020-09-10 18:01:55,2020-09-10 20:11:01,0.0,public,0,,...,53501.0,4500.0,"В здании будут кабинеты логопеда и психолога, ...",<p>Детский сад на 235 мест появится на пересеч...,/news/item/79697073/,,,,,0
37,87907073,Экскурсии и квесты: день открытых дверей в биб...,,2021-03-17 09:01:03,2021-03-16 20:50:35,2021-03-17 09:00:26,0.0,public,1,,...,2501.0,1500.0,В 130 читальнях и 76 культурных центрах можно ...,<p>Культурные центры и библиотеки столичного Д...,/news/item/87907073/,,,,,0
40,58340073,1500 цветников ко Дню города: фестиваль «Цвето...,,2019-07-08 13:04:00,2019-07-08 11:38:20,2020-06-11 19:33:47,0.0,public,1,,...,,,Создать авторский цветник смогут все желающие....,<p>В столице продолжается фестиваль ландшафтно...,/news/item/58340073/,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6499,7401050,Москва стала лидером Национального рейтинга со...,,2021-06-04 13:47:00,2021-06-04 13:36:44,2021-06-04 13:46:03,,public,1,,...,,,,,/mayor/themes/12299/7401050/,Большинство отраслей бизнеса в Москве после пи...,<p>Москва третий год подряд становится лидером...,0.0,"[{'id': 3161483281, 'title': 'Фото М. Денисова...",0
6508,7267050,В столице определили победителей конкурса «Учи...,,2021-04-12 20:33:00,2021-04-12 20:26:16,2021-04-12 20:32:33,,public,1,,...,,,,,/mayor/themes/15299/7267050/,Всего в конкурсе приняли участие 720 учителей ...,<p>Подведены итоги городского конкурса &laquo;...,0.0,"[{'id': 3039817281, 'title': '', 'copyright': ...",0
6509,7503050,Образовательный туризм: Москва подписала Согла...,,2021-07-16 17:52:00,2021-07-16 17:46:23,2021-07-16 17:51:38,,public,1,,...,,,,,/mayor/themes/11299/7503050/,В рамках Соглашения школьники из Москвы смогут...,<p>В столице подписано трехстороннее Соглашени...,0.0,"[{'id': 3217642281, 'title': 'Фото М. Мишина. ...",0
6527,5602050,Новые музыкальные инструменты и отремонтирован...,,2019-04-24 18:19:00,2019-04-24 17:46:20,2020-06-11 19:33:41,,public,1,,...,,,,,/mayor/themes/3299/5602050/,Всего в рамках проекта «Искусство — детям» пла...,<p>В детской школе искусств имени С.Т. Рихтера...,0.0,"[{'id': 1971092281, 'title': '', 'copyright': ...",0


In [183]:
df_users_content = pd.DataFrame(index=df_views['user_id'].unique())

def get_ids_from_data(tags, spheres):
    ids_from_data = set()
    all_contents = tags + spheres
    for content in all_contents:
        for i in content:
            ids_from_data.add(i.get('id'))
    return ids_from_data

def get_user_spheres_and_tags(user_id):
    user_news_ids = df_views[df_views.user_id == user_id]['news_id'].unique()
    user_tags = df_news[df_news['id'].isin(user_news_ids)].tags.sum()
    user_spheres = df_news[df_news['id'].isin(user_news_ids)].spheres.sum()
    user_spheres_tags = user_tags + user_spheres
    user_spheres_tags = [x.get('id') for x in user_spheres_tags]
    return np.unique(user_spheres_tags, return_counts=True)

content_ids = get_ids_from_data(df_news.tags.values, df_news.spheres.values)
for c_id in content_ids:
    df_users_content[c_id] = 0

for ui in df_views['user_id'].unique():
    if not ui:
        print(ui)
    user_st = get_user_spheres_and_tags(ui)
    for i in range(user_st[0].shape[0]):
        df_users_content.loc[ui, user_st[0][i]] = user_st[1][i]



In [128]:
df_users_content



Unnamed: 0,4489217,20873217,819217,44040217,56328217,54493217,29917217,5341217,18055217,57180217,...,51937217,48267217,50528217,5472217,18186217,8159217,49119217,4063217,53215217,6324217
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,2,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,4,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
275,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
276,0,0,0,3,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
277,0,0,0,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [184]:
df_news['spheres_tags'] = df_news['spheres'] + df_news['tags']
df_news['spheres_tags_ids'] = df_news['spheres_tags'].apply(lambda x: [i.get('id') for i in x])

def get_interes(user_id, news_id):
    news_tags_and_spheres = df_news[df_news['id'] == news_id]['spheres_tags_ids'].sum()
    interes = df_users_content.loc[user_id, news_tags_and_spheres].sum()
    return interes

df_news['spheres_tags_ids'] 

0                      [1299, 170217, 16324217, 40823217]
1       [2299, 231299, 255299, 10217, 462217, 4790217,...
2              [6299, 332217, 4318217, 6601217, 30170217]
3                     [4299, 4000217, 12252217, 40016217]
4       [3299, 292299, 242299, 248299, 49217, 1074217,...
                              ...                        
6549    [231299, 238299, 57217, 136217, 144217, 367217...
6550    [1299, 15299, 145217, 308217, 587217, 4019217,...
6551    [18299, 244299, 19217, 127217, 4019217, 658621...
6552    [4299, 18299, 231299, 244299, 28217, 127217, 2...
6553    [15299, 150217, 151217, 854217, 4019217, 43599...
Name: spheres_tags_ids, Length: 6554, dtype: object

In [46]:
%%time
df_users_news_interes = pd.DataFrame(
    index=df_views['user_id'].unique(), columns=df_news['id'].unique(), data=0
)

for index in df_users_news_interes.index:
    for column in df_users_news_interes.columns:
        df_users_news_interes.loc[index, column] = get_interes(index, column)



CPU times: user 27min 23s, sys: 731 ms, total: 27min 24s
Wall time: 27min 22s


In [49]:
%%time
df_users_news_interes.to_json(path_or_buf='/app/data/users_news_interes.json', orient="index")

CPU times: user 158 ms, sys: 30 ms, total: 188 ms
Wall time: 202 ms


In [185]:
df_users_news_interes

Unnamed: 0,75178073,80375073,41116073,94978073,64742073,42454073,78167073,95199073,67109073,94753073,...,6959050,6151050,4782050,7564050,7418050,7163050,6965050,5484050,7239050,6751050
1,0,24,1,7,14,10,36,0,7,35,...,10,4,13,10,29,22,15,17,48,14
2,18,65,1,18,37,15,108,3,6,57,...,13,28,17,14,67,69,41,22,93,16
3,23,141,4,34,66,46,188,5,26,150,...,101,67,64,78,163,126,97,98,233,69
4,30,98,5,23,38,34,106,23,16,120,...,53,48,31,47,96,79,74,59,155,37
5,84,63,16,22,68,54,113,14,22,225,...,123,118,66,89,105,68,182,96,174,70
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274,31,47,11,6,107,91,90,1,22,84,...,56,46,64,32,81,83,109,41,90,63
275,18,69,10,15,23,19,48,0,28,67,...,45,46,24,44,67,30,53,57,98,25
276,4,7,1,12,0,0,16,0,1,9,...,1,4,7,2,7,8,13,0,17,6
277,8,10,1,13,1,1,17,0,0,12,...,3,9,9,3,10,10,19,1,22,9


In [186]:
def get_top_news_for_user(user_id, n=10):
    result = df_users_news_interes.loc[user_id].sort_values(ascending=False)[:n]
    return result

def get_top_new_news_for_user(user_id, n=10):
    new_news_ids = df_news[df_news['unique_views'] == 0]['id'].values
    result = df_users_news_interes.loc[user_id, new_news_ids].sort_values(ascending=False)[:n]
    return result 
print('top_news_for_user:\n', get_top_news_for_user(2), '\n')
print('top_new_news_for_user\n', get_top_new_news_for_user(2))


top_news_for_user:
 71394073    140
7465050     139
83304073    137
93999073    135
75988073    131
81175073    128
94418073    128
92502073    125
86064073    123
89100073    121
Name: 2, dtype: int64 

top_new_news_for_user
 75988073    131
81175073    128
7457050     116
4658050     113
86448073    112
86871073    108
7117050     106
86863073    103
78602073    103
7225050     103
Name: 2, dtype: int64


In [187]:
df_news[df_news['unique_views'] == 0]['id'].values.shape

(745,)

In [188]:
final_df = merged.drop(['importance', 'is_deferred_publication', 'status', 'ya_rss', 'active_from',
                       'active_to', 'search', 'display_image', 'icon_id', 'canonical_url', 'canonical_updated_at',
                       'is_powered', 'has_image', 'attach', 'active_from_timestamp', 'active_to_timestamp',
                       'image', 'counter', 'preview_text', 'images'],
                      axis=1)
final_df['title_age'] = (pd.Timestamp.now() - final_df['published_at']).dt.days
final_df['age_param'] = 1 / final_df['title_age']
final_df['age_view_param'] = final_df['unique_views'] / final_df['title_age']
users, items, interactions = final_df.user_id.nunique(), final_df.id.nunique(), final_df.shape[0]
print('# users: ', users)
print('# items: ', items)
print('# interactions: ', interactions)
final_df = final_df[['user_id', 'news_id', 'date_time', 'age_param', 'unique_views', 'age_view_param', 'title_age']]
final_df.head()

# users:  239
# items:  5809
# interactions:  26442


Unnamed: 0,user_id,news_id,date_time,age_param,unique_views,age_view_param,title_age
0,1,94006073,2021-08-01 18:51:19,0.010526,38,0.4,95
1,2,94006073,2021-08-04 13:08:19,0.010526,38,0.4,95
2,3,94006073,2021-08-29 12:40:07,0.010526,38,0.4,95
3,6,94006073,2021-08-02 09:04:55,0.010526,38,0.4,95
4,11,94006073,2021-08-02 17:16:23,0.010526,38,0.4,95


In [191]:
df_views[df_views['user_id']==1]

Unnamed: 0,date_time,url_clean,user_id,news_id
0,2021-08-01 18:51:19,mos.ru/news/item/94006073/,1,94006073
1,2021-08-01 18:57:28,mos.ru/news/item/94000073/,1,94000073
2,2021-08-04 08:49:49,mos.ru/news/item/94062073/,1,94062073
3,2021-08-04 08:49:49,mos.ru/news/item/94063073/,1,94063073
4,2021-08-04 08:49:57,mos.ru/news/item/93893073/,1,93893073
5,2021-08-04 09:15:14,mos.ru/news/item/94098073/,1,94098073
6,2021-08-04 09:15:38,mos.ru/news/item/94106073/,1,94106073
7,2021-08-04 16:55:38,mos.ru/news/item/94108073/,1,94108073
8,2021-08-04 16:55:45,mos.ru/news/item/94132073/,1,94132073
9,2021-08-04 16:56:02,mos.ru/news/item/94057073/,1,94057073


In [89]:
final_df.describe()

Unnamed: 0,user_id,news_id,age_param,unique_views,age_view_param
count,26442.0,26442.0,26442.0,26442.0,26442.0
mean,133.698056,78673130.0,0.009928,20.42198,0.239715
std,81.158833,30991220.0,0.003836,17.108284,0.20933
min,1.0,179050.0,0.000269,1.0,0.000269
25%,60.0,86515320.0,0.008403,4.0,0.028986
50%,137.0,94144070.0,0.011111,18.0,0.216495
75%,193.0,94620070.0,0.012658,33.0,0.397849
max,278.0,95372070.0,0.015385,69.0,0.810127


age_param - величина, обратно пропорциональная количеству дней с даты публикации новости. Это значение мы используем для определения актуальности новости на момент просмотра.

Разделяем данные на тренировочные и тестовые. В train берем 3 недели августа от даты просмотра, остальное в test.

In [194]:
test_size_days = 20

data_train = final_df[final_df['date_time'].dt.day < final_df['date_time'].dt.day.min() + test_size_days]
data_test = final_df[final_df['date_time'].dt.day >= final_df['date_time'].dt.day.min() + test_size_days]
print("Количество просмотров в train: ", data_train.shape[0])
print("Количество просмотров в test: ", data_test.shape[0])

Количество просмотров в train:  20483
Количество просмотров в test:  5959


Готовим результирующий сет данных для проверки рекомендаций.

In [195]:
result = data_test.groupby('user_id')['news_id'].unique().reset_index()
result.columns = ['user_id', 'history']
result['history'] = result['history'].apply(lambda x: list(x))
result.head(5)

Unnamed: 0,user_id,history
0,2,"[94339073, 94351073]"
1,3,"[94006073, 94108073, 94642073, 94860073, 75790..."
2,4,"[94953073, 95030073, 95023073, 95149073, 95151..."
3,5,"[94482073, 94953073, 95149073, 94898073, 75970..."
4,6,"[94953073, 95030073, 95149073, 95076073, 95148..."


Подготовливаем матрицы для обучения и тестирования модели. user_item_matrix_test - матрица, которая содержит все исходные данные для проверки модели. user_item_matrix - тренировачная матрица, которая содержит только данные для обучения. Размеры матрицы соответствуют количеству уникальных пользователей к количеству уникальных новостей. На пересечении в качестве значимого параметра используем age_param (актуальность новости в момент получения рекомендаций). Если у новости не было просмотров, то присваиваем значение параметра 0.

In [196]:
user_item_matrix_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='age_param', fill_value=0
)


ui_av_matrix_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='age_view_param', fill_value=0
)


user_item_matrix_test = user_item_matrix_test.astype(float) 
ui_av_matrix_test = ui_av_matrix_test.astype(float)

user_item_matrix = user_item_matrix_test.copy(deep=True)
ui_av_matrix = ui_av_matrix_test.copy(deep=True)

for index, row in data_test.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    user_item_matrix.loc[user_id, news_id] = 0
    ui_av_matrix.loc[user_id, news_id] = 0
    


sparse_user_item = csr_matrix(user_item_matrix).T.tocsr()
sparse_user_item_test = csr_matrix(user_item_matrix_test).T.tocsr()

sparse_ui_av_item = csr_matrix(ui_av_matrix).T.tocsr()
sparse_ui_av_test = csr_matrix(ui_av_matrix_test).T.tocsr()

print("Размер train матрицы: ", user_item_matrix.shape)
print("Размер test матрицы: ", user_item_matrix_test.shape)

user_item_matrix.describe()

Размер train матрицы:  (239, 5809)
Размер test матрицы:  (239, 5809)


news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,0.0,1e-06,1e-06,0.0,2e-06,0.0,2e-06,0.0,2e-06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,1.7e-05,0.0,2.1e-05,2.2e-05,0.0,2.5e-05,0.0,2.5e-05,0.0,2.6e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.000269,0.0,0.00033,0.000333,0.0,0.000379,0.0,0.000383,0.0,0.000398,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [199]:
user_item_matrix.loc[1].max()

0.012658227848101266

In [200]:
final_df['interes'] = final_df.apply(lambda x: df_users_news_interes.loc[x.user_id, x.news_id], axis='columns')
final_df.head()

Unnamed: 0,user_id,news_id,date_time,age_param,unique_views,age_view_param,title_age,interes
0,1,94006073,2021-08-01 18:51:19,0.010526,38,0.4,95,36
1,2,94006073,2021-08-04 13:08:19,0.010526,38,0.4,95,87
2,3,94006073,2021-08-29 12:40:07,0.010526,38,0.4,95,163
3,6,94006073,2021-08-02 09:04:55,0.010526,38,0.4,95,98
4,11,94006073,2021-08-02 17:16:23,0.010526,38,0.4,95,37


In [201]:
final_df['interes'] = final_df.apply(lambda x: df_users_news_interes.loc[x.user_id, x.news_id], axis='columns')
final_df['content_param'] = final_df['interes'] / final_df['title_age']

uc_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='content_param', fill_value=0
)

uc_test = uc_test.astype(float) 
uc_train = uc_test.copy(deep=True)


for index, row in data_test.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    uc_train.loc[user_id, news_id] = 0
    

sparse_uc = csr_matrix(uc_train).T.tocsr()
sparse_uc_test = csr_matrix(uc_test).T.tocsr()


print("Размер train матрицы: ", uc_train.shape)
print("Размер test матрицы: ", uc_test.shape)

uc_test.describe()

Размер train матрицы:  (239, 5809)
Размер test матрицы:  (239, 5809)


news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,2e-05,4e-06,0.00012,3e-06,1.1e-05,2.2e-05,6e-06,2.2e-05,2.6e-05,8.7e-05,...,0.012037,0.002253,0.020985,0.008239,0.006437,0.020985,0.00412,0.006244,0.002575,0.001674
std,0.000313,6.4e-05,0.00186,4.3e-05,0.000172,0.000343,9.9e-05,0.000347,0.000401,0.001338,...,0.186093,0.03483,0.230025,0.127379,0.099515,0.230025,0.063689,0.096529,0.039806,0.025874
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.004839,0.000986,0.028751,0.000666,0.002664,0.005305,0.001534,0.005362,0.006206,0.020684,...,2.876923,0.538462,2.753846,1.969231,1.538462,2.753846,0.984615,1.492308,0.615385,0.4


In [202]:
user_item_matrix_test.describe()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,1e-06,1e-06,1e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,...,6.4e-05,6.4e-05,0.000129,6.4e-05,6.4e-05,0.000129,6.4e-05,6.4e-05,6.4e-05,6.4e-05
std,1.7e-05,2.1e-05,2.1e-05,2.2e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.6e-05,...,0.000995,0.000995,0.001404,0.000995,0.000995,0.001404,0.000995,0.000995,0.000995,0.000995
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.000269,0.000329,0.00033,0.000333,0.000381,0.000379,0.000383,0.000383,0.000388,0.000398,...,0.015385,0.015385,0.015385,0.015385,0.015385,0.015385,0.015385,0.015385,0.015385,0.015385


In [203]:
ui_av_matrix_test.describe()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,1e-06,1e-06,1e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,...,6.4e-05,6.4e-05,0.000257,6.4e-05,6.4e-05,0.000257,6.4e-05,6.4e-05,6.4e-05,6.4e-05
std,1.7e-05,2.1e-05,2.1e-05,2.2e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.6e-05,...,0.000995,0.000995,0.002809,0.000995,0.000995,0.002809,0.000995,0.000995,0.000995,0.000995
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.000269,0.000329,0.00033,0.000333,0.000381,0.000379,0.000383,0.000383,0.000388,0.000398,...,0.015385,0.015385,0.030769,0.015385,0.015385,0.030769,0.015385,0.015385,0.015385,0.015385


In [93]:
user_item_matrix_test.tail()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
274,0.0,0.0,0.000331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
277,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
278,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Функция для получения рекомендаций от обученной модели.

In [229]:
user_ids = list(user_item_matrix_test.index.values)
news_ids = list(user_item_matrix_test.columns.values)
def recomend_test_user(user_id, model, n=20, data=sparse_user_item_test):
    user_index = user_ids.index(user_id)
    recommendations = model.recommend(user_index, data, N=n)
    result = [news_ids[x[0]] for x in recommendations]
    return result

In [231]:
recomend_test_user(1, model_iir)

[]

### Обучение модели ItemItemRecommender
Алгоритм на основе метода ближайших соседей.

[модель ItemItemRecommender](https://github.com/benfred/implicit/blob/main/implicit/nearest_neighbours.py#L12)

In [214]:
%%time

model_iir = ItemItemRecommender(K=20, num_threads=4) 
# model_iir.fit(csr_matrix(user_item_matrix).T, show_progress=True)
model_iir.fit(sparse_user_item, show_progress=True)

model_iir.recommend(12, sparse_user_item, N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

CPU times: user 322 ms, sys: 141 ms, total: 464 ms
Wall time: 211 ms


[(4999, 7.320704240034766e-09),
 (5003, 7.320704240034766e-09),
 (4983, 7.238449136214149e-09),
 (4993, 7.238449136214149e-09),
 (4997, 7.238449136214149e-09),
 (4954, 7.158021923589548e-09),
 (4942, 7.158021923589548e-09),
 (4903, 7.079362342011641e-09),
 (4913, 7.079362342011641e-09),
 (4813, 7.0024127513376e-09),
 (4841, 7.0024127513376e-09),
 (4767, 6.927117990570531e-09),
 (663, 6.927117990570531e-09),
 (4773, 6.927117990570531e-09),
 (4810, 6.927117990570531e-09),
 (4590, 6.781283927611151e-09),
 (4708, 6.781283927611151e-09),
 (4728, 6.781283927611151e-09),
 (4734, 6.781283927611151e-09),
 (4729, 6.710645553365201e-09)]

In [215]:
def get_similar_news(news_id, model, count=10):
    item_id = news_ids.index(news_id)
    result = model.similar_items(item_id, count)
    news_ids_result = [news_ids[x[0]] for x in result]
    news_ids_result.append(news_id)
    df_result = df_news[df_news['id'].isin(news_ids_result)]
    return df_result[['id', 'title', 'unique_views', 'published_at']]

get_similar_news(80375073, model_iir_с)

Unnamed: 0,id,title,unique_views,published_at
1,80375073,Для пассажиров закрытого участка Арбатско-Покр...,1,2020-09-26 09:04:00
96,94349073,Сказки в стиле Пикассо и путевые рисунки: каки...,23,2021-08-08 09:03:00
1406,94419073,Лучшие в 2020-м. Рассказываем о победителях ко...,62,2021-08-10 07:01:03
3073,94317073,Тоннели между станциями «Лианозово» и «Физтех»...,26,2021-08-08 09:05:00
3374,94136073,"Палеоарт, Булгаков и княгиня Волконская. Выста...",42,2021-08-03 10:01:00
3435,94417073,В Москве начала работать онлайн-платформа «Кар...,66,2021-08-10 07:03:00
3641,94234073,Открытие трех станций БКЛ улучшит транспортную...,48,2021-08-05 07:01:03
3759,94058073,Волшебные леденцы и богомол Великолепный. Шест...,17,2021-08-01 12:11:00
4036,94415073,От ИТ-технологий до флористики: в «Технограде»...,50,2021-08-10 09:01:00
6210,7547050,Сергей Собянин посетил обновленный Дом ученых ...,38,2021-08-04 13:49:03


In [112]:
news_ids[5286], news_ids[2], news_ids[5268]
news_ids.index(94681073)
df_news.head()

Unnamed: 0,id,title,importance,published_at,created_at,updated_at,is_deferred_publication,status,ya_rss,active_from,...,preview_text,full_text,url,preview,text,promo,images,unique_views,spheres_tags,spheres_tags_ids
0,75178073,Открыта запись на электронное голосование по и...,,2020-06-05 09:00:00,2020-06-04 22:14:43,2020-06-24 09:11:55,0.0,public,1,,...,Электронное голосование доступно для граждан Р...,"<p>На порталах <a href=""https://www.mos.ru/pgu...",/news/item/75178073/,,,,,1,"[{'id': 1299, 'title': 'Социальная сфера', 'sp...","[1299, 170217, 16324217, 40823217]"
1,80375073,Для пассажиров закрытого участка Арбатско-Покр...,,2020-09-26 09:04:00,2020-09-25 16:49:20,2020-09-26 09:04:00,0.0,public,1,,...,Они будут следовать от станции метро «Молодежн...,<p>С 26 сентября по 5 октября закрылся участок...,/news/item/80375073/,,,,,1,"[{'id': 2299, 'title': 'Транспорт', 'special':...","[2299, 231299, 255299, 10217, 462217, 4790217,..."
2,41116073,Москвичка Арина Аверина выиграла чемпионат Евр...,,2018-06-04 12:04:00,2018-06-04 09:51:15,2020-06-11 19:33:47,0.0,public,0,,...,"Арина Аверина победила с результатом 79,250 ба...",<p>Московские спортсменки Арина и Дина Аверины...,/news/item/41116073/,,,,,2,"[{'id': 6299, 'title': 'Спорт', 'special': 0, ...","[6299, 332217, 4318217, 6601217, 30170217]"
3,94978073,Многофункциональный комплекс со спортивными и ...,,2021-08-23 09:23:51,2021-08-23 09:07:27,2021-08-23 09:23:01,0.0,public_oiv,0,,...,Новое здание появится на Волгоградском проспекте.,"<p style=""text-align: justify;"">На юго-востоке...",/news/item/94978073/,,,,,3,"[{'id': 4299, 'title': 'Строительство и реконс...","[4299, 4000217, 12252217, 40016217]"
4,64742073,«По масштабам Вселенной 90 лет — это миг». Инт...,,2019-11-05 10:00:00,2019-11-01 21:28:46,2021-01-22 13:50:55,0.0,public,1,,...,"Фаина Рублева — о новейших технологиях, подаре...",<p>5 ноября 1929 года в Москве открылся первый...,/news/item/64742073/,,,,,2,"[{'id': 3299, 'title': 'Культура', 'special': ...","[3299, 292299, 242299, 248299, 49217, 1074217,..."


In [216]:
%%time

model_iir_v = ItemItemRecommender(K=20, num_threads=4) 
model_iir_v.fit(sparse_ui_av_item, show_progress=True)

model_iir_v.recommend(12, sparse_ui_av_item, N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

CPU times: user 271 ms, sys: 181 ms, total: 452 ms
Wall time: 204 ms


[(4841, 3.2211098656152963e-07),
 (4767, 2.8401183761339173e-07),
 (4993, 2.8229951631235184e-07),
 (4813, 2.6609168455082885e-07),
 (4999, 2.48903944161182e-07),
 (4728, 2.3734493746639023e-07),
 (4983, 1.9543812667778207e-07),
 (663, 1.9395930373597484e-07),
 (4773, 1.9395930373597484e-07),
 (4590, 1.8309466604550104e-07),
 (4734, 1.7631338211788992e-07),
 (4729, 1.1408097440720842e-07),
 (4942, 1.0021230693025367e-07),
 (4671, 9.962195460665866e-08),
 (4708, 9.49379749865561e-08),
 (4810, 9.005253387741689e-08),
 (5003, 8.784845088041717e-08),
 (4519, 7.976081571999783e-08),
 (4324, 7.911497915546343e-08),
 (4954, 7.873824115948503e-08)]

In [220]:
%%time

model_iir_с = ItemItemRecommender(K=20, num_threads=4) 
model_iir_с.fit(sparse_uc, show_progress=True)

model_iir_с.recommend(12, sparse_uc, N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

CPU times: user 279 ms, sys: 172 ms, total: 451 ms
Wall time: 202 ms


[(4728, 0.0005784774254448692),
 (4610, 0.00047963640637732423),
 (4477, 0.00044803530962572393),
 (4590, 0.00040862660690999274),
 (4993, 0.00038362332732107745),
 (4519, 0.0003830727915610603),
 (4841, 0.0003761976226528612),
 (4710, 0.00037609280303105774),
 (4767, 0.000369637943094834),
 (4492, 0.0003668275705783067),
 (4942, 0.0003481805024072427),
 (4729, 0.0003459069356937627),
 (4734, 0.0003372400310040301),
 (4999, 0.00033483437053071006),
 (4135, 0.0003322191360921986),
 (625, 0.0003249102165509404),
 (4643, 0.0003125988519175596),
 (4983, 0.0003047966162277054),
 (4614, 0.0002952684043480688),
 (4403, 0.00029073563532996873)]

In [223]:
recomend_test_user(1, model_iir_с, n=5, data=sparse_uc)

[]

### Обучение модели ALS (Alternating Least Squares)
Алгоритм наименьших квадратов

[модель AlternatingLeastSquares](https://github.com/benfred/implicit/blob/main/implicit/als.py#L7)

In [114]:
%%time
model_als = AlternatingLeastSquares(factors=100, #k f
                                regularization=0.001,
                                iterations=15, 
                                calculate_training_loss=True, 
                                num_threads=4)

model_als.fit(sparse_user_item, show_progress=True)

model_als.recommend(12, user_items=sparse_user_item, N=20)



  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 18.8 s, sys: 18.6 s, total: 37.4 s
Wall time: 3.19 s


[(5049, 0.73590624),
 (4934, 0.6579679),
 (5173, 0.619991),
 (4723, 0.5993882),
 (5305, 0.58116615),
 (5304, 0.5605637),
 (5250, 0.44830734),
 (4879, 0.43192726),
 (685, 0.42494482),
 (698, 0.4170365),
 (4817, 0.41369218),
 (5192, 0.40941116),
 (4842, 0.34828132),
 (678, 0.34163445),
 (4826, 0.33572203),
 (4771, 0.32941884),
 (5054, 0.324797),
 (5347, 0.32169178),
 (5111, 0.30733085),
 (695, 0.30613372)]

In [134]:
result['itemitem'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir), axis='columns')
result['diff_iir'] = result.apply(lambda x: len(set(x["history"]) & set(x["itemitem"])), axis='columns')
result['als'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_als), axis='columns')
result['diff_als'] = result.apply(lambda x: len(set(x["history"]) & set(x["als"])), axis='columns')

result.head()

Unnamed: 0,user_id,history,itemitem,diff_iir,als,diff_als
0,2,"[94339073, 94351073]","[94633073, 94643073, 94614073, 94645073, 75670...",0,"[94131073, 94115073, 94792073, 93978073, 94679...",0
1,3,"[94006073, 94108073, 94642073, 94860073, 75790...","[94860073, 94765073, 94865073, 94724073, 94852...",2,"[94479073, 94294073, 94482073, 94469073, 94279...",0
2,4,"[94953073, 95030073, 95023073, 95149073, 95151...","[94702073, 94659073, 94705073, 94681073, 94688...",0,"[94702073, 94701073, 94707073, 94687073, 94779...",0
3,5,"[94482073, 94953073, 95149073, 94898073, 75970...","[94900073, 94874073, 94913073, 94875073, 94876...",0,"[94415073, 94479073, 7575050, 94702073, 947920...",0
4,6,"[94953073, 95030073, 95149073, 95076073, 95148...","[94702073, 94659073, 94705073, 94681073, 94688...",0,"[94638073, 94193073, 94479073, 94346073, 94197...",0


In [135]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als
count,163.0,163.0,163.0
mean,131.349693,0.226994,0.067485
std,77.642146,0.580555,0.296665
min,2.0,0.0,0.0
25%,68.5,0.0,0.0
50%,129.0,0.0,0.0
75%,190.0,0.0,0.0
max,275.0,3.0,2.0


In [136]:
positive_result_iir_count = result[result['diff_iir'] > 0].shape[0]
positive_result_als_count = result[result['diff_als'] > 0].shape[0]
print("Количество попаданий для модели IIR: ", positive_result_iir_count)
print("Количество попаданий для модели ALS: ", positive_result_als_count)

Количество попаданий для модели IIR:  26
Количество попаданий для модели ALS:  9


In [117]:
map_iir = mean_average_precision_at_k(model_iir, sparse_user_item.T, sparse_user_item_test.T, K=5)
map_als = d = mean_average_precision_at_k(model_als, sparse_user_item.T, sparse_user_item_test.T, K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir)
print("mean average precision at k for model ALS: ", map_als)

  0%|          | 0/239 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.012147838214783824
mean average precision at k for model ALS:  0.00800557880055788


Вариант IRR, если в качестве веса взять уникальные просмотры:

In [118]:
map_iir_v = mean_average_precision_at_k(model_iir_v, sparse_ui_av_item.T, sparse_ui_av_test.T, K=5)

print("mean average precision at k for model ItemItemRecommender: ", map_iir_v)


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.013347280334728037


In [124]:
map_iir_с = mean_average_precision_at_k(model_iir_с, sparse_uc.T, sparse_uc_test.T, K=5)

print("mean average precision at k for model ItemItemRecommender: ", map_iir_с)

  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.01260808926080893


In [127]:
precision_iiс = precision_at_k(model_iir, sparse_uc.T, sparse_uc_test.T, K=5)
precision_iiс

  0%|          | 0/239 [00:00<?, ?it/s]

0.028451882845188285

In [119]:
precision_iir = precision_at_k(model_iir, sparse_user_item.T, sparse_user_item_test.T, K=5)
precision_als = precision_at_k(model_als, sparse_user_item.T, sparse_user_item_test.T, K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir)
print("precision at k for model ALS: ", precision_als)

  0%|          | 0/239 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.024267782426778243
precision at k for model ALS:  0.012552301255230125


In [None]:
def get_result_df(model):
    result = dict()
    users = df_views['user_id'].unique()
    for user in users:
        model_news_for_user = recomend_test_user(user, model, n=5)
        top_interes_news_for_user = get_top_new_news_for_user(user, n=5)
        
    

In [180]:
recomend_test_user(8, model_iir_с, n=5)

[94577073, 94634073, 94681073, 94688073, 94349073]

Модель ItemItemRecommender по текущим показателям выигрывает у ALS, но при больших данных и более глубоком погружении в тематику можно достичь лучших результатов. Помимо этого можно также использовать гибридный тип (на базе анализа контента и метода ближайших соседей) коллаборативной фильтрации, который позволит улучшить качество рекомендательной системы.

## Описание алгоритма для авторазметки новостей

Алгоритм разметки сделан на основе меры [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF). Он выбирает наиболее весомые слова на основе частоты употребления в документе в сравнении с полным корпусом. Полный корпус составляется на основе всех новостей, их тегов и сфер, исключая стоп-слова. Результатом алгоритма является набор тегов и сфер для переданного текста новости.

[Функции для обработки текста](https://github.com/mandrianova/mos-news/blob/master/auto_markup/support_for_model/text_manipulation.py):
- get_text_on_pattern_replacement_func - очистка от html-тегов
- get_lst_of_normalized_tokens_without_stopwords - нормализация слов и очистка от стоп-слов

[Функции для создания корпуса всех доступных материалов, тегов и сфер](https://github.com/mandrianova/mos-news/blob/master/auto_markup/support_for_model/work_with_files.py#L164)

[Основные функции алгоритма](https://github.com/mandrianova/mos-news/blob/master/auto_markup/model.py):
- get_result_tag_and_spheres_for_title_preview_fulltext - функция для получения результатов
- compute_idf - функция для расчета IDF
- compute_tf - функция для расчета TF
- get_named_objects_without_stopwords - функция для получения именованных объектов для обогощения результатов (выдает адреса, названия, имена, организации)


Используемые технологии:
- [nltk](https://github.com/nltk/nltk "набор инструментов для обработки текста NLTK -- the Natural Language Toolkit")
- [pymorphy2](https://github.com/kmike/pymorphy2/blob/92d546f042ff14601376d3646242908d5ab786c1/docs/index.rst "Морфологический анализатор pymorphy2 -> приводит слова к нормальной форме, а также многое другое")
- [natasha](https://github.com/natasha/natasha "библиотека для обработки текстов на русском языке")