# Сопроводительная документация по задаче

## Анализ полученного датасета

1. Выявление ошибок. Список некорректных url просмотров из dataset_news_1.xlsx
```
mos.ru/news/item/89421073/ /
mos.ru/news/item/9468/
mos.ru/news/item/94670073/ /
mos.ru/news/item/94501073/душ/
mos.ru/news/item/89957073/ Их/
mos.ru/news/item/94852073/%5c/
mos.ru/news/item/94479073/ (https:/app.aif.ru/owa/redir.aspx/
mos.ru/news/item/94792073/ /
mos.ru/news/item/94897073/+/
mos.ru/news/item/94953073/ /
mos.ru/news/item/91919073/-/
```
2. Анализ полученных данных:
- 239 пользователей на 5812 новости для 26 446 просмотров

## Протестированные гипотезы и алгоритм работы решения для рекомендательной системы

In [1]:
import pandas as pd
import numpy as np
import json
import datetime
from scipy.sparse import csr_matrix
from implicit.nearest_neighbours import ItemItemRecommender
from implicit.als import AlternatingLeastSquares
from implicit.evaluation import precision_at_k, mean_average_precision_at_k

Загружаем данные

In [2]:
def get_news_id_from_url(url: str) -> int:
    """
    id из url
    """
    parts = url.split('/')
    try:
        return int(parts[-2])
    except Exception as err:
        for part in parts:
            if '073' in part:  # Опытным путем выявлено, что битые урлы
                # только для типа 073, поэтому просто решила вытащить такие
                return int(part)
        return 0

    
df_views = pd.read_excel('/app/data/dataset_news_1.xlsx')
df_news = pd.read_json('/app/data/news.json', encoding="utf_8_sig")
df_views['news_id'] = df_views['url_clean'].apply(get_news_id_from_url)
df_news['unique_views'] = df_news['id'].apply(lambda x: df_views[df_views.news_id == x].user_id.nunique())
merged = df_views.merge(df_news, left_on='news_id', right_on='id')

In [3]:
final_df = merged.drop(['importance', 'is_deferred_publication', 'status', 'ya_rss', 'active_from',
                       'active_to', 'search', 'display_image', 'icon_id', 'canonical_url', 'canonical_updated_at',
                       'is_powered', 'has_image', 'attach', 'active_from_timestamp', 'active_to_timestamp',
                       'image', 'counter', 'preview_text', 'images'],
                      axis=1)
final_df['title_age'] = (final_df['published_at'].max() - final_df['published_at']).dt.days + 1
final_df['age_param'] = 1 / final_df['title_age']
final_df['age_view_param'] = final_df['unique_views'] / final_df['title_age']

users, items, interactions = final_df.user_id.nunique(), final_df.id.nunique(), final_df.shape[0]
print('# users: ', users)
print('# items: ', items)
print('# interactions: ', interactions)
final_df = final_df[['user_id', 'news_id', 'date_time', 'age_param', 'unique_views', 'age_view_param', 'title_age']]
final_df.head()

# users:  239
# items:  5809
# interactions:  26442


Unnamed: 0,user_id,news_id,date_time,age_param,unique_views,age_view_param,title_age
0,1,94006073,2021-08-01 18:51:19,0.032258,38,1.225806,31
1,2,94006073,2021-08-04 13:08:19,0.032258,38,1.225806,31
2,3,94006073,2021-08-29 12:40:07,0.032258,38,1.225806,31
3,6,94006073,2021-08-02 09:04:55,0.032258,38,1.225806,31
4,11,94006073,2021-08-02 17:16:23,0.032258,38,1.225806,31


age_param - величина, обратно пропорциональная количеству дней с даты публикации новости. Это значение мы используем для определения актуальности новости на момент просмотра.

Разделяем данные на тренировочные и тестовые. В train берем 3 недели августа от даты просмотра, остальное в test.

In [4]:
test_size_days = 20

data_train = final_df[final_df['date_time'].dt.day < final_df['date_time'].dt.day.min() + test_size_days]
data_test = final_df[final_df['date_time'].dt.day >= final_df['date_time'].dt.day.min() + test_size_days]
print("Количество просмотров в train: ", data_train.shape[0])
print("Количество просмотров в test: ", data_test.shape[0])

Количество просмотров в train:  20483
Количество просмотров в test:  5959


Готовим результирующий сет данных для проверки рекомендаций.

In [5]:
result = data_test.groupby('user_id')['news_id'].unique().reset_index()
result.columns = ['user_id', 'history']
result['history'] = result['history'].apply(lambda x: list(x))
result.head(5)

Unnamed: 0,user_id,history
0,2,"[94339073, 94351073]"
1,3,"[94006073, 94108073, 94642073, 94860073, 75790..."
2,4,"[94953073, 95030073, 95023073, 95149073, 95151..."
3,5,"[94482073, 94953073, 95149073, 94898073, 75970..."
4,6,"[94953073, 95030073, 95149073, 95076073, 95148..."


Подготовливаем матрицы для обучения и тестирования модели. user_item_matrix_test - матрица, которая содержит все исходные данные для проверки модели. user_item_matrix - тренировачная матрица, которая содержит только данные для обучения. Размеры матрицы соответствуют количеству уникальных пользователей к количеству уникальных новостей. На пересечении в качестве значимого параметра используем age_param (актуальность новости в момент получения рекомендаций). Если у новости не было просмотров, то присваиваем значение параметра 0.

In [6]:
user_item_matrix_test = pd.pivot_table(final_df, 
                                  index='user_id', columns='news_id', 
                                  values='age_param', 
                                  fill_value=0                                       
                                 )
user_item_matrix = user_item_matrix_test.copy(deep=True)
for index, row in data_test.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    user_item_matrix.loc[user_id, news_id] = 0
    

user_item_matrix = user_item_matrix.astype(float) 

sparse_user_item = csr_matrix(user_item_matrix).T
sparse_user_item_test = csr_matrix(user_item_matrix_test).T

print("Размер train матрицы: ", sparse_user_item.shape)
print("Размер test матрицы: ", sparse_user_item_test.shape)



Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [7]:
user_item_matrix.describe()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,0.0,1e-06,1e-06,0.0,2e-06,0.0,2e-06,0.0,2e-06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,1.8e-05,0.0,2.2e-05,2.2e-05,0.0,2.5e-05,0.0,2.5e-05,0.0,2.6e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.000274,0.0,0.000338,0.00034,0.0,0.000388,0.0,0.000393,0.0,0.000408,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
user_item_matrix_test.describe()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,1e-06,1e-06,1e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,...,7.7e-05,7.7e-05,0.000155,7.7e-05,7.7e-05,0.000155,7.7e-05,7.7e-05,7.7e-05,7.7e-05
std,1.7e-05,2.1e-05,2.1e-05,2.2e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.6e-05,...,0.001198,0.001198,0.00169,0.001198,0.001198,0.00169,0.001198,0.001198,0.001198,0.001198
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.00027,0.00033,0.000332,0.000334,0.000382,0.000381,0.000385,0.000385,0.00039,0.0004,...,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519


Функция для получения рекомендаций от обученной модели.

In [8]:
user_ids = list(user_item_matrix_test.index.values)
news_ids = list(user_item_matrix_test.columns.values)
def recomend_test_user(user_id, model, data=sparse_user_item, n=20):
    user_index = user_ids.index(user_id)
    recommendations = model.recommend(user_index, data.T.tocsr(), N=n)
    result = [news_ids[x[0]] for x in recommendations]
    return result

### Обучение модели ItemItemRecommender
Алгоритм на основе метода ближайших соседей.

[модель ItemItemRecommender](https://github.com/benfred/implicit/blob/main/implicit/nearest_neighbours.py#L12)

In [27]:
%%time

model_iir = ItemItemRecommender(K=10) 
model_iir.fit(sparse_user_item, show_progress=True)

model_iir.recommend(0, sparse_user_item.T.tocsr(), N=20)

map_iir = mean_average_precision_at_k(model_iir, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
print(map_iir)

  0%|          | 0/5809 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

0.026485355648535568
CPU times: user 257 ms, sys: 267 ms, total: 523 ms
Wall time: 236 ms


In [15]:
user_item_matrix.head()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
sparse_user_item.T.toarray()[0].max()

0.06666666666666667

In [10]:
for ui in df_views['user_id'].unique():
    rec = recomend_test_user(ui, model_iir, n=5)
    print(ui, rec)

1 [94634073, 94703073, 7575050, 94702073, 94849073]
2 [94634073, 94801073, 94849073, 94703073, 7575050]
3 [94849073, 94860073, 94634073, 94639073, 94876073]
4 [94801073, 94860073, 94750073, 94876073, 94805073]
5 [94805073, 94634073, 94705073, 94779073, 94750073]
6 [94849073, 94876073, 94805073, 94703073, 94750073]
7 [94741073, 7575050, 94801073, 94702073, 94779073]
8 [7575050, 94849073, 94876073, 94801073, 94750073]
9 [94703073, 7575050, 94849073, 94647073, 7579050]
10 [7575050, 94849073, 94702073, 7578050, 7552050]
11 [94849073, 94801073, 7575050, 94702073, 94634073]
13 [94634073, 94679073, 94801073, 94647073, 94750073]
14 [7575050, 94634073, 7574050, 94849073, 94702073]
16 [7575050, 94849073, 94702073, 94801073, 94750073]
17 [94779073, 94639073, 94849073, 94702073, 7574050]
18 [94801073, 94634073, 94849073, 7575050, 94702073]
19 [94849073, 7575050, 7574050, 94647073, 94634073]
20 [7575050, 94702073, 94801073, 7574050, 94860073]
21 [7575050, 94849073, 94702073, 94634073, 94647073]
22 

### Обучение модели ALS (Alternating Least Squares)
Алгоритм наименьших квадратов

[модель AlternatingLeastSquares](https://github.com/benfred/implicit/blob/main/implicit/als.py#L7)

In [11]:
%%time
model_als = AlternatingLeastSquares(factors=100, #k f
                                regularization=0.001,
                                iterations=15, 
                                calculate_training_loss=True, 
                                num_threads=4)

model_als.fit(sparse_user_item, show_progress=True)

model_als.recommend(12, user_items=sparse_user_item.T.tocsr(), N=20)



  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 25 s, sys: 28 s, total: 53 s
Wall time: 4.5 s


[(691, 0.111535564),
 (699, 0.09608271),
 (5047, 0.081598945),
 (5388, 0.081111476),
 (5246, 0.08086336),
 (5042, 0.07241055),
 (677, 0.06686108),
 (686, 0.0625821),
 (681, 0.06256615),
 (697, 0.061226755),
 (5422, 0.05439251),
 (5423, 0.052573234),
 (5131, 0.0525547),
 (5044, 0.052017674),
 (4875, 0.047687773),
 (5285, 0.045575123),
 (4725, 0.044333383),
 (4712, 0.04294884),
 (4976, 0.04271647),
 (4858, 0.042591494)]

In [12]:
result['itemitem'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir), axis='columns')
result['diff_iir'] = result.apply(lambda x: len(set(x["history"]) & set(x["itemitem"])), axis='columns')
result['als'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_als), axis='columns')
result['diff_als'] = result.apply(lambda x: len(set(x["history"]) & set(x["als"])), axis='columns')

result.head()

Unnamed: 0,user_id,history,itemitem,diff_iir,als,diff_als
0,2,"[94339073, 94351073]","[94634073, 94801073, 94849073, 94703073, 75750...",0,"[94765073, 94722073, 94642073, 94753073, 94568...",0
1,3,"[94006073, 94108073, 94642073, 94860073, 75790...","[94849073, 94860073, 94634073, 94639073, 94876...",2,"[94154073, 94167073, 94669073, 94525073, 94480...",1
2,4,"[94953073, 95030073, 95023073, 95149073, 95151...","[94801073, 94860073, 94750073, 94876073, 94805...",0,"[94349073, 7557050, 7529050, 94338073, 9486807...",0
3,5,"[94482073, 94953073, 95149073, 94898073, 75970...","[94805073, 94634073, 94705073, 94779073, 94750...",0,"[94410073, 7538050, 94705073, 94062073, 936090...",0
4,6,"[94953073, 95030073, 95149073, 95076073, 95148...","[94849073, 94876073, 94805073, 94703073, 94750...",1,"[7553050, 94338073, 94668073, 94156073, 948480...",0


In [13]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als
count,163.0,163.0,163.0
mean,131.349693,0.527607,0.251534
std,77.642146,1.112848,0.622059
min,2.0,0.0,0.0
25%,68.5,0.0,0.0
50%,129.0,0.0,0.0
75%,190.0,1.0,0.0
max,275.0,8.0,4.0


In [14]:
positive_result_iir_count = result[result['diff_iir'] > 0].shape[0]
positive_result_als_count = result[result['diff_als'] > 0].shape[0]
print("Количество попаданий для модели IIR: ", positive_result_iir_count)
print("Количество попаданий для модели ALS: ", positive_result_als_count)

Количество попаданий для модели IIR:  44
Количество попаданий для модели ALS:  30


In [15]:
map_iir = mean_average_precision_at_k(model_iir, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
map_als = d = mean_average_precision_at_k(model_als, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir)
print("mean average precision at k for model ALS: ", map_als)

  0%|          | 0/239 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.026485355648535568
mean average precision at k for model ALS:  0.006234309623430963


In [16]:
precision_iir = precision_at_k(model_iir, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
precision_als = precision_at_k(model_als, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir)
print("precision at k for model ALS: ", precision_als)

  0%|          | 0/239 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.03682008368200837
precision at k for model ALS:  0.008368200836820083


Модель ItemItemRecommender по текущим показателям выигрывает у ALS, но при больших данных и более глубоком погружении в тематику можно достичь лучших результатов. Помимо этого можно также использовать гибридный тип (на базе анализа контента и метода ближайших соседей) коллаборативной фильтрации, который позволит улучшить качество рекомендательной системы.

#### Пробуем добавить значимость контента и просмотров для модели

In [17]:
df_users_content = pd.DataFrame(index=df_views['user_id'].unique())

def get_ids_from_data(tags, spheres):
    ids_from_data = set()
    all_contents = tags + spheres
    for content in all_contents:
        for i in content:
            ids_from_data.add(i.get('id'))
    return ids_from_data

def get_user_spheres_and_tags(user_id):
    user_news_ids = df_views[df_views.user_id == user_id]['news_id'].unique()
    user_tags = df_news[df_news['id'].isin(user_news_ids)].tags.sum()
    user_spheres = df_news[df_news['id'].isin(user_news_ids)].spheres.sum()
    user_spheres_tags = user_tags + user_spheres
    user_spheres_tags = [x.get('id') for x in user_spheres_tags]
    return np.unique(user_spheres_tags, return_counts=True)

content_ids = get_ids_from_data(df_news.tags.values, df_news.spheres.values)
for c_id in content_ids:
    df_users_content[c_id] = 0

for ui in df_views['user_id'].unique():
    if not ui:
        print(ui)
    user_st = get_user_spheres_and_tags(ui)
    for i in range(user_st[0].shape[0]):
        df_users_content.loc[ui, user_st[0][i]] = user_st[1][i]



In [18]:
df_news['spheres_tags'] = df_news['spheres'] + df_news['tags']
df_news['spheres_tags_ids'] = df_news['spheres_tags'].apply(lambda x: [i.get('id') for i in x])

def get_interes(user_id, news_id):
    news_tags_and_spheres = df_news[df_news['id'] == news_id]['spheres_tags_ids'].sum()
    interes = df_users_content.loc[user_id, news_tags_and_spheres].sum()
    return interes

df_news['spheres_tags_ids'] 

0                      [1299, 170217, 16324217, 40823217]
1       [2299, 231299, 255299, 10217, 462217, 4790217,...
2              [6299, 332217, 4318217, 6601217, 30170217]
3                     [4299, 4000217, 12252217, 40016217]
4       [3299, 292299, 242299, 248299, 49217, 1074217,...
                              ...                        
6549    [231299, 238299, 57217, 136217, 144217, 367217...
6550    [1299, 15299, 145217, 308217, 587217, 4019217,...
6551    [18299, 244299, 19217, 127217, 4019217, 658621...
6552    [4299, 18299, 231299, 244299, 28217, 127217, 2...
6553    [15299, 150217, 151217, 854217, 4019217, 43599...
Name: spheres_tags_ids, Length: 6554, dtype: object

In [19]:
%%time

def get_users_news_interes():
    df_users_news_interes = pd.DataFrame(
        index=df_views['user_id'].unique(), columns=df_news['id'].unique(), data=0
    )

    for index in df_users_news_interes.index:
        for column in df_users_news_interes.columns:
            df_users_news_interes.loc[index, column] = get_interes(index, column)
    df_users_news_interes.to_json(path_or_buf='/app/data/users_news_interes.json', orient="index")
    return df_users_news_interes

try:
    df_users_news_interes = pd.read_json('/app/data/users_news_interes.json', orient="index")
except:
    print("not file")
    df_users_news_interes = get_users_news_interes()
    
df_users_news_interes.head()

CPU times: user 2.55 s, sys: 108 ms, total: 2.65 s
Wall time: 2.64 s


Unnamed: 0,75178073,80375073,41116073,94978073,64742073,42454073,78167073,95199073,67109073,94753073,...,6959050,6151050,4782050,7564050,7418050,7163050,6965050,5484050,7239050,6751050
1,0,24,1,7,14,10,36,0,7,35,...,10,4,13,10,29,22,15,17,48,14
2,18,65,1,18,37,15,108,3,6,57,...,13,28,17,14,67,69,41,22,93,16
3,23,141,4,34,66,46,188,5,26,150,...,101,67,64,78,163,126,97,98,233,69
4,30,98,5,23,38,34,106,23,16,120,...,53,48,31,47,96,79,74,59,155,37
5,84,63,16,22,68,54,113,14,22,225,...,123,118,66,89,105,68,182,96,174,70


In [20]:
def get_top_news_for_user(user_id, n=10):
    result = df_users_news_interes.loc[user_id].sort_values(ascending=False)[:n]
    return result

def get_top_new_news_for_user(user_id, n=10):
    new_news_ids = df_news[df_news['unique_views'] == 0]['id'].values
    result = df_users_news_interes.loc[user_id, new_news_ids].sort_values(ascending=False)[:n]
    return result 
print('top_news_for_user:\n', get_top_news_for_user(2), '\n')
print('top_new_news_for_user\n', get_top_new_news_for_user(2))

top_news_for_user:
 71394073    140
7465050     139
83304073    137
93999073    135
75988073    131
81175073    128
94418073    128
92502073    125
86064073    123
89100073    121
Name: 2, dtype: int64 

top_new_news_for_user
 75988073    131
81175073    128
7457050     116
4658050     113
86448073    112
86871073    108
7117050     106
86863073    103
78602073    103
7225050     103
Name: 2, dtype: int64


In [53]:
df_news[df_news['unique_views'] == 0]['id'].values.shape

(745,)

In [21]:
ui_av_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='age_view_param', fill_value=0
)

ui_av_test = ui_av_test.astype(float)
ui_av = ui_av_test.copy(deep=True)

for index, row in data_test.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    ui_av.loc[user_id, news_id] = 0

sparse_ui_av = csr_matrix(ui_av).T
sparse_ui_av_test = csr_matrix(ui_av_test).T

print("Размер train матрицы: ", sparse_ui_av.shape)
print("Размер test матрицы: ", sparse_ui_av_test.shape)

Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [22]:
final_df['interes'] = final_df.apply(lambda x: df_users_news_interes.loc[x.user_id, x.news_id], axis='columns')
final_df['content_param'] = final_df['interes'] / final_df['title_age']

uc_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='content_param', fill_value=0
)

uc_test = uc_test.astype(float) 
uc_train = uc_test.copy(deep=True)


for index, row in data_test.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    uc_train.loc[user_id, news_id] = 0
    

sparse_uc = csr_matrix(uc_train).T
sparse_uc_test = csr_matrix(uc_test).T


print("Размер train матрицы: ", sparse_uc.shape)
print("Размер test матрицы: ", sparse_uc_test.shape)

Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [40]:
model_iir_v = ItemItemRecommender(K=11) 
model_iir_v.fit(sparse_ui_av, show_progress=True)

print(recomend_test_user(1, model_iir_v, data=sparse_ui_av, n=20))

map_iir_v = mean_average_precision_at_k(model_iir_v, sparse_ui_av.T.tocsr(), sparse_ui_av_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir_v)
precision_iir_v = precision_at_k(model_iir_v, sparse_ui_av.T.tocsr(), sparse_ui_av_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir_v)

model_iir_v.recommend(0, sparse_ui_av.T.tocsr(), N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

[7575050, 94634073, 94702073, 94849073, 94705073, 7571050, 94703073, 94417073, 94639073, 94419073, 7579050, 94860073, 7552050, 94750073, 94638073, 94801073, 94582073, 7557050, 94061073, 94479073]


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.02266387726638773


  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.03096234309623431


[(694, 9410.217634464865),
 (5245, 9125.849900290877),
 (5303, 9117.15141408889),
 (5423, 7631.985221926664),
 (5305, 4361.470964917585),
 (690, 4272.149704157203),
 (5304, 3962.42206733907),
 (5049, 3221.0475687397625),
 (5250, 2130.2686083927574),
 (5051, 1976.9521081620874),
 (698, 1842.2418985308987),
 (5431, 1818.5674529272746),
 (671, 1654.7104087329506),
 (5340, 1503.8984204416422),
 (5249, 1381.13525390625),
 (5384, 742.0282212501137),
 (5197, 668.9013157894735),
 (676, 630.2704502553082),
 (4723, 535.1800001844836),
 (5102, 519.7964267162006)]

In [92]:
for ui in df_views['user_id'].unique():
    rec = recomend_test_user(ui, model_iir_v, n=5, data=sparse_ui_av)
    print(ui, rec)

1 [7575050, 94634073, 94702073, 94849073, 94705073]
2 [7575050, 94849073, 94702073, 94634073, 94417073]
3 [94849073, 7575050, 94634073, 94702073, 94417073]
4 [94417073, 94860073, 7572050, 94801073, 94750073]
5 [94634073, 94705073, 7574050, 94647073, 94419073]
6 [94849073, 7574050, 7571050, 94750073, 94703073]
7 [7575050, 94702073, 94705073, 7574050, 94639073]
8 [7575050, 94849073, 7557050, 94417073, 94702073]
9 [7575050, 94702073, 94849073, 94634073, 94647073]
10 [7575050, 7552050, 94849073, 94702073, 94417073]
11 [94849073, 7575050, 94702073, 94417073, 94634073]
13 [94634073, 94417073, 94647073, 94705073, 94639073]
14 [7575050, 94849073, 94702073, 94634073, 7574050]
16 [7575050, 94849073, 94702073, 94634073, 94705073]
17 [94702073, 7575050, 94849073, 7574050, 94639073]
18 [94634073, 7575050, 94702073, 94849073, 7574050]
19 [7575050, 94849073, 94634073, 94702073, 7574050]
20 [7575050, 94702073, 7574050, 94860073, 94419073]
21 [7575050, 94702073, 94849073, 94634073, 94647073]
22 [757505

In [56]:
model_iir_c = ItemItemRecommender(K=5) 
model_iir_c.fit(sparse_uc, show_progress=True)

print(recomend_test_user(1, model_iir_c, data=sparse_uc, n=20))

map_iir_c = mean_average_precision_at_k(model_iir_c, sparse_uc.T.tocsr(), sparse_uc_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir_c)
precision_iir_c = precision_at_k(model_iir_c, sparse_uc.T.tocsr(), sparse_uc_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir_c)

model_iir_c.recommend(0, sparse_uc.T.tocsr(), N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

[7579050, 7575050, 94913073, 94849073, 94717073, 94838073, 94757073, 94681073, 94897073, 94634073, 94703073, 94418073, 94893073, 94906073, 94720073, 94661073, 94871073, 94840073, 94754073, 94801073]


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.025118549511854944


  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.04435146443514645


[(698, 85005.71438139578),
 (694, 70470.90235620523),
 (5475, 44530.109676828986),
 (5423, 44060.39308086562),
 (5313, 37965.79738461663),
 (5417, 14761.012898642362),
 (5347, 11759.38101438492),
 (5286, 10906.20219275108),
 (5463, 10236.542533282078),
 (5245, 6055.739320036591),
 (5304, 3840.249604807629),
 (5050, 3187.8465710388964),
 (5459, 1566.5654552883666),
 (5471, 1529.4385650737704),
 (5315, 979.2873886772309),
 (5270, 929.1277032852797),
 (5441, 863.3483885751091),
 (5418, 827.8866279824073),
 (5344, 803.2625024254885),
 (5384, 559.7477104154333)]

In [115]:
model_iir_ct = ItemItemRecommender(K=200) 
model_iir_ct.fit(sparse_uc, show_progress=True)

for ui in df_views['user_id'].unique():
    rec = recomend_test_user(ui, model_iir_ct, n=5, data=sparse_uc)
    print(ui, rec)

  0%|          | 0/5809 [00:00<?, ?it/s]

1 [7579050, 7575050, 94913073, 94849073, 94717073]
2 [7579050, 7575050, 94717073, 94913073, 94849073]
3 [7579050, 7575050, 94849073, 94913073, 94717073]
4 [7579050, 7572050, 94681073, 94838073, 94567073]
5 [7579050, 94897073, 94913073, 94634073, 94449073]
6 [7579050, 94849073, 94897073, 94913073, 94449073]
7 [7579050, 7575050, 94913073, 94897073, 94717073]
8 [7579050, 7575050, 94717073, 94913073, 94849073]
9 [7579050, 7575050, 94849073, 94703073, 94567073]
10 [7575050, 7579050, 94849073, 7571050, 7572050]
11 [7579050, 7575050, 94849073, 94913073, 94717073]
13 [94897073, 94681073, 94733073, 94449073, 94679073]
14 [7575050, 94913073, 94717073, 94849073, 7572050]
16 [7575050, 7579050, 94849073, 94913073, 94717073]
17 [7575050, 94913073, 94849073, 94717073, 94897073]
18 [7575050, 7579050, 94849073, 94913073, 94681073]
19 [7575050, 7579050, 94717073, 7572050, 94849073]
20 [7579050, 7575050, 94913073, 94681073, 94717073]
21 [7575050, 94913073, 94849073, 94717073, 7572050]
22 [7579050, 757505

In [69]:
result['user_view'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir_v, data=sparse_ui_av), axis='columns')
result['user_content'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir_c, data=sparse_uc), axis='columns')

result['diff_iir_v'] = result.apply(lambda x: len(set(x["history"]) & set(x["user_view"])), axis='columns')
result['diff_iir_c'] = result.apply(lambda x: len(set(x["history"]) & set(x["user_content"])), axis='columns')


positive_result_iir_v_count = result[result['diff_iir_v'] > 0].shape[0]
positive_result_iir_c_count = result[result['diff_iir_c'] > 0].shape[0]
print("Количество попаданий для модели view: ", positive_result_iir_v_count)
print("Количество попаданий для модели content: ", positive_result_iir_c_count)

Количество попаданий для модели view:  48
Количество попаданий для модели content:  51


In [103]:
result.head()

Unnamed: 0,user_id,history,itemitem,diff_iir,als,diff_als,user_view,user_content,diff_iir_v,diff_iir_c
0,2,"[94339073, 94351073]","[7575050, 94849073, 94702073, 94634073, 948010...",0,"[94572073, 94722073, 94765073, 94639073, 94753...",0,"[7575050, 94849073, 94702073, 94634073, 944170...","[7579050, 7575050, 94717073, 94913073, 9484907...",0,0
1,3,"[94006073, 94108073, 94642073, 94860073, 75790...","[94849073, 94634073, 94860073, 94639073, 94702...",2,"[94530073, 94639073, 94132073, 94871073, 75360...",2,"[94849073, 7575050, 94634073, 94702073, 944170...","[7579050, 7575050, 94849073, 94913073, 9471707...",3,1
2,4,"[94953073, 95030073, 95023073, 95149073, 95151...","[94860073, 94801073, 94750073, 94876073, 94805...",0,"[7539050, 94620073, 94847073, 7557050, 9422207...",0,"[94417073, 94860073, 7572050, 94801073, 947500...","[7579050, 7572050, 94681073, 94838073, 9456707...",0,0
3,5,"[94482073, 94953073, 95149073, 94898073, 75970...","[94705073, 94634073, 94805073, 94779073, 94750...",1,"[7569050, 94805073, 94261073, 94108073, 756305...",0,"[94634073, 94705073, 7574050, 94647073, 944190...","[7579050, 94897073, 94913073, 94634073, 944490...",1,0
4,6,"[94953073, 95030073, 95149073, 95076073, 95148...","[94849073, 94750073, 94876073, 94805073, 94703...",1,"[7539050, 7532050, 93893073, 94219073, 9415207...",0,"[94849073, 7574050, 7571050, 94750073, 9470307...","[7579050, 94849073, 94897073, 94913073, 944490...",1,1


In [70]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als,diff_iir_ci,diff_iir_v,diff_iir_c
count,163.0,163.0,163.0,163.0,163.0,163.0
mean,131.349693,0.527607,0.251534,0.368098,0.558282,0.496933
std,77.642146,1.112848,0.622059,0.874587,1.128192,0.983658
min,2.0,0.0,0.0,0.0,0.0,0.0
25%,68.5,0.0,0.0,0.0,0.0,0.0
50%,129.0,0.0,0.0,0.0,0.0,0.0
75%,190.0,1.0,0.0,0.0,1.0,1.0
max,275.0,8.0,4.0,7.0,8.0,6.0


In [101]:
def get_similar_news(news_id, model, count=10):
    item_id = news_ids.index(news_id)
    result = model.similar_items(item_id, count)
    news_ids_result = [news_ids[x[0]] for x in result]
    news_ids_result.append(news_id)
    df_result = df_news[df_news['id'].isin(news_ids_result)]
    return df_result[['id', 'title', 'unique_views', 'published_at']]

get_similar_news(80375073, model_iir_c)

Unnamed: 0,id,title,unique_views,published_at
1,80375073,Для пассажиров закрытого участка Арбатско-Покр...,1,2020-09-26 09:04:00
96,94349073,Сказки в стиле Пикассо и путевые рисунки: каки...,23,2021-08-08 09:03:00
301,94414073,С ракеткой и мячом на свежем воздухе: в каких ...,46,2021-08-11 07:01:00
1406,94419073,Лучшие в 2020-м. Рассказываем о победителях ко...,62,2021-08-10 07:01:03
3073,94317073,Тоннели между станциями «Лианозово» и «Физтех»...,26,2021-08-08 09:05:00
3435,94417073,В Москве начала работать онлайн-платформа «Кар...,66,2021-08-10 07:03:00
3641,94234073,Открытие трех станций БКЛ улучшит транспортную...,48,2021-08-05 07:01:03
4036,94415073,От ИТ-технологий до флористики: в «Технограде»...,50,2021-08-10 09:01:00
5167,94576073,Москва представит стенд Created in Moscow на М...,9,2021-08-13 08:36:06
5500,94422073,На нескольких улицах в трех районах Москвы вре...,22,2021-08-10 09:04:00


In [57]:
final_df['uc_param'] = final_df['interes'] * final_df['unique_views'] / final_df['title_age']

uci_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='uc_param', fill_value=0
)

uci_test = uci_test.astype(float) 
uci_train = uci_test.copy(deep=True)


for index, row in data_test.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    uci_train.loc[user_id, news_id] = 0
    

sparse_uci = csr_matrix(uci_train).T
sparse_uci_test = csr_matrix(uci_test).T


print("Размер train матрицы: ", sparse_uci.shape)
print("Размер test матрицы: ", sparse_uci_test.shape)

Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [74]:
model_iir_ci = ItemItemRecommender(K=10) 
model_iir_ci.fit(sparse_uci, show_progress=True)

print(recomend_test_user(1, model_iir_ci, data=sparse_uci, n=20))

map_iir_ci = mean_average_precision_at_k(model_iir_ci, sparse_uci.T.tocsr(), sparse_uci_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir_ci)
precision_iir_ci = precision_at_k(model_iir_ci, sparse_uci.T.tocsr(), sparse_uci_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir_ci)

model_iir_ci.recommend(0, sparse_uci.T.tocsr(), N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

[7575050, 94849073, 7579050, 94419073, 94634073, 7557050, 94703073, 94417073, 94681073, 7571050, 94717073, 94582073, 7562050, 7536050, 94415073, 94801073, 94702073, 7553050, 94232073, 94754073]


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.02482566248256625


  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.0401673640167364


[(694, 9132623834.216827),
 (5423, 7669176610.834228),
 (698, 6761830807.079614),
 (5051, 5575051443.234564),
 (5245, 5555506873.610384),
 (676, 5376835363.409877),
 (5304, 3180672912.672043),
 (5049, 2514723337.973042),
 (5286, 2316507363.149209),
 (690, 1745569223.996699),
 (5313, 621690289.3605655),
 (5197, 484780961.74371475),
 (681, 428654957.92143273),
 (657, 201739642.51408753),
 (5047, 149185723.02005893),
 (5384, 112946084.12294464),
 (5303, 20789955.055093143),
 (672, 12165675.362053677),
 (4875, 6714467.022636193),
 (5344, 6553611.923076923)]

In [75]:
result['view_content'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir_ci, data=sparse_uci), axis='columns')
result['diff_iir_ci'] = result.apply(lambda x: len(set(x["history"]) & set(x["view_content"])), axis='columns')

positive_result_iir_ci_count = result[result['diff_iir_ci'] > 0].shape[0]
print("Количество попаданий для модели content: ", positive_result_iir_ci_count)

Количество попаданий для модели content:  45


In [76]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als,diff_iir_ci,diff_iir_v,diff_iir_c
count,163.0,163.0,163.0,163.0,163.0,163.0
mean,131.349693,0.527607,0.251534,0.478528,0.558282,0.496933
std,77.642146,1.112848,0.622059,0.970792,1.128192,0.983658
min,2.0,0.0,0.0,0.0,0.0,0.0
25%,68.5,0.0,0.0,0.0,0.0,0.0
50%,129.0,0.0,0.0,0.0,0.0,0.0
75%,190.0,1.0,0.0,1.0,1.0,1.0
max,275.0,8.0,4.0,6.0,8.0,6.0


In [80]:
import csv
model_prod = ItemItemRecommender(K=10) 

model_prod.fit(sparse_user_item_test, show_progress=True)

model_prod.recommend(0, sparse_user_item_test.T.tocsr(), N=20)

with open('result.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=';')
    writer.writerow(['user_id', 'news_id_1', 'news_id_2', 'news_id_3', 'news_id_4', 'news_id_5'])
    for user in df_views.user_id.unique():
        writer.writerow([user, *recomend_test_user(user, model_prod, data=sparse_user_item_test, n=5)])

  0%|          | 0/5809 [00:00<?, ?it/s]

## Описание алгоритма для авторазметки новостей

Алгоритм разметки сделан на основе меры [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF). Он выбирает наиболее весомые слова на основе частоты употребления в документе в сравнении с полным корпусом. Полный корпус составляется на основе всех новостей, их тегов и сфер, исключая стоп-слова. Результатом алгоритма является набор тегов и сфер для переданного текста новости.

[Функции для обработки текста](https://github.com/mandrianova/mos-news/blob/master/auto_markup/support_for_model/text_manipulation.py):
- get_text_on_pattern_replacement_func - очистка от html-тегов
- get_lst_of_normalized_tokens_without_stopwords - нормализация слов и очистка от стоп-слов

[Функции для создания корпуса всех доступных материалов, тегов и сфер](https://github.com/mandrianova/mos-news/blob/master/auto_markup/support_for_model/work_with_files.py#L164)

[Основные функции алгоритма](https://github.com/mandrianova/mos-news/blob/master/auto_markup/model.py):
- get_result_tag_and_spheres_for_title_preview_fulltext - функция для получения результатов
- compute_idf - функция для расчета IDF
- compute_tf - функция для расчета TF
- get_named_objects_without_stopwords - функция для получения именованных объектов для обогощения результатов (выдает адреса, названия, имена, организации)


Используемые технологии:
- [nltk](https://github.com/nltk/nltk "набор инструментов для обработки текста NLTK -- the Natural Language Toolkit")
- [pymorphy2](https://github.com/kmike/pymorphy2/blob/92d546f042ff14601376d3646242908d5ab786c1/docs/index.rst "Морфологический анализатор pymorphy2 -> приводит слова к нормальной форме, а также многое другое")
- [natasha](https://github.com/natasha/natasha "библиотека для обработки текстов на русском языке")