# Сопроводительная документация по задаче

## Анализ полученного датасета

1. Выявление ошибок. Список некорректных url просмотров из dataset_news_1.xlsx
```
mos.ru/news/item/89421073/ /
mos.ru/news/item/9468/
mos.ru/news/item/94670073/ /
mos.ru/news/item/94501073/душ/
mos.ru/news/item/89957073/ Их/
mos.ru/news/item/94852073/%5c/
mos.ru/news/item/94479073/ (https:/app.aif.ru/owa/redir.aspx/
mos.ru/news/item/94792073/ /
mos.ru/news/item/94897073/+/
mos.ru/news/item/94953073/ /
mos.ru/news/item/91919073/-/
```
2. Анализ полученных данных:
- 239 пользователей на 5812 новости для 26 446 просмотров

## Протестированные гипотезы и алгоритм работы решения для рекомендательной системы

In [10]:
import pandas as pd
import numpy as np
import json
import datetime
from scipy.sparse import csr_matrix
from implicit.nearest_neighbours import ItemItemRecommender
from implicit.als import AlternatingLeastSquares
from implicit.evaluation import precision_at_k, mean_average_precision_at_k

Загружаем данные

In [11]:
def get_news_id_from_url(url: str) -> int:
    """
    id из url
    """
    parts = url.split('/')
    try:
        return int(parts[-2])
    except Exception as err:
        for part in parts:
            if '073' in part:  # Опытным путем выявлено, что битые урлы
                # только для типа 073, поэтому просто решила вытащить такие
                return int(part)
        return 0

    
df_views = pd.read_excel('/app/data/dataset_news_1.xlsx')
df_news = pd.read_json('/app/data/news.json', encoding="utf_8_sig")
df_views['news_id'] = df_views['url_clean'].apply(get_news_id_from_url)
df_news['unique_views'] = df_news['id'].apply(lambda x: df_views[df_views.news_id == x].user_id.nunique())
merged = df_views.merge(df_news, left_on='news_id', right_on='id')

In [12]:
final_df = merged.drop(['importance', 'is_deferred_publication', 'status', 'ya_rss', 'active_from',
                       'active_to', 'search', 'display_image', 'icon_id', 'canonical_url', 'canonical_updated_at',
                       'is_powered', 'has_image', 'attach', 'active_from_timestamp', 'active_to_timestamp',
                       'image', 'counter', 'preview_text', 'images'],
                      axis=1)
final_df['title_age'] = (final_df['published_at'].max() - final_df['published_at']).dt.days + 1
final_df['age_param'] = 1 / final_df['title_age']
final_df['age_view_param'] = final_df['unique_views'] / final_df['title_age']

users, items, interactions = final_df.user_id.nunique(), final_df.id.nunique(), final_df.shape[0]
print('# users: ', users)
print('# items: ', items)
print('# interactions: ', interactions)
final_df = final_df[['user_id', 'news_id', 'date_time', 'age_param', 'unique_views', 'age_view_param', 'title_age']]
final_df.head()

# users:  239
# items:  5809
# interactions:  26442


Unnamed: 0,user_id,news_id,date_time,age_param,unique_views,age_view_param,title_age
0,1,94006073,2021-08-01 18:51:19,0.032258,38,1.225806,31
1,2,94006073,2021-08-04 13:08:19,0.032258,38,1.225806,31
2,3,94006073,2021-08-29 12:40:07,0.032258,38,1.225806,31
3,6,94006073,2021-08-02 09:04:55,0.032258,38,1.225806,31
4,11,94006073,2021-08-02 17:16:23,0.032258,38,1.225806,31


In [13]:
final_df = final_df.sort_values(by=['user_id', 'date_time'], ascending = (True, False))
data_test_2 = pd.DataFrame(columns=['user_id', 'news_id', 'date_time'])
for user in final_df.user_id.unique():
    data_test_2 = data_test_2.append(
        final_df[final_df['user_id'] == user][['user_id', 'news_id', 'date_time']][:20], 
        ignore_index=True
    )
data_test_2

Unnamed: 0,user_id,news_id,date_time
0,1,7574050,2021-08-17 18:22:01
1,1,94605073,2021-08-17 18:21:35
2,1,94701073,2021-08-17 18:21:31
3,1,7573050,2021-08-17 18:19:32
4,1,94679073,2021-08-17 18:18:24
...,...,...,...
4775,278,94275073,2021-08-07 07:40:51
4776,278,94293073,2021-08-07 07:40:47
4777,278,94292073,2021-08-07 07:40:44
4778,278,94286073,2021-08-07 07:40:39


age_param - величина, обратно пропорциональная количеству дней с даты публикации новости. Это значение мы используем для определения актуальности новости на момент просмотра.

Разделяем данные на тренировочные и тестовые. В train берем 3 недели августа от даты просмотра, остальное в test.

In [84]:
test_size_days = 20

data_train = final_df[final_df['date_time'].dt.day < final_df['date_time'].dt.day.min() + test_size_days]
data_test = final_df[final_df['date_time'].dt.day >= final_df['date_time'].dt.day.min() + test_size_days]
print("Количество просмотров в train: ", data_train.shape[0])
print("Количество просмотров в test: ", data_test.shape[0])

Количество просмотров в train:  20483
Количество просмотров в test:  5959


Готовим результирующий сет данных для проверки рекомендаций.

In [14]:
result = data_test_2.groupby('user_id')['news_id'].unique().reset_index()
result.columns = ['user_id', 'history']
result['history'] = result['history'].apply(lambda x: list(x))
result.head(5)

Unnamed: 0,user_id,history
0,1,"[7574050, 94605073, 94701073, 7573050, 9467907..."
1,2,"[94351073, 94339073, 94860073, 89645073, 91643..."
2,3,"[95258073, 7612050, 94190073, 95266073, 952890..."
3,4,"[95239073, 95280073, 95279073, 95160073, 95141..."
4,5,"[95239073, 7608050, 91456073, 85817073, 950110..."


Подготовливаем матрицы для обучения и тестирования модели. user_item_matrix_test - матрица, которая содержит все исходные данные для проверки модели. user_item_matrix - тренировачная матрица, которая содержит только данные для обучения. Размеры матрицы соответствуют количеству уникальных пользователей к количеству уникальных новостей. На пересечении в качестве значимого параметра используем age_param (актуальность новости в момент получения рекомендаций). Если у новости не было просмотров, то присваиваем значение параметра 0.

In [15]:
user_item_matrix_test = pd.pivot_table(final_df, 
                                  index='user_id', columns='news_id', 
                                  values='age_param', 
                                  fill_value=0                                       
                                 )
user_item_matrix = user_item_matrix_test.copy(deep=True)
for index, row in data_test_2.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    user_item_matrix.loc[user_id, news_id] = 0
    

user_item_matrix = user_item_matrix.astype(float) 

sparse_user_item = csr_matrix(user_item_matrix).T
sparse_user_item_test = csr_matrix(user_item_matrix_test).T

print("Размер train матрицы: ", sparse_user_item.shape)
print("Размер test матрицы: ", sparse_user_item_test.shape)



Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [7]:
user_item_matrix.describe()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,0.0,1e-06,1e-06,0.0,2e-06,0.0,2e-06,0.0,2e-06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
std,1.8e-05,0.0,2.2e-05,2.2e-05,0.0,2.5e-05,0.0,2.5e-05,0.0,2.6e-05,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.000274,0.0,0.000338,0.00034,0.0,0.000388,0.0,0.000393,0.0,0.000408,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
user_item_matrix_test.describe()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
count,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,...,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0,239.0
mean,1e-06,1e-06,1e-06,1e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,...,7.7e-05,7.7e-05,0.000155,7.7e-05,7.7e-05,0.000155,7.7e-05,7.7e-05,7.7e-05,7.7e-05
std,1.7e-05,2.1e-05,2.1e-05,2.2e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.5e-05,2.6e-05,...,0.001198,0.001198,0.00169,0.001198,0.001198,0.00169,0.001198,0.001198,0.001198,0.001198
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,0.00027,0.00033,0.000332,0.000334,0.000382,0.000381,0.000385,0.000385,0.00039,0.0004,...,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519


Функция для получения рекомендаций от обученной модели.

In [16]:
user_ids = list(user_item_matrix_test.index.values)
news_ids = list(user_item_matrix_test.columns.values)
def recomend_test_user(user_id, model, data=sparse_user_item, n=20):
    user_index = user_ids.index(user_id)
    recommendations = model.recommend(user_index, data.T.tocsr(), N=n)
    result = [news_ids[x[0]] for x in recommendations]
    return result

### Обучение модели ItemItemRecommender
Алгоритм на основе метода ближайших соседей.

[модель ItemItemRecommender](https://github.com/benfred/implicit/blob/main/implicit/nearest_neighbours.py#L12)

In [22]:
%%time

def test_param_model(model, data_train, data_test, k=1):
    model = ItemItemRecommender(K=k) 
    model.fit(data_train, show_progress=False)
    map_k = mean_average_precision_at_k(model, data_train.T.tocsr(), data_test.T.tocsr(), K=10, show_progress=False)
    return map_k


def get_the_best_k(model, data_train, data_test):
    d = []
    for i in range(4, 300):
        d.append(test_param_model(model, data_train, data_test, k=i))
    h = np.array(d)
    the_best_k = h.max()
    return np.where(h==the_best_k)[0][0] + 4, the_best_k, h





model_iir = ItemItemRecommender(K=25) 
model_iir.fit(sparse_user_item, show_progress=True)

model_iir.recommend(0, sparse_user_item.T.tocsr(), N=20)
map_iir = mean_average_precision_at_k(model_iir, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=10)
print(map_iir)

  0%|          | 0/5809 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

0.05899564986385068
CPU times: user 438 ms, sys: 197 ms, total: 635 ms
Wall time: 280 ms


In [23]:
%%time
k, map_k, all_k = get_the_best_k(model_iir, sparse_user_item, sparse_user_item_test)

print("k: ", k, "map_k: ", map_k)

k:  25 map_k:  0.05899564986385068
CPU times: user 1min 40s, sys: 1min 51s, total: 3min 32s
Wall time: 1min 13s


In [25]:
all_k

array([0.05055722, 0.05086289, 0.05272365, 0.04984708, 0.05191107,
       0.04985422, 0.05253985, 0.0541924 , 0.05445026, 0.05487282,
       0.05440426, 0.05506276, 0.05597397, 0.05504466, 0.05571146,
       0.05567875, 0.05578402, 0.05536345, 0.05678904, 0.05734276,
       0.05775121, 0.05899565, 0.05708292, 0.05651308, 0.05598011,
       0.05623647, 0.05630869, 0.05450222, 0.05395862, 0.05501428,
       0.05428405, 0.05441954, 0.05335226, 0.05429485, 0.0537901 ,
       0.05488361, 0.05416434, 0.05344175, 0.05330677, 0.05431344,
       0.0533594 , 0.05236667, 0.05131201, 0.05187637, 0.05171714,
       0.05224215, 0.05290181, 0.05328186, 0.05348277, 0.05427658,
       0.05377798, 0.05302434, 0.05290081, 0.05339128, 0.05407402,
       0.05320416, 0.05355184, 0.05355283, 0.05342316, 0.0535163 ,
       0.05421764, 0.05448363, 0.05448429, 0.05324334, 0.05416799,
       0.05446072, 0.05459205, 0.05408548, 0.05414043, 0.05389885,
       0.05440941, 0.05438301, 0.05502955, 0.05490137, 0.05468

In [15]:
user_item_matrix.head()

news_id,179050,1221050,1261050,1319050,1918050,1931050,1988050,1996050,2040050,2232050,...,95333073,95334073,95335073,95336073,95338073,95340073,95341073,95343073,95370073,95372073
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000274,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Обучение модели ALS (Alternating Least Squares)
Алгоритм наименьших квадратов

[модель AlternatingLeastSquares](https://github.com/benfred/implicit/blob/main/implicit/als.py#L7)

In [27]:
%%time
model_als = AlternatingLeastSquares(factors=100, #k f
                                regularization=0.001,
                                iterations=15, 
                                calculate_training_loss=True, 
                                num_threads=4)

model_als.fit(sparse_user_item, show_progress=True)

model_als.recommend(12, user_items=sparse_user_item.T.tocsr(), N=20)



  0%|          | 0/15 [00:00<?, ?it/s]

CPU times: user 40.7 s, sys: 1min, total: 1min 41s
Wall time: 8.57 s


[(4964, 0.009680003),
 (4304, 0.0079981),
 (4069, 0.007138895),
 (4246, 0.0071357414),
 (4353, 0.0066889226),
 (4794, 0.0066147298),
 (4984, 0.00573875),
 (3869, 0.0056389887),
 (4362, 0.005637466),
 (4537, 0.005626984),
 (4420, 0.0055864826),
 (4310, 0.0054562315),
 (4843, 0.0053236485),
 (5277, 0.0051469505),
 (643, 0.0051239133),
 (4725, 0.0050973594),
 (4288, 0.005091626),
 (4765, 0.0050508603),
 (4933, 0.0049698055),
 (4793, 0.004931599)]

In [28]:
result['itemitem'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir), axis='columns')
result['diff_iir'] = result.apply(lambda x: len(set(x["history"]) & set(x["itemitem"])), axis='columns')
result['als'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_als), axis='columns')
result['diff_als'] = result.apply(lambda x: len(set(x["history"]) & set(x["als"])), axis='columns')

result.head()

Unnamed: 0,user_id,history,itemitem,diff_iir,als,diff_als
0,1,"[7574050, 94605073, 94701073, 7573050, 9467907...","[7585050, 94869073, 95131073, 95075073, 950900...",0,"[94182073, 94224073, 94193073, 94449073, 94293...",1
1,2,"[94351073, 94339073, 94860073, 89645073, 91643...","[7585050, 95075073, 94849073, 7602050, 9501907...",0,"[94295073, 94558073, 94605073, 94423073, 75430...",1
2,3,"[95258073, 7612050, 94190073, 95266073, 952890...","[94869073, 95079073, 95075073, 95208073, 95207...",1,"[7559050, 7553050, 94192073, 94558073, 7577050...",1
3,4,"[95239073, 95280073, 95279073, 95160073, 95141...","[94869073, 95075073, 95207073, 95079073, 95194...",7,"[7543050, 94349073, 94256073, 94098073, 755905...",0
4,5,"[95239073, 7608050, 91456073, 85817073, 950110...","[94869073, 95076073, 95093073, 95011073, 75900...",2,"[7546050, 94606073, 93496073, 7492050, 9321107...",0


In [48]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als
count,239.0,239.0,239.0
mean,137.866109,1.794979,0.623431
std,80.63195,2.091353,0.987546
min,1.0,0.0,0.0
25%,69.5,0.0,0.0
50%,136.0,1.0,0.0
75%,207.5,3.0,1.0
max,278.0,9.0,7.0


In [29]:
positive_result_iir_count = result[result['diff_iir'] > 0].shape[0]
positive_result_als_count = result[result['diff_als'] > 0].shape[0]
print("Количество попаданий для модели IIR: ", positive_result_iir_count)
print("Количество попаданий для модели ALS: ", positive_result_als_count)

Количество попаданий для модели IIR:  145
Количество попаданий для модели ALS:  87


In [30]:
map_iir = mean_average_precision_at_k(model_iir, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
map_als = d = mean_average_precision_at_k(model_als, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir)
print("mean average precision at k for model ALS: ", map_als)

  0%|          | 0/239 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.08509065550906553
mean average precision at k for model ALS:  0.024044630404463047


In [51]:
precision_iir = precision_at_k(model_iir, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
precision_als = precision_at_k(model_als, sparse_user_item.T.tocsr(), sparse_user_item_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir)
print("precision at k for model ALS: ", precision_als)

  0%|          | 0/239 [00:00<?, ?it/s]

  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.1322175732217573
precision at k for model ALS:  0.03933054393305439


Модель ItemItemRecommender по текущим показателям выигрывает у ALS, но при больших данных и более глубоком погружении в тематику можно достичь лучших результатов. Помимо этого можно также использовать гибридный тип (на базе анализа контента и метода ближайших соседей) коллаборативной фильтрации, который позволит улучшить качество рекомендательной системы.

#### Пробуем добавить значимость контента и просмотров для модели

In [31]:
df_users_content = pd.DataFrame(index=df_views['user_id'].unique())

def get_ids_from_data(tags, spheres):
    ids_from_data = set()
    all_contents = tags + spheres
    for content in all_contents:
        for i in content:
            ids_from_data.add(i.get('id'))
    return ids_from_data

def get_user_spheres_and_tags(user_id):
    user_news_ids = df_views[df_views.user_id == user_id]['news_id'].unique()
    user_tags = df_news[df_news['id'].isin(user_news_ids)].tags.sum()
    user_spheres = df_news[df_news['id'].isin(user_news_ids)].spheres.sum()
    user_spheres_tags = user_tags + user_spheres
    user_spheres_tags = [x.get('id') for x in user_spheres_tags]
    return np.unique(user_spheres_tags, return_counts=True)

content_ids = get_ids_from_data(df_news.tags.values, df_news.spheres.values)
for c_id in content_ids:
    df_users_content[c_id] = 0

for ui in df_views['user_id'].unique():
    if not ui:
        print(ui)
    user_st = get_user_spheres_and_tags(ui)
    for i in range(user_st[0].shape[0]):
        df_users_content.loc[ui, user_st[0][i]] = user_st[1][i]



In [53]:
df_news['spheres_tags'] = df_news['spheres'] + df_news['tags']
df_news['spheres_tags_ids'] = df_news['spheres_tags'].apply(lambda x: [i.get('id') for i in x])

def get_interes(user_id, news_id):
    news_tags_and_spheres = df_news[df_news['id'] == news_id]['spheres_tags_ids'].sum()
    interes = df_users_content.loc[user_id, news_tags_and_spheres].sum()
    return interes

df_news['spheres_tags_ids'] 

0                      [1299, 170217, 16324217, 40823217]
1       [2299, 231299, 255299, 10217, 462217, 4790217,...
2              [6299, 332217, 4318217, 6601217, 30170217]
3                     [4299, 4000217, 12252217, 40016217]
4       [3299, 292299, 242299, 248299, 49217, 1074217,...
                              ...                        
6549    [231299, 238299, 57217, 136217, 144217, 367217...
6550    [1299, 15299, 145217, 308217, 587217, 4019217,...
6551    [18299, 244299, 19217, 127217, 4019217, 658621...
6552    [4299, 18299, 231299, 244299, 28217, 127217, 2...
6553    [15299, 150217, 151217, 854217, 4019217, 43599...
Name: spheres_tags_ids, Length: 6554, dtype: object

In [54]:
%%time

def get_users_news_interes():
    df_users_news_interes = pd.DataFrame(
        index=df_views['user_id'].unique(), columns=df_news['id'].unique(), data=0
    )

    for index in df_users_news_interes.index:
        for column in df_users_news_interes.columns:
            df_users_news_interes.loc[index, column] = get_interes(index, column)
    df_users_news_interes.to_json(path_or_buf='/app/data/users_news_interes.json', orient="index")
    return df_users_news_interes

try:
    df_users_news_interes = pd.read_json('/app/data/users_news_interes.json', orient="index")
except:
    print("not file")
    df_users_news_interes = get_users_news_interes()
    
df_users_news_interes.head()

CPU times: user 2.66 s, sys: 0 ns, total: 2.66 s
Wall time: 2.69 s


Unnamed: 0,75178073,80375073,41116073,94978073,64742073,42454073,78167073,95199073,67109073,94753073,...,6959050,6151050,4782050,7564050,7418050,7163050,6965050,5484050,7239050,6751050
1,0,24,1,7,14,10,36,0,7,35,...,10,4,13,10,29,22,15,17,48,14
2,18,65,1,18,37,15,108,3,6,57,...,13,28,17,14,67,69,41,22,93,16
3,23,141,4,34,66,46,188,5,26,150,...,101,67,64,78,163,126,97,98,233,69
4,30,98,5,23,38,34,106,23,16,120,...,53,48,31,47,96,79,74,59,155,37
5,84,63,16,22,68,54,113,14,22,225,...,123,118,66,89,105,68,182,96,174,70


In [55]:
def get_top_news_for_user(user_id, n=10):
    result = df_users_news_interes.loc[user_id].sort_values(ascending=False)[:n]
    return result

def get_top_new_news_for_user(user_id, n=10):
    new_news_ids = df_news[df_news['unique_views'] == 0]['id'].values
    result = df_users_news_interes.loc[user_id, new_news_ids].sort_values(ascending=False)[:n]
    return result 
print('top_news_for_user:\n', get_top_news_for_user(2), '\n')
print('top_new_news_for_user\n', get_top_new_news_for_user(2))

top_news_for_user:
 71394073    140
7465050     139
83304073    137
93999073    135
75988073    131
81175073    128
94418073    128
92502073    125
86064073    123
89100073    121
Name: 2, dtype: int64 

top_new_news_for_user
 75988073    131
81175073    128
7457050     116
4658050     113
86448073    112
86871073    108
7117050     106
86863073    103
78602073    103
7225050     103
Name: 2, dtype: int64


In [56]:
df_news[df_news['unique_views'] == 0]['id'].values.shape

(745,)

In [57]:
ui_av_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='age_view_param', fill_value=0
)

ui_av_test = ui_av_test.astype(float)
ui_av = ui_av_test.copy(deep=True)

for index, row in data_test_2.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    ui_av.loc[user_id, news_id] = 0

sparse_ui_av = csr_matrix(ui_av).T
sparse_ui_av_test = csr_matrix(ui_av_test).T

print("Размер train матрицы: ", sparse_ui_av.shape)
print("Размер test матрицы: ", sparse_ui_av_test.shape)

Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [58]:
final_df['interes'] = final_df.apply(lambda x: df_users_news_interes.loc[x.user_id, x.news_id], axis='columns')
final_df['content_param'] = final_df['interes'] / final_df['title_age']

uc_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='content_param', fill_value=0
)

uc_test = uc_test.astype(float) 
uc_train = uc_test.copy(deep=True)


for index, row in data_test_2.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    uc_train.loc[user_id, news_id] = 0
    

sparse_uc = csr_matrix(uc_train).T
sparse_uc_test = csr_matrix(uc_test).T


print("Размер train матрицы: ", sparse_uc.shape)
print("Размер test матрицы: ", sparse_uc_test.shape)

Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [62]:
model_iir_v = ItemItemRecommender(K=7) 
model_iir_v.fit(sparse_ui_av, show_progress=True)

print(recomend_test_user(1, model_iir_v, data=sparse_ui_av, n=20))

map_iir_v = mean_average_precision_at_k(model_iir_v, sparse_ui_av.T.tocsr(), sparse_ui_av_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir_v)
precision_iir_v = precision_at_k(model_iir_v, sparse_ui_av.T.tocsr(), sparse_ui_av_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir_v)

model_iir_v.recommend(0, sparse_ui_av.T.tocsr(), N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

[94849073, 94634073, 94417073, 7575050, 94702073, 7552050, 94705073, 7585050, 7557050, 94419073, 94647073, 94061073, 94639073, 94479073, 7574050, 94703073, 7536050, 94190073, 94953073, 7579050]


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.04177126917712691


  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.07112970711297072


[(5423, 1962.7163354187967),
 (5245, 1313.5285843844213),
 (5049, 1237.7771106031857),
 (694, 1173.658874638665),
 (5303, 945.6835590783653),
 (671, 868.8670742504448),
 (5305, 645.4059605746017),
 (704, 595.6223249502226),
 (676, 498.56528989942035),
 (5051, 446.8995834302231),
 (5258, 436.210611057656),
 (4723, 387.7429911928713),
 (5250, 271.4890011890606),
 (5102, 245.6999366116791),
 (693, 215.38386634347228),
 (5304, 214.3959805094716),
 (657, 189.71603749849862),
 (4838, 133.2838173786586),
 (5508, 112.2821301775148),
 (698, 86.71888329099133)]

In [60]:
%%time
k, map_k, all_k = get_the_best_k(model_iir_v, sparse_ui_av, sparse_ui_av_test)

print("k: ", k, "map_k: ", map_k)

k:  1 map_k:  0.07870292887029284
CPU times: user 1min 29s, sys: 1min 36s, total: 3min 6s
Wall time: 1min 11s


In [61]:
all_k

array([0.07870293, 0.04315202, 0.03892608, 0.03916318, 0.04073919,
       0.03870293, 0.04177127, 0.03991632, 0.03750349, 0.03864714,
       0.0392887 , 0.03956764, 0.03942817, 0.03966527, 0.03891213,
       0.037894  , 0.03728033, 0.03688982, 0.03750349, 0.03796374,
       0.03772664, 0.03748954, 0.03804742, 0.03820084, 0.03827057,
       0.03847978, 0.03810321, 0.03835425, 0.03843794, 0.03847978,
       0.03831241, 0.03825662, 0.03808926, 0.03875872, 0.03924686,
       0.03841004, 0.03835425, 0.03839609, 0.03849372, 0.03874477,
       0.03834031, 0.03800558, 0.03804742, 0.03808926, 0.0381311 ,
       0.0381311 , 0.03817294, 0.03817294, 0.03757322, 0.03771269,
       0.03761506, 0.0376848 , 0.03772664, 0.03786611, 0.03786611,
       0.03782427, 0.038159  , 0.038159  , 0.038159  , 0.038159  ,
       0.038159  , 0.03788006, 0.03788006, 0.03788006, 0.03788006,
       0.03788006, 0.03788006, 0.03788006, 0.03788006, 0.03788006,
       0.03788006, 0.03788006, 0.03788006, 0.03788006, 0.03788

In [92]:
for ui in df_views['user_id'].unique():
    rec = recomend_test_user(ui, model_iir_v, n=5, data=sparse_ui_av)
    print(ui, rec)

1 [7575050, 94634073, 94702073, 94849073, 94705073]
2 [7575050, 94849073, 94702073, 94634073, 94417073]
3 [94849073, 7575050, 94634073, 94702073, 94417073]
4 [94417073, 94860073, 7572050, 94801073, 94750073]
5 [94634073, 94705073, 7574050, 94647073, 94419073]
6 [94849073, 7574050, 7571050, 94750073, 94703073]
7 [7575050, 94702073, 94705073, 7574050, 94639073]
8 [7575050, 94849073, 7557050, 94417073, 94702073]
9 [7575050, 94702073, 94849073, 94634073, 94647073]
10 [7575050, 7552050, 94849073, 94702073, 94417073]
11 [94849073, 7575050, 94702073, 94417073, 94634073]
13 [94634073, 94417073, 94647073, 94705073, 94639073]
14 [7575050, 94849073, 94702073, 94634073, 7574050]
16 [7575050, 94849073, 94702073, 94634073, 94705073]
17 [94702073, 7575050, 94849073, 7574050, 94639073]
18 [94634073, 7575050, 94702073, 94849073, 7574050]
19 [7575050, 94849073, 94634073, 94702073, 7574050]
20 [7575050, 94702073, 7574050, 94860073, 94419073]
21 [7575050, 94702073, 94849073, 94634073, 94647073]
22 [757505

In [71]:
model_iir_c = ItemItemRecommender(K=31) 
model_iir_c.fit(sparse_uc, show_progress=True)

print(recomend_test_user(1, model_iir_c, data=sparse_uc, n=20))

map_iir_c = mean_average_precision_at_k(model_iir_c, sparse_uc.T.tocsr(), sparse_uc_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir_c)
precision_iir_c = precision_at_k(model_iir_c, sparse_uc.T.tocsr(), sparse_uc_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir_c)

model_iir_c.recommend(0, sparse_uc.T.tocsr(), N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

[93590073, 95098073, 95003073, 95166073, 95131073, 95194073, 95105073, 94896073, 7604050, 7593050, 95015073, 94979073, 95031073, 95148073, 95024073, 95146073, 95109073, 7603050, 95182073, 95155073]


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.028786610878661082


  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.049372384937238493


[(4463, 41553.60006859217),
 (5631, 40854.72570590222),
 (5549, 39125.28199044329),
 (5687, 37685.55128797327),
 (5656, 37126.97948226483),
 (5708, 32831.92668509282),
 (5636, 32543.926351019814),
 (5462, 31543.11183438643),
 (722, 30944.961794165687),
 (711, 30418.273169775206),
 (5557, 29931.59818192894),
 (5532, 29440.346014663395),
 (5572, 28135.599273823616),
 (5669, 28028.203025332026),
 (5565, 27409.494184677165),
 (5667, 27381.920873533843),
 (5639, 26881.747961158217),
 (721, 25339.94213354576),
 (5700, 24389.193862122596),
 (5676, 20836.164365264452)]

In [66]:
%%time
k, map_k, all_k = get_the_best_k(model_iir_c, sparse_uc, sparse_uc_test)

print("k: ", k, "map_k: ", map_k)

k:  1 map_k:  0.03270571827057182
CPU times: user 1min 26s, sys: 1min 36s, total: 3min 3s
Wall time: 1min 11s


In [70]:
np.where(all_k>0.0289)

(array([ 0,  1, 31, 32, 33, 34]),)

In [115]:
model_iir_ct = ItemItemRecommender(K=200) 
model_iir_ct.fit(sparse_uc, show_progress=True)

for ui in df_views['user_id'].unique():
    rec = recomend_test_user(ui, model_iir_ct, n=5, data=sparse_uc)
    print(ui, rec)

  0%|          | 0/5809 [00:00<?, ?it/s]

1 [7579050, 7575050, 94913073, 94849073, 94717073]
2 [7579050, 7575050, 94717073, 94913073, 94849073]
3 [7579050, 7575050, 94849073, 94913073, 94717073]
4 [7579050, 7572050, 94681073, 94838073, 94567073]
5 [7579050, 94897073, 94913073, 94634073, 94449073]
6 [7579050, 94849073, 94897073, 94913073, 94449073]
7 [7579050, 7575050, 94913073, 94897073, 94717073]
8 [7579050, 7575050, 94717073, 94913073, 94849073]
9 [7579050, 7575050, 94849073, 94703073, 94567073]
10 [7575050, 7579050, 94849073, 7571050, 7572050]
11 [7579050, 7575050, 94849073, 94913073, 94717073]
13 [94897073, 94681073, 94733073, 94449073, 94679073]
14 [7575050, 94913073, 94717073, 94849073, 7572050]
16 [7575050, 7579050, 94849073, 94913073, 94717073]
17 [7575050, 94913073, 94849073, 94717073, 94897073]
18 [7575050, 7579050, 94849073, 94913073, 94681073]
19 [7575050, 7579050, 94717073, 7572050, 94849073]
20 [7579050, 7575050, 94913073, 94681073, 94717073]
21 [7575050, 94913073, 94849073, 94717073, 7572050]
22 [7579050, 757505

In [72]:
result['user_view'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir_v, data=sparse_ui_av), axis='columns')
result['user_content'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir_c, data=sparse_uc), axis='columns')

result['diff_iir_v'] = result.apply(lambda x: len(set(x["history"]) & set(x["user_view"])), axis='columns')
result['diff_iir_c'] = result.apply(lambda x: len(set(x["history"]) & set(x["user_content"])), axis='columns')


positive_result_iir_v_count = result[result['diff_iir_v'] > 0].shape[0]
positive_result_iir_c_count = result[result['diff_iir_c'] > 0].shape[0]
print("Количество попаданий для модели view: ", positive_result_iir_v_count)
print("Количество попаданий для модели content: ", positive_result_iir_c_count)

Количество попаданий для модели view:  139
Количество попаданий для модели content:  98


In [75]:
result

Unnamed: 0,user_id,history,itemitem,diff_iir,als,diff_als,user_view,user_content,diff_iir_v,diff_iir_c
0,1,"[7574050, 94605073, 94701073, 7573050, 9467907...","[7585050, 94869073, 95131073, 95075073, 950900...",0,"[94293073, 94235073, 94239073, 94224073, 94338...",1,"[94849073, 94634073, 94417073, 7575050, 947020...","[93590073, 95098073, 95003073, 95166073, 95131...",2,0
1,2,"[94351073, 94339073, 94860073, 89645073, 91643...","[7585050, 95075073, 94849073, 7602050, 9501907...",0,"[94346073, 94642073, 94703073, 94190073, 94775...",0,"[94849073, 7575050, 94634073, 94702073, 944170...","[95098073, 93590073, 95166073, 95131073, 95003...",0,0
2,3,"[95258073, 7612050, 94190073, 95266073, 952890...","[94869073, 95079073, 95075073, 95208073, 95207...",1,"[94192073, 94098073, 7523050, 7565050, 9455807...",0,"[94849073, 7575050, 7585050, 94634073, 9507507...","[95003073, 95166073, 95194073, 95131073, 94896...",0,1
3,4,"[95239073, 95280073, 95279073, 95160073, 95141...","[94869073, 95075073, 95207073, 95079073, 95194...",7,"[94750073, 7568050, 93056073, 94190073, 939370...",0,"[95075073, 7583050, 94869073, 7593050, 9441707...","[95166073, 95194073, 95131073, 95105073, 95182...",1,3
4,5,"[95239073, 7608050, 91456073, 85817073, 950110...","[94869073, 95076073, 95093073, 95011073, 75900...",2,"[94499073, 94556073, 94289073, 93713073, 92865...",0,"[94634073, 7574050, 94705073, 95021073, 944190...","[93590073, 95003073, 95166073, 95194073, 95098...",0,0
...,...,...,...,...,...,...,...,...,...,...
234,274,"[80821073, 81985073, 54965073, 64744073, 66506...","[94898073, 95129073, 94869073, 95076073, 95021...",0,"[94555073, 7572050, 94849073, 94360073, 944140...",0,"[94849073, 7575050, 94639073, 7574050, 9464707...","[95194073, 7605050, 93590073, 95146073, 951660...",0,0
235,275,"[76376073, 92684073, 95149073, 66613073, 68712...","[94869073, 95075073, 95090073, 95093073, 94898...",0,"[94279073, 7570050, 94899073, 94583073, 757305...",0,"[94702073, 94634073, 95075073, 94639073, 94953...","[93590073, 95166073, 95098073, 95003073, 95194...",0,0
236,276,"[93931073, 94113073, 94011073, 94259073, 94306...","[7613050, 95329073, 7615050, 7585050, 95210073...",0,"[94449073, 94687073, 94559073, 7549050, 944790...",0,"[94849073, 7557050, 94703073, 7585050, 7575050...","[7603050, 93590073, 7593050, 95131073, 9497907...",0,0
237,277,"[93248073, 94256073, 93910073, 93317073, 93931...","[95075073, 94953073, 7615050, 95329073, 952640...",0,"[94112073, 94233073, 94506073, 94327073, 94840...",0,"[94849073, 94634073, 94417073, 7552050, 946470...","[93590073, 95098073, 7593050, 95166073, 951460...",0,0


In [74]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als,diff_iir_v,diff_iir_c
count,239.0,239.0,239.0,239.0,239.0
mean,137.866109,1.794979,0.623431,1.25523,1.037657
std,80.63195,2.091353,0.987546,1.497099,1.74734
min,1.0,0.0,0.0,0.0,0.0
25%,69.5,0.0,0.0,0.0,0.0
50%,136.0,1.0,0.0,1.0,0.0
75%,207.5,3.0,1.0,2.0,1.0
max,278.0,9.0,7.0,6.0,10.0


In [80]:
def get_similar_news(news_id, model, count=10):
    item_id = news_ids.index(news_id)
    result = model.similar_items(item_id, count)
    news_ids_result = [news_ids[x[0]] for x in result]
    news_ids_result.append(news_id)
    df_result = df_news[df_news['id'].isin(news_ids_result)]
    return df_result[['id', 'title', 'unique_views', 'published_at']]

get_similar_news(7585050, model_iir)

Unnamed: 0,id,title,unique_views,published_at
427,95090073,В Москве завершилась реконструкция 12 прудов,31,2021-08-25 07:06:00
2428,95075073,На mos.ru заработал сервис проверки статуса за...,35,2021-08-25 10:16:03
2595,95067073,Миллион москвичей подали заявление на онлайн-г...,26,2021-08-24 17:34:00
5780,7592050,На МЦК в тестовом режиме пустили двухэтажный п...,26,2021-08-25 11:47:00
5824,7586050,Детские сады большинства районов Москвы начнут...,33,2021-08-24 15:57:00
5878,7602050,Сергей Собянин: Московская программа современн...,29,2021-08-26 09:06:00
6073,7585050,Учебный год в московских школах начнется в очн...,40,2021-08-24 14:38:00
6175,7583050,Сергей Собянин: По просьбам родителей организо...,38,2021-08-24 09:06:00
6474,7590050,Еще три государственные услуги в социальной сф...,29,2021-08-25 09:01:00
6483,7593050,Сергей Собянин открыл новую школу в Щербинке,35,2021-08-25 13:08:00


In [81]:
final_df['uc_param'] = final_df['interes'] * final_df['unique_views'] / final_df['title_age']

uci_test = pd.pivot_table(
    final_df, index='user_id', columns='news_id', 
    values='uc_param', fill_value=0
)

uci_test = uci_test.astype(float) 
uci_train = uci_test.copy(deep=True)


for index, row in data_test_2.iterrows():
    user_id = row['user_id']
    news_id = row['news_id']
    uci_train.loc[user_id, news_id] = 0
    

sparse_uci = csr_matrix(uci_train).T
sparse_uci_test = csr_matrix(uci_test).T


print("Размер train матрицы: ", sparse_uci.shape)
print("Размер test матрицы: ", sparse_uci_test.shape)

Размер train матрицы:  (5809, 239)
Размер test матрицы:  (5809, 239)


In [82]:
%%time

k, map_k, all_k = get_the_best_k(model_iir, sparse_uci, sparse_uci_test)

print("k: ", k, "map_k: ", map_k)

k:  1 map_k:  0.10490934449093443
CPU times: user 1min 27s, sys: 1min 34s, total: 3min 1s
Wall time: 1min 11s


In [92]:
np.where(all_k>0.0443)

(array([0, 1, 2, 3, 4, 5, 6, 9]),)

In [94]:
model_iir_ci = ItemItemRecommender(K=10) 
model_iir_ci.fit(sparse_uci, show_progress=True)

print(recomend_test_user(1, model_iir_ci, data=sparse_uci, n=20))

map_iir_ci = mean_average_precision_at_k(model_iir_ci, sparse_uci.T.tocsr(), sparse_uci_test.T.tocsr(), K=5)
print("mean average precision at k for model ItemItemRecommender: ", map_iir_ci)
precision_iir_ci = precision_at_k(model_iir_ci, sparse_uci.T.tocsr(), sparse_uci_test.T.tocsr(), K=5)
print("precision at k for model ItemItemRecommender: ", precision_iir_ci)

model_iir_ci.recommend(0, sparse_uci.T.tocsr(), N=20)

  0%|          | 0/5809 [00:00<?, ?it/s]

[7575050, 94849073, 95003073, 94979073, 7579050, 93590073, 94419073, 7593050, 7572050, 94634073, 7557050, 94896073, 95105073, 94417073, 95098073, 94703073, 94681073, 95015073, 7604050, 7603050]


  0%|          | 0/239 [00:00<?, ?it/s]

mean average precision at k for model ItemItemRecommender:  0.04443514644351461


  0%|          | 0/239 [00:00<?, ?it/s]

precision at k for model ItemItemRecommender:  0.08200836820083682


[(694, 2129481218.456305),
 (5423, 2058752234.4075217),
 (5549, 1560107575.2125626),
 (5532, 1427678623.4340715),
 (698, 1363747474.3890853),
 (4463, 1282176573.9386518),
 (5051, 1060982292.2212607),
 (711, 1052875644.5084926),
 (691, 822993931.0283371),
 (5245, 801525897.365884),
 (676, 588213349.883648),
 (5462, 537451681.2074642),
 (5636, 384570166.1085277),
 (5049, 339134201.8625768),
 (5631, 264007426.8843199),
 (5304, 190933138.9372635),
 (5286, 142251050.85093185),
 (5557, 119650407.51949678),
 (722, 110150104.25808407),
 (721, 45296491.64803805)]

In [95]:
result['view_content'] = result.apply(lambda x: recomend_test_user(x['user_id'], model_iir_ci, data=sparse_uci), axis='columns')
result['diff_iir_ci'] = result.apply(lambda x: len(set(x["history"]) & set(x["view_content"])), axis='columns')

positive_result_iir_ci_count = result[result['diff_iir_ci'] > 0].shape[0]
print("Количество попаданий для модели content: ", positive_result_iir_ci_count)

Количество попаданий для модели content:  141


In [96]:
result.describe()

Unnamed: 0,user_id,diff_iir,diff_als,diff_iir_v,diff_iir_c,diff_iir_ci
count,239.0,239.0,239.0,239.0,239.0,239.0
mean,137.866109,1.794979,0.623431,1.25523,1.037657,1.125523
std,80.63195,2.091353,0.987546,1.497099,1.74734,1.277199
min,1.0,0.0,0.0,0.0,0.0,0.0
25%,69.5,0.0,0.0,0.0,0.0,0.0
50%,136.0,1.0,0.0,1.0,0.0,1.0
75%,207.5,3.0,1.0,2.0,1.0,2.0
max,278.0,9.0,7.0,6.0,10.0,7.0


In [34]:
import csv
model_prod = ItemItemRecommender(K=25) 

model_prod.fit(sparse_user_item_test, show_progress=True)

model_prod.recommend(0, sparse_user_item_test.T.tocsr(), N=20)

with open('result_task10.csv', 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=';')
    writer.writerow([
        'user_id', 'news_id_1', 'news_id_2', 'news_id_3', 'news_id_4', 'news_id_5',
        'news_id_6', 'news_id_7', 'news_id_8', 'news_id_9', 'news_id_10',
        'news_id_11', 'news_id_12', 'news_id_13', 'news_id_14', 'news_id_15',
        'news_id_16', 'news_id_17', 'news_id_18', 'news_id_19', 'news_id_20',
    ])
    for user in df_views.user_id.unique():
        writer.writerow([user, *recomend_test_user(user, model_prod, data=sparse_user_item_test, n=20)])

  0%|          | 0/5809 [00:00<?, ?it/s]

## Описание алгоритма для авторазметки новостей

Алгоритм разметки сделан на основе меры [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF). Он выбирает наиболее весомые слова на основе частоты употребления в документе в сравнении с полным корпусом. Полный корпус составляется на основе всех новостей, их тегов и сфер, исключая стоп-слова. Результатом алгоритма является набор тегов и сфер для переданного текста новости.

[Функции для обработки текста](https://github.com/mandrianova/mos-news/blob/master/auto_markup/support_for_model/text_manipulation.py):
- get_text_on_pattern_replacement_func - очистка от html-тегов
- get_lst_of_normalized_tokens_without_stopwords - нормализация слов и очистка от стоп-слов

[Функции для создания корпуса всех доступных материалов, тегов и сфер](https://github.com/mandrianova/mos-news/blob/master/auto_markup/support_for_model/work_with_files.py#L164)

[Основные функции алгоритма](https://github.com/mandrianova/mos-news/blob/master/auto_markup/model.py):
- get_result_tag_and_spheres_for_title_preview_fulltext - функция для получения результатов
- compute_idf - функция для расчета IDF
- compute_tf - функция для расчета TF
- get_named_objects_without_stopwords - функция для получения именованных объектов для обогощения результатов (выдает адреса, названия, имена, организации)


Используемые технологии:
- [nltk](https://github.com/nltk/nltk "набор инструментов для обработки текста NLTK -- the Natural Language Toolkit")
- [pymorphy2](https://github.com/kmike/pymorphy2/blob/92d546f042ff14601376d3646242908d5ab786c1/docs/index.rst "Морфологический анализатор pymorphy2 -> приводит слова к нормальной форме, а также многое другое")
- [natasha](https://github.com/natasha/natasha "библиотека для обработки текстов на русском языке")