# Домашнее задание 3

Выполнил: Хайкин Глеб, ИАД 5

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

In [2]:
import pandas as pd
import numpy as np
import math

## Загрузка и предобработка данных

Загрузим данные и проведем предобраотку данных как на семинаре.

In [3]:
!wget -q -N https://www.dropbox.com/s/z8syrl5trawxs0n/articles.zip?dl=0 -O articles.zip
!unzip -o -q articles.zip

In [4]:
articles_df = pd.read_csv('articles/shared_articles.csv')
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']
articles_df.head()

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
5,1459194522,CONTENT SHARED,-2826566343807132236,4340306774493623681,8940341205206233829,,,,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en


In [5]:
interactions_df = pd.read_csv('articles/users_interactions.csv')
interactions_df.head()

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,


In [6]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [7]:
# Зададим словарь определяющий силу взаимодействия
event_type_strength = {'VIEW': 1.0,
                       'LIKE': 2.0, 
                       'BOOKMARK': 2.5, 
                       'FOLLOW': 3.0,
                       'COMMENT CREATED': 4.0}

interactions_df['eventStrength'] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [8]:
users_interactions_count_df = (interactions_df
                               .groupby(['personId', 'contentId'])
                               .first()
                               .reset_index()
                               .groupby('personId').size())

print('# users:', len(users_interactions_count_df))

# users: 1895


In [9]:
users_with_enough_interactions_df = users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[['personId']]
print('# users with at least 5 interactions:',len(users_with_enough_interactions_df))

# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [10]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
                                                                  users_with_enough_interactions_df)]

In [11]:
print('# interactions before:', interactions_df.shape)
print('# interactions after:', interactions_from_selected_users_df.shape)

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [12]:
def smooth_user_preference(x):
    return math.log(1 + x, 2)
    
interactions_full_df = (interactions_from_selected_users_df
                        .groupby(['personId', 'contentId'])
                        .eventStrength.sum()
                        .apply(smooth_user_preference)
                        .reset_index().set_index(['personId', 'contentId']))

interactions_full_df['last_timestamp'] = (interactions_from_selected_users_df
                                          .groupby(['personId', 'contentId'])['timestamp']
                                          .last())
        
interactions_full_df = interactions_full_df.reset_index()

In [13]:
interactions_full_df.head()

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [14]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print('# interactions on Train set: %d' % len(interactions_train_df))
print('# interactions on Test set: %d' % len(interactions_test_df))

# interactions on Train set: 29329
# interactions on Test set: 9777


In [15]:
interactions_train_df.head()

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [16]:
interactions = (interactions_train_df
                .groupby('personId')['contentId'].agg(lambda x: list(x))
                .reset_index()
                .rename(columns={'contentId': 'true_train'})
                .set_index('personId'))

interactions['true_test'] = (interactions_test_df
                             .groupby('personId')['contentId'].agg(lambda x: list(x)))

# Заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), 
                 'true_test'] = [list() for x
                                 in range(len(interactions.loc[pd.isnull(interactions.true_test),
                                                               'true_test']))]

interactions.head()

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."
-1032019229384696495,"[-1006791494035379303, -1039912738963181810, -...","[-1415040208471067980, -2555801390963402198, -..."
-108842214936804958,"[-1196068832249300490, -133139342397538859, -1...","[-2780168264183400543, -3060116862184714437, -..."
-1130272294246983140,"[-1150591229250318592, -1196068832249300490, -...","[-1606980109000976010, -1663441888197894674, -..."
-1160159014793528221,"[-133139342397538859, -387651900461462767, 377...",[-3462051751080362224]


## Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [17]:
!pip install lightfm
from lightfm import LightFM
from lightfm.evaluation import precision_at_k



## Задание 1. (2 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [18]:
from scipy.sparse import csr_matrix
from tqdm.notebook import tqdm
from lightfm.data import Dataset

In [19]:
people_id = interactions_full_df['personId'].unique()
contents_id = interactions_full_df['contentId'].unique()[pd.Series(interactions_full_df['contentId'].unique()).isin(articles_df['contentId'])]

In [20]:
new_interactions_train_df = interactions_train_df[interactions_train_df['contentId'].isin(contents_id)]
new_interactions_test_df = interactions_test_df[interactions_test_df['contentId'].isin(contents_id)]

In [21]:
df_train = Dataset()

df_train.fit(users=people_id, items=contents_id)

interactions_train, data_train = df_train.build_interactions(zip(new_interactions_train_df['personId'], 
                                                                 new_interactions_train_df['contentId'], 
                                                                 new_interactions_train_df['eventStrength']))

In [22]:
df_test = Dataset()

df_test.fit(users=people_id, items=contents_id)

interactions_test, data_test = df_test.build_interactions(zip(new_interactions_test_df['personId'], 
                                                              new_interactions_test_df['contentId'], 
                                                              new_interactions_test_df['eventStrength']))

Получили матрицы `data_train` и `data_test` размером количество пользователей на количество статей.

## Задание 2. (1 балл)

Обучите модель LightFM с `loss='warp'` и посчитайте *precision@10* на тесте.

Обучаем модель.

In [23]:
lightfm = LightFM(loss='warp', random_state=77)
lightfm.fit(data_train, epochs=60);

Считаем *precision@10* на тесте:

In [25]:
print(f'Test precision@10: {precision_at_k(lightfm, data_test, train_interactions=data_train, k=10).mean():.3}')

Test precision@10: 0.00703


## Задание 3. (3 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количество статей на размер признакового описания и обучите LightFM с `loss='warp'` и посчитайте precision@10 на тесте.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from pandas.api.types import CategoricalDtype

Созадем датасет `new_articles_df`, где статьи из `articles_df` сортированы так, чтобы они соотносились с порядком столбцов в матрицах `data_train` и `data_test`. Также `new_articles_df` удаляем те статьи из `articles_df`, которые не встречаются в `data_train` и `data_test`.

In [27]:
cat_size_order = CategoricalDtype(interactions_full_df['contentId'].unique(), ordered=True)
new_articles_df = articles_df.copy()
new_articles_df['contentId'] = new_articles_df['contentId'].astype(cat_size_order)
new_articles_df.sort_values(by='contentId', inplace=True, ascending=True)
new_articles_df = new_articles_df[new_articles_df['contentId'].notna()]

In [52]:
tfidf_vec = TfidfVectorizer(min_df=0.15)
feat = tfidf_vec.fit_transform(new_articles_df['text'])

In [54]:
lightfm = LightFM(loss='warp', random_state=77)
lightfm.fit(data_train, item_features=feat, epochs=30);

In [56]:
print(f'Test precision@10: {precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10: 0.0109


## Задание 4. (2 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

Поскольку у нас несколько языков в датасете, будем по всем ним делать свою предобработку: в `spacy` есть множество лемматизаторов для разных языков.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download("stopwords")

import string

In [None]:
!pip install spacy
!spacy download en
!spacy download pt
!spacy download es

In [35]:
import spacy

# Загружаем наши лемматизаторы
nlp_en = spacy.load('en', disable=['parser', 'ner'])
nlp_pt = spacy.load('pt', disable=['parser', 'ner'])
nlp_es = spacy.load('es', disable=['parser', 'ner'])

In [36]:
stopwords = stopwords.words('english') + stopwords.words('portuguese') + stopwords.words('spanish') 
punctuation = list(string.punctuation)
noise = stopwords + punctuation

In [37]:
def process_text(text, lang):
    # Лемматизируем текст, удаляем лишние пробелы и переводим в нижний регистр
    if lang == 'en':
        output = nlp_en((text.strip()).lower())
    elif lang == 'pt':
        output = nlp_pt((text.strip()).lower())
    else:
        output = nlp_es((text.strip()).lower())

    output = " ".join([token.lemma_ for token in output])

    # Убираем шум 
    output = " ".join([word for word in word_tokenize(output) if word not in noise])
    
    return output

Преобразуем тексты:

In [None]:
new_articles_df['text'][new_articles_df['lang'] == 'en'] = \
    new_articles_df['text'][new_articles_df['lang'] == 'en'].apply(lambda text: process_text(text, lang='en'))

new_articles_df['text'][new_articles_df['lang'] == 'pt'] = \
    new_articles_df['text'][new_articles_df['lang'] == 'pt'].apply(lambda text: process_text(text, lang='pt'))

new_articles_df['text'][new_articles_df['lang'] == 'es'] = \
    new_articles_df['text'][new_articles_df['lang'] == 'es'].apply(lambda text: process_text(text, lang='es'))

In [39]:
tfidf_vec = TfidfVectorizer(min_df=0.15)
feat_processed = tfidf_vec.fit_transform(new_articles_df['text'])

In [40]:
lightfm = LightFM(loss='warp', random_state=77)
lightfm.fit(data_train, item_features=feat_processed, epochs=60);

In [42]:
print(f'Test precision@10: {precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat_processed, k=10).mean():.3}')

Test precision@10: 0.0103


Улучшилось ли качество предсказания?

К сожалению, предобработка текста незначительно ухудшило качество предсказания.

## Задание 5. (2 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [43]:
losses = ['logistic', 'bpr', 'warp', 'warp-kos']

for loss in losses:
    lightfm = LightFM(loss=loss, random_state=77)
    lightfm.fit(data_train, item_features=feat)
    print(f'Test precision@10 with loss {loss}: { \
          precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10 with loss logistic: 0.000611
Test precision@10 with loss bpr: 0.00244
Test precision@10 with loss warp: 0.00346
Test precision@10 with loss warp-kos: 0.00428


In [44]:
no_components = np.arange(2, 40)

for n in no_components:
    lightfm = LightFM(loss='warp-kos', no_components=n, random_state=77)
    lightfm.fit(data_train, item_features=feat)
    print(f'Test precision@10 with no_components of {n}: { \
          precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10 with no_components of 2: 0.00916
Test precision@10 with no_components of 3: 0.00804
Test precision@10 with no_components of 4: 0.00794
Test precision@10 with no_components of 5: 0.00407
Test precision@10 with no_components of 6: 0.00448
Test precision@10 with no_components of 7: 0.00804
Test precision@10 with no_components of 8: 0.00499
Test precision@10 with no_components of 9: 0.0053
Test precision@10 with no_components of 10: 0.00428
Test precision@10 with no_components of 11: 0.00692
Test precision@10 with no_components of 12: 0.00815
Test precision@10 with no_components of 13: 0.00621
Test precision@10 with no_components of 14: 0.00937
Test precision@10 with no_components of 15: 0.00519
Test precision@10 with no_components of 16: 0.00825
Test precision@10 with no_components of 17: 0.00713
Test precision@10 with no_components of 18: 0.00835
Test precision@10 with no_components of 19: 0.00611
Test precision@10 with no_components of 20: 0.00418
Test precision@10 wit

In [45]:
ks = np.arange(2, 40)

for k in ks:
    lightfm = LightFM(loss='warp-kos', no_components=36, k=k, random_state=77)
    lightfm.fit(data_train, item_features=feat)
    print(f'Test precision@10 with k of {k}: { \
          precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10 with k of 2: 0.00458
Test precision@10 with k of 3: 0.00631
Test precision@10 with k of 4: 0.00591
Test precision@10 with k of 5: 0.0122
Test precision@10 with k of 6: 0.0104
Test precision@10 with k of 7: 0.0103
Test precision@10 with k of 8: 0.0118
Test precision@10 with k of 9: 0.0123
Test precision@10 with k of 10: 0.00855
Test precision@10 with k of 11: 0.00855
Test precision@10 with k of 12: 0.00855
Test precision@10 with k of 13: 0.00855
Test precision@10 with k of 14: 0.00855
Test precision@10 with k of 15: 0.00855
Test precision@10 with k of 16: 0.00855
Test precision@10 with k of 17: 0.00855
Test precision@10 with k of 18: 0.00855
Test precision@10 with k of 19: 0.00855
Test precision@10 with k of 20: 0.00855
Test precision@10 with k of 21: 0.00855
Test precision@10 with k of 22: 0.00855
Test precision@10 with k of 23: 0.00855
Test precision@10 with k of 24: 0.00855
Test precision@10 with k of 25: 0.00855
Test precision@10 with k of 26: 0.00855
Test precisio

In [46]:
ns = np.arange(2, 40)

for n in ns:
    lightfm = LightFM(loss='warp-kos', no_components=36, k=9, n=n, random_state=77)
    lightfm.fit(data_train, item_features=feat)
    print(f'Test precision@10 with n of {n}: { \
          precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10 with n of 2: 0.0116
Test precision@10 with n of 3: 0.0112
Test precision@10 with n of 4: 0.0127
Test precision@10 with n of 5: 0.013
Test precision@10 with n of 6: 0.00998
Test precision@10 with n of 7: 0.0104
Test precision@10 with n of 8: 0.00835
Test precision@10 with n of 9: 0.0105
Test precision@10 with n of 10: 0.0123
Test precision@10 with n of 11: 0.0127
Test precision@10 with n of 12: 0.0113
Test precision@10 with n of 13: 0.00947
Test precision@10 with n of 14: 0.0113
Test precision@10 with n of 15: 0.0123
Test precision@10 with n of 16: 0.013
Test precision@10 with n of 17: 0.0103
Test precision@10 with n of 18: 0.00682
Test precision@10 with n of 19: 0.00998
Test precision@10 with n of 20: 0.00591
Test precision@10 with n of 21: 0.00855
Test precision@10 with n of 22: 0.00957
Test precision@10 with n of 23: 0.00957
Test precision@10 with n of 24: 0.00611
Test precision@10 with n of 25: 0.00794
Test precision@10 with n of 26: 0.00468
Test precision@10 with 

In [60]:
learning_schedules = ['adagrad', 'adadelta']

for learning_schedule in learning_schedules:
    lightfm = LightFM(loss='warp-kos', no_components=36, k=9, n=16, 
                      learning_schedule=learning_schedule, random_state=77)
    lightfm.fit(data_train, item_features=feat)
    print(f'Test precision@10 with learning_schedule {learning_schedule}: { \
          precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10 with learning_schedule adagrad: 0.0119
Test precision@10 with learning_schedule adadelta: 0.0125


In [70]:
lightfm = LightFM(loss='warp-kos', no_components=36, k=9, n=16, 
                  learning_schedule='adadelta', random_state=77)

lightfm.fit(data_train, item_features=feat);

In [72]:
print(f'Test precision@10: {precision_at_k(lightfm, data_test, train_interactions=data_train, item_features=feat, k=10).mean():.3}')

Test precision@10: 0.0125
