# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop). Если нет аккаунта на кеггле, скачать датасет можно [здесь](https://drive.google.com/file/d/1rLSr49zx6RPZIn7PV_LQr9KnnpPhrr0K/view?usp=sharing).

# Загрузка и предобработка данных

In [1]:
import math

import numpy as np
import pandas as pd

Загрузим данные и проведем предобраотку данных как на семинаре.

In [2]:
articles_df = pd.read_csv("shared_articles.csv")
articles_df = articles_df[articles_df["eventType"] == "CONTENT SHARED"]
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [3]:
interactions_df = pd.read_csv("users_interactions.csv")
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [4]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [5]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
    "VIEW": 1.0,
    "LIKE": 2.0,
    "BOOKMARK": 2.5,
    "FOLLOW": 3.0,
    "COMMENT CREATED": 4.0,
}

interactions_df["eventStrength"] = interactions_df.eventType.apply(
    lambda x: event_type_strength[x]
)

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [6]:
users_interactions_count_df = (
    interactions_df.groupby(["personId", "contentId"])
    .first()
    .reset_index()
    .groupby("personId")
    .size()
)
print("# users:", len(users_interactions_count_df))

users_with_enough_interactions_df = users_interactions_count_df[
    users_interactions_count_df >= 5
].reset_index()[["personId"]]
print("# users with at least 5 interactions:", len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [7]:
interactions_from_selected_users_df = interactions_df.loc[
    np.in1d(interactions_df.personId, users_with_enough_interactions_df)
]

In [8]:
print(f"# interactions before: {interactions_df.shape}")
print(f"# interactions after: {interactions_from_selected_users_df.shape}")

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [9]:
def smooth_user_preference(x):
    return math.log(1 + x, 2)


interactions_full_df = (
    interactions_from_selected_users_df.groupby(["personId", "contentId"])
    .eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index()
    .set_index(["personId", "contentId"])
)
interactions_full_df["last_timestamp"] = interactions_from_selected_users_df.groupby(
    ["personId", "contentId"]
)["timestamp"].last()

interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [10]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[
    interactions_full_df.last_timestamp < split_ts
].copy()
interactions_test_df = interactions_full_df.loc[
    interactions_full_df.last_timestamp >= split_ts
].copy()

print(f"# interactions on Train set: {len(interactions_train_df)}")
print(f"# interactions on Test set: {len(interactions_test_df)}")

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [11]:
interactions = (
    interactions_train_df.groupby("personId")["contentId"]
    .agg(lambda x: list(x))
    .reset_index()
    .rename(columns={"contentId": "true_train"})
    .set_index("personId")
)

interactions["true_test"] = interactions_test_df.groupby("personId")["contentId"].agg(
    lambda x: list(x)
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), "true_test"] = [
    ""
    for x in range(
        len(interactions.loc[pd.isnull(interactions.true_test), "true_test"])
    )
]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


# Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [12]:
# !pip install lightfm

Looking in indexes: https://artifactory.s.o3.ru/artifactory/api/pypi/pypi-virtual/simple
[0m

In [12]:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

## Задание 1 (1.5 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [13]:
# Ваш код здесь
# data_train =
# data_test =

ratings_train = pd.pivot_table(
    interactions_train_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)
ratings_train

contentId,-1006791494035379303,-1021685224930603833,-1022885988494278200,-1024046541613287684,-1033806831489252007,-1038011342017850,-1039912738963181810,-1046621686880462790,-1051830303851697653,-1055630159212837930,...,9217155070834564627,921770761777842242,9220445660318725468,9222265156747237864,943818026930898372,957332268361319692,966067567430037498,972258375127367383,980458131533897249,98528655405030624
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1007001694607905623,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
-1032019229384696495,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,2.321928,0.0,0.0,0.0,0.0,0.0
-108842214936804958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,2.0,0.0,0.0,0.0
-1130272294246983140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.000000,0.0,0.0,0.0,0.0,0.0
-1160159014793528221,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
953707509720613429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
983095443598229476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
989049974880576288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0
997469202936578234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0


In [14]:
ratings_train.head(2)

contentId,-1006791494035379303,-1021685224930603833,-1022885988494278200,-1024046541613287684,-1033806831489252007,-1038011342017850,-1039912738963181810,-1046621686880462790,-1051830303851697653,-1055630159212837930,...,9217155070834564627,921770761777842242,9220445660318725468,9222265156747237864,943818026930898372,957332268361319692,966067567430037498,972258375127367383,980458131533897249,98528655405030624
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1007001694607905623,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1032019229384696495,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,3.0,0.0,0.0,0.0,2.321928,0.0,0.0,0.0,0.0,0.0


In [15]:
ratings_test = pd.pivot_table(
    interactions_test_df,
    values='eventStrength',
    index='personId',
    columns='contentId').fillna(0)
ratings_test

contentId,-1021685224930603833,-1022885988494278200,-1072987232233605661,-1101361754763388054,-1104501717379772664,-1119244241345534741,-1123543351704082417,-1124738136890721085,-1129449063360470561,-1138633255366005559,...,9208127165664287660,9209629151177723638,9209886322932807692,9213260650272029784,921770761777842242,9220445660318725468,962287586799267519,966067567430037498,967143806332397325,991271693336573226
personId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-1007001694607905623,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1032019229384696495,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-108842214936804958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1119397949556155765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
-1130272294246983140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
953707509720613429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
983095443598229476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
989049974880576288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997469202936578234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
from scipy.sparse import csr_matrix

In [17]:
data_train = csr_matrix(ratings_train)
data_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [18]:
data_test = csr_matrix(ratings_test)
data_test.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [19]:
from lightfm.data import Dataset

In this example, we’ll use LightFM’s built-in Dataset class to build an interaction dataset from raw data. The goal is to demonstrate how to go from raw data (lists of interactions and perhaps item and user features) to scipy.sparse matrices that can be used to fit a LightFM model. https://making.lyst.com/lightfm/docs/examples/dataset.html

In [20]:
dataset = Dataset()
dataset.fit(interactions_full_df.personId, interactions_full_df.contentId)

In [21]:
X = (interactions_train_df[["personId", "contentId", "eventStrength"]].apply(tuple,axis=1),interactions_test_df[["personId", "contentId", "eventStrength"]].apply(tuple,axis=1))

In [22]:
data_train = dataset.build_interactions(X[0])[1] 
data_test = dataset.build_interactions(X[1])[1]

In [53]:
data_train

<1140x2984 sparse matrix of type '<class 'numpy.float32'>'
	with 29329 stored elements in COOrdinate format>

In [54]:
interactions_train_df[["personId", "contentId", "eventStrength"]]

Unnamed: 0,personId,contentId,eventStrength
0,-1007001694607905623,-5065077552540450930,1.0
2,-1007001694607905623,-793729620925729327,1.0
6,-1032019229384696495,-1006791494035379303,1.0
7,-1032019229384696495,-1039912738963181810,1.0
8,-1032019229384696495,-1081723567492738167,2.0
...,...,...,...
39099,997469202936578234,9112765177685685246,2.0
39100,998688566268269815,-1255189867397298842,1.0
39101,998688566268269815,-401664538366009049,1.0
39103,998688566268269815,6881796783400625893,1.0


## Задание 2 (0.5 балла)

Обучите модель LightFM с `loss="warp"` и посчитайте *precision@10* на тесте.

In [23]:
intr = interactions.reset_index()
intr.head(3)

Unnamed: 0,personId,true_train,true_test
0,-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."
1,-1032019229384696495,"[-1006791494035379303, -1039912738963181810, -...","[-1415040208471067980, -2555801390963402198, -..."
2,-108842214936804958,"[-1196068832249300490, -133139342397538859, -1...","[-2780168264183400543, -3060116862184714437, -..."


In [24]:
intr['personId']

0       -1007001694607905623
1       -1032019229384696495
2        -108842214936804958
3       -1130272294246983140
4       -1160159014793528221
                ...         
1107      953707509720613429
1108      983095443598229476
1109      989049974880576288
1110      997469202936578234
1111      998688566268269815
Name: personId, Length: 1112, dtype: object

In [25]:
# Ваш код здесь
from lightfm import LightFM

mlightFM = LightFM(k=10, loss = 'warp')
mlightFM.fit(data_train, epochs = 100)

<lightfm.lightfm.LightFM at 0x7f1ee54f8190>

In [26]:
precision_at_k(mlightFM, data_test, data_train, 10).mean()

0.006211813

## Задание 3 (2 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss="warp"` и посчитайте precision@10 на тесте.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Ваш код здесь
# feat =
df_new = pd.DataFrame(interactions_full_df.contentId.unique(), columns = ['contentId'])
adf = articles_df.copy()
ff = pd.merge(df_new, adf, on = 'contentId', how = 'left' ) 
ff['text'] = ff['text'].fillna('no text') # заполняем пропуски
tr = TfidfVectorizer() 
feat = tr.fit_transform(ff.text)

In [45]:
mlightFM2 = LightFM(k=10, loss = 'warp')
mlightFM2.fit(data_train, epochs = 100, item_features = feat)

<lightfm.lightfm.LightFM at 0x7f8e103d4890>

In [46]:
precision_at_k(mlightFM2, data_test, data_train, 10, item_features = feat).mean()

0.006720978

Качество только ухудшилось, вероятно из-за недостаточного перебора эпох и отсутсвия предобработки

## Задание 4 (1.5 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

In [None]:
# !pip install langdetect

In [38]:
# Ваш код здесь
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
from string import punctuation
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download("stopwords")
nltk.download('punkt')
en = stopwords.words('english') + list(punctuation) 
pt = stopwords.words('portuguese') + list(punctuation) 
# es = stopwords.words('spanish') + list(punctuation)
nltk.download('wordnet')
import nltk
nltk.download('rslp')
# nltk.download('wordnet')
#from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from langdetect import detect

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/nmakhanova/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/nmakhanova/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/nmakhanova/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package rslp to /home/nmakhanova/nltk_data...
[nltk_data]   Package rslp is already up-to-date!


In [31]:
adf_new = articles_df.copy()
adf_new['text'] = articles_df['text']
adf_new = adf_new[['contentId','lang','text']]
adf_new = pd.merge(df_new, adf_new, on = 'contentId', how = 'left' )
adf_new['text'] = adf_new['text'].fillna('unknown')
adf_new['lang'] = adf_new['lang'].fillna('no text')
adf_new.head()

Unnamed: 0,contentId,lang,text
0,-5065077552540450930,pt,A AXA se manteve na liderança do ranking de ma...
1,-6623581327558800021,en,"About a decade ago, a handful of Google's most..."
2,-793729620925729327,en,"Posted by Sam Thorogood , Developer Programs E..."
3,1469580151036142903,en,This is one of the great discussions among dev...
4,7270966256391553686,en,We are excited to announce the release of .NET...


In [33]:
adf_new['lang'].unique()

array(['pt', 'en', 'la', 'es', 'no text', 'ja'], dtype=object)

Было бы логично для каждого языка сделать доп обработку, но это ухудшает качество, поэтому нет

In [34]:
lmtzr = WordNetLemmatizer()
stemmer = nltk.stem.RSLPStemmer()

In [35]:
def my_tokenizer(x):
    wt = word_tokenize(x)  
    if detect(x) == 'pt':  
        preprocessed = [stemmer.stem(word) for word in wt if word not in pt and word.isalpha()]  
    else:    
        preprocessed = [lmtzr.lemmatize(word) for word in wt if word not in en and word.isalpha()]
    return preprocessed

In [39]:
stopwords_all =  stopwords.words('portuguese') + list(punctuation) + stopwords.words('english') #+ stopwords.words('spanish') + stopwords.words('portuguese')
tr2 = TfidfVectorizer(lowercase = True, tokenizer = my_tokenizer, stop_words = stopwords_all)
feat2 = tr2.fit_transform(adf_new['text'])

In [56]:
feat2

<2984x47781 sparse matrix of type '<class 'numpy.float64'>'
	with 769123 stored elements in Compressed Sparse Row format>

In [62]:
mlightFM3 = LightFM(k=10, loss = 'warp')
mlightFM3.fit(data_train, item_features = feat2, epochs = 100)

<lightfm.lightfm.LightFM at 0x7f8dc4f9c950>

In [63]:
precision_at_k(mlightFM3, data_test, data_train, item_features = feat2, k = 10).mean()

0.006211813

Улучшилось ли качество предсказания?

###### У меня только упало качество. Надо что-то придумать... Скорее всего это потому что надо улучшать параметры модели

## Задание 5 (1.5 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [65]:
# Ваш код здесь

# код беру целиком с https://stackoverflow.com/questions/49896816/how-do-i-optimize-the-hyperparameters-of-lightfm
import itertools



def sample_hyperparameters():
    """
    Yield possible hyperparameter choices.
    """

    while True:
        yield {
            "no_components": np.random.randint(10, 200), 
            "learning_schedule": np.random.choice(["adagrad", "adadelta"]),  
            "loss": np.random.choice(["bpr", "warp", "warp-kos"]),
            "learning_rate": np.random.exponential(0.05),
            "item_alpha": np.random.exponential(1e-7),
            "user_alpha": np.random.exponential(1e-7),
            "max_sampled": np.random.randint(5, 50),
            "num_epochs": np.random.randint(5, 50),
        }

def random_search(train, test, num_samples=10, num_threads=1):
    """
    Sample random hyperparameters, fit a LightFM model, and evaluate it
    on the test set.

    Parameters
    ----------

    train: np.float32 coo_matrix of shape [n_users, n_items]
        Training data.
    test: np.float32 coo_matrix of shape [n_users, n_items]
        Test data.
    num_samples: int, optional
        Number of hyperparameter choices to evaluate.


    Returns
    -------

    generator of (precision, hyperparameter dict, fitted model)

    """

    for hyperparams in itertools.islice(sample_hyperparameters(), num_samples):
        num_epochs = hyperparams.pop("num_epochs")

        model = LightFM(**hyperparams, k = 10, random_state = 0) 
        model.fit(train, epochs = num_epochs, num_threads=num_threads)

        score = precision_at_k(model, test, train, k = 10, num_threads=num_threads).mean()

        hyperparams["num_epochs"] = num_epochs

        yield (score, hyperparams, model)

In [66]:
(score, hyperparams, model) = max(random_search(data_train, data_test, num_threads = 2), key=lambda x: x[0])
print("Best score {} at {}".format(score, hyperparams))

Best score 0.006619144696742296 at {'no_components': 192, 'learning_schedule': 'adagrad', 'loss': 'warp-kos', 'learning_rate': 0.01735654765949916, 'item_alpha': 5.586662097490282e-07, 'user_alpha': 5.310484860464415e-07, 'max_sampled': 41, 'num_epochs': 11}


In [59]:
%%time
m_final = LightFM(random_state = 0,
                   no_components = 100, 
                   learning_schedule = 'adagrad', 
                   loss = 'warp', 
                   learning_rate = 0.025, 
                   item_alpha =  1.4921059666702605e-09, 
                   user_alpha =  2.1366172295379337e-08,
                   max_sampled = 19)
m_final.fit(data_train, epochs = 26)

CPU times: user 2.29 s, sys: 0 ns, total: 2.29 s
Wall time: 2.29 s


<lightfm.lightfm.LightFM at 0x7f1ed9d43c50>

In [82]:
precision_at_k(m_final, data_test, data_train, k = 10).mean()

0.008655804

In [69]:
m_final.fit(data_train, item_features = feat, epochs = 26)
precision_at_k(m_final, data_test, data_train, item_features = feat, k = 10).mean()

0.008248473

In [70]:
m_final.fit(data_train, item_features = feat2, epochs = 26)
precision_at_k(m_final, data_test, data_train, item_features = feat2, k = 10).mean()

0.007535642

ооооо наконец-то повысилось качество с первой итерации!!!!

## Задание 6 (1 балл)

Реализуйте функции для вычисления следующих метрик:
* precision@k
* recall@k
* NDCG@k



In [57]:
interactions_test_df[["personId", "contentId", "eventStrength"]]

Unnamed: 0,personId,contentId,eventStrength
1,-1007001694607905623,-6623581327558800021,1.000000
3,-1007001694607905623,1469580151036142903,1.000000
4,-1007001694607905623,7270966256391553686,1.584963
5,-1007001694607905623,8729086959762650511,1.000000
16,-1032019229384696495,-1415040208471067980,2.700440
...,...,...,...
39090,997469202936578234,-7047448754687279385,2.584963
39093,997469202936578234,2834072258350675251,1.000000
39098,997469202936578234,8869347744613364434,2.000000
39102,998688566268269815,3456674717452933449,2.584963


In [46]:
def precision_at_k_1(y_true, y_score, k, pos_label=1):
    from sklearn.utils import column_or_1d
    from sklearn.utils.multiclass import type_of_target
    
    y_true_type = type_of_target(y_true)
    if not (y_true_type == "binary"):
        raise ValueError("y_true must be a binary column.")
    
    y_true_arr = column_or_1d(y_true)
    y_score_arr = column_or_1d(y_score)
    
    y_true_arr = y_true_arr == pos_label
    
    desc_sort_order = np.argsort(y_score_arr)[::-1]
    y_true_sorted = y_true_arr[desc_sort_order]
    y_score_sorted = y_score_arr[desc_sort_order]
    
    true_positives = y_true_sorted[:k].sum()
    
    return true_positives / k

In [47]:
def recall_at_k_2(actual, predicted, k):
    if len(predicted) > k:
        predicted = predicted[:k]
    
    true_positive = 0
    false_negative = len(actual)
    
    for i in range(k):
        if predicted[i] in actual:
            true_positive += 1
            false_negative -= 1

    recall = true_positive / (true_positive + false_negative)
    return recall

In [48]:
def ndcg_at_k_2(actual, predicted, k):
    if len(predicted) > k:
        predicted = predicted[:k]
    
    dcg = 0
    idcg = 0
    
    for i in range(k):
        if predicted[i] in actual:
            rel = 1
        else:
            rel = 0
        dcg += (2**rel - 1) / (math.log2(i + 2))
    
    ideal_order = sorted(actual, reverse=True)
    for i in range(min(k, len(ideal_order))):
        idcg += (2**1 - 1) / (math.log2(i + 2))
    
    ndcg = dcg / idcg
    return ndcg


## Задание 7 (1 балл)

Вычислите значения реализованных метрик для $k=10$ для лучшей полученной модели в предыдущих шагах.

Найдите уже реализованные варианты этих метрик в библиотеках lightfm и sklearn. Сравните полученные у вас значения метрик с результатами встроенных в библиотеки метрик.

In [75]:
# Ваш код здесь
from lightfm.evaluation import precision_at_k
from lightfm.evaluation import recall_at_k
from sklearn.metrics import ndcg_score
# precision_at_k(m_final, data_test, data_train, k = 10).mean() - m_final = лучшая модель с лучшим скором 0.008655804

In [79]:
ndcg_score(data_test.toarray(), data_train.toarray(), k = 10).mean()

0.00035815469388297594

In [83]:
precision_at_k(m_final, data_test, data_train, k = 10).mean()

0.008655804

In [84]:
recall_at_k(m_final, data_test, data_train, k = 10).mean()

0.011226796638258508

In [51]:
for i in 
precision_at_k_1(data_test.toarray()[0], data_train.toarray()[0], k = 10)

0.2

In [50]:
data_test.toarray()[0]

array([0., 1., 0., ..., 0., 0., 0.], dtype=float32)

In [52]:
len(data_test.toarray())

1140

In [43]:
data_train.toarray()

array([[1., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [None]:
# scores = model.predict(<new-user-index>, np.arange(n_items),user_features=new_user_feature)

Возникла сложность с предсказанием, а так как времени мало оставалось, то не успела доделать

## Задание 8 (1 балл)

Реализуйте алгоритм ALS и примените его для решения задачи ноутбука.

**ALS**

Итак, поставлена задача построения модели со скрытыми переменными (latent factor model) для коллаборативной фильтрации:

$$ \sum_{u,i} (r_{ui} - \langle p_u, q_i \rangle)^2 \to \min_{P,Q}$$

Суммирование ведется по всем парам $(u, i),$ для которых известен рейтинг $r_{ui}$ (и только по ним), а $p_u, q_i$ – латентные представления пользователя~$u$ и товара $i$, соответственно, матрицы $P, Q$ получаются путем записывания по столбцам векторов $p_u, q_i$ соответственно.

Подход ALS (Alternating Least Squares) решает задачу, попеременно фиксируя матрицы $P$ и $Q$, — оказывается, что, зафиксировав одну из матриц, можно выписать аналитическое решение задачи для другой.

$$\nabla_{p_u} \bigg[ \sum_{u,i} (r_{ui} - \langle p_u, q_i \rangle)^2 \bigg] = \sum_{i} 2(r_{ui} - \langle p_u, q_i \rangle)q_i = 0$$

Воспользовавшись тем, что $a^Tbc = cb^Ta$, получим
$$\sum_{i} r_{ui}q_i - \sum_i q_i q_i^T p_u = 0.$$

Тогда окончательно каждый столбец матрицы $P$ можно найти по формуле
$$p_u = \bigg( \sum_i q_i q_i^T\bigg)^{-1}\sum_ir_{ui}q_i \;\; \forall u,$$

аналогично для столбцов матрицы $Q$
$$q_i = \bigg( \sum_u p_u p_u^T\bigg)^{-1}\sum_ur_{ui}p_u \;\; \forall i.$$

Таким образом мы можем решать оптимизационную задачу, поочередно фиксируя одну из матриц $P$ или $Q$ и проводя оптимизацию по второй.

**Оригинальная статья c постановкой задачи для ALS на explicit feedback:**

* Bell, R.M. and Koren, Y., 2007, October. Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In Seventh IEEE international conference on data mining (ICDM 2007) (pp. 43-52). IEEE.

**Оригинальная статья с ALS для implicit данных, которая стала более известной:**

* Hu, Y., Koren, Y. and Volinsky, C., 2008, December. Collaborative filtering for implicit feedback datasets. In 2008 Eighth IEEE international conference on data mining (pp. 263-272). Ieee.


In [None]:
# Ваш код здесь