## Рекомендательные системы


Данные - https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop/

### Загрузка и предобработка данных

In [1]:
import pandas as pd
import numpy as np
import math

import random

random.seed(1000)

In [10]:
! pip install --user kaggle

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [13]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 68 bytes


In [14]:
!kaggle datasets download -d gspmoreira/articles-sharing-reading-from-cit-deskdrop
!unzip articles-sharing-reading-from-cit-deskdrop.zip -d articles

Downloading articles-sharing-reading-from-cit-deskdrop.zip to /content
  0% 0.00/8.20M [00:00<?, ?B/s]
100% 8.20M/8.20M [00:00<00:00, 140MB/s]
Archive:  articles-sharing-reading-from-cit-deskdrop.zip
  inflating: articles/shared_articles.csv  
  inflating: articles/users_interactions.csv  


In [26]:
articles_df = pd.read_csv("articles/shared_articles.csv")
articles_df = articles_df[articles_df["eventType"] == "CONTENT SHARED"]
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [27]:
interactions_df = pd.read_csv("articles/users_interactions.csv")
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [28]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [29]:
event_type_strength = {
   "VIEW": 1.0,
   "LIKE": 2.0, 
   "BOOKMARK": 2.5, 
   "FOLLOW": 3.0,
   "COMMENT CREATED": 4.0,  
}

interactions_df["eventStrength"] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [30]:
users_interactions_count_df = (
    interactions_df
    .groupby(["personId", "contentId"])
    .first()
    .reset_index()
    .groupby("personId").size())
print("# users:", len(users_interactions_count_df))

users_with_enough_interactions_df = \
    users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[["personId"]]
print("# users with at least 5 interactions:",len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [31]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
            users_with_enough_interactions_df)]

In [32]:
print(f"# interactions before: {interactions_df.shape}")
print(f"# interactions after: {interactions_from_selected_users_df.shape}")

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглаживаем полученный результат, взяв от него логарифм.

In [33]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = (
    interactions_from_selected_users_df
    .groupby(["personId", "contentId"]).eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index().set_index(["personId", "contentId"])
)
interactions_full_df["last_timestamp"] = (
    interactions_from_selected_users_df
    .groupby(["personId", "contentId"])["timestamp"].last()
)
        
interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [34]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print(f"# interactions on Train set: {len(interactions_train_df)}")
print(f"# interactions on Test set: {len(interactions_test_df)}")

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [35]:
interactions = (
    interactions_train_df
    .groupby("personId")["contentId"].agg(lambda x: list(x))
    .reset_index()
    .rename(columns={"contentId": "true_train"})
    .set_index("personId")
)

interactions["true_test"] = (
    interactions_test_df
    .groupby("personId")["contentId"].agg(lambda x: list(x))
)


interactions.loc[pd.isnull(interactions.true_test), "true_test"] = [
    "" for x in range(len(interactions.loc[pd.isnull(interactions.true_test), "true_test"]))]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


### LightFM

In [36]:
!pip install lightfm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lightfm
  Downloading lightfm-1.16.tar.gz (310 kB)
[K     |████████████████████████████████| 310 kB 6.5 MB/s 
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.16-cp37-cp37m-linux_x86_64.whl size=705357 sha256=061023c600ae0f718c4a74eafd54d052f19c693ce865ade878257275b3a57a04
  Stored in directory: /root/.cache/pip/wheels/f8/56/28/5772a3bd3413d65f03aa452190b00898b680b10028a1021914
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.16


In [37]:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

Модели в LightFM работают с разреженными матрицами.

In [38]:
from lightfm.data import Dataset

In [39]:
dataset = Dataset()
dataset.fit(interactions_full_df.personId, interactions_full_df.contentId)

X = (interactions_train_df[["personId", "contentId", "eventStrength"]].apply(tuple, axis=1), interactions_test_df[["personId", "contentId", "eventStrength"]].apply(tuple, axis=1))


In [40]:
data_train = dataset.build_interactions(X[0])[1]
data_test = dataset.build_interactions(X[1])[1]

In [41]:
model_lightFM = LightFM(loss = 'warp', k = 10)

model_lightFM.fit(data_train, epochs = 20)

<lightfm.lightfm.LightFM at 0x7fbf47fe0a50>

In [42]:
precision_at_k(model_lightFM, data_test, data_train, 10).mean()

0.0061099795

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

articles_text_full_df = pd.merge(pd.DataFrame(interactions_full_df["contentId"].unique(), columns = ["contentId"]), articles_df[["contentId", "text"]], on = "contentId", how = "left") 

articles_text_full_df.text.isna().sum() # везде ли есть данные

8

In [45]:
articles_text_full_df["text"] = articles_text_full_df.text.fillna("")

articles_text_full_df.head()

Unnamed: 0,contentId,text
0,-5065077552540450930,A AXA se manteve na liderança do ranking de ma...
1,-6623581327558800021,"About a decade ago, a handful of Google's most..."
2,-793729620925729327,"Posted by Sam Thorogood , Developer Programs E..."
3,1469580151036142903,This is one of the great discussions among dev...
4,7270966256391553686,We are excited to announce the release of .NET...


In [46]:
vectorizer = TfidfVectorizer()
feat = vectorizer.fit_transform(articles_text_full_df["text"])

In [47]:
model_lightFM2 = LightFM(loss = 'warp', k = 10)

model_lightFM2.fit(data_train, epochs = 20, item_features = feat)

<lightfm.lightfm.LightFM at 0x7fbf37a71550>

In [48]:
precision_at_k(model_lightFM2, data_test, data_train, 10, item_features = feat).mean()

0.0063136453

### Обработка данных

Ранее использовался сырой текст статей. Сделать будет произведена предобработка

In [49]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import nltk
import re

In [50]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [51]:
def tokenizer_ow(strok: str) -> list: 
    strok = str(strok)
    strok = strok.lower() # перевожу все буквы в нижний регистр
    strok = word_tokenize(strok) # разделяю слова и символы пунктуации
    # беру значения, в которых нет цифр и символов пунктуации
    strok = pd.DataFrame(filter(re.compile(r"^[a-z]+$").search, strok))
    # беру значения, которые не являются стоп-словами
    if not strok.empty: strok = strok[[let not in stopwords.words('portuguese') for let in strok[0]]]
    if not strok.empty: strok = strok[[let not in stopwords.words('english') for let in strok[0]]]

    stemmer = SnowballStemmer('portuguese', "english")
    stemmed = []
    # произвожу стемминг слов
    if not strok.empty: stemmed = [stemmer.stem(w) for w in strok[0]]
    
    return str(" ".join(stemmed))
    

In [52]:
tokenizer_ow(articles_text_full_df.text[0])

'axa mantev ranking maior segur europ segund estud divulg fundacion mapfr nest tot eur ranking traz allianz general prudential zurich talanx cnp credit agricol aviv mapfr dez maior segurdor europ registr eur cresciment ano anterior segment vid eur lider allianz axa zurich quesit marg solvenc mapfr ranking segu prudential axa aviv talanx'

In [53]:
articles_text_full_df["text"] = list(map(lambda x: tokenizer_ow(x), articles_text_full_df["text"]))

In [54]:
articles_text_full_df["text"].head()

0    axa mantev ranking maior segur europ segund es...
1    decad ago handful googl talented engineers sta...
2    posted sam thorogood develop programs engin cl...
3    one great discussions among developers documen...
4    excited announc releas cor cor entity framewor...
Name: text, dtype: object

In [55]:
vectorizer = TfidfVectorizer()
feat = vectorizer.fit_transform(articles_text_full_df["text"])

In [56]:
model_lightFM3 = LightFM(loss = 'warp', k = 10)

model_lightFM3.fit(data_train, epochs = 20, item_features = feat)

<lightfm.lightfm.LightFM at 0x7fbf33aa9f90>

In [57]:
precision_at_k(model_lightFM3, data_test, data_train, 10, item_features = feat).mean()

0.0066191447

Качество увеличилось, но крайне незначительно


### Подбор гиперпараметров модели LightFM 

Тут мы будем настраивать модель на следующих параметрах: learning_rate, no_components, n, learning_shedule, rho и item_alpha


In [58]:
model_lightFM4 = LightFM(loss = 'warp', k = 10, no_components=20, n=7, learning_schedule='adadelta', learning_rate=0.08, item_alpha=0.0, user_alpha=0.0)

model_lightFM4.fit(data_train, epochs = 20, item_features = feat)

<lightfm.lightfm.LightFM at 0x7fbf379a6550>

In [59]:
precision_at_k(model_lightFM4, data_test, data_train, 10, item_features = feat).mean()

0.0075356415

Теперь попробуем увеличить no_component, поменяем learning_schedule и увеличим learning_rate

In [60]:
model_lightFM5 = LightFM(loss = 'warp', k = 10, no_components=30, n=7, learning_schedule='adagrad', learning_rate=0.15, item_alpha=0.0, user_alpha=0.0)

model_lightFM5.fit(data_train, epochs = 20, item_features = feat)

<lightfm.lightfm.LightFM at 0x7fbf379adad0>

In [61]:
precision_at_k(model_lightFM5, data_test, data_train, 10, item_features = feat).mean()

0.006211813

Теперь попробуем уменьшить no_component, увеличим n 

In [62]:
model_lightFM6 = LightFM(loss = 'warp', k = 10, no_components=10, n=15, learning_schedule='adagrad', learning_rate=0.15, item_alpha=0.0, user_alpha=0.0)

model_lightFM6.fit(data_train, epochs = 20, item_features = feat)

precision_at_k(model_lightFM6, data_test, data_train, 10, item_features = feat).mean()

0.005193483

Теперь попробуем уменьшить no_component, уменьшим n и уменьшим learning_rate

In [63]:
model_lightFM7 = LightFM(loss = 'warp', k = 10, no_components=10, n=10, learning_schedule='adagrad', learning_rate=0.1, item_alpha=0.0, user_alpha=0.0)

model_lightFM7.fit(data_train, epochs = 20, item_features = feat)

precision_at_k(model_lightFM7, data_test, data_train, 10, item_features = feat).mean()

0.0066191447

Теперь попробуем увеличить no_component, уменьшим n и перейдем к adadelta

In [64]:
model_lightFM9 = LightFM(loss = 'warp', k = 10, no_components=20, n=7, learning_schedule='adadelta', learning_rate=0.1, item_alpha=0.0, user_alpha=0.0, rho = 0.5)

model_lightFM9.fit(data_train, epochs = 25, item_features = feat)

precision_at_k(model_lightFM9, data_test, data_train, 10, item_features = feat).mean()

0.00784114

Мы получили лучшее качество из всех  моделей
