# Рекомендательная система
## Обучение тестовой модели

**Описание задачи**

На данном шаге необходимо обучить модель рекомендательной системы постов в социальной сети с применением методов deep learning.

**Описание данных**

Датасет `user_data` cодержит информацию о всех пользователях социальной сети.

- age — возраст пользователя (в профиле),
- city — город пользователя (в профиле),
- country — страна пользователя (в профиле),
- exp_group — экспериментальная группа (зашифрованная категория),
- gender — пол пользователя,
- user_id — уникальный идентификатор пользователя,
- os — операционная система устройства, с которого происходит пользование соцcетью,
- source — признак, показывающий пришел ли пользователь в приложение с органического трафика или с рекламы.

Датасет `post_text_df` содержит информацию о постах и уникальный ID каждой единицы с соответствующим ей текстом и топиком.

- id — уникальный идентификатор поста,
- text — текстовое содержание поста,
- topic — основная тематика.

Датасет `feed_data` содержит историю о просмотренных постах для каждого юзера в изучаемый период.

- timestamp — время, когда был произведен просмотр,
- user_id — уникальный идентификатор пользователя, который совершил просмотр,
- post_id — уникальный идентификатор просмотренного поста,
- action — тип действия (просмотр или лайк),
- target — 1 у просмотров, если почти сразу после просмотра был совершен лайк, иначе 0; у действий like пропущенное значение.

**Последовательность выполнения**

- Загрузить датасеты;
- Провести кодировку признаков (бинарную, One Hot Encoding, TF-IDF);
- Подготовить датасеты для обучения модели;
- Разделить выборки на обучающую и тестовую;
- Обучить модель и замерить качество;
- Сохранить модель.

### Реализация

Загрузим библиотеки.

In [1]:
import pandas as pd
from sqlalchemy import create_engine

Создадим подключения к базе данных PostgreSQL.

In [2]:
# параметры подключения удалены
engine = create_engine(
   
)

Загрузим датасет `user_data`.

In [62]:
user_data = pd.read_sql('SELECT * FROM public.user_data', con=engine, params=None)
user_data

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads
...,...,...,...,...,...,...,...,...
163200,168548,0,36,Russia,Kaliningrad,4,Android,organic
163201,168549,0,18,Russia,Tula,2,Android,organic
163202,168550,1,41,Russia,Yekaterinburg,4,Android,organic
163203,168551,0,38,Russia,Moscow,3,iOS,organic


Загрузим датасет `post_text_df`.

In [44]:
engine.dispose()
post_text_df = pd.read_sql('SELECT * FROM public.post_text_df', con=engine, params=None)
post_text_df

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business
3,4,India power shares jump on debut\n\nShares in ...,business
4,5,Lacroix label bought by US firm\n\nLuxury good...,business
...,...,...,...
7018,7315,"OK, I would not normally watch a Farrelly brot...",movie
7019,7316,I give this movie 2 stars purely because of it...,movie
7020,7317,I cant believe this film was allowed to be mad...,movie
7021,7318,The version I saw of this film was the Blockbu...,movie


Создадим новые признаки постов с помощью методов deep learning. Создадим эмбеддинги текстов постов, кластеризуем тексты и посчитаем расстояния между кластерами.

Создадим эмбеддинги постов.

In [45]:
from transformers import AutoTokenizer
from transformers import BertModel  # https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel
from transformers import RobertaModel  # https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaModel
from transformers import DistilBertModel  # https://huggingface.co/docs/transformers/model_doc/distilbert#transformers.DistilBertModel


def get_model(model_name):
    assert model_name in ['bert', 'roberta', 'distilbert']

    checkpoint_names = {
        'bert': 'bert-base-cased',  # https://huggingface.co/bert-base-cased
        'roberta': 'roberta-base',  # https://huggingface.co/roberta-base
        'distilbert': 'distilbert-base-cased'  # https://huggingface.co/distilbert-base-cased
    }

    model_classes = {
        'bert': BertModel,
        'roberta': RobertaModel,
        'distilbert': DistilBertModel
    }

    return AutoTokenizer.from_pretrained(checkpoint_names[model_name]), model_classes[model_name].from_pretrained(checkpoint_names[model_name])

In [46]:
tokenizer, model = get_model('distilbert')

Создадим датасет для постов.

In [47]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding


class PostDataset(Dataset):
    def __init__(self, texts, tokenizer):
        super().__init__()

        self.texts = tokenizer.batch_encode_plus(
            texts,
            add_special_tokens=True,
            return_token_type_ids=False,
            return_tensors='pt',
            truncation=True,
            padding=True
        )
        self.tokenizer = tokenizer

    def __getitem__(self, idx):
        return {'input_ids': self.texts['input_ids'][idx], 'attention_mask': self.texts['attention_mask'][idx]}

    def __len__(self):
        return len(self.texts['input_ids'])


dataset = PostDataset(post_text_df['text'].values.tolist(), tokenizer)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

loader = DataLoader(dataset, batch_size=32, collate_fn=data_collator, pin_memory=True, shuffle=False)

In [48]:
import torch
from tqdm import tqdm


@torch.inference_mode()
def get_embeddings_labels(model, loader):
    model.eval()

    total_embeddings = []

    for batch in tqdm(loader):
        batch = {key: batch[key].to(device) for key in ['attention_mask', 'input_ids']}

        embeddings = model(**batch)['last_hidden_state'][:, 0, :]

        total_embeddings.append(embeddings.cpu())

    return torch.cat(total_embeddings, dim=0)

In [49]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

print(device)
print(torch.cuda.get_device_name())

model = model.to(device)

cuda:0
Tesla T4


In [50]:
embeddings = get_embeddings_labels(model, loader).numpy()

embeddings

100%|██████████| 220/220 [01:44<00:00,  2.10it/s]


array([[ 3.63150924e-01,  4.89373170e-02, -2.64081299e-01, ...,
        -1.41593695e-01,  1.59184076e-02,  9.18315927e-05],
       [ 2.36416340e-01, -1.59501180e-01, -3.27798218e-01, ...,
        -2.89936006e-01,  1.19365506e-01, -1.62340223e-03],
       [ 3.75191480e-01, -1.13943778e-01, -2.40546927e-01, ...,
        -3.38919282e-01,  5.86939603e-02, -2.12654602e-02],
       ...,
       [ 3.40382725e-01,  6.64922222e-02, -1.63184673e-01, ...,
        -8.65626708e-02,  2.03403935e-01,  3.20908204e-02],
       [ 4.32092279e-01,  1.10915396e-02, -1.17306016e-01, ...,
         7.54014924e-02,  1.02739781e-01,  1.52742360e-02],
       [ 3.04277599e-01, -7.62158036e-02, -6.77587986e-02, ...,
        -5.43485284e-02,  2.44383752e-01, -1.41485315e-02]], dtype=float32)

Кластеризуем тексты.

In [51]:

from sklearn.decomposition import PCA

centered = embeddings - embeddings.mean()

pca = PCA(n_components=50)
pca_decomp = pca.fit_transform(centered)

In [52]:
from sklearn.cluster import KMeans

n_clusters = 15

kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(pca_decomp)

post_text_df['TextCluster'] = kmeans.labels_

dists_columns = [f'DistanceToCluster_{i}' for i in range(n_clusters)]

dists_df = pd.DataFrame(
    data=kmeans.transform(pca_decomp),
    columns=dists_columns
)

dists_df.head()

Unnamed: 0,DistanceToCluster_0,DistanceToCluster_1,DistanceToCluster_2,DistanceToCluster_3,DistanceToCluster_4,DistanceToCluster_5,DistanceToCluster_6,DistanceToCluster_7,DistanceToCluster_8,DistanceToCluster_9,DistanceToCluster_10,DistanceToCluster_11,DistanceToCluster_12,DistanceToCluster_13,DistanceToCluster_14
0,3.376088,3.460485,3.587103,3.412926,1.752419,3.657207,2.820774,2.347432,3.467215,3.574177,3.442537,3.410246,3.390423,2.955081,2.204343
1,3.326862,3.121793,3.417043,3.220188,1.716712,3.45777,2.543294,2.320527,3.242741,3.267741,2.977005,3.32842,3.36731,2.809945,2.213508
2,3.257704,3.11222,3.484241,3.290414,1.61069,3.444299,2.872854,2.385547,3.391403,3.24516,2.967417,3.360057,3.495586,3.036035,3.014316
3,3.52079,3.783128,3.693471,3.699594,2.344504,3.146952,3.364932,2.80527,4.059343,3.804429,3.714694,3.740502,3.754162,3.242586,3.373845
4,3.037258,2.773268,3.149175,2.84779,1.702607,3.156846,2.131284,2.019922,3.236338,2.933604,2.641417,2.81265,2.804676,2.64044,2.913074


Объединим признаки постов в один датасет.

In [53]:
post_text_df = pd.concat((post_text_df, dists_df), axis=1)
post_text_df

Unnamed: 0,post_id,text,topic,TextCluster,DistanceToCluster_0,DistanceToCluster_1,DistanceToCluster_2,DistanceToCluster_3,DistanceToCluster_4,DistanceToCluster_5,DistanceToCluster_6,DistanceToCluster_7,DistanceToCluster_8,DistanceToCluster_9,DistanceToCluster_10,DistanceToCluster_11,DistanceToCluster_12,DistanceToCluster_13,DistanceToCluster_14
0,1,UK economy facing major risks\n\nThe UK manufa...,business,4,3.376088,3.460485,3.587103,3.412926,1.752419,3.657207,2.820774,2.347432,3.467215,3.574177,3.442537,3.410246,3.390423,2.955081,2.204343
1,2,Aids and climate top Davos agenda\n\nClimate c...,business,4,3.326862,3.121793,3.417043,3.220188,1.716712,3.457770,2.543294,2.320527,3.242741,3.267741,2.977005,3.328420,3.367310,2.809945,2.213508
2,3,Asian quake hits European shares\n\nShares in ...,business,4,3.257704,3.112220,3.484241,3.290414,1.610690,3.444299,2.872854,2.385547,3.391403,3.245160,2.967417,3.360057,3.495586,3.036035,3.014316
3,4,India power shares jump on debut\n\nShares in ...,business,4,3.520790,3.783128,3.693471,3.699594,2.344504,3.146952,3.364932,2.805270,4.059343,3.804429,3.714694,3.740502,3.754162,3.242586,3.373845
4,5,Lacroix label bought by US firm\n\nLuxury good...,business,4,3.037258,2.773268,3.149175,2.847790,1.702607,3.156846,2.131284,2.019922,3.236338,2.933604,2.641417,2.812650,2.804676,2.640440,2.913074
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7018,7315,"OK, I would not normally watch a Farrelly brot...",movie,11,3.130865,1.825135,3.218858,2.012034,2.948652,2.135622,2.336439,2.747971,3.395160,2.739128,3.049363,1.280399,1.793777,2.833709,3.341163
7019,7316,I give this movie 2 stars purely because of it...,movie,11,2.931342,1.840502,2.867923,1.785840,2.943121,1.946208,2.234463,2.460762,3.381366,2.405871,3.195972,0.925918,1.461003,2.520706,3.177790
7020,7317,I cant believe this film was allowed to be mad...,movie,11,2.831295,1.988769,2.734966,2.180366,3.179315,2.387480,2.446443,2.819374,3.460753,2.168348,3.153391,1.500393,2.006320,2.562688,3.393071
7021,7318,The version I saw of this film was the Blockbu...,movie,12,3.428010,1.557200,3.535438,1.848580,3.194752,1.782534,2.316764,3.001987,3.408694,3.113902,3.216329,1.486679,1.032388,3.111778,3.432442


Очистим память от переменных, которые больше не нужны.

In [15]:
model.cpu()

del model
del tokenizer

del dataset
del loader

del embeddings
del centered
del pca
del pca_decomp

In [16]:
import gc

In [17]:
gc.collect()

220

Загрузим 9000000 строк из датасета `feed_data`.

In [19]:
engine.dispose()
feed_data = pd.read_sql('''SELECT * FROM public.feed_data where action = 'view' limit 9000000''', con=engine, params=None)
feed_data

Unnamed: 0,timestamp,user_id,post_id,action,target
0,2021-10-25 20:45:43,132130,1866,view,0
1,2021-10-25 20:46:33,132130,5582,view,0
2,2021-10-25 20:48:56,132130,582,view,0
3,2021-10-25 20:49:41,132130,1344,view,0
4,2021-10-25 20:52:40,132130,6852,view,0
...,...,...,...,...,...
8999995,2021-11-11 06:34:38,113135,5769,view,0
8999996,2021-11-11 06:37:07,113135,2512,view,0
8999997,2021-11-11 06:39:06,113135,1243,view,0
8999998,2021-11-11 06:39:19,113135,2361,view,1


Удалим столбец action.

In [20]:
feed_data.drop('action', axis=1, inplace=True)

Создадим временные признаки - выделим из времени совершённого действия месяц и час.

In [21]:
feed_data["timestamp"] = pd.to_datetime(feed_data["timestamp"])

feed_data['month'] = feed_data["timestamp"].dt.month.astype(int)
feed_data['hour'] = feed_data["timestamp"].dt.hour.astype(int)

feed_data.drop('timestamp', axis=1, inplace=True)

In [22]:
feed_data

Unnamed: 0,user_id,post_id,target,month,hour
0,132130,1866,0,10,20
1,132130,5582,0,10,20
2,132130,582,0,10,20
3,132130,1344,0,10,20
4,132130,6852,0,10,20
...,...,...,...,...,...
8999995,113135,5769,0,11,6
8999996,113135,2512,0,11,6
8999997,113135,1243,0,11,6
8999998,113135,2361,1,11,6


Объединим все таблицы в одну, удалим индексы.

In [23]:
merged_df = pd.merge(feed_data, user_data, on='user_id', how='inner')

In [26]:
df = pd.merge(merged_df, post_text_df, on='post_id', how='inner')

In [29]:
df.drop(['user_id', 'post_id', 'text'], axis=1, inplace=True)

In [None]:
df

Unnamed: 0,target,month,hour,gender,age,country,city,exp_group,os,source,topic,TotalTfIdf,MaxTfIdf,MeanTfIdf,TextCluster,DistanceToCluster_0,DistanceToCluster_1,DistanceToCluster_2
0,0,12,15,1,26,Russia,Moscow,3,Android,organic,movie,8.905982,0.288883,0.000193,1,2.572731,1.748103,3.004548
1,0,12,15,1,26,Russia,Moscow,3,Android,organic,politics,8.975713,0.450692,0.000195,0,1.977838,2.436120,2.828584
2,0,12,15,1,26,Russia,Moscow,3,Android,organic,tech,8.943814,0.419935,0.000194,0,2.683954,2.914568,3.390737
3,0,12,15,1,26,Russia,Moscow,3,Android,organic,movie,10.567959,0.299109,0.000230,1,2.645297,2.062866,3.378429
4,0,12,15,1,26,Russia,Moscow,3,Android,organic,movie,5.601328,0.568047,0.000122,1,2.655525,1.426013,2.593185
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8999995,0,12,12,0,32,Ukraine,Novomoskovsk,4,iOS,organic,movie,6.188172,0.316797,0.000134,1,2.777361,1.592560,2.774757
8999996,0,12,12,0,32,Ukraine,Novomoskovsk,4,iOS,organic,business,12.652122,0.204141,0.000275,0,2.363652,3.170294,3.213653
8999997,0,12,12,0,32,Ukraine,Novomoskovsk,4,iOS,organic,movie,6.512940,0.240806,0.000141,1,2.747341,1.674026,2.640491
8999998,0,12,12,0,32,Ukraine,Novomoskovsk,4,iOS,organic,movie,4.720363,0.339974,0.000103,1,2.774836,1.869421,2.795029


Разделим данные на признаки и целевую переменную, выделим обучающую и тестовую выборки.

In [30]:
X = df.drop('target', axis=1)
y = df['target']

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Обучим модель с помощью библиотеки CatBoost.

In [35]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp312-cp312-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8


In [36]:
from catboost import CatBoostClassifier

In [37]:
object_cols = [
    'topic', 'TextCluster', 'gender', 'country',
    'city', 'exp_group', 'hour', 'month',
    'os', 'source'
]

In [39]:
catboost = CatBoostClassifier(iterations=200,
                              learning_rate=1,
                              depth=2,
                              random_seed=12345612,
                              thread_count=-1,
                              task_type="GPU",
                              verbose=10,
                              cat_features=object_cols)

In [40]:
catboost.fit(X_train, y_train)

0:	learn: 0.3620159	total: 318ms	remaining: 1m 3s
10:	learn: 0.3503648	total: 3.37s	remaining: 58s
20:	learn: 0.3487492	total: 6.32s	remaining: 53.9s
30:	learn: 0.3482955	total: 9.01s	remaining: 49.1s
40:	learn: 0.3474757	total: 11.7s	remaining: 45.2s
50:	learn: 0.3470533	total: 14.8s	remaining: 43.3s
60:	learn: 0.3467422	total: 17.7s	remaining: 40.4s
70:	learn: 0.3464980	total: 20.7s	remaining: 37.6s
80:	learn: 0.3463216	total: 23.6s	remaining: 34.7s
90:	learn: 0.3461664	total: 26.3s	remaining: 31.5s
100:	learn: 0.3459855	total: 29.1s	remaining: 28.5s
110:	learn: 0.3458691	total: 31.7s	remaining: 25.4s
120:	learn: 0.3458170	total: 34.2s	remaining: 22.3s
130:	learn: 0.3457259	total: 36.7s	remaining: 19.4s
140:	learn: 0.3456656	total: 39.5s	remaining: 16.5s
150:	learn: 0.3456076	total: 42.3s	remaining: 13.7s
160:	learn: 0.3455217	total: 45s	remaining: 10.9s
170:	learn: 0.3454548	total: 47.5s	remaining: 8.06s
180:	learn: 0.3453788	total: 50.2s	remaining: 5.27s
190:	learn: 0.3453051	total

<catboost.core.CatBoostClassifier at 0x7c30c4889a90>

Оценим качество

In [41]:
from sklearn.metrics import roc_auc_score

print(f"Качество на трейне: {roc_auc_score(y_train, catboost.predict_proba(X_train)[:, 1])}")
print(f"Качество на тесте: {roc_auc_score(y_test, catboost.predict_proba(X_test)[:, 1])}")


Качество на трейне: 0.6791472261543421
Качество на тесте: 0.6791403705526892


Сохраним модель.

In [None]:
catboost.save_model('model_test', format="cbm")

Сохраним таблицы с признаками постов и юзеров.

In [57]:
post_text_df.to_sql('m_3_posts_features_v7', con=engine, if_exists='replace', index=False)

23

In [64]:
user_data.to_sql('m_3_users_features_v7', con=engine, if_exists='replace', index=False)

205