# Rec Sys

## Предположения и что мы проверяем в решении проекта

- На практике мы хотим достаточно быстро формировать рекомендации. Поэтому будем требовать, чтобы алгоритм работал не более, чем ~0.5 секунд на один запрос и занимал не более ~4 ГБ памяти (цифры приблизительные).
- Набор пользователей фиксирован, и новых добавляться не будет.
- Чекер будет проверять модель в рамках того же временного периода, что вы видите в базе данных.
- Модели не обучаются заново при использовании сервисов. Мы ожидаем, что ваш код будет импортировать уже обученную модель и применять её.

## 0. Notes & Ideas

In [1]:
# - 

- Про таблицы. Предлагаю не усложнаять систму и работать с таблицей только для постов. Когда вы сделаете новые фичи для постов - сохраните их в одну таблицу, такого же размера, то есть примерно 7000 строк.
- Про RAM. Для обучения вам достаточно 5 млн из фид таблицы- это правильно. 

**Несколько подсказок:**
- Для того чтобы как раз не отдавать все данные мы и строим по сути модель машинного обучения.

- В сервисе вам нужно будет выгрузить все строки с лайками, для того чтобы отфильтровать те посты которые нужный пользователь уже лайнул.

- Работайте сразу в БД с sql запросом, чтобы сформировать нужный датасет из 5 млн строк. Для обучения вам не нужно прогонять обработки по всем данным.

- В сервисе загружайте только с action = 'like'
- По поводу запроса при старте сервиса. Уменьшите количество столбцов до двух конкрентных столбцов, которые необходимы для работы сервиса + в запросе должно быть action = 'like'.
- По поводу признаков для user - наибольшее качество дадут спроектированные признаки для постов, поэтому предлагаю, начать с них.
- Параметр timestamp можно использовать как основу для своих фичей - таких как час дня, день месяца и тд. В фильтрации он не участвует.

- Вы можете выгрузить лайкнутые посты уже в датафрейм и делать сортировку по датафреймам
- В JupiterHub лучше не делать финальный проект т.к. там нет столько вычислительных ресурсов. Попробуйте обучать модель на google colab или kaggle

- А зачем вам таблица feed для выдачи рекомендаций? таблица feed нужна для обучения така как имеет колонку target и взаимодействия юзер-пост.
- Для сервиса, который будет делать рекомендации взаимодействия юзер-пост совсем не нужны. Нужны только юзера и посты как таковые.
- А для ЛМС стоит возвращать таблицу, которая содержит инфомрацию о юзерах

- Q: стоит оставлять только чистые признаки юзеров и постов без идентификаторов?  
A: в итоговой модели не должно быть этих id. Датафрейм который подается в модель не должен содержать никаких id.

* Параметр `time` - вы можете его использовать в признаках модели, собрать например час дня, день недели и тд.
Можно не использовать этот аргумент если у вас модель работает хорошо без таких признаков.

- надо будет убрать все запросы из эндпоинта, они 100% будут отрабатывать дольше,чем 0.5с
- попробуйте взять больше данных, например 1М записей или 700к. 
- так же возможно обучать стоит менее глубокую модель

## 1. Загрузка данных из базы данных (БД) и обзор данных

На первом этапе мы подключаемся к базе данных, выгружаем необходимые данные и загружаем их в Jupyter Hub для анализа. В этот момент цель — понять структуру данных, выявить возможные пропуски или аномалии, а также получить общее представление о распределении и составе данных. Анализ включает изучение признаков (features) и целевой переменной.



In [33]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [2]:
engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

### USER_DATA

In [5]:
user_df = pd.read_sql('SELECT * FROM "user_data"', con=engine)

In [6]:
user_df

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads
...,...,...,...,...,...,...,...,...
163200,168548,0,36,Russia,Kaliningrad,4,Android,organic
163201,168549,0,18,Russia,Tula,2,Android,organic
163202,168550,1,41,Russia,Yekaterinburg,4,Android,organic
163203,168551,0,38,Russia,Moscow,3,iOS,organic


In [6]:
user_df.user_id.nunique()

163205

In [7]:
user_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163205 entries, 0 to 163204
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    163205 non-null  int64 
 1   gender     163205 non-null  int64 
 2   age        163205 non-null  int64 
 3   country    163205 non-null  object
 4   city       163205 non-null  object
 5   exp_group  163205 non-null  int64 
 6   os         163205 non-null  object
 7   source     163205 non-null  object
dtypes: int64(4), object(4)
memory usage: 44.5 MB


In [8]:
user_df.gender.value_counts()

1    89980
0    73225
Name: gender, dtype: int64

In [9]:
user_df.age.value_counts()

20    10280
21    10139
19     9802
22     9049
18     9034
      ...  
86        1
83        1
85        1
92        1
95        1
Name: age, Length: 76, dtype: int64

In [10]:
user_df.age.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])

count    163205.000000
mean         27.195405
std          10.239158
min          14.000000
1%           14.000000
5%           16.000000
25%          19.000000
50%          24.000000
75%          33.000000
95%          48.000000
99%          58.000000
max          95.000000
Name: age, dtype: float64

In [11]:
user_df.country.value_counts()

Russia         143035
Ukraine          8273
Belarus          3293
Kazakhstan       3172
Turkey           1606
Finland          1599
Azerbaijan       1542
Estonia           178
Latvia            175
Cyprus            170
Switzerland       162
Name: country, dtype: int64

In [12]:
user_df.city.nunique()

3915

In [13]:
user_df.os.value_counts()

Android    105972
iOS         57233
Name: os, dtype: int64

In [14]:
user_df.source.value_counts()

ads        101685
organic     61520
Name: source, dtype: int64

In [15]:
user_df.exp_group.value_counts()

3    32768
0    32723
1    32638
2    32614
4    32462
Name: exp_group, dtype: int64

### POST_TEXT_DF

In [7]:
post_df = pd.read_sql('SELECT * FROM "post_text_df"', con=engine)

In [8]:
post_df

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business
3,4,India power shares jump on debut\n\nShares in ...,business
4,5,Lacroix label bought by US firm\n\nLuxury good...,business
...,...,...,...
7018,7315,"OK, I would not normally watch a Farrelly brot...",movie
7019,7316,I give this movie 2 stars purely because of it...,movie
7020,7317,I cant believe this film was allowed to be mad...,movie
7021,7318,The version I saw of this film was the Blockbu...,movie


In [18]:
post_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7023 entries, 0 to 7022
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   post_id  7023 non-null   int64 
 1   text     7023 non-null   object
 2   topic    7023 non-null   object
dtypes: int64(1), object(2)
memory usage: 9.8 MB


In [19]:
post_df.post_id.nunique()

7023

In [20]:
print(post_df.text[:2].values)

['UK economy facing major risks\n\nThe UK manufacturing sector will continue to face serious challenges over the next two years, the British Chamber of Commerce (BCC) has said.\n\nThe groups quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced major risks and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.\n\nManufacturers domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.\n\nDespite some positive news for the export sector, there are worrying signs for manufacturing, the BCC said. These results reinforce our concern over the sectors persistent inabil

In [21]:
print(post_df.text[:1].values[0], sep='\n')

UK economy facing major risks

The UK manufacturing sector will continue to face serious challenges over the next two years, the British Chamber of Commerce (BCC) has said.

The groups quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced major risks and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.

Manufacturers domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.

Despite some positive news for the export sector, there are worrying signs for manufacturing, the BCC said. These results reinforce our concern over the sectors persistent inability to sus

In [22]:
post_df.topic.value_counts()

movie            3000
covid            1799
business          510
sport             510
politics          417
tech              401
entertainment     386
Name: topic, dtype: int64

### POST_TEXT_PCA

In [9]:
post_df_pca = pd.read_sql('SELECT * FROM "post_text"', con=engine)

In [10]:
post_df_pca

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.046990
1,2,business,-0.102748,-0.316642,0.021755,-0.045080,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667
3,4,business,-0.075025,-0.231227,0.010022,-0.035137,0.000382
4,5,business,-0.086689,-0.252862,0.017764,-0.032102,0.054850
...,...,...,...,...,...,...,...
7018,7315,movie,-0.202135,0.246869,-0.003796,-0.270650,0.034498
7019,7316,movie,-0.212412,0.296487,-0.005319,-0.208802,0.099140
7020,7317,movie,-0.187134,0.195667,0.013325,0.380744,0.118247
7021,7318,movie,-0.143072,0.082273,0.005458,0.219139,0.029975


In [25]:
post_df_pca.post_id.nunique()

7023

In [26]:
post_df_pca.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7023 entries, 0 to 7022
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   post_id  7023 non-null   int64  
 1   topic    7023 non-null   object 
 2   PCA_1    7023 non-null   float64
 3   PCA_2    7023 non-null   float64
 4   PCA_3    7023 non-null   float64
 5   PCA_4    7023 non-null   float64
 6   PCA_5    7023 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 759.9 KB


### FEED_DATA

In [27]:
feed_data_size = pd.read_sql('SELECT COUNT(*) as table_size FROM "feed_data"', con=engine)

In [28]:
feed_data_size

Unnamed: 0,table_size
0,76892800


In [29]:
dates_range_df = pd.read_sql('SELECT MIN(timestamp) AS min_date, MAX(timestamp) AS max_date FROM "feed_data"', con=engine)

In [30]:
dates_range_df

Unnamed: 0,min_date,max_date
0,2021-10-01 06:01:40,2021-12-29 23:51:06


In [86]:
feed_df = pd.read_sql("SELECT * FROM feed_data WHERE action != 'like' LIMIT 50000", con=engine)

In [87]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,action,target
0,2021-10-15 16:02:50,98488,5174,view,0
1,2021-10-15 16:04:44,98488,6536,view,0
2,2021-10-15 16:06:21,98488,5936,view,1
3,2021-10-15 16:08:19,98488,6096,view,0
4,2021-10-15 16:11:07,98488,1683,view,1
...,...,...,...,...,...
49995,2021-10-26 07:52:24,19548,3325,view,0
49996,2021-10-26 07:54:42,19548,1530,view,0
49997,2021-10-26 07:57:09,19548,1332,view,0
49998,2021-10-26 07:59:04,19548,1852,view,0


In [33]:
feed_df.timestamp.min()

Timestamp('2021-10-01 08:19:39')

In [34]:
feed_df.timestamp.max()

Timestamp('2021-12-29 23:25:55')

In [35]:
feed_df.iloc[30:40]

Unnamed: 0,timestamp,user_id,post_id,action,target
30,2021-12-13 18:38:03,84189,343,view,0
31,2021-12-13 18:39:10,84189,4939,view,0
32,2021-12-13 18:41:46,84189,6123,view,0
33,2021-12-13 18:44:37,84189,4907,view,1
34,2021-12-13 18:47:21,84189,2651,view,0
35,2021-12-13 18:48:34,84189,6032,view,0
36,2021-12-13 18:50:04,84189,618,view,0
37,2021-12-13 18:51:19,84189,4025,view,0
38,2021-12-13 18:53:54,84189,4909,view,0
39,2021-12-13 18:54:14,84189,5401,view,0


In [36]:
feed_df.timestamp.nunique()

46526

In [37]:
feed_df.user_id.nunique()

121

In [38]:
feed_df.post_id.nunique()

6801

In [39]:
feed_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  50000 non-null  datetime64[ns]
 1   user_id    50000 non-null  int64         
 2   post_id    50000 non-null  int64         
 3   action     50000 non-null  object        
 4   target     50000 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 4.4 MB


In [40]:
feed_df.isna().sum()

timestamp    0
user_id      0
post_id      0
action       0
target       0
dtype: int64

In [41]:
feed_df.action.value_counts()

view    50000
Name: action, dtype: int64

In [42]:
feed_df.target.value_counts()

0    44028
1     5972
Name: target, dtype: int64

## 2. Создание признаков и формирование обучающей выборки

На этом этапе мы создаем новые признаки, которые могут быть полезны для модели. Признаки могут включать информацию о пользователе (например, возраст, пол, история взаимодействий), информацию о постах (тексты, темы, категории), а также дополнительные статистики, такие как частота лайков или вовлеченность пользователя. После генерации признаков формируется обучающая выборка, которая содержит все необходимые данные для последующего обучения модели.

In [43]:
user_df.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [44]:
post_df.head(3)

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business


In [45]:
post_df_pca.head(3)

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.04699
1,2,business,-0.102748,-0.316642,0.021755,-0.04508,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667


In [46]:
feed_df.head(3)

Unnamed: 0,timestamp,user_id,post_id,action,target
0,2021-11-26 14:01:12,84189,1776,view,0
1,2021-11-26 14:03:43,84189,1486,view,0
2,2021-11-26 14:04:26,84189,438,view,0


In [47]:
feed_df.drop('action', axis=1, inplace=True)

In [48]:
feed_df.head(3)

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-11-26 14:01:12,84189,1776,0
1,2021-11-26 14:03:43,84189,1486,0
2,2021-11-26 14:04:26,84189,438,0


In [49]:
feed_df.drop('timestamp', axis=1, inplace=True)
feed_df.head(3)

Unnamed: 0,user_id,post_id,target
0,84189,1776,0
1,84189,1486,0
2,84189,438,0


In [50]:
df = feed_df.merge(user_df, on='user_id', how='inner')

In [51]:
df = df.merge(post_df_pca, on='post_id', how='inner')

In [52]:
df

Unnamed: 0,user_id,post_id,target,gender,age,country,city,exp_group,os,source,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,84189,1776,0,1,30,Russia,Kaliningrad,0,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
1,38801,1776,0,0,17,Russia,Chita,1,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
2,19477,1776,0,1,15,Russia,Tomsk,2,Android,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
3,38802,1776,0,0,38,Russia,Novotroitsk,2,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
4,38806,1776,0,0,14,Russia,Moscow,0,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,19504,4521,0,1,17,Russia,Saint Petersburg,4,iOS,ads,movie,-0.237518,0.343240,0.002746,-0.089310,0.072245
49996,70468,6614,0,0,45,Russia,Moscow,2,Android,ads,movie,-0.204158,0.253604,-0.002414,-0.211505,-0.002567
49997,70468,4920,1,0,45,Russia,Moscow,2,Android,ads,movie,-0.161189,0.183307,0.000495,-0.073823,0.041760
49998,38824,5499,0,0,22,Russia,Barnaul,1,iOS,ads,movie,-0.162954,0.132514,-0.004940,-0.129042,-0.056654


In [53]:
X = df.drop(['target', 'user_id', 'post_id'], axis=1)
y = df.target

In [54]:
X.head(3)

Unnamed: 0,gender,age,country,city,exp_group,os,source,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,30,Russia,Kaliningrad,0,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
1,0,17,Russia,Chita,1,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
2,1,15,Russia,Tomsk,2,Android,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542


In [55]:
y[0:3]

0    0
1    0
2    0
Name: target, dtype: int64

In [56]:
# those are just basic features to proof the concept and have a baseline solution
# will come back later to do a features engineering step properly later
# tbc..

## 3. Тренировка модели и оценка её качества

Используя обучающую выборку, мы обучаем модель, выбирая алгоритм и его параметры. После обучения настраиваем модель и проверяем её качество на валидационной выборке. Оценка качества проводится с помощью метрик, например, точности, полноты или ROC-AUC. Этот этап помогает определить, насколько хорошо модель способна делать предсказания и где её можно улучшить.

Важно понимать, что повышение локального ROC-AUC не всегда гарантирует улучшение hitrate в LMS. Поэтому мы советуем проверять, как изменения вашей валидационной метрики сказываются на hitrate в LMS, чтобы убедиться в положительном влиянии.

In [57]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42, 
                                                    test_size=0.25)

In [58]:
cat_cols = X.select_dtypes(include='object').columns.to_list()
cat_cols

['country', 'city', 'os', 'source', 'topic']

In [59]:
from catboost import CatBoostClassifier


catboost = CatBoostClassifier()
catboost.fit(X_train, y_train, cat_features=cat_cols, verbose=100)

Learning rate set to 0.048422
0:	learn: 0.6575133	total: 88ms	remaining: 1m 27s
100:	learn: 0.3535494	total: 2.1s	remaining: 18.7s
200:	learn: 0.3482397	total: 4.3s	remaining: 17.1s
300:	learn: 0.3426794	total: 6.94s	remaining: 16.1s
400:	learn: 0.3377525	total: 9.63s	remaining: 14.4s
500:	learn: 0.3335170	total: 12.3s	remaining: 12.2s
600:	learn: 0.3290967	total: 14.9s	remaining: 9.9s
700:	learn: 0.3250677	total: 17.8s	remaining: 7.6s
800:	learn: 0.3213377	total: 20.7s	remaining: 5.13s
900:	learn: 0.3176110	total: 23.3s	remaining: 2.56s
999:	learn: 0.3142838	total: 26s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f7a50e09490>

In [60]:
# Predict class labels
y_pred = catboost.predict(X_test)

# Predict probabilities
y_pred_proba = catboost.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [61]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.88328
Precision: 0.3076923076923077
Recall: 0.002751031636863824
F1 Score: 0.0054533060668029995
ROC-AUC Score: 0.6440508505011305

Confusion Matrix:
 [[11037     9]
 [ 1450     4]]


In [62]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

print('FEATURE IMPORTANCE report')
for n, i in zip(feature_names, feature_importance):
    print(n, round(i, 2), sep=': ')

FEATURE IMPORTANCE report
gender: 1.63
age: 14.44
country: 3.34
city: 13.95
exp_group: 6.18
os: 1.25
source: 1.46
topic: 7.61
PCA_1: 8.87
PCA_2: 9.63
PCA_3: 8.85
PCA_4: 11.47
PCA_5: 11.31


In [63]:
# this is just a baseline solution
# right now I do not care about model quality and metrics
# tbc..

## 4. Сохранение обученной модели

После того как модель успешно обучена и её качество удовлетворяет требованиям, мы сохраняем её в определённом формате, который требует модель/библиотека. Этот файл станет основой для дальнейшего использования модели, так как он содержит все необходимые данные для предсказаний, включая веса и параметры.



In [64]:
catboost.save_model('catboost_model.cbm')

In [65]:
loaded_model = CatBoostClassifier()
loaded_model.load_model('catboost_model.cbm')

<catboost.core.CatBoostClassifier at 0x7f7a228deee0>

In [66]:
# Predict class labels
y_pred = loaded_model.predict(X_test)

# Predict probabilities
y_pred_proba = loaded_model.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [67]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.88328
Precision: 0.3076923076923077
Recall: 0.002751031636863824
F1 Score: 0.0054533060668029995
ROC-AUC Score: 0.6440508505011305

Confusion Matrix:
 [[11037     9]
 [ 1450     4]]


In [68]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

print('FEATURE IMPORTANCE report')
for n, i in zip(feature_names, feature_importance):
    print(n, round(i, 2), sep=': ')

FEATURE IMPORTANCE report
gender: 1.63
age: 14.44
country: 3.34
city: 13.95
exp_group: 6.18
os: 1.25
source: 1.46
topic: 7.61
PCA_1: 8.87
PCA_2: 9.63
PCA_3: 8.85
PCA_4: 11.47
PCA_5: 11.31


In [69]:
# Got the same results as before, proofs that saving model to a file and then loading back - works!

### 4.1 Step 5 - loading model for LMS checker

In [70]:
import os
import pandas as pd
import numpy as np


FILE_NAME = '/catboost_model.cbm'

# getting path to a model
def get_model_path(path: str) -> str:
    if os.environ.get("IS_LMS") == "1":  # проверяем где выполняется код в лмс, или локально. Немного магии
        MODEL_PATH = '/workdir/user_input/model'
    else:
        MODEL_PATH = path + FILE_NAME
    return MODEL_PATH


# loading the model
def load_models():
    model_path = get_model_path("/home/karpov/mle/03_ml/22_rec_sys")
    from catboost import CatBoostClassifier
    loaded_model = CatBoostClassifier()
    loaded_model.load_model(model_path)
    return loaded_model
    

# Creating data for checker
num_rows = 10
data = {
    'gender': np.random.choice([1, 0], size=num_rows),  # Randomly choose gender
    'age': np.random.randint(18, 65, size=num_rows),  # Random age between 18 and 65
    'country': np.random.choice(['USA', 'Canada', 'UK', 'Germany'], size=num_rows),  # Random country
    'city': np.random.choice(['New York', 'Toronto', 'London', 'Berlin'], size=num_rows),  # Random city
    'exp_group': np.random.choice([1, 2, 3, 4], size=num_rows),  # Random experiment group
    'os': np.random.choice(['Windows', 'Mac', 'Linux'], size=num_rows),  # Random OS
    'source': np.random.choice(['Google', 'Facebook', 'Direct'], size=num_rows),  # Random source
    'topic': np.random.choice(['Sports', 'Politics', 'Technology', 'Entertainment'], size=num_rows),  # Random topic
    'PCA_1': np.random.normal(0, 1, size=num_rows),  # Random PCA component 1
    'PCA_2': np.random.normal(0, 1, size=num_rows),  # Random PCA component 2
    'PCA_3': np.random.normal(0, 1, size=num_rows),  # Random PCA component 3
    'PCA_4': np.random.normal(0, 1, size=num_rows),  # Random PCA component 4
    'PCA_5': np.random.normal(0, 1, size=num_rows),  # Random PCA component 5
}
X_train_fake = pd.DataFrame(data)


# loading the model and making some prediction on fake data for LMS checker
model = load_models()
model.predict(X_train_fake)
model.predict_proba(X_train_fake)
print('Success!')

Success!


## 5. Разработка сервиса для использования модели

Здесь мы создаем сервис, который позволит взаимодействовать с моделью в реальном времени. Сервис включает следующие шаги:

- Загрузка модели: при запуске сервис загружает ранее сохранённую модель из файла.
- Получение признаков: сервис принимает запросы с user_id, на основе которого формирует нужные признаки для предсказания или загружаются уже с таблиц, которые вы загрузили в базу данных КарповКурсес. Признаки в момент предсказания должны совпадать с признаками, которые были в момент обучения модели.
- Предсказание: используя загруженную модель и полученные признаки, сервис делает предсказание — определяет посты, которые, вероятно, понравятся пользователю.
- Возвращение ответа: сервис возвращает ответ с результатами предсказания.


Важно: для того чтобы система проверки (чекер) могла корректно протестировать сервис, необходимо одновременно загружать как сам сервис, так и модель.

### 5.1 Step 6 - Getting features

In [71]:
user_df.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [72]:
user_df.to_sql('nktn_lx_step6_draft', con=engine, if_exists='replace', index=False) # записываем таблицу

205

In [73]:
user_df_draft = pd.read_sql('SELECT * FROM "nktn_lx_step6_draft"', con=engine)

In [74]:
user_df_draft.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [75]:
user_df.user_id.nunique()

163205

In [76]:
user_df_draft.user_id.nunique()

163205

In [77]:
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)

In [78]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step6_draft"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [79]:
user_df_chunks = load_features()

In [80]:
user_df_chunks.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [81]:
user_df_chunks.user_id.nunique()

163205

## 6. Загрузка сервиса в LMS для проверки (чекер)

После завершения разработки сервис и модель загружаются в LMS, где автоматический чекер выполняет тестирование. Чекер проверяет, соответствует ли сервис требованиям, выполняет ли корректные предсказания, работает ли без ошибок и насколько быстро отвечает на запросы. Успешное прохождение проверки подтверждает готовность модели к использованию в продакшене.

### 6.1 Step 7 - Checking API draft

In [None]:
import os
import random
from typing import List

import pandas as pd
from sqlalchemy import create_engine
from fastapi import FastAPI
from schema import PostGet
from datetime import datetime


engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

# step 5 - start
# getting path to a model
def get_model_path(path: str) -> str:
    if os.environ.get("IS_LMS") == "1":  # проверяем где выполняется код в лмс, или локально. Немного магии
        MODEL_PATH = '/workdir/user_input/model'
    else:
        MODEL_PATH = path
    return MODEL_PATH


# loading the model
def load_models():
    model_path = get_model_path("/Users/nikitin_a/PycharmProjects/l22_rec_sys/catboost_model.cbm")
    from catboost import CatBoostClassifier
    loaded_model = CatBoostClassifier()
    loaded_model.load_model(model_path)
    return loaded_model


# loading the model
model = load_models()


# step 6 - start
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)


def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step7_draft"'
    loaded_features_df = batch_load_sql(QUERY)
    return loaded_features_df


# loading dataframe with features
features_df = load_features()


# step 7 - start
posts_df = pd.read_sql('SELECT * FROM "post_text_df"', con=engine)

app = FastAPI()

@app.get("/post/recommendations/", response_model=List[PostGet])
def recommended_posts(
        id: int,
        time: datetime,
        limit: int = 10) -> List[PostGet]:
    user_df = features_df[features_df['user_id'] == id]
    user_features_df = user_df.drop(['target', 'user_id', 'post_id'], axis=1)

    y_pred_proba = model.predict_proba(user_features_df)
    y_pred_proba_positive = y_pred_proba[:, 1]

    user_df['probability'] = y_pred_proba_positive
    user_df.sort_values('probability', ascending=False) \
        .drop_duplicates(subset='post_id', keep='first', inplace=True)

    ## TO-DO:
    # учесть те посты, что уже были залайканы, т.е.
    # В сервисе вам нужно будет выгрузить все строки с лайками, для того чтобы отфильтровать те посты которые нужный пользователь уже лайнул.
    
    
    top_posts_ids = user_df.head(limit).post_id.to_list()
    if len(top_posts_ids) < limit:
        random_items = limit - len(top_posts_ids)
        top_posts_ids.extend(random.sample(posts_df.post_id.to_list(), k=random_items))

    top_posts_df = posts_df.query('post_id in @top_posts_ids')
    top_posts_df['user_id'] = id

    result = [
        PostGet(
            id=row['user_id'],
            text=row.get('text', ''),
            topic=row.get('topic', '')
        )
        for _, row in top_posts_df.iterrows()
    ]

    return result


In [None]:
# this is just a service draft, not the working service itself
# I'm just checking the the data model is correct and I receive responses from the API 

### 6.2 - The Whole Pipeline

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import gc
import psutil

In [2]:
engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

#### Получение feed_df 5 млн

Собираем для каждого юзера посты с лайками. 

In [3]:
query = """
SELECT 
  f.timestamp,
  f.user_id,
  f.post_id,
  f.target
FROM (
  SELECT
    fd.timestamp,
    fd.user_id,
    fd.post_id,
    fd.target,
    ROW_NUMBER() OVER(PARTITION BY fd.user_id ORDER BY fd.target DESC) rn
  FROM 
    feed_data fd
  WHERE 
    fd.action != 'like'
) AS f
WHERE 
  f.rn <=15
"""

In [4]:
feed_df_likes = pd.read_sql(query, con=engine)

In [5]:
feed_df_likes

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
2448070,2021-11-07 06:41:39,168552,4760,1
2448071,2021-11-23 14:57:04,168552,3817,1
2448072,2021-12-07 18:22:13,168552,7063,1
2448073,2021-12-07 18:37:22,168552,3428,0


In [6]:
feed_df_likes.user_id.nunique()

163205

In [7]:
feed_df_likes.post_id.nunique()

6831

In [8]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 390.02 MB
Memory usage after gc: 390.02 MB


Собираем для каждого юзера посты без лайков. 

In [9]:
query = """
SELECT 
  f.timestamp,
  f.user_id,
  f.post_id,
  f.target
FROM (
  SELECT
    fd.timestamp,
    fd.user_id,
    fd.post_id,
    fd.target,
    ROW_NUMBER() OVER(PARTITION BY fd.user_id ORDER BY fd.target ASC) rn
  FROM 
    feed_data fd
  WHERE 
    fd.action != 'like'
) AS f
WHERE 
  f.rn <=15
"""

In [10]:
feed_df_views = pd.read_sql(query, con=engine)

In [11]:
feed_df_views

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-29 19:12:00,200,6738,0
1,2021-10-29 19:15:39,200,5007,0
2,2021-10-29 19:15:54,200,4998,0
3,2021-10-29 19:18:36,200,620,0
4,2021-10-29 19:19:30,200,5684,0
...,...,...,...,...
2448070,2021-10-14 11:03:56,168552,2829,0
2448071,2021-12-20 18:47:39,168552,3205,0
2448072,2021-10-14 11:02:20,168552,4428,0
2448073,2021-12-07 18:58:26,168552,1229,0


In [12]:
feed_df_views.user_id.nunique()

163205

In [13]:
feed_df_views.post_id.nunique()

6831

In [14]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 391.88 MB
Memory usage after gc: 391.88 MB


Объединяем лайки и просто просмотры. 

In [15]:
feed_df = pd.concat([feed_df_likes, feed_df_views], ignore_index=True)

In [16]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4896145,2021-10-14 11:03:56,168552,2829,0
4896146,2021-12-20 18:47:39,168552,3205,0
4896147,2021-10-14 11:02:20,168552,4428,0
4896148,2021-12-07 18:58:26,168552,1229,0


In [27]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

del feed_df_likes
del feed_df_views

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 728.68 MB


NameError: name 'feed_df_likes' is not defined

In [18]:
feed_df = feed_df.drop_duplicates()

In [19]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4896145,2021-10-14 11:03:56,168552,2829,0
4896146,2021-12-20 18:47:39,168552,3205,0
4896147,2021-10-14 11:02:20,168552,4428,0
4896148,2021-12-07 18:58:26,168552,1229,0


In [20]:
feed_df.groupby('user_id', as_index=False).agg({'target': 'sum'}).sort_values('target', ascending=True).reset_index(drop=True).iloc[:14000]

Unnamed: 0,user_id,target
0,121046,0
1,162510,0
2,52233,0
3,89263,1
4,34612,1
...,...,...
13995,117605,15
13996,98752,15
13997,117730,15
13998,91970,15


In [21]:
feed_df.user_id.nunique()

163205

In [22]:
feed_df.post_id.nunique()

6831

In [23]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4896145,2021-10-14 11:03:56,168552,2829,0
4896146,2021-12-20 18:47:39,168552,3205,0
4896147,2021-10-14 11:02:20,168552,4428,0
4896148,2021-12-07 18:58:26,168552,1229,0


In [24]:
feed_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4888776 entries, 0 to 4896149
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   timestamp  datetime64[ns]
 1   user_id    int64         
 2   post_id    int64         
 3   target     int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 186.5 MB


In [25]:
feed_df.target.sum()

2381338

In [28]:
feed_df.to_sql('nktn_lx_step8_feed_df', con=engine, if_exists='replace', index=False) # записываем таблицу

776

In [3]:
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)

In [4]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step8_feed_df"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [5]:
feed_df_db = load_features()

In [6]:
feed_df_db

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4888771,2021-10-14 11:03:56,168552,2829,0
4888772,2021-12-20 18:47:39,168552,3205,0
4888773,2021-10-14 11:02:20,168552,4428,0
4888774,2021-12-07 18:58:26,168552,1229,0


In [7]:
feed_df_db.user_id.nunique()

163205

In [8]:
feed_df_db.post_id.nunique()

6831

In [9]:
feed_df_db.target.sum()

2381338

In [10]:
feed_df_db.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   timestamp  datetime64[ns]
 1   user_id    int64         
 2   post_id    int64         
 3   target     int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 149.2 MB


In [11]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")


# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 328.97 MB
Memory usage after gc: 329.16 MB


In [12]:
print('hey!')

hey!


In [26]:
feed_df['month'] = feed_df['timestamp'].dt.month          
feed_df['day'] = feed_df['timestamp'].dt.day              
feed_df['day_of_week'] = feed_df['timestamp'].dt.dayofweek  
feed_df['hour_of_day'] = feed_df['timestamp'].dt.hour     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feed_df['month'] = feed_df['timestamp'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feed_df['day'] = feed_df['timestamp'].dt.day
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feed_df['day_of_week'] = feed_df['timestamp'].dt.dayofweek
A value is trying to be set on a copy of a slice fro

In [27]:
feed_df.drop(['timestamp'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feed_df.drop(['timestamp'], axis=1, inplace=True)


In [29]:
feed_df

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day
0,200,6264,1,10,2,5,14
1,200,4200,1,12,29,2,14
2,200,3567,1,12,29,2,14
3,200,3539,1,12,29,2,15
4,200,994,1,12,29,2,15
...,...,...,...,...,...,...,...
4896145,168552,2829,0,10,14,3,11
4896146,168552,3205,0,12,20,0,18
4896147,168552,4428,0,10,14,3,11
4896148,168552,1229,0,12,7,1,18


In [None]:
feed_df.to_sql('nktn_lx_step8_feed_df', con=engine, if_exists='replace', index=False) # записываем таблицу

Догрузим посты, которых не хватает. 

In [23]:
# all_posts = set(post_df.post_id.to_list())
# len(all_posts)

# feed_posts = set(feed_df.post_id.unique())
# len(feed_posts)

# missing_posts = all_posts.difference(feed_posts)
# len(missing_posts)

# query = """
# SELECT 
#   f.timestamp,
#   f.user_id,
#   f.post_id,
#   f.target
# FROM 
#   feed_data f
# WHERE 
#   --f.action != 'like'
#     --AND 
#     post_id IN %(id_list)s
# """

# missed_posts_df = pd.read_sql(query, con=engine, params={'id_list': tuple(missing_posts)})

# missed_posts_df.post_id.nunique()

#### Получение user_df

In [27]:
user_df = pd.read_sql('SELECT * FROM "user_data"', con=engine)

In [28]:
user_df

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads
...,...,...,...,...,...,...,...,...
163200,168548,0,36,Russia,Kaliningrad,4,Android,organic
163201,168549,0,18,Russia,Tula,2,Android,organic
163202,168550,1,41,Russia,Yekaterinburg,4,Android,organic
163203,168551,0,38,Russia,Moscow,3,iOS,organic


In [29]:
user_df.user_id.nunique()

163205

In [30]:
user_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163205 entries, 0 to 163204
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    163205 non-null  int64 
 1   gender     163205 non-null  int64 
 2   age        163205 non-null  int64 
 3   country    163205 non-null  object
 4   city       163205 non-null  object
 5   exp_group  163205 non-null  int64 
 6   os         163205 non-null  object
 7   source     163205 non-null  object
dtypes: int64(4), object(4)
memory usage: 44.5 MB


#### Получение post_df_pca

In [31]:
post_df_pca = pd.read_sql('SELECT * FROM "post_text"', con=engine)

In [32]:
post_df_pca

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.046990
1,2,business,-0.102748,-0.316642,0.021755,-0.045080,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667
3,4,business,-0.075025,-0.231227,0.010022,-0.035137,0.000382
4,5,business,-0.086689,-0.252862,0.017764,-0.032102,0.054850
...,...,...,...,...,...,...,...
7018,7315,movie,-0.202135,0.246869,-0.003796,-0.270650,0.034498
7019,7316,movie,-0.212412,0.296487,-0.005319,-0.208802,0.099140
7020,7317,movie,-0.187134,0.195667,0.013325,0.380744,0.118247
7021,7318,movie,-0.143072,0.082273,0.005458,0.219139,0.029975


In [33]:
post_df_pca.post_id.nunique()

7023

In [34]:
post_df_pca.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7023 entries, 0 to 7022
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   post_id  7023 non-null   int64  
 1   topic    7023 non-null   object 
 2   PCA_1    7023 non-null   float64
 3   PCA_2    7023 non-null   float64
 4   PCA_3    7023 non-null   float64
 5   PCA_4    7023 non-null   float64
 6   PCA_5    7023 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 759.9 KB


#### Три датафрейма

USER_DF

In [39]:
bins = [0, 18, 30, 45, 60, np.inf]
labels = [18, 30, 45, 60, 99]

user_df['age_category'] = pd.cut(user_df['age'], bins=bins, labels=labels, right=False)

user_df.drop(['age'], axis=1, inplace=True)

In [40]:
user_df = user_df.astype({'age_category': 'object'})

In [41]:
user_df.head(3)

Unnamed: 0,user_id,gender,country,city,exp_group,os,source,age_category
0,200,1,Russia,Degtyarsk,3,Android,ads,45
1,201,0,Russia,Abakan,0,Android,ads,45
2,202,1,Russia,Smolensk,4,Android,ads,18


In [42]:
# поправить фичи на категориальные (gender, exp_group, age_group)

POST_DF_PCA

In [43]:
post_df_pca.head(3)

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.04699
1,2,business,-0.102748,-0.316642,0.021755,-0.04508,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667


DF

In [44]:
df = feed_df.merge(user_df, on='user_id', how='inner')

In [45]:
df = df.merge(post_df_pca, on='post_id', how='inner')

In [46]:
df.head()

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,200,6264,1,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
1,532,6264,1,10,27,2,13,1,Russia,Moscow,4,Android,ads,45,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
2,809,6264,1,11,7,6,21,0,Russia,Izhevsk,2,Android,ads,30,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
3,985,6264,0,12,1,2,18,0,Russia,Korolëv,0,Android,ads,30,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
4,1078,6264,0,12,12,6,7,0,Russia,Vidnoye,3,iOS,ads,18,movie,-0.224576,0.157913,0.023313,0.306276,0.005328


In [47]:
# TODO???
# преобразовать int в object, а object в cat???
df['gender'] = df['gender'].astype(str)
df['exp_group'] = df['exp_group'].astype(str)

In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4888776 entries, 0 to 4888775
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   user_id       int64  
 1   post_id       int64  
 2   target        int64  
 3   month         int64  
 4   day           int64  
 5   day_of_week   int64  
 6   hour_of_day   int64  
 7   gender        object 
 8   country       object 
 9   city          object 
 10  exp_group     object 
 11  os            object 
 12  source        object 
 13  age_category  object 
 14  topic         object 
 15  PCA_1         float64
 16  PCA_2         float64
 17  PCA_3         float64
 18  PCA_4         float64
 19  PCA_5         float64
dtypes: float64(5), int64(7), object(8)
memory usage: 783.3+ MB


In [49]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

del feed_df
del user_df
del post_df_pca

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 2792.34 MB
Memory usage after gc: 2792.34 MB


SAVING DF TO DB AND READING IT

In [None]:
df.to_sql('nktn_lx_step8_features', con=engine, if_exists='replace', index=False) # записываем таблицу

In [None]:
# RESTART KERNEL HERE

In [77]:
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)

In [78]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step8_features"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [79]:
user_df_chunks = load_features()

In [80]:
user_df_chunks.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [81]:
user_df_chunks.user_id.nunique()

163205

In [46]:
X = df.drop(['target', 'user_id', 'post_id'], axis=1)
y = df.target

In [47]:
X.head(3)

Unnamed: 0,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
1,10,27,2,13,1,Russia,Moscow,4,Android,ads,45,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
2,11,7,6,21,0,Russia,Izhevsk,2,Android,ads,30,movie,-0.224576,0.157913,0.023313,0.306276,0.005328


In [48]:
y[0:3]

0    1
1    1
2    1
Name: target, dtype: int64

In [49]:
import gc
import psutil


# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Delete the dataframe
del feed_df
del user_df
del post_df_pca

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after deletion: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 3724.02 MB
Memory usage after deletion: 3724.02 MB


MODEL

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42, 
                                                    test_size=0.2)

In [None]:
cat_cols = X.select_dtypes(include='object').columns.to_list()
cat_cols

In [61]:
from catboost import CatBoostClassifier


catboost = CatBoostClassifier()
#catboost.fit(X_train, y_train, cat_features=cat_cols, verbose=100)

In [60]:
# Predict class labels
y_pred = catboost.predict(X_test)

# Predict probabilities
y_pred_proba = catboost.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [61]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.88328
Precision: 0.3076923076923077
Recall: 0.002751031636863824
F1 Score: 0.0054533060668029995
ROC-AUC Score: 0.6440508505011305

Confusion Matrix:
 [[11037     9]
 [ 1450     4]]


In [62]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

print('FEATURE IMPORTANCE report')
for n, i in zip(feature_names, feature_importance):
    print(n, round(i, 2), sep=': ')

FEATURE IMPORTANCE report
gender: 1.63
age: 14.44
country: 3.34
city: 13.95
exp_group: 6.18
os: 1.25
source: 1.46
topic: 7.61
PCA_1: 8.87
PCA_2: 9.63
PCA_3: 8.85
PCA_4: 11.47
PCA_5: 11.31


In [None]:
### TODO!
# del all conneciton creds when loading to git