# Rec Sys

## Assumptions and what we are checking in the project solution

- In practice, we want to generate recommendations quickly enough. Therefore, we will require the algorithm to run for no more than ~0.5 seconds per request and occupy no more than ~4 GB of memory (these figures are approximate).
- The set of users is fixed, and no new users will be added.
- The checker will verify the model within the same time period that you see in the database.
- Models are not re-trained when using the services. We expect your code to import an already trained model and apply it.

## 1. Loading Data from Database (DB) and Data Overview

In the first step, we connect to the database, extract the necessary data, and load it into Jupyter Hub for analysis. The goal at this stage is to understand the data structure, identify any possible missing values or anomalies, and gain a general understanding of the data distribution and composition. The analysis includes studying the features and the target variable.


In [None]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

In [None]:
engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

### USER_DATA

In [None]:
user_df = pd.read_sql('SELECT * FROM "user_data"', con=engine)

In [None]:
user_df

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads
...,...,...,...,...,...,...,...,...
163200,168548,0,36,Russia,Kaliningrad,4,Android,organic
163201,168549,0,18,Russia,Tula,2,Android,organic
163202,168550,1,41,Russia,Yekaterinburg,4,Android,organic
163203,168551,0,38,Russia,Moscow,3,iOS,organic


In [None]:
user_df.user_id.nunique()

163205

In [None]:
user_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163205 entries, 0 to 163204
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    163205 non-null  int64 
 1   gender     163205 non-null  int64 
 2   age        163205 non-null  int64 
 3   country    163205 non-null  object
 4   city       163205 non-null  object
 5   exp_group  163205 non-null  int64 
 6   os         163205 non-null  object
 7   source     163205 non-null  object
dtypes: int64(4), object(4)
memory usage: 44.5 MB


In [None]:
user_df.gender.value_counts()

1    89980
0    73225
Name: gender, dtype: int64

In [None]:
user_df.age.value_counts()

20    10280
21    10139
19     9802
22     9049
18     9034
      ...  
86        1
83        1
85        1
92        1
95        1
Name: age, Length: 76, dtype: int64

In [None]:
user_df.age.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99])

count    163205.000000
mean         27.195405
std          10.239158
min          14.000000
1%           14.000000
5%           16.000000
25%          19.000000
50%          24.000000
75%          33.000000
95%          48.000000
99%          58.000000
max          95.000000
Name: age, dtype: float64

In [None]:
user_df.country.value_counts()

Russia         143035
Ukraine          8273
Belarus          3293
Kazakhstan       3172
Turkey           1606
Finland          1599
Azerbaijan       1542
Estonia           178
Latvia            175
Cyprus            170
Switzerland       162
Name: country, dtype: int64

In [None]:
user_df.city.nunique()

3915

In [None]:
user_df.os.value_counts()

Android    105972
iOS         57233
Name: os, dtype: int64

In [None]:
user_df.source.value_counts()

ads        101685
organic     61520
Name: source, dtype: int64

In [None]:
user_df.exp_group.value_counts()

3    32768
0    32723
1    32638
2    32614
4    32462
Name: exp_group, dtype: int64

### POST_TEXT_DF

In [None]:
post_df = pd.read_sql('SELECT * FROM "post_text_df"', con=engine)

In [None]:
post_df

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business
3,4,India power shares jump on debut\n\nShares in ...,business
4,5,Lacroix label bought by US firm\n\nLuxury good...,business
...,...,...,...
7018,7315,"OK, I would not normally watch a Farrelly brot...",movie
7019,7316,I give this movie 2 stars purely because of it...,movie
7020,7317,I cant believe this film was allowed to be mad...,movie
7021,7318,The version I saw of this film was the Blockbu...,movie


In [None]:
post_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7023 entries, 0 to 7022
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   post_id  7023 non-null   int64 
 1   text     7023 non-null   object
 2   topic    7023 non-null   object
dtypes: int64(1), object(2)
memory usage: 9.8 MB


In [None]:
post_df.post_id.nunique()

7023

In [None]:
print(post_df.text[:2].values)

['UK economy facing major risks\n\nThe UK manufacturing sector will continue to face serious challenges over the next two years, the British Chamber of Commerce (BCC) has said.\n\nThe groups quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced major risks and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.\n\nManufacturers domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.\n\nDespite some positive news for the export sector, there are worrying signs for manufacturing, the BCC said. These results reinforce our concern over the sectors persistent inabil

In [None]:
print(post_df.text[:1].values[0], sep='\n')

UK economy facing major risks

The UK manufacturing sector will continue to face serious challenges over the next two years, the British Chamber of Commerce (BCC) has said.

The groups quarterly survey of companies found exports had picked up in the last three months of 2004 to their best levels in eight years. The rise came despite exchange rates being cited as a major concern. However, the BCC found the whole UK economy still faced major risks and warned that growth is set to slow. It recently forecast economic growth will slow from more than 3% in 2004 to a little below 2.5% in both 2005 and 2006.

Manufacturers domestic sales growth fell back slightly in the quarter, the survey of 5,196 firms found. Employment in manufacturing also fell and job expectations were at their lowest level for a year.

Despite some positive news for the export sector, there are worrying signs for manufacturing, the BCC said. These results reinforce our concern over the sectors persistent inability to sus

In [None]:
post_df.topic.value_counts()

movie            3000
covid            1799
business          510
sport             510
politics          417
tech              401
entertainment     386
Name: topic, dtype: int64

### POST_TEXT_PCA

In [None]:
post_df_pca = pd.read_sql('SELECT * FROM "post_text"', con=engine)

In [None]:
post_df_pca

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.046990
1,2,business,-0.102748,-0.316642,0.021755,-0.045080,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667
3,4,business,-0.075025,-0.231227,0.010022,-0.035137,0.000382
4,5,business,-0.086689,-0.252862,0.017764,-0.032102,0.054850
...,...,...,...,...,...,...,...
7018,7315,movie,-0.202135,0.246869,-0.003796,-0.270650,0.034498
7019,7316,movie,-0.212412,0.296487,-0.005319,-0.208802,0.099140
7020,7317,movie,-0.187134,0.195667,0.013325,0.380744,0.118247
7021,7318,movie,-0.143072,0.082273,0.005458,0.219139,0.029975


In [None]:
post_df_pca.post_id.nunique()

7023

In [None]:
post_df_pca.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7023 entries, 0 to 7022
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   post_id  7023 non-null   int64  
 1   topic    7023 non-null   object 
 2   PCA_1    7023 non-null   float64
 3   PCA_2    7023 non-null   float64
 4   PCA_3    7023 non-null   float64
 5   PCA_4    7023 non-null   float64
 6   PCA_5    7023 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 759.9 KB


### FEED_DATA

In [None]:
feed_data_size = pd.read_sql('SELECT COUNT(*) as table_size FROM "feed_data"', con=engine)

In [None]:
feed_data_size

Unnamed: 0,table_size
0,76892800


In [None]:
dates_range_df = pd.read_sql('SELECT MIN(timestamp) AS min_date, MAX(timestamp) AS max_date FROM "feed_data"', con=engine)

In [None]:
dates_range_df

Unnamed: 0,min_date,max_date
0,2021-10-01 06:01:40,2021-12-29 23:51:06


In [None]:
feed_df = pd.read_sql("SELECT * FROM feed_data WHERE action != 'like' LIMIT 50000", con=engine)

In [None]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,action,target
0,2021-10-15 16:02:50,98488,5174,view,0
1,2021-10-15 16:04:44,98488,6536,view,0
2,2021-10-15 16:06:21,98488,5936,view,1
3,2021-10-15 16:08:19,98488,6096,view,0
4,2021-10-15 16:11:07,98488,1683,view,1
...,...,...,...,...,...
49995,2021-10-26 07:52:24,19548,3325,view,0
49996,2021-10-26 07:54:42,19548,1530,view,0
49997,2021-10-26 07:57:09,19548,1332,view,0
49998,2021-10-26 07:59:04,19548,1852,view,0


In [None]:
feed_df.timestamp.min()

Timestamp('2021-10-01 08:19:39')

In [None]:
feed_df.timestamp.max()

Timestamp('2021-12-29 23:25:55')

In [None]:
feed_df.iloc[30:40]

Unnamed: 0,timestamp,user_id,post_id,action,target
30,2021-12-13 18:38:03,84189,343,view,0
31,2021-12-13 18:39:10,84189,4939,view,0
32,2021-12-13 18:41:46,84189,6123,view,0
33,2021-12-13 18:44:37,84189,4907,view,1
34,2021-12-13 18:47:21,84189,2651,view,0
35,2021-12-13 18:48:34,84189,6032,view,0
36,2021-12-13 18:50:04,84189,618,view,0
37,2021-12-13 18:51:19,84189,4025,view,0
38,2021-12-13 18:53:54,84189,4909,view,0
39,2021-12-13 18:54:14,84189,5401,view,0


In [None]:
feed_df.timestamp.nunique()

46526

In [None]:
feed_df.user_id.nunique()

121

In [None]:
feed_df.post_id.nunique()

6801

In [None]:
feed_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   timestamp  50000 non-null  datetime64[ns]
 1   user_id    50000 non-null  int64         
 2   post_id    50000 non-null  int64         
 3   action     50000 non-null  object        
 4   target     50000 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 4.4 MB


In [None]:
feed_df.isna().sum()

timestamp    0
user_id      0
post_id      0
action       0
target       0
dtype: int64

In [None]:
feed_df.action.value_counts()

view    50000
Name: action, dtype: int64

In [None]:
feed_df.target.value_counts()

0    44028
1     5972
Name: target, dtype: int64

## 2. Feature engineering and training dataset formation

At this stage, we create new features that may be useful for the model. Features can include information about the user (for example, age, gender, interaction history), information about posts (texts, topics, categories), as well as additional statistics, such as the frequency of likes or user engagement. After feature generation, a training dataset is formed, which contains all the necessary data for subsequent model training.

In [None]:
user_df.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [None]:
post_df.head(3)

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business


In [None]:
post_df_pca.head(3)

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.04699
1,2,business,-0.102748,-0.316642,0.021755,-0.04508,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667


In [None]:
feed_df.head(3)

Unnamed: 0,timestamp,user_id,post_id,action,target
0,2021-11-26 14:01:12,84189,1776,view,0
1,2021-11-26 14:03:43,84189,1486,view,0
2,2021-11-26 14:04:26,84189,438,view,0


In [None]:
feed_df.drop('action', axis=1, inplace=True)

In [None]:
feed_df.head(3)

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-11-26 14:01:12,84189,1776,0
1,2021-11-26 14:03:43,84189,1486,0
2,2021-11-26 14:04:26,84189,438,0


In [None]:
feed_df.drop('timestamp', axis=1, inplace=True)
feed_df.head(3)

Unnamed: 0,user_id,post_id,target
0,84189,1776,0
1,84189,1486,0
2,84189,438,0


In [None]:
df = feed_df.merge(user_df, on='user_id', how='inner')

In [None]:
df = df.merge(post_df_pca, on='post_id', how='inner')

In [None]:
df

Unnamed: 0,user_id,post_id,target,gender,age,country,city,exp_group,os,source,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,84189,1776,0,1,30,Russia,Kaliningrad,0,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
1,38801,1776,0,0,17,Russia,Chita,1,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
2,19477,1776,0,1,15,Russia,Tomsk,2,Android,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
3,38802,1776,0,0,38,Russia,Novotroitsk,2,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
4,38806,1776,0,0,14,Russia,Moscow,0,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,19504,4521,0,1,17,Russia,Saint Petersburg,4,iOS,ads,movie,-0.237518,0.343240,0.002746,-0.089310,0.072245
49996,70468,6614,0,0,45,Russia,Moscow,2,Android,ads,movie,-0.204158,0.253604,-0.002414,-0.211505,-0.002567
49997,70468,4920,1,0,45,Russia,Moscow,2,Android,ads,movie,-0.161189,0.183307,0.000495,-0.073823,0.041760
49998,38824,5499,0,0,22,Russia,Barnaul,1,iOS,ads,movie,-0.162954,0.132514,-0.004940,-0.129042,-0.056654


In [None]:
X = df.drop(['target', 'user_id', 'post_id'], axis=1)
y = df.target

In [None]:
X.head(3)

Unnamed: 0,gender,age,country,city,exp_group,os,source,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,30,Russia,Kaliningrad,0,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
1,0,17,Russia,Chita,1,iOS,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542
2,1,15,Russia,Tomsk,2,Android,ads,sport,-0.090315,-0.195647,0.008877,-0.015622,-0.105542


In [None]:
y[0:3]

0    0
1    0
2    0
Name: target, dtype: int64

In [1]:
# those are just basic features to proof the concept and have a baseline solution
# will come back later to do a features engineering step properly later (see section 6 below).

## 3. Model training and quality evaluation

Using the training dataset, we train the model, selecting an algorithm and its parameters. After training, we tune the model and check its quality on a validation dataset. Quality evaluation is performed using metrics, such as precision, recall, or ROC-AUC. This stage helps to determine how well the model is able to make predictions and where it can be improved.

It is important to understand that increasing the local ROC-AUC does not always guarantee an improvement in hitrate in the LMS. Therefore, we advise you to check how changes in your validation metric affect the hitrate in the LMS to ensure a positive impact.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42,
                                                    test_size=0.25)

In [None]:
cat_cols = X.select_dtypes(include='object').columns.to_list()
cat_cols

['country', 'city', 'os', 'source', 'topic']

In [None]:
from catboost import CatBoostClassifier


catboost = CatBoostClassifier()
catboost.fit(X_train, y_train, cat_features=cat_cols, verbose=100)

Learning rate set to 0.048422
0:	learn: 0.6575133	total: 88ms	remaining: 1m 27s
100:	learn: 0.3535494	total: 2.1s	remaining: 18.7s
200:	learn: 0.3482397	total: 4.3s	remaining: 17.1s
300:	learn: 0.3426794	total: 6.94s	remaining: 16.1s
400:	learn: 0.3377525	total: 9.63s	remaining: 14.4s
500:	learn: 0.3335170	total: 12.3s	remaining: 12.2s
600:	learn: 0.3290967	total: 14.9s	remaining: 9.9s
700:	learn: 0.3250677	total: 17.8s	remaining: 7.6s
800:	learn: 0.3213377	total: 20.7s	remaining: 5.13s
900:	learn: 0.3176110	total: 23.3s	remaining: 2.56s
999:	learn: 0.3142838	total: 26s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7f7a50e09490>

In [None]:
# Predict class labels
y_pred = catboost.predict(X_test)

# Predict probabilities
y_pred_proba = catboost.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.88328
Precision: 0.3076923076923077
Recall: 0.002751031636863824
F1 Score: 0.0054533060668029995
ROC-AUC Score: 0.6440508505011305

Confusion Matrix:
 [[11037     9]
 [ 1450     4]]


In [None]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

print('FEATURE IMPORTANCE report')
for n, i in zip(feature_names, feature_importance):
    print(n, round(i, 2), sep=': ')

FEATURE IMPORTANCE report
gender: 1.63
age: 14.44
country: 3.34
city: 13.95
exp_group: 6.18
os: 1.25
source: 1.46
topic: 7.61
PCA_1: 8.87
PCA_2: 9.63
PCA_3: 8.85
PCA_4: 11.47
PCA_5: 11.31


In [None]:
# this is just a baseline solution
# right now I kind of do not care about model quality and metrics

## 4. Saving the trained model

After the model has been successfully trained and its quality meets the requirements, we save it in a specific format that the model/library requires. This file will become the basis for further use of the model, as it contains all the necessary data for predictions, including weights and parameters.

In [None]:
catboost.save_model('catboost_model.cbm')

In [None]:
loaded_model = CatBoostClassifier()
loaded_model.load_model('catboost_model.cbm')

<catboost.core.CatBoostClassifier at 0x7f7a228deee0>

In [None]:
# Predict class labels
y_pred = loaded_model.predict(X_test)

# Predict probabilities
y_pred_proba = loaded_model.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.88328
Precision: 0.3076923076923077
Recall: 0.002751031636863824
F1 Score: 0.0054533060668029995
ROC-AUC Score: 0.6440508505011305

Confusion Matrix:
 [[11037     9]
 [ 1450     4]]


In [None]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

print('FEATURE IMPORTANCE report')
for n, i in zip(feature_names, feature_importance):
    print(n, round(i, 2), sep=': ')

FEATURE IMPORTANCE report
gender: 1.63
age: 14.44
country: 3.34
city: 13.95
exp_group: 6.18
os: 1.25
source: 1.46
topic: 7.61
PCA_1: 8.87
PCA_2: 9.63
PCA_3: 8.85
PCA_4: 11.47
PCA_5: 11.31


In [None]:
# Got the same results as before, proofs that saving model to a file and then loading back - works!

### 4.1 Step 5 - loading model for LMS checker

In [None]:
import os
import pandas as pd
import numpy as np


FILE_NAME = '/catboost_model.cbm'

# getting path to a model
def get_model_path(path: str) -> str:
    if os.environ.get("IS_LMS") == "1":  # checking that the code is running in LMS
        MODEL_PATH = '/workdir/user_input/model'
    else:
        MODEL_PATH = path + FILE_NAME
    return MODEL_PATH


# loading the model
def load_models():
    model_path = get_model_path("/home/karpov/mle/03_ml/22_rec_sys")
    from catboost import CatBoostClassifier
    loaded_model = CatBoostClassifier()
    loaded_model.load_model(model_path)
    return loaded_model


# Creating data for checker
num_rows = 10
data = {
    'gender': np.random.choice([1, 0], size=num_rows),  # Randomly choose gender
    'age': np.random.randint(18, 65, size=num_rows),  # Random age between 18 and 65
    'country': np.random.choice(['USA', 'Canada', 'UK', 'Germany'], size=num_rows),  # Random country
    'city': np.random.choice(['New York', 'Toronto', 'London', 'Berlin'], size=num_rows),  # Random city
    'exp_group': np.random.choice([1, 2, 3, 4], size=num_rows),  # Random experiment group
    'os': np.random.choice(['Windows', 'Mac', 'Linux'], size=num_rows),  # Random OS
    'source': np.random.choice(['Google', 'Facebook', 'Direct'], size=num_rows),  # Random source
    'topic': np.random.choice(['Sports', 'Politics', 'Technology', 'Entertainment'], size=num_rows),  # Random topic
    'PCA_1': np.random.normal(0, 1, size=num_rows),  # Random PCA component 1
    'PCA_2': np.random.normal(0, 1, size=num_rows),  # Random PCA component 2
    'PCA_3': np.random.normal(0, 1, size=num_rows),  # Random PCA component 3
    'PCA_4': np.random.normal(0, 1, size=num_rows),  # Random PCA component 4
    'PCA_5': np.random.normal(0, 1, size=num_rows),  # Random PCA component 5
}
X_train_fake = pd.DataFrame(data)


# loading the model and making some prediction on fake data for LMS checker
model = load_models()
model.predict(X_train_fake)
model.predict_proba(X_train_fake)
print('Success!')

Success!


## 5. Developing a service for using the model

Here, we create a service that will allow interaction with the model in real time. The service includes the following steps:

- Loading the model: when starting, the service loads the previously saved model from a file.
- Obtaining features: the service receives requests with a user_id, based on which it generates the necessary features for prediction or loads them from tables that you have uploaded to the LMS database. The features at the time of prediction must match the features that were present at the time of model training.
- Prediction: using the loaded model and the obtained features, the service makes a prediction — determines the posts that the user is likely to like.
- Returning the response: the service returns a response with the prediction results.

Important: In order for the testing system (checker) to be able to correctly test the service, it is necessary to load both the service itself and the model simultaneously.

### 5.1 Step 6 - Getting features

In [None]:
user_df.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [None]:
user_df.to_sql('nktn_lx_step6_draft', con=engine, if_exists='replace', index=False) # записываем таблицу

205

In [None]:
user_df_draft = pd.read_sql('SELECT * FROM "nktn_lx_step6_draft"', con=engine)

In [None]:
user_df_draft.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [None]:
user_df.user_id.nunique()

163205

In [None]:
user_df_draft.user_id.nunique()

163205

In [None]:
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)

In [None]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step6_draft"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [None]:
user_df_chunks = load_features()

In [None]:
user_df_chunks.head(3)

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads


In [None]:
user_df_chunks.user_id.nunique()

163205

## 6. Loading the service into the LMS for testing (checker)

After development is complete, the service and model are loaded into the LMS, where an automated checker performs testing. The checker verifies whether the service meets the requirements, whether it makes correct predictions, whether it works without errors, and how quickly it responds to requests. Successful completion of the check confirms the model's readiness for use in production.

### 6.1 Step 7 - Checking API draft

In [None]:
import os
import random
from typing import List

import pandas as pd
from sqlalchemy import create_engine
from fastapi import FastAPI
from schema import PostGet
from datetime import datetime


engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

# step 5 - start
# getting path to a model
def get_model_path(path: str) -> str:
    if os.environ.get("IS_LMS") == "1":  
        MODEL_PATH = '/workdir/user_input/model'
    else:
        MODEL_PATH = path
    return MODEL_PATH


# loading the model
def load_models():
    model_path = get_model_path("/Users/nikitin_a/PycharmProjects/l22_rec_sys/catboost_model.cbm")
    from catboost import CatBoostClassifier
    loaded_model = CatBoostClassifier()
    loaded_model.load_model(model_path)
    return loaded_model


# loading the model
model = load_models()


# step 6 - start
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)


def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step7_draft"'
    loaded_features_df = batch_load_sql(QUERY)
    return loaded_features_df


# loading dataframe with features
features_df = load_features()


# step 7 - start
posts_df = pd.read_sql('SELECT * FROM "post_text_df"', con=engine)

app = FastAPI()

@app.get("/post/recommendations/", response_model=List[PostGet])
def recommended_posts(
        id: int,
        time: datetime,
        limit: int = 10) -> List[PostGet]:
    user_df = features_df[features_df['user_id'] == id]
    user_features_df = user_df.drop(['target', 'user_id', 'post_id'], axis=1)

    y_pred_proba = model.predict_proba(user_features_df)
    y_pred_proba_positive = y_pred_proba[:, 1]

    user_df['probability'] = y_pred_proba_positive
    user_df.sort_values('probability', ascending=False) \
        .drop_duplicates(subset='post_id', keep='first', inplace=True)

    ## TO-DO:
    # Take into account the posts that have already been liked, i.e.
    # In the service, you will need to upload all the rows with likes, 
    # in order to filter out the posts that the desired user has already liked.

    top_posts_ids = user_df.head(limit).post_id.to_list()
    if len(top_posts_ids) < limit:
        random_items = limit - len(top_posts_ids)
        top_posts_ids.extend(random.sample(posts_df.post_id.to_list(), k=random_items))

    top_posts_df = posts_df.query('post_id in @top_posts_ids')
    top_posts_df['user_id'] = id

    result = [
        PostGet(
            id=row['user_id'],
            text=row.get('text', ''),
            topic=row.get('topic', '')
        )
        for _, row in top_posts_df.iterrows()
    ]

    return result


In [None]:
# this is just a service draft, not the working service itself
# I'm just checking that the data model is correct and I receive responses from the API

### 6.2 - Getting `feed_df` 5 mln rows

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import gc
import psutil

In [3]:
engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

Getting liked posts for each user.

In [None]:
query = """
SELECT
  f.timestamp,
  f.user_id,
  f.post_id,
  f.target
FROM (
  SELECT
    fd.timestamp,
    fd.user_id,
    fd.post_id,
    fd.target,
    ROW_NUMBER() OVER(PARTITION BY fd.user_id ORDER BY fd.target DESC) rn
  FROM
    feed_data fd
  WHERE
    fd.action != 'like'
) AS f
WHERE
  f.rn <=15
"""

In [None]:
feed_df_likes = pd.read_sql(query, con=engine)

In [None]:
feed_df_likes

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
2448070,2021-11-07 06:41:39,168552,4760,1
2448071,2021-11-23 14:57:04,168552,3817,1
2448072,2021-12-07 18:22:13,168552,7063,1
2448073,2021-12-07 18:37:22,168552,3428,0


In [None]:
feed_df_likes.user_id.nunique()

163205

In [None]:
feed_df_likes.post_id.nunique()

6831

In [None]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 390.02 MB
Memory usage after gc: 390.02 MB


Getting posts without likes for each user.

In [None]:
query = """
SELECT
  f.timestamp,
  f.user_id,
  f.post_id,
  f.target
FROM (
  SELECT
    fd.timestamp,
    fd.user_id,
    fd.post_id,
    fd.target,
    ROW_NUMBER() OVER(PARTITION BY fd.user_id ORDER BY fd.target ASC) rn
  FROM
    feed_data fd
  WHERE
    fd.action != 'like'
) AS f
WHERE
  f.rn <=15
"""

In [None]:
feed_df_views = pd.read_sql(query, con=engine)

In [None]:
feed_df_views

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-29 19:12:00,200,6738,0
1,2021-10-29 19:15:39,200,5007,0
2,2021-10-29 19:15:54,200,4998,0
3,2021-10-29 19:18:36,200,620,0
4,2021-10-29 19:19:30,200,5684,0
...,...,...,...,...
2448070,2021-10-14 11:03:56,168552,2829,0
2448071,2021-12-20 18:47:39,168552,3205,0
2448072,2021-10-14 11:02:20,168552,4428,0
2448073,2021-12-07 18:58:26,168552,1229,0


In [None]:
feed_df_views.user_id.nunique()

163205

In [None]:
feed_df_views.post_id.nunique()

6831

In [None]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 391.88 MB
Memory usage after gc: 391.88 MB


Merging posts with likes and without likes for each user.

In [None]:
feed_df = pd.concat([feed_df_likes, feed_df_views], ignore_index=True)

In [None]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4896145,2021-10-14 11:03:56,168552,2829,0
4896146,2021-12-20 18:47:39,168552,3205,0
4896147,2021-10-14 11:02:20,168552,4428,0
4896148,2021-12-07 18:58:26,168552,1229,0


In [None]:
feed_df = feed_df.drop_duplicates()

In [None]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4896145,2021-10-14 11:03:56,168552,2829,0
4896146,2021-12-20 18:47:39,168552,3205,0
4896147,2021-10-14 11:02:20,168552,4428,0
4896148,2021-12-07 18:58:26,168552,1229,0


In [None]:
feed_df.groupby('user_id', as_index=False).agg({'target': 'sum'}).sort_values('target', ascending=True).reset_index(drop=True).iloc[:14000]

Unnamed: 0,user_id,target
0,121046,0
1,162510,0
2,52233,0
3,89263,1
4,34612,1
...,...,...
13995,117605,15
13996,98752,15
13997,117730,15
13998,91970,15


In [None]:
feed_df.user_id.nunique()

163205

In [None]:
feed_df.post_id.nunique()

6831

In [None]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4896145,2021-10-14 11:03:56,168552,2829,0
4896146,2021-12-20 18:47:39,168552,3205,0
4896147,2021-10-14 11:02:20,168552,4428,0
4896148,2021-12-07 18:58:26,168552,1229,0


In [None]:
feed_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4888776 entries, 0 to 4896149
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   timestamp  datetime64[ns]
 1   user_id    int64         
 2   post_id    int64         
 3   target     int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 186.5 MB


In [None]:
feed_df.target.sum()

2381338

In [None]:
feed_df.to_sql('nktn_lx_step8_feed_df', con=engine, if_exists='replace', index=False) # записываем таблицу

776

### 6.3 - Feature engineering 

In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import gc
import psutil

In [2]:
 engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )

In [3]:
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)

In [4]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step8_feed_df"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [5]:
feed_df = load_features()

In [6]:
feed_df

Unnamed: 0,timestamp,user_id,post_id,target
0,2021-10-02 14:07:30,200,6264,1
1,2021-12-29 14:55:04,200,4200,1
2,2021-12-29 14:58:19,200,3567,1
3,2021-12-29 15:03:05,200,3539,1
4,2021-12-29 15:18:42,200,994,1
...,...,...,...,...
4888771,2021-10-14 11:03:56,168552,2829,0
4888772,2021-12-20 18:47:39,168552,3205,0
4888773,2021-10-14 11:02:20,168552,4428,0
4888774,2021-12-07 18:58:26,168552,1229,0


In [7]:
feed_df.user_id.nunique()

163205

In [8]:
feed_df.post_id.nunique()

6831

In [9]:
feed_df.target.sum()

2381338

In [10]:
feed_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 4 columns):
 #   Column     Dtype         
---  ------     -----         
 0   timestamp  datetime64[ns]
 1   user_id    int64         
 2   post_id    int64         
 3   target     int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 149.2 MB


In [11]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")


# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 513.30 MB
Memory usage after gc: 504.11 MB


In [12]:
feed_df['month'] = feed_df['timestamp'].dt.month
feed_df['day'] = feed_df['timestamp'].dt.day
feed_df['day_of_week'] = feed_df['timestamp'].dt.dayofweek
feed_df['hour_of_day'] = feed_df['timestamp'].dt.hour

In [13]:
feed_df.drop(['timestamp'], axis=1, inplace=True)

In [14]:
feed_df

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day
0,200,6264,1,10,2,5,14
1,200,4200,1,12,29,2,14
2,200,3567,1,12,29,2,14
3,200,3539,1,12,29,2,15
4,200,994,1,12,29,2,15
...,...,...,...,...,...,...,...
4888771,168552,2829,0,10,14,3,11
4888772,168552,3205,0,12,20,0,18
4888773,168552,4428,0,10,14,3,11
4888774,168552,1229,0,12,7,1,18


In [None]:
#feed_df.to_sql('nktn_lx_step8_feed_df', con=engine, if_exists='replace', index=False) # записываем таблицу

#### Getting user_df

In [15]:
user_df = pd.read_sql('SELECT * FROM "user_data"', con=engine)

In [16]:
user_df

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads
...,...,...,...,...,...,...,...,...
163200,168548,0,36,Russia,Kaliningrad,4,Android,organic
163201,168549,0,18,Russia,Tula,2,Android,organic
163202,168550,1,41,Russia,Yekaterinburg,4,Android,organic
163203,168551,0,38,Russia,Moscow,3,iOS,organic


In [17]:
user_df.user_id.nunique()

163205

In [18]:
user_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163205 entries, 0 to 163204
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   user_id    163205 non-null  int64 
 1   gender     163205 non-null  int64 
 2   age        163205 non-null  int64 
 3   country    163205 non-null  object
 4   city       163205 non-null  object
 5   exp_group  163205 non-null  int64 
 6   os         163205 non-null  object
 7   source     163205 non-null  object
dtypes: int64(4), object(4)
memory usage: 44.5 MB


#### Getting post_df_pca

In [19]:
post_df_pca = pd.read_sql('SELECT * FROM "post_text"', con=engine)

In [20]:
post_df_pca

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.046990
1,2,business,-0.102748,-0.316642,0.021755,-0.045080,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667
3,4,business,-0.075025,-0.231227,0.010022,-0.035137,0.000382
4,5,business,-0.086689,-0.252862,0.017764,-0.032102,0.054850
...,...,...,...,...,...,...,...
7018,7315,movie,-0.202135,0.246869,-0.003796,-0.270650,0.034498
7019,7316,movie,-0.212412,0.296487,-0.005319,-0.208802,0.099140
7020,7317,movie,-0.187134,0.195667,0.013325,0.380744,0.118247
7021,7318,movie,-0.143072,0.082273,0.005458,0.219139,0.029975


In [21]:
post_df_pca.post_id.nunique()

7023

In [22]:
post_df_pca.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7023 entries, 0 to 7022
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   post_id  7023 non-null   int64  
 1   topic    7023 non-null   object 
 2   PCA_1    7023 non-null   float64
 3   PCA_2    7023 non-null   float64
 4   PCA_3    7023 non-null   float64
 5   PCA_4    7023 non-null   float64
 6   PCA_5    7023 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 759.9 KB


#### Merging three dataframes

USER_DF

In [23]:
bins = [0, 18, 30, 45, 60, np.inf]
labels = [18, 30, 45, 60, 99]

user_df['age_category'] = pd.cut(user_df['age'], bins=bins, labels=labels, right=False)

user_df.drop(['age'], axis=1, inplace=True)

In [24]:
user_df = user_df.astype({'age_category': 'object'})

In [25]:
user_df.head(3)

Unnamed: 0,user_id,gender,country,city,exp_group,os,source,age_category
0,200,1,Russia,Degtyarsk,3,Android,ads,45
1,201,0,Russia,Abakan,0,Android,ads,45
2,202,1,Russia,Smolensk,4,Android,ads,18


POST_DF_PCA

In [27]:
post_df_pca.head(3)

Unnamed: 0,post_id,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,1,business,-0.098651,-0.312493,0.023472,-0.029465,-0.04699
1,2,business,-0.102748,-0.316642,0.021755,-0.04508,0.137338
2,3,business,-0.089932,-0.211479,0.010823,-0.033651,-0.037667


DF

In [28]:
df = feed_df.merge(user_df, on='user_id', how='inner')

In [29]:
df = df.merge(post_df_pca, on='post_id', how='inner')

In [30]:
df.head()

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,200,6264,1,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
1,200,4200,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.853775,0.12555,0.226212,0.000183,0.050433
2,200,3567,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.24459,-0.00703,-0.248296,0.002506,0.010041
3,200,3539,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.259425,0.005846,0.072833,0.002658,-0.030134
4,200,994,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,politics,-0.112084,-0.312764,0.031349,-0.073475,0.387709


In [31]:
df['gender'] = df['gender'].astype(str)
df['exp_group'] = df['exp_group'].astype(str)

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   user_id       int64  
 1   post_id       int64  
 2   target        int64  
 3   month         int32  
 4   day           int32  
 5   day_of_week   int32  
 6   hour_of_day   int32  
 7   gender        object 
 8   country       object 
 9   city          object 
 10  exp_group     object 
 11  os            object 
 12  source        object 
 13  age_category  object 
 14  topic         object 
 15  PCA_1         float64
 16  PCA_2         float64
 17  PCA_3         float64
 18  PCA_4         float64
 19  PCA_5         float64
dtypes: float64(5), int32(4), int64(3), object(8)
memory usage: 671.4+ MB


In [33]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

del feed_df
del user_df
del post_df_pca

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 1882.97 MB
Memory usage after gc: 1882.97 MB


In [35]:
df.to_csv('250117features.csv')

SAVING DF TO DB AND READING IT

In [None]:
# RESET KERNEL

In [4]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 183.23 MB
Memory usage after gc: 183.23 MB


In [5]:
df = pd.read_csv('250117features.csv')

In [6]:
df

Unnamed: 0.1,Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,...,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,0,200,6264,1,10,2,5,14,1,Russia,...,3,Android,ads,45,movie,-0.224576,0.157913,0.023313,0.306276,0.005328
1,1,200,4200,1,12,29,2,14,1,Russia,...,3,Android,ads,45,covid,0.853775,0.125550,0.226212,0.000183,0.050433
2,2,200,3567,1,12,29,2,14,1,Russia,...,3,Android,ads,45,covid,0.244590,-0.007030,-0.248296,0.002506,0.010041
3,3,200,3539,1,12,29,2,15,1,Russia,...,3,Android,ads,45,covid,0.259425,0.005846,0.072833,0.002658,-0.030134
4,4,200,994,1,12,29,2,15,1,Russia,...,3,Android,ads,45,politics,-0.112084,-0.312764,0.031349,-0.073475,0.387709
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888771,4888771,168552,2829,0,10,14,3,11,1,Russia,...,4,Android,organic,18,covid,0.078897,-0.070991,-0.107583,-0.007093,-0.038227
4888772,4888772,168552,3205,0,12,20,0,18,1,Russia,...,4,Android,organic,18,covid,0.342201,0.052768,0.101260,-0.011941,-0.025448
4888773,4888773,168552,4428,0,10,14,3,11,1,Russia,...,4,Android,organic,18,movie,-0.087613,0.000636,-0.005936,-0.000200,-0.067521
4888774,4888774,168552,1229,0,12,7,1,18,1,Russia,...,4,Android,organic,18,politics,-0.100333,-0.254534,0.014260,-0.039871,0.028564


In [9]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [10]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   user_id       int64  
 1   post_id       int64  
 2   target        int64  
 3   month         int64  
 4   day           int64  
 5   day_of_week   int64  
 6   hour_of_day   int64  
 7   gender        int64  
 8   country       object 
 9   city          object 
 10  exp_group     int64  
 11  os            object 
 12  source        object 
 13  age_category  int64  
 14  topic         object 
 15  PCA_1         float64
 16  PCA_2         float64
 17  PCA_3         float64
 18  PCA_4         float64
 19  PCA_5         float64
dtypes: float64(5), int64(10), object(5)
memory usage: 2.0 GB


In [15]:
# Define a dictionary to map columns to their optimized data types
dtype_mapping = {
    'user_id': 'int32',        # int32 is sufficient for user IDs
    'post_id': 'int16',        # int32 is sufficient for post IDs
    'target': 'int8',          # target is binary or small range, so int8 is enough
    'month': 'int8',           # month ranges from 1 to 12
    'day': 'int8',             # day ranges from 1 to 31
    'day_of_week': 'int8',     # day_of_week ranges from 0 to 6
    'hour_of_day': 'int8',     # hour_of_day ranges from 0 to 23
    'gender': 'object',        # gender is binary or small range
    'country': 'object',       # country is categorical
    'city': 'object',          # city is categorical
    'exp_group': 'object',     # exp_group is likely a small range
    'os': 'object',            # os is categorical
    'source': 'object',        # source is categorical
    'age_category': 'int8',    # age_category is likely a small range
    'topic': 'object',         # topic is categorical
    'PCA_1': 'float16',        # float16 is sufficient for PCA components
    'PCA_2': 'float16',        # float16 is sufficient for PCA components
    'PCA_3': 'float16',        # float16 is sufficient for PCA components
    'PCA_4': 'float16',        # float16 is sufficient for PCA components
    'PCA_5': 'float16'         # float16 is sufficient for PCA components
}

# Convert columns to optimized data types
df = df.astype(dtype_mapping)

# Check memory usage after optimization
print(df.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   user_id       int32  
 1   post_id       int16  
 2   target        int8   
 3   month         int8   
 4   day           int8   
 5   day_of_week   int8   
 6   hour_of_day   int8   
 7   gender        object 
 8   country       object 
 9   city          object 
 10  exp_group     object 
 11  os            object 
 12  source        object 
 13  age_category  int8   
 14  topic         object 
 15  PCA_1         float16
 16  PCA_2         float16
 17  PCA_3         float16
 18  PCA_4         float16
 19  PCA_5         float16
dtypes: float16(5), int16(1), int32(1), int8(6), object(7)
memory usage: 1.9 GB
None


In [16]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 2347.93 MB
Memory usage after gc: 2347.93 MB


In [17]:
df

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,200,6264,1,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224609,0.157959,0.023315,0.306396,0.005329
1,200,4200,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.854004,0.125610,0.226196,0.000183,0.050446
2,200,3567,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.244629,-0.007030,-0.248291,0.002506,0.010040
3,200,3539,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.259521,0.005848,0.072815,0.002659,-0.030136
4,200,994,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,politics,-0.112061,-0.312744,0.031342,-0.073486,0.387695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888771,168552,2829,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.078918,-0.070984,-0.107605,-0.007092,-0.038239
4888772,168552,3205,0,12,20,0,18,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.342285,0.052765,0.101257,-0.011940,-0.025452
4888773,168552,4428,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,movie,-0.087585,0.000636,-0.005936,-0.000200,-0.067505
4888774,168552,1229,0,12,7,1,18,1,Russia,Ivanteyevka,4,Android,organic,18,politics,-0.100342,-0.254639,0.014259,-0.039886,0.028564


In [18]:
df.to_sql('nktn_lx_step8_features', con=engine, if_exists='replace', index=False) # записываем таблицу

776

In [22]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step8_features"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [23]:
df_db = load_features()

In [24]:
df_db.shape

(4888776, 20)

In [25]:
df.shape

(4888776, 20)

In [26]:
df_db.user_id.nunique()

163205

In [27]:
df.user_id.nunique()

163205

In [28]:
df_db.post_id.nunique()

6831

In [29]:
df.post_id.nunique()

6831

In [30]:
df_db

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,200,6264,1,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224609,0.157959,0.023315,0.306396,0.005329
1,200,4200,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.854004,0.125610,0.226196,0.000183,0.050446
2,200,3567,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.244629,-0.007030,-0.248291,0.002506,0.010040
3,200,3539,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.259521,0.005848,0.072815,0.002659,-0.030136
4,200,994,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,politics,-0.112061,-0.312744,0.031342,-0.073486,0.387695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888771,168552,2829,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.078918,-0.070984,-0.107605,-0.007092,-0.038239
4888772,168552,3205,0,12,20,0,18,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.342285,0.052765,0.101257,-0.011940,-0.025452
4888773,168552,4428,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,movie,-0.087585,0.000636,-0.005936,-0.000200,-0.067505
4888774,168552,1229,0,12,7,1,18,1,Russia,Ivanteyevka,4,Android,organic,18,politics,-0.100342,-0.254639,0.014259,-0.039886,0.028564


In [31]:
df

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,200,6264,1,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224609,0.157959,0.023315,0.306396,0.005329
1,200,4200,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.854004,0.125610,0.226196,0.000183,0.050446
2,200,3567,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.244629,-0.007030,-0.248291,0.002506,0.010040
3,200,3539,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.259521,0.005848,0.072815,0.002659,-0.030136
4,200,994,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,politics,-0.112061,-0.312744,0.031342,-0.073486,0.387695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888771,168552,2829,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.078918,-0.070984,-0.107605,-0.007092,-0.038239
4888772,168552,3205,0,12,20,0,18,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.342285,0.052765,0.101257,-0.011940,-0.025452
4888773,168552,4428,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,movie,-0.087585,0.000636,-0.005936,-0.000200,-0.067505
4888774,168552,1229,0,12,7,1,18,1,Russia,Ivanteyevka,4,Android,organic,18,politics,-0.100342,-0.254639,0.014259,-0.039886,0.028564


In [32]:
df_db.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   user_id       int64  
 1   post_id       int64  
 2   target        int64  
 3   month         int64  
 4   day           int64  
 5   day_of_week   int64  
 6   hour_of_day   int64  
 7   gender        int64  
 8   country       object 
 9   city          object 
 10  exp_group     int64  
 11  os            object 
 12  source        object 
 13  age_category  int64  
 14  topic         object 
 15  PCA_1         float64
 16  PCA_2         float64
 17  PCA_3         float64
 18  PCA_4         float64
 19  PCA_5         float64
dtypes: float64(5), int64(10), object(5)
memory usage: 746.0+ MB


### 6.4 - Fitting a model

In [2]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import gc
import psutil

In [3]:
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
    conn.close()
    return pd.concat(chunks, ignore_index=True)

In [3]:
def load_features() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "nktn_lx_step8_features"'
    features_df = batch_load_sql(QUERY)
    return features_df

In [4]:
df = load_features()

In [5]:
# Define a dictionary to map columns to their optimized data types
dtype_mapping = {
    'user_id': 'int32',        # int32 is sufficient for user IDs
    'post_id': 'int16',        # int32 is sufficient for post IDs
    'target': 'int8',          # target is binary or small range, so int8 is enough
    'month': 'int8',           # month ranges from 1 to 12
    'day': 'int8',             # day ranges from 1 to 31
    'day_of_week': 'int8',     # day_of_week ranges from 0 to 6
    'hour_of_day': 'int8',     # hour_of_day ranges from 0 to 23
    'gender': 'object',        # gender is binary or small range
    'country': 'object',       # country is categorical
    'city': 'object',          # city is categorical
    'exp_group': 'object',     # exp_group is likely a small range
    'os': 'object',            # os is categorical
    'source': 'object',        # source is categorical
    'age_category': 'int8',    # age_category is likely a small range
    'topic': 'object',         # topic is categorical
    'PCA_1': 'float16',        # float16 is sufficient for PCA components
    'PCA_2': 'float16',        # float16 is sufficient for PCA components
    'PCA_3': 'float16',        # float16 is sufficient for PCA components
    'PCA_4': 'float16',        # float16 is sufficient for PCA components
    'PCA_5': 'float16'         # float16 is sufficient for PCA components
}

# Convert columns to optimized data types
df = df.astype(dtype_mapping)

# Check memory usage after optimization
print(df.info(memory_usage='deep'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888776 entries, 0 to 4888775
Data columns (total 20 columns):
 #   Column        Dtype  
---  ------        -----  
 0   user_id       int32  
 1   post_id       int16  
 2   target        int8   
 3   month         int8   
 4   day           int8   
 5   day_of_week   int8   
 6   hour_of_day   int8   
 7   gender        object 
 8   country       object 
 9   city          object 
 10  exp_group     object 
 11  os            object 
 12  source        object 
 13  age_category  int8   
 14  topic         object 
 15  PCA_1         float16
 16  PCA_2         float16
 17  PCA_3         float16
 18  PCA_4         float16
 19  PCA_5         float16
dtypes: float16(5), int16(1), int32(1), int8(6), object(7)
memory usage: 1.9 GB
None


In [6]:
df

Unnamed: 0,user_id,post_id,target,month,day,day_of_week,hour_of_day,gender,country,city,exp_group,os,source,age_category,topic,PCA_1,PCA_2,PCA_3,PCA_4,PCA_5
0,200,6264,1,10,2,5,14,1,Russia,Degtyarsk,3,Android,ads,45,movie,-0.224609,0.157959,0.023315,0.306396,0.005329
1,200,4200,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.854004,0.125610,0.226196,0.000183,0.050446
2,200,3567,1,12,29,2,14,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.244629,-0.007030,-0.248291,0.002506,0.010040
3,200,3539,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,covid,0.259521,0.005848,0.072815,0.002659,-0.030136
4,200,994,1,12,29,2,15,1,Russia,Degtyarsk,3,Android,ads,45,politics,-0.112061,-0.312744,0.031342,-0.073486,0.387695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4888771,168552,2829,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.078918,-0.070984,-0.107605,-0.007092,-0.038239
4888772,168552,3205,0,12,20,0,18,1,Russia,Ivanteyevka,4,Android,organic,18,covid,0.342285,0.052765,0.101257,-0.011940,-0.025452
4888773,168552,4428,0,10,14,3,11,1,Russia,Ivanteyevka,4,Android,organic,18,movie,-0.087585,0.000636,-0.005936,-0.000200,-0.067505
4888774,168552,1229,0,12,7,1,18,1,Russia,Ivanteyevka,4,Android,organic,18,politics,-0.100342,-0.254639,0.014259,-0.039886,0.028564


In [7]:
df.user_id.nunique()

163205

In [8]:
X = df.drop(['target', 'user_id', 'post_id'], axis=1)
y = df.target

MODEL

In [9]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=42,
                                                    test_size=0.2)

In [10]:
cat_cols = X.select_dtypes(include='object').columns.to_list()
cat_cols

['gender', 'country', 'city', 'exp_group', 'os', 'source', 'topic']

In [11]:
# !pip install catboost
# !pip install scikit-learn
# !pip install ipywidgets
# !jupyter nbextension enable --py widgetsnbextension

In [12]:
# Check memory usage
print(f"Memory usage: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

del df

# Force garbage collection
gc.collect()

# Check memory usage again
print(f"Memory usage after gc: {psutil.Process().memory_info().rss / 1024 ** 2:.2f} MB")

Memory usage: 2995.51 MB
Memory usage after gc: 2995.51 MB


In [15]:
from catboost import CatBoostClassifier


catboost = CatBoostClassifier(
    eval_metric='AUC',
    random_seed=42,
    auto_class_weights="Balanced"
)

catboost.fit(X_train, y_train, cat_features=cat_cols, verbose=10)

Learning rate set to 0.352243
0:	total: 5.63s	remaining: 1h 33m 46s
100:	total: 8m 12s	remaining: 1h 13m 2s
200:	total: 16m 27s	remaining: 1h 5m 25s
300:	total: 25m 18s	remaining: 58m 45s
400:	total: 33m 43s	remaining: 50m 23s
500:	total: 42m 12s	remaining: 42m 2s
600:	total: 50m 45s	remaining: 33m 41s
700:	total: 59m 18s	remaining: 25m 17s
800:	total: 1h 7m 45s	remaining: 16m 50s
900:	total: 1h 16m 25s	remaining: 8m 23s
999:	total: 1h 24m 49s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7c51da6c7650>

In [16]:
# Predict class labels
y_pred = catboost.predict(X_test)

# Predict probabilities
y_pred_proba = catboost.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [17]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix


# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.6930594135960301
Precision: 0.6727211260075752
Recall: 0.7192412117747928
F1 Score: 0.6952038064866677
ROC-AUC Score: 0.766532549636372

Confusion Matrix:
 [[335382 166510]
 [133603 342261]]


In [20]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

feature_name = []
feature_imp = []

for n, i in zip(feature_names, feature_importance):
    #print(n, round(i, 2), sep=': ')
    feature_name.append(n)
    feature_imp.append(i)

fi_df = pd.DataFrame({'feature_name': feature_name,
                      'feature_importance': feature_imp})

In [21]:
print('FEATURE IMPORTANCE report')
fi_df.sort_values('feature_importance', ascending=False)

FEATURE IMPORTANCE report


Unnamed: 0,feature_name,feature_importance
1,day,14.672514
10,age_category,12.964909
0,month,9.930182
6,city,9.106883
3,hour_of_day,7.52064
7,exp_group,6.911694
16,PCA_5,6.856466
12,PCA_1,6.834706
15,PCA_4,6.447573
13,PCA_2,6.424823


In [22]:
catboost.save_model('catboost_model.cbm')

In [23]:
loaded_model = CatBoostClassifier()
loaded_model.load_model('catboost_model.cbm')

<catboost.core.CatBoostClassifier at 0x7c520de57650>

In [24]:
# Predict class labels
y_pred = loaded_model.predict(X_test)

# Predict probabilities
y_pred_proba = loaded_model.predict_proba(X_test)
y_pred_proba_positive = y_pred_proba[:, 1]  # Probability of class 1

In [25]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba_positive)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("ROC-AUC Score:", roc_auc)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.6930594135960301
Precision: 0.6727211260075752
Recall: 0.7192412117747928
F1 Score: 0.6952038064866677
ROC-AUC Score: 0.766532549636372

Confusion Matrix:
 [[335382 166510]
 [133603 342261]]


In [26]:
feature_importance = catboost.get_feature_importance()
feature_names = catboost.feature_names_

feature_name = []
feature_imp = []

for n, i in zip(feature_names, feature_importance):
    #print(n, round(i, 2), sep=': ')
    feature_name.append(n)
    feature_imp.append(i)

fi_df = pd.DataFrame({'feature_name': feature_name,
                      'feature_importance': feature_imp})

In [27]:
print('FEATURE IMPORTANCE report')
fi_df.sort_values('feature_importance', ascending=False)

FEATURE IMPORTANCE report


Unnamed: 0,feature_name,feature_importance
1,day,14.672514
10,age_category,12.964909
0,month,9.930182
6,city,9.106883
3,hour_of_day,7.52064
7,exp_group,6.911694
16,PCA_5,6.856466
12,PCA_1,6.834706
15,PCA_4,6.447573
13,PCA_2,6.424823


### 6.5 - Experiments with a base model (top-n liked posts for everyone as a prediction)

In [4]:
def load_features(query) -> pd.DataFrame:
    QUERY = query
    features_df = batch_load_sql(QUERY)
    return features_df

In [9]:
request = """
SELECT
  post_id,
  COUNT(*) as likes_cnt
FROM feed_data
WHERE action = 'like'
GROUP BY
  post_id
"""

In [10]:
most_likes_df = load_features(request)

In [11]:
most_likes_df

Unnamed: 0,post_id,likes_cnt
0,6114,640
1,4790,721
2,273,1119
3,3936,709
4,5468,668
...,...,...
6826,4827,2497
6827,7227,705
6828,790,1099
6829,2850,725


In [13]:
most_likes_df.sort_values('likes_cnt', ascending=False).head(20).post_id.to_list()

[1141,
 1634,
 1707,
 1883,
 1685,
 1857,
 1479,
 1508,
 1572,
 1760,
 1916,
 1686,
 1361,
 3404,
 1829,
 1776,
 1819,
 1461,
 1460,
 1902]

In [14]:
most_likes_df.sort_values('likes_cnt', ascending=False).head(20)

Unnamed: 0,post_id,likes_cnt
529,1141,2968
5875,1634,2963
651,1707,2952
851,1883,2929
4372,1685,2927
5156,1857,2925
4852,1479,2920
5394,1508,2920
5613,1572,2920
2534,1760,2918


In [15]:
most_likes_posts = [1141, 1634, 1707, 1883, 1685, 1857, 1479, 1508, 1572, 1760,
                        1916, 1686, 1361, 3404, 1829, 1776, 1819, 1461, 1460, 1902]

In [24]:
user_likes = [1141, 1634, 1883]
limit = 5

[i for i in most_likes_posts if i not in user_likes][:limit]

[1707, 1685, 1857, 1479, 1508]

### 6.6 - Rec Sys service API

In [None]:
import os
import random
from typing import List

import numpy as np
import pandas as pd
from sqlalchemy import create_engine
from fastapi import FastAPI
from schema import PostGet
from datetime import datetime

#print('starting script..')


engine = create_engine(
    "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
    "postgres.lab.karpov.courses:6432/startml"
)

# step 5 - start
# getting path to a model
def get_model_path(path: str) -> str:
    if os.environ.get("IS_LMS") == "1":  
        MODEL_PATH = '/workdir/user_input/model'
    else:
        MODEL_PATH = path
    return MODEL_PATH


# loading the model
def load_models():
    model_path = get_model_path("/Users/nikitin_a/PycharmProjects/l22_rec_sys/catboost_model.cbm")
    from catboost import CatBoostClassifier
    loaded_model = CatBoostClassifier()
    loaded_model.load_model(model_path)
    return loaded_model


# loading the model
model = load_models()
#print('model loaded..')


# step 6 - start
def batch_load_sql(query: str) -> pd.DataFrame:
    CHUNKSIZE = 200000
    engine = create_engine(
        "postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
    )
    conn = engine.connect().execution_options(stream_results=True)
    chunks = []
    cnt = 0
    for chunk_dataframe in pd.read_sql(query, conn, chunksize=CHUNKSIZE):
        chunks.append(chunk_dataframe)
        cnt += CHUNKSIZE
        #print(f'lines loaded: {cnt}')
    conn.close()
    return pd.concat(chunks, ignore_index=True)


def load_users() -> pd.DataFrame:
    QUERY = 'SELECT * FROM "user_data"'
    loaded_users_df = batch_load_sql(QUERY)
    return loaded_users_df


def load_liked() -> pd.DataFrame:
    QUERY = "SELECT user_id, post_id FROM feed_data WHERE action = 'like'"
    loaded_likes_df = batch_load_sql(QUERY)
    return loaded_likes_df


# loading dataframe with features, likes and posts
users_df = load_users()
#print('users_df loaded..')

posts_df = pd.read_sql('SELECT * FROM "post_text_df"', con=engine)
#print('posts_df loaded..')

posts_df_pca = pd.read_sql('SELECT * FROM "post_text"', con=engine)
#print('posts_df_pca loaded..')

likes_df = load_liked()
#print('likes_df loaded..')


app = FastAPI()
#print('starting FastAPI endpoint..')

@app.get("/post/recommendations/", response_model=List[PostGet])
def recommended_posts(
        id: int,
        time: datetime,
        limit: int = 10) -> List[PostGet]:

    user_df = users_df[users_df['user_id'] == id]

    bins = [0, 18, 30, 45, 60, np.inf]
    labels = [18, 30, 45, 60, 99]
    user_df['age_category'] = pd.cut(user_df['age'], bins=bins, labels=labels, right=False)
    user_df.drop(['age'], axis=1, inplace=True)
    user_df.reset_index(drop=True, inplace=True)

    #print(user_df.head())
    #print(user_df.shape)

    user_features_df = pd.DataFrame([{'timestamp': time}])
    user_features_df['month'] = user_features_df['timestamp'].dt.month
    user_features_df['day'] = user_features_df['timestamp'].dt.day
    user_features_df['day_of_week'] = user_features_df['timestamp'].dt.dayofweek
    user_features_df['hour_of_day'] = user_features_df['timestamp'].dt.hour

    user_features_df = pd.concat([user_features_df, user_df], axis=1)

    #print(user_features_df.head())
    #print(user_features_df.shape)

    user_features_df = user_features_df.merge(posts_df_pca, how='cross')
    user_features_df.drop(['timestamp'], axis=1, inplace=True)

    #print(user_features_df.head())
    #print(user_features_df.shape)

    # Define a dictionary to map columns to their optimized data types
    dtype_mapping = {
        'user_id': 'int32',        # int32 is sufficient for user IDs
        'post_id': 'int16',        # int32 is sufficient for post IDs
        'target': 'int8',          # target is binary or small range, so int8 is enough
        'month': 'int8',           # month ranges from 1 to 12
        'day': 'int8',             # day ranges from 1 to 31
        'day_of_week': 'int8',     # day_of_week ranges from 0 to 6
        'hour_of_day': 'int8',     # hour_of_day ranges from 0 to 23
        'gender': 'object',        # gender is binary or small range
        'country': 'object',       # country is categorical
        'city': 'object',          # city is categorical
        'exp_group': 'object',     # exp_group is likely a small range
        'os': 'object',            # os is categorical
        'source': 'object',        # source is categorical
        'age_category': 'int8',    # age_category is likely a small range
        'topic': 'object',         # topic is categorical
        'PCA_1': 'float16',        # float16 is sufficient for PCA components
        'PCA_2': 'float16',        # float16 is sufficient for PCA components
        'PCA_3': 'float16',        # float16 is sufficient for PCA components
        'PCA_4': 'float16',        # float16 is sufficient for PCA components
        'PCA_5': 'float16'         # float16 is sufficient for PCA components
        }

    # Convert columns to optimized data types
    user_features_df = user_features_df.astype(dtype_mapping)

    #print(user_features_df.head())
    #print(user_features_df.shape)
    #print(user_features_df.info(memory_usage='deep'))

    y_pred_proba = model.predict_proba(user_features_df.drop(['user_id', 'post_id'], axis=1))
    y_pred_proba_positive = y_pred_proba[:, 1]

    user_df = user_features_df
    user_df['probability'] = y_pred_proba_positive

    print(user_df.sort_values('probability', ascending=False).drop_duplicates(subset='post_id', keep='first').head())
    print(user_df.shape)

    # taking into account already liked posts
    user_likes = likes_df[likes_df['user_id'] == id].post_id.to_list()
    filtered_df = user_df[~user_df['post_id'].isin(user_likes)]
    filtered_df = filtered_df.sort_values('probability', ascending=False) \
        .drop_duplicates(subset='post_id', keep='first')#, inplace=True)

    print(filtered_df.head())
    print(filtered_df.shape)

    top_posts_ids = filtered_df.head(limit).post_id.to_list()

    if len(top_posts_ids) < limit:
        random_items = limit - len(top_posts_ids)
        most_likes_posts = [1141, 1634, 1707, 1883, 1685, 1857, 1479, 1508, 1572, 1760,
                            1916, 1686, 1361, 3404, 1829, 1776, 1819, 1461, 1460, 1902]
        most_likes_filt = [i for i in most_likes_posts if i not in user_likes]
        top_posts_ids.extend(random.sample(most_likes_filt[:limit], k=random_items))

    print(top_posts_ids)

    # checking random model
    #top_posts_ids = []
    #top_posts_ids.extend(random.sample(posts_df.post_id.to_list(), k=limit))

    # checking same posts for everyone
    # top_posts_ids = [i for i in range(1, len(top_posts_ids)+1)]
    # print(top_posts_ids)

    # checking most liked posts recommendation
    # most_likes_posts = [1141, 1634, 1707, 1883, 1685, 1857, 1479, 1508, 1572, 1760,
    #                     1916, 1686, 1361, 3404, 1829, 1776, 1819, 1461, 1460, 1902]
    # most_likes_filt = [i for i in most_likes_posts if i not in user_likes]
    # top_posts_ids = most_likes_filt[:limit]
    # print(top_posts_ids)

    top_posts_df = posts_df.query('post_id in @top_posts_ids')
    top_posts_df['user_id'] = id

    result = [
        PostGet(
            #id=row['user_id'],
            id=row['post_id'],
            text=row.get('text', ''),
            topic=row.get('topic', '')
        )
        for _, row in top_posts_df.iterrows()
    ]

    return result

In [None]:
### TODO!
# del all conneciton creds when loading to git