# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including 
- data preprocessing
- data splitting
- model training and inference
- model optimization
- model saving and loading
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [4]:
import pandas as pd
from pyspark.sql.functions import rand

from replay.data_preparator import DataPreparator, Indexer
from replay.experiment import Experiment
from replay.metrics import Coverage, HitRate, NDCG, MAP
from replay.model_handler import save, load, save_indexer, load_indexer
from replay.models import ALSWrap, KNN, SLIM
from replay.session_handler import State
from replay.splitters import UserSplitter
from replay.utils import convert2spark

In [5]:
K = 5
SEED=1234

In [7]:
spark = State().session
spark.sparkContext.setLogLevel('ERROR')

In [8]:
spark

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [9]:
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["userId", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

In [10]:
df.head(2)

Unnamed: 0,userId,item_id,relevance,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109


In [11]:
users.head(2)

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072


### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe. 

Columns with users' and items' identificators are recuired for interaction log. Original user and item identifiers should be named as `user_id` and `item_id`. Those identifiers in section [0.2. Indexing](#indexing) will be converted to integer identifiers, which will be named `user_idx`, `item_idx`. Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``. 

DataFrames with user or item features should have column `user_id` or `item_id` respectively.

We implemented DataPreparator class to convert pandas dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing. It is an optional step, if you already have data in Spark DataFrame format, could rename the above mentioned columns, and confident in completeness and quality of the data, skip this step.

In [12]:
preparator = DataPreparator()

#### Interactions log preprocessing

In [13]:
log = preparator.transform(columns_mapping={'user_id': 'userId',
                                      'item_id': 'item_id',
                                      'relevance': 'relevance',
                                      'timestamp': 'timestamp'
                                     }, 
                           data=df)

01-Jul-22 20:02:24, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.
                                                                                

In [14]:
log.show(2)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2001-01-01 01:12:40|
|      1|    661|      3.0|2001-01-01 01:35:09|
+-------+-------+---------+-------------------+
only showing top 2 rows



In [15]:
log.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- item_id: long (nullable = true)
 |-- relevance: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)



In [17]:
State().logger.info('*')

01-Jul-22 20:02:45, replay, INFO: *


As you see, `userId` was renamed to `user_id` and `timestamp` was converted to `TimestampType`.

#### Feature dataframe preprocessing
To transform feature dataframes you could also use datapreparator:

In [18]:
user_feat = preparator.transform(columns_mapping={'user_id': 'user_id'}, 
                           data=users)
user_feat.show(2)

01-Jul-22 20:02:49, replay, INFO: Column with ids of users or items is absent in mapping. The dataframe will be treated as a users'/items' features dataframe.


+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
+-------+------+---+----------+--------+
only showing top 2 rows



DataPreparator use is optional, you could convert dataFrame to spark with ``convert_to_spark`` from ``replay.utils`` and manually raname columns.

In [19]:
# the same result without DataPreparator
convert2spark(users).show(2)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
+-------+------+---+----------+--------+
only showing top 2 rows



<a id='indexing'></a>
### 0.2. Indexing

RePlay models require columns with users' and items' identifiers _(ids)_ to be named as `user_idx` and `item_idx`. Those _ids_ should be integers starting at zero without gaps. This is important for models that use sparse matrices and define the matrix size as a biggest seen user and item index. Storing _ids_ as integers also help to reduce memory usage compared to string _ids_. 

You should convert user and item _ids_ in interaction's log and feature dataframes. RaPlay offers Indexer class to perform the _ids_ converation and convert them back after recommendations generation (predict). The Indexer will store label encoders for users and items and allow to transfrom ids for users and items, which come after the Indexer fit. 

In [20]:
indexer = Indexer(user_col='user_id', item_col='item_id')

Take all available user and item ids from log and features and pass them to Indexer. The _ids_ could repeat, the indexes will be ordered by label frequencies so the most frequent label gets index 0.

In [21]:
%%time
indexer.fit(users=log.select('user_id').unionByName(user_feat.select('user_id')),
           items=log.select('item_id'))

                                                                                

CPU times: user 40.6 ms, sys: 11 ms, total: 51.6 ms
Wall time: 2.47 s


In [22]:
State().logger.info('*')

01-Jul-22 20:03:04, replay, INFO: *


In [23]:
%%time
log_replay = indexer.transform(df=log)
log_replay.show(2)

+--------+--------+---------+-------------------+
|user_idx|item_idx|relevance|          timestamp|
+--------+--------+---------+-------------------+
|    4131|      43|      5.0|2001-01-01 01:12:40|
|    4131|     585|      3.0|2001-01-01 01:35:09|
+--------+--------+---------+-------------------+
only showing top 2 rows

CPU times: user 46.5 ms, sys: 13.9 ms, total: 60.4 ms
Wall time: 1.53 s


In [24]:
%%time
user_feat_replay = indexer.transform(df=user_feat)
user_feat_replay.show(2)

+--------+------+---+----------+--------+
|user_idx|gender|age|occupation|zip_code|
+--------+------+---+----------+--------+
|    4131|     F|  1|        10|   48067|
|    2364|     M| 56|        16|   70072|
+--------+------+---+----------+--------+
only showing top 2 rows

CPU times: user 29.9 ms, sys: 8.17 ms, total: 38.1 ms
Wall time: 339 ms


In [25]:
State().logger.info('*')

01-Jul-22 20:03:12, replay, INFO: *


### 0.3. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems. Splitters returns cached dataframes to copmute them once and re-use for models training, inference and metrics calculation.

`UserSplitter` takes ``item_test_size`` items for ``user_test_size`` user to the test dataset.

In [26]:
%%time
splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log_replay)
print(train.count(), test.count())

                                                                                

997709 2500
CPU times: user 31.4 ms, sys: 12.7 ms, total: 44.1 ms
Wall time: 9.82 s


In [27]:
State().logger.info('*')

01-Jul-22 20:03:26, replay, INFO: *


In [28]:
test.is_cached

True

## 1. Models training

#### SLIM

In [29]:
slim = SLIM(seed=SEED)

In [30]:
%%time
slim.fit(log=train)



CPU times: user 1.69 s, sys: 172 ms, total: 1.86 s
Wall time: 51 s


                                                                                

In [31]:
State().logger.info('*')

01-Jul-22 20:04:21, replay, INFO: *


In [32]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_idx').distinct(),
    log=train,
    filter_seen_items=True
)

                                                                                

CPU times: user 39 ms, sys: 26.6 ms, total: 65.6 ms
Wall time: 3.78 s


In [33]:
State().logger.info('*')

01-Jul-22 20:04:25, replay, INFO: *


In [26]:
recs.show(2)



+--------+--------+------------------+
|user_idx|item_idx|         relevance|
+--------+--------+------------------+
|     161|      25|  1.32220150668077|
|     161|      42|1.2345164154171402|
+--------+--------+------------------+
only showing top 2 rows





## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [39]:
metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K],
                            Coverage(train): K
                           })

In [40]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

                                                                                

CPU times: user 191 ms, sys: 138 ms, total: 328 ms
Wall time: 26.3 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM,0.155963,0.272,0.548,0.1005,0.17213


## 3. Hyperparameters optimization

#### 3.1 Search

In [41]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

                                                                                

In [42]:
%%time
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=15)

[32m[I 2022-07-01 16:38:46,082][0m A new study created in memory with name: no-name-53e3bf72-a34a-444b-a335-53c4aae324c2[0m
[32m[I 2022-07-01 16:40:26,612][0m Trial 0 finished with value: 0.18106487814533648 and parameters: {'beta': 0.01, 'lambda_': 0.01}. Best is trial 0 with value: 0.18106487814533648.[0m
[32m[I 2022-07-01 16:42:36,609][0m Trial 1 finished with value: 0.18370690579398508 and parameters: {'beta': 0.3933517988592923, 'lambda_': 2.8789378669505645e-06}. Best is trial 1 with value: 0.18370690579398508.[0m
[32m[I 2022-07-01 16:45:23,918][0m Trial 2 finished with value: 0.17667836348216465 and parameters: {'beta': 1.114924230260106e-06, 'lambda_': 6.623838963770643e-05}. Best is trial 1 with value: 0.18370690579398508.[0m
[32m[I 2022-07-01 16:46:40,399][0m Trial 3 finished with value: 0.18481367198384718 and parameters: {'beta': 0.0763700677939602, 'lambda_': 0.0005897851592959778}. Best is trial 3 with value: 0.18481367198384718.[0m
[32m[I 2022-07-01 16:48

CPU times: user 48.2 s, sys: 9.45 s, total: 57.6 s
Wall time: 30min


In [43]:
best_params

{'beta': 0.0763700677939602, 'lambda_': 0.0005897851592959778}

#### 3.2 Compare with previous

In [44]:
def fit_predict_evaluate(model, experiment, name):
    model.fit(log=train)

    recs = model.predict(
        k=K,
        users=test.select('user_idx').distinct(),
        log=train,
        filter_seen_items=True
    )

    experiment.add_result(name, recs)
    return recs

In [51]:
%%time
recs = fit_predict_evaluate(SLIM(**best_params, seed=SEED), metrics, 'SLIM_optimized')
recs.cache() #caching for further processing
metrics.results.sort_values('NDCG@5', ascending=False)

                                                                                

CPU times: user 2.38 s, sys: 1.1 s, total: 3.48 s
Wall time: 3min 41s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.149757,0.274,0.548,0.100607,0.172422
SLIM,0.155963,0.272,0.548,0.1005,0.17213


## 4. Getting final recommendations 

### Return to original user and item identificators

In [52]:
%%time
recs = indexer.inverse_transform(recs)
recs.show(2)



+-------+-------+-----------------+
|item_id|user_id|        relevance|
+-------+-------+-----------------+
|    356|    509| 1.31493935041048|
|   3175|    509|1.205293743165952|
+-------+-------+-----------------+
only showing top 2 rows

CPU times: user 1.3 s, sys: 573 ms, total: 1.87 s
Wall time: 16.3 s


                                                                                

### Convert to pandas or save

In [53]:
recs_pd = recs.toPandas()
recs_pd.head(2)

Unnamed: 0,item_id,user_id,relevance
0,356,509,1.314939
1,3175,509,1.205294


In [65]:
%%time
recs.write.parquet(path='./slim_recs.parquet', mode='overwrite')



CPU times: user 47.1 ms, sys: 54.7 ms, total: 102 ms
Wall time: 16.6 s


                                                                                

## 4. Save and load

RePlay allows to save and load fitted models with `save` and `load` functions of `model_handler` module. Model is saved as a folder with all necessary parameters and data.

In [96]:
%%time
save_indexer(indexer, './indexer_ml1')
indexer = load_indexer('./indexer_ml1')

CPU times: user 556 ms, sys: 263 ms, total: 819 ms
Wall time: 6.66 s


In [55]:
%%time
save(slim, path='./slim_best_params')
slim_loaded = load('./slim_best_params')

                                                                                

In [57]:
slim_loaded.beta, slim_loaded.lambda_

(0.0763700677939602, 0.0005897851592959778)

In [56]:
%%time
pred_from_loaded = slim_loaded.predict(k=K,
    users=test.select('user_idx').distinct(),
    log=train,
    filter_seen_items=True)
pred_from_loaded.show(2)



+--------+--------+------------------+
|user_idx|item_idx|         relevance|
+--------+--------+------------------+
|     161|      25|1.2804809774833301|
|     161|      42|1.1790474873064987|
+--------+--------+------------------+
only showing top 2 rows

CPU times: user 99.5 ms, sys: 67.1 ms, total: 167 ms
Wall time: 14.9 s


                                                                                

In [98]:
%%time
recs = indexer.inverse_transform(pred_from_loaded)
recs.show(2)



+-------+-------+------------------+
|user_id|item_id|         relevance|
+-------+-------+------------------+
|    509|    356|1.2804809774833301|
|    509|   3175|1.1790474873064987|
+-------+-------+------------------+
only showing top 2 rows

CPU times: user 562 ms, sys: 225 ms, total: 786 ms
Wall time: 8.05 s


                                                                                

## 5. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [62]:
%%time
recs = fit_predict_evaluate(ALSWrap(rank=100, seed=SEED), metrics, 'ALS')
metrics.results.sort_values('NDCG@5', ascending=False)

                                                                                

CPU times: user 1.13 s, sys: 1.39 s, total: 2.52 s
Wall time: 5min 27s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.149757,0.274,0.548,0.100607,0.172422
SLIM,0.155963,0.272,0.548,0.1005,0.17213
ALS,0.192121,0.218,0.532,0.09006,0.15834


#### KNN
Commonly-used item-based recommender

In [63]:
%%time
recs = fit_predict_evaluate(KNN(num_neighbours=100), metrics, 'KNN')
metrics.results.sort_values('NDCG@5', ascending=False)

                                                                                

CPU times: user 572 ms, sys: 1.08 s, total: 1.66 s
Wall time: 6min 3s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.149757,0.274,0.548,0.100607,0.172422
SLIM,0.155963,0.272,0.548,0.1005,0.17213
ALS,0.192121,0.218,0.532,0.09006,0.15834
KNN,0.051268,0.176,0.376,0.059327,0.107508


## 6 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

In [64]:
metrics.add_result("my_model", recs)
metrics.results.sort_values("NDCG@5", ascending=False)

                                                                                

Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.149757,0.274,0.548,0.100607,0.172422
SLIM,0.155963,0.272,0.548,0.1005,0.17213
ALS,0.192121,0.218,0.532,0.09006,0.15834
KNN,0.051268,0.176,0.376,0.059327,0.107508
my_model,0.051268,0.176,0.376,0.059327,0.107508
