# RePlay Tutorial
This notebook is designed to familiarize with the use of RePlay library, including:
- data preprocessing
- dataset users and items re-indexing
- data splitting
- model training and inference
- model optimization
- model saving and loading
- models comparison

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%config Completer.use_jedi = False

In [3]:
import warnings
from optuna.exceptions import ExperimentalWarning
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=ExperimentalWarning)

In [4]:
import pandas as pd

from replay.data_preparator import DataPreparator, Indexer
from replay.experiment import Experiment
from replay.metrics import Coverage, HitRate, NDCG, MAP
from replay.model_handler import save, load, save_indexer, load_indexer
from replay.models import ALSWrap, KNN, SLIM
from replay.session_handler import State
from replay.splitters import UserSplitter
from replay.utils import convert2spark, get_log_info

In [5]:
K = 5
SEED=1234

In [6]:
spark = State().session
spark.sparkContext.setLogLevel('ERROR')

22/07/04 16:30:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/07/04 16:30:30 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
22/07/04 16:30:30 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/07/04 16:30:30 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
22/07/04 16:30:30 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [7]:
spark

## 0. Data preprocessing <a name='data-preparator'></a>
We will use MovieLens 1m as an example.

In [8]:
df = pd.read_csv("data/ml1m_ratings.dat", sep="\t", names=["userId", "item_id", "relevance", "timestamp"])
users = pd.read_csv("data/ml1m_users.dat", sep="\t", names=["user_id", "gender", "age", "occupation", "zip_code"])

In [9]:
df.head(2)

Unnamed: 0,userId,item_id,relevance,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109


In [10]:
users.head(2)

Unnamed: 0,user_id,gender,age,occupation,zip_code
0,1,F,1,10,48067
1,2,M,56,16,70072


### 0.1. DataPreparator

An inner data format in RePlay is a spark dataframe. 

Columns with users' and items' identifiers are required for interaction log. Original user and item identifiers should be named as `user_id` and `item_id`. Those identifiers in section [0.3. Indexing](#indexing) will be converted to integer identifiers, which will be named `user_idx`, `item_idx`. Optional columns for interaction matrix are ``relevance`` and interaction ``timestamp``.

DataFrames with user or item features should have column `user_id` or `item_id` respectively.

We implemented DataPreparator class to convert pandas dataframes to spark format and preprocess the data, including renaming/creation of required and optional interaction matrix columns, null check and dates parsing. It is an optional step, if you already have data in Spark DataFrame format, could rename the above mentioned columns, and confident in completeness and quality of the data, skip this step.

In [11]:
preparator = DataPreparator()

#### Interactions log preprocessing

In [12]:
%%time
log = preparator.transform(columns_mapping={'user_id': 'userId',
                                      'item_id': 'item_id',
                                      'relevance': 'relevance',
                                      'timestamp': 'timestamp'
                                     }, 
                           data=df)

04-Jul-22 16:30:32, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.


CPU times: user 51.8 ms, sys: 44.2 ms, total: 96.1 ms
Wall time: 4.94 s


In [13]:
log.show(2)

+-------+-------+---------+-------------------+
|user_id|item_id|relevance|          timestamp|
+-------+-------+---------+-------------------+
|      1|   1193|      5.0|2000-12-31 22:12:40|
|      1|    661|      3.0|2000-12-31 22:35:09|
+-------+-------+---------+-------------------+
only showing top 2 rows



In [14]:
log.printSchema()

root
 |-- user_id: long (nullable = true)
 |-- item_id: long (nullable = true)
 |-- relevance: double (nullable = true)
 |-- timestamp: timestamp (nullable = true)



In [16]:
get_log_info(log, user_col='user_id', item_col='item_id')

                                                                                

'total lines: 1000209, total users: 6040, total items: 3706'

As you see, `userId` was renamed to `user_id` and `timestamp` was converted to `TimestampType`.

#### Feature dataframe preprocessing
To transform feature dataframes you could also use DataPreparator:

In [17]:
user_feat = preparator.transform(columns_mapping={'user_id': 'user_id'}, 
                           data=users)
user_feat.show(2)

04-Jul-22 16:31:00, replay, INFO: Column with ids of users or items is absent in mapping. The dataframe will be treated as a users'/items' features dataframe.


+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
+-------+------+---+----------+--------+
only showing top 2 rows



DataPreparator use is optional, you could convert dataFrame to spark with ``convert_to_spark`` from ``replay.utils`` and manually rename columns.

In [18]:
# the same result without DataPreparator
convert2spark(users).show(2)

+-------+------+---+----------+--------+
|user_id|gender|age|occupation|zip_code|
+-------+------+---+----------+--------+
|      1|     F|  1|        10|   48067|
|      2|     M| 56|        16|   70072|
+-------+------+---+----------+--------+
only showing top 2 rows



### 0.2 Filtering
It is common to filter interactions log by interaction date or rating value or remove items or users with small number of interactions. RePlay offers some filters presented in `replay.filters` module.
We will leave ratings greater than or equal to 3 and remove users with 4 or fewer interactions.

In [19]:
from replay.filters import filter_by_min_count, filter_out_low_ratings

In [20]:
log = filter_out_low_ratings(log, value=3)
get_log_info(log, user_col='user_id', item_col='item_id')

                                                                                

'total lines: 836478, total users: 6039, total items: 3628'

In [21]:
%%time
log = filter_by_min_count(log, num_entries=5, group_by='user_id')
get_log_info(log, user_col='user_id', item_col='item_id')

04-Jul-22 16:31:29, replay, INFO: current threshold removes 1.1954887038272376e-06% of data

CPU times: user 23.6 ms, sys: 7.31 ms, total: 30.9 ms
Wall time: 9.08 s


                                                                                

'total lines: 836477, total users: 6038, total items: 3628'

<a id='indexing'></a>
### 0.3. Indexing

RePlay models require columns with users' and items' identifiers _(ids)_ to be named as `user_idx` and `item_idx`. Those _ids_ should be integers starting at zero without gaps. This is important for models that use sparse matrices and define the matrix size as the biggest seen user and item index. Storing _ids_ as integers also help to reduce memory usage compared to string _ids_.

You should convert user and item _ids_ in interaction's log and feature dataframes. RaPlay offers Indexer class to perform the _ids_ conversion and convert them back after recommendations generation (predict). The Indexer will store label encoders for users and items and allow transforming ids for users and items, which come after the Indexer fit.

In [22]:
indexer = Indexer(user_col='user_id', item_col='item_id')

Take all available user and item ids from log and features and pass them to Indexer. The _ids_ could repeat, the indexes will be ordered by label frequencies, so the most frequent label gets index 0.

In [23]:
%%time
indexer.fit(users=log.select('user_id').unionByName(user_feat.select('user_id')),
           items=log.select('item_id'))



CPU times: user 38.6 ms, sys: 1.85 ms, total: 40.5 ms
Wall time: 4.29 s


                                                                                

In [24]:
%%time
log_replay = indexer.transform(df=log)
log_replay.show(2)



+--------+--------+---------+-------------------+
|user_idx|item_idx|relevance|          timestamp|
+--------+--------+---------+-------------------+
|    1650|      42|      5.0|2000-12-11 09:29:40|
|    1650|     242|      4.0|2000-12-11 09:38:29|
+--------+--------+---------+-------------------+
only showing top 2 rows

CPU times: user 47.1 ms, sys: 10.6 ms, total: 57.7 ms
Wall time: 5.4 s


                                                                                

In [25]:
%%time
user_feat_replay = indexer.transform(df=user_feat)
user_feat_replay.show(2)

+--------+------+---+----------+--------+
|user_idx|gender|age|occupation|zip_code|
+--------+------+---+----------+--------+
|    3861|     F|  1|        10|   48067|
|    2301|     M| 56|        16|   70072|
+--------+------+---+----------+--------+
only showing top 2 rows

CPU times: user 14.3 ms, sys: 10.3 ms, total: 24.6 ms
Wall time: 704 ms


### 0.4. Split

RePlay provides you with data splitters to reproduce a validation schemas widely-used in recommender systems. Splitters return cached dataframes to compute them once and re-use for models training, inference and metrics calculation.

`UserSplitter` takes ``item_test_size`` items for ``user_test_size`` user to the test dataset.

In [26]:
%%time
splitter = UserSplitter(
    drop_cold_items=True,
    drop_cold_users=True,
    item_test_size=K,
    user_test_size=500,
    seed=SEED,
    shuffle=True
)
train, test = splitter.split(log_replay)
print(train.count(), test.count())

                                                                                

833977 2500
CPU times: user 40.8 ms, sys: 11.9 ms, total: 52.7 ms
Wall time: 14.4 s


In [27]:
test.is_cached

True

## 1. Models training

#### SLIM

In [28]:
slim = SLIM(seed=SEED)

In [29]:
%%time
slim.fit(log=train)



CPU times: user 1.29 s, sys: 87.1 ms, total: 1.38 s
Wall time: 18.1 s




In [30]:
%%time

recs = slim.predict(
    k=K,
    users=test.select('user_idx').distinct(),
    log=train,
    filter_seen_items=True
)

04-Jul-22 16:32:37, replay, INFO: This model can't predict cold items, they will be ignored


CPU times: user 23.1 ms, sys: 3.08 ms, total: 26.1 ms
Wall time: 2.89 s


In [None]:
recs.show(2)



## 2. Models evaluation

RePlay implements some popular recommenders' quality metrics. Use pure metrics or calculate a set of chosen metrics and compare models with the ``Experiment`` class.

In [None]:
metrics = Experiment(test, {NDCG(): K,
                            MAP() : K,
                            HitRate(): [1, K],
                            Coverage(train): K
                           })

In [None]:
%%time
metrics.add_result("SLIM", recs)
metrics.results

## 3. Hyperparameters optimization

#### 3.1 Search

In [None]:
# data split for hyperparameters optimization
train_opt, val_opt = splitter.split(train)

In [35]:
%%time
best_params = slim.optimize(train_opt, val_opt, criterion=NDCG(), k=K, budget=15)

[32m[I 2022-07-04 16:34:31,527][0m Trial 1 finished with value: 0.13122731248619673 and parameters: {'beta': 0.5448504249427771, 'lambda_': 0.5222474594354795}. Best is trial 0 with value: 0.17994896373696406.[0m
04-Jul-22 16:34:42, replay, INFO: This model can't predict cold items, they will be ignored
[32m[I 2022-07-04 16:34:55,339][0m Trial 2 finished with value: 0.1711360902472643 and parameters: {'beta': 2.3570008528335234e-05, 'lambda_': 0.06060884085743946}. Best is trial 0 with value: 0.17994896373696406.[0m
04-Jul-22 16:35:05, replay, INFO: This model can't predict cold items, they will be ignored
[32m[I 2022-07-04 16:35:17,379][0m Trial 3 finished with value: 0.16711672386011686 and parameters: {'beta': 0.018465462835785512, 'lambda_': 0.09411386761886086}. Best is trial 0 with value: 0.17994896373696406.[0m
04-Jul-22 16:35:30, replay, INFO: This model can't predict cold items, they will be ignored
[32m[I 2022-07-04 16:35:49,074][0m Trial 4 finished with value: 0.1

CPU times: user 20.5 s, sys: 1.55 s, total: 22.1 s
Wall time: 7min 39s


In [36]:
best_params

{'beta': 0.22493478895964944, 'lambda_': 0.0003740072564484256}

#### 3.2 Compare with previous

In [37]:
def fit_predict_evaluate(model, experiment, name):
    model.fit(log=train)

    recs = model.predict(
        k=K,
        users=test.select('user_idx').distinct(),
        log=train,
        filter_seen_items=True
    )

    experiment.add_result(name, recs)
    return recs

In [38]:
%%time
recs = fit_predict_evaluate(SLIM(**best_params, seed=SEED), metrics, 'SLIM_optimized')
recs.cache() #caching for further processing
metrics.results.sort_values('NDCG@5', ascending=False)

04-Jul-22 16:41:26, replay, INFO: This model can't predict cold items, they will be ignored
                                                                                4]]

CPU times: user 1.52 s, sys: 130 ms, total: 1.65 s
Wall time: 46.9 s




Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.130926,0.25,0.556,0.106327,0.178962
SLIM,0.149394,0.25,0.562,0.10512,0.178352


## 4. Getting final recommendations 

### Return to original user and item identifiers

In [39]:
%%time
recs = indexer.inverse_transform(recs)
recs.show(2)



+-------+-------+------------------+
|user_id|item_id|         relevance|
+-------+-------+------------------+
|     33|   1036|0.9192236214439711|
|     33|    457|0.8906340395289785|
+-------+-------+------------------+
only showing top 2 rows

CPU times: user 390 ms, sys: 99.9 ms, total: 490 ms
Wall time: 10.2 s


                                                                                

### Convert to pandas or save

In [40]:
recs_pd = recs.toPandas()
recs_pd.head(2)

Unnamed: 0,user_id,item_id,relevance
0,33,1036,0.919224
1,33,457,0.890634


In [41]:
%%time
recs.write.parquet(path='./slim_recs.parquet', mode='overwrite')



CPU times: user 9.33 ms, sys: 1.28 ms, total: 10.6 ms
Wall time: 2.03 s




## 4. Save and load

RePlay allows saving and loading fitted models with `save` and `load` functions of `model_handler` module. Model is saved as a folder with all necessary parameters and data.

In [42]:
%%time
save_indexer(indexer, './indexer_ml1')
indexer = load_indexer('./indexer_ml1')

CPU times: user 362 ms, sys: 73.8 ms, total: 436 ms
Wall time: 2.4 s


In [43]:
%%time
save(slim, path='./slim_best_params')
slim_loaded = load('./slim_best_params')

                                                                                

CPU times: user 62.1 ms, sys: 26.5 ms, total: 88.7 ms
Wall time: 11.3 s


In [44]:
slim_loaded.beta, slim_loaded.lambda_

(0.22493478895964944, 0.0003740072564484256)

In [45]:
%%time
pred_from_loaded = slim_loaded.predict(k=K,
    users=test.select('user_idx').distinct(),
    log=train,
    filter_seen_items=True)
pred_from_loaded.show(2)

04-Jul-22 16:42:27, replay, INFO: This model can't predict cold items, they will be ignored

+--------+--------+------------------+
|user_idx|item_idx|         relevance|
+--------+--------+------------------+
|     619|      46| 0.893503728709634|
|     619|      18|0.8330520553824681|
+--------+--------+------------------+
only showing top 2 rows

CPU times: user 38.3 ms, sys: 9.99 ms, total: 48.3 ms
Wall time: 11.8 s


                                                                                

In [46]:
%%time
recs = indexer.inverse_transform(pred_from_loaded)
recs.show(2)



+-------+-------+------------------+
|user_id|item_id|         relevance|
+-------+-------+------------------+
|     33|   1036| 0.893503728709634|
|     33|   1617|0.8330520553824681|
+-------+-------+------------------+
only showing top 2 rows

CPU times: user 317 ms, sys: 94.4 ms, total: 411 ms
Wall time: 9.9 s




## 5. Other RePlay models

#### ALS
Commonly-used matrix factorization algorithm.

In [47]:
%%time
recs = fit_predict_evaluate(ALSWrap(rank=100, seed=SEED), metrics, 'ALS')
metrics.results.sort_values('NDCG@5', ascending=False)

04-Jul-22 16:43:07, replay, INFO: This model can't predict cold users, they will be ignored
04-Jul-22 16:43:07, replay, INFO: This model can't predict cold items, they will be ignored
                                                                                

CPU times: user 346 ms, sys: 44.1 ms, total: 390 ms
Wall time: 1min 37s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.130926,0.25,0.556,0.106327,0.178962
SLIM,0.149394,0.25,0.562,0.10512,0.178352
ALS,0.185777,0.202,0.526,0.093427,0.161389


#### KNN
Commonly-used item-based recommender

In [48]:
%%time
recs = fit_predict_evaluate(KNN(num_neighbours=100), metrics, 'KNN')
metrics.results.sort_values('NDCG@5', ascending=False)

04-Jul-22 16:44:49, replay, INFO: This model can't predict cold items, they will be ignored
                                                                                

CPU times: user 182 ms, sys: 43.7 ms, total: 226 ms
Wall time: 49.5 s


Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.130926,0.25,0.556,0.106327,0.178962
SLIM,0.149394,0.25,0.562,0.10512,0.178352
ALS,0.185777,0.202,0.526,0.093427,0.161389
KNN,0.046582,0.154,0.402,0.06092,0.111169


## 6 Compare RePlay models with others
To easily evaluate recommendations obtained from other sources, read and pass these recommendations to ``Experiment``

In [49]:
import pyspark.sql.functions as sf

In [50]:
metrics.add_result("my_model", recs.withColumn("relevance", sf.rand()))
metrics.results.sort_values("NDCG@5", ascending=False)

                                                                                

Unnamed: 0,Coverage@5,HitRate@1,HitRate@5,MAP@5,NDCG@5
SLIM_optimized,0.130926,0.25,0.556,0.106327,0.178962
SLIM,0.149394,0.25,0.562,0.10512,0.178352
ALS,0.185777,0.202,0.526,0.093427,0.161389
KNN,0.046582,0.154,0.402,0.06092,0.111169
my_model,0.046582,0.068,0.402,0.047313,0.094533
