# Feature generation with RePlay

This notebook presents the RePlay functionality for features preprocessing and generation of new users and item features based on existing features and interactions history. RePlay offers classes:

* CatFeaturesTransformer - one-hot encoding for categorical features
* LogStatFeaturesProcessor - generates users and items statistical features based on historical interactions
* ConditionalPopularityProcessor - generates popularity among users and items conditioned on categorical feature value for given user-item pair
* HistoryBasedFeaturesProcessor - applies LogStatFeaturesProcessor and ConditionalPopularityProcessor as a pipeline


### Fit 

To train a feature generator use the method `.fit()`.

### Transform the data

Method `.transform()` allows you to transform the data based on the train dataset statistics.

In [2]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import random
from pyspark.sql import functions as sf
from replay.session_handler import get_spark_session, State 

from replay.data_preparator import DataPreparator, Indexer
from replay.utils import convert2spark

spark = State().session
spark.sparkContext.setLogLevel('ERROR')
spark

## Get started

Download the dataset **MovieLens** and preprocess it with `DataPreparator` and `Indexer`

In [3]:
ratings = pd.read_csv("./data/ml1m_ratings.dat", sep="\t", names=["userId", "itemId","relevance","timestamp"])

For each user, we will add the categorical variable `month`

In [4]:
new_val = pd.to_datetime(ratings["timestamp"], unit='s').map(lambda x: x.month)
ratings.loc[:,"month"] = new_val

In [7]:
dp = DataPreparator()
log = dp.transform(data=ratings,
                  columns_mapping={
                      "user_id": "userId",
                      "item_id":  "itemId",
                      "relevance": "relevance",
                      "timestamp": "timestamp"
                  })

log.show(2)

11-Nov-22 20:19:04, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.


+-------+-------+---------+-------------------+-----+
|user_id|item_id|relevance|          timestamp|month|
+-------+-------+---------+-------------------+-----+
|      1|   1193|      5.0|2001-01-01 01:12:40|   12|
|      1|    661|      3.0|2001-01-01 01:35:09|   12|
+-------+-------+---------+-------------------+-----+
only showing top 2 rows



In [8]:
indexer = Indexer(user_col='user_id', item_col='item_id')
indexer.fit(users=log.select('user_id'),
            items=log.select('item_id'))
log = indexer.transform(df=log)
log.show(2)

+--------+--------+---------+-------------------+-----+
|user_idx|item_idx|relevance|          timestamp|month|
+--------+--------+---------+-------------------+-----+
|    4131|      43|      5.0|2001-01-01 01:12:40|   12|
|    4131|     585|      3.0|2001-01-01 01:35:09|   12|
+--------+--------+---------+-------------------+-----+
only showing top 2 rows



We will leave only the first 20 users and will not take the 12th month

In [9]:
log_20_users = log.where("user_idx < 20 and month != 12")

Let's create a dataframe with user attributes

In [10]:
gender = ["M","F"]
age = [20,30,40]

user_features =  spark.createDataFrame(
    [(i, age[random.randint(0,2)], gender[random.randint(0,1)])for i in range(20)]
).toDF("user_idx", "age" , "gender")

## class CatFeaturesTransformer()

Transform categorical features in ``cat_cols_list`` with one-hot encoding and remove original columns.
    
Parameters:
* `cat_cols_list` - List of categorical columns
* `alias` - Prefix of the generated column names (default is "ohe")

In [11]:
from replay.data_preparator import CatFeaturesTransformer
cft = CatFeaturesTransformer(["month"])
cft.fit(log_20_users)

In [12]:
log_trsfrm = cft.transform(log_20_users)

#### Before

In [13]:
log_20_users.show(1, vertical=True)

-RECORD 0------------------------
 user_idx  | 16                  
 item_idx  | 366                 
 relevance | 4.0                 
 timestamp | 2001-01-10 21:07:43 
 month     | 1                   
only showing top 1 row



#### After

In [14]:
log_trsfrm.show(1, vertical=True)

-RECORD 0---------------------------
 user_idx     | 16                  
 item_idx     | 366                 
 relevance    | 4.0                 
 timestamp    | 2001-01-10 21:07:43 
 ohe_month_9  | 0                   
 ohe_month_1  | 1                   
 ohe_month_5  | 0                   
 ohe_month_2  | 0                   
 ohe_month_6  | 0                   
 ohe_month_3  | 0                   
 ohe_month_10 | 0                   
 ohe_month_7  | 0                   
 ohe_month_4  | 0                   
 ohe_month_11 | 0                   
 ohe_month_8  | 0                   
only showing top 1 row



### Processing of cold users and items
If the dataframe contains new values, not presented in train, those values are ignored (encoded columns will be all zeros).

To show this, add a user to the DataFrame with the value 12 in the "month" attribute. The value was absent in the training data.

In [15]:
user_with_12_month_attriubute = log.where("month == 12").limit(1)

user_idx, item_idx = user_with_12_month_attriubute.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(user_with_12_month_attriubute)

In [16]:
log_trsfrm_21_users = cft.transform(log_21_users)

As we can see, for a user with a month value of 12, all attributes are **0**.

In [17]:
log_trsfrm_21_users.where(f"user_idx == {user_idx} and item_idx == {item_idx}").show(vertical=True)

-RECORD 0---------------------------
 user_idx     | 4131                
 item_idx     | 43                  
 relevance    | 5.0                 
 timestamp    | 2001-01-01 01:12:40 
 ohe_month_9  | 0                   
 ohe_month_1  | 0                   
 ohe_month_5  | 0                   
 ohe_month_2  | 0                   
 ohe_month_6  | 0                   
 ohe_month_3  | 0                   
 ohe_month_10 | 0                   
 ohe_month_7  | 0                   
 ohe_month_4  | 0                   
 ohe_month_11 | 0                   
 ohe_month_8  | 0                   



## class LogStatFeaturesProcessor()

Calculate user and item features based on historical interactions.

Generated features:

* `(u/i)_log_num_interact` - logarithm of the number of interactions
* `(u/i)_log_interact_days_count` - logarithm of the number of unique dates with user-item interactions 
* `(u/i)_min_interact_date` - min interaction timestamp
* `(u/i)_max_interact_date` - max interaction timestamp
* `(u/i)_std` - standard deviation of relevance values for a user/item
* `(u/i)_mean` - mean relevance values for a user/item
* `(u/i)_quantile_05` - 0.05 percentile of relevance relevance for a user/item
* `(u/i)_quantile_5` - 0.5 percentile of relevance relevance for a user/item
* `(u/i)_quantile_95` - 0.95 percentile of relevance relevance for a user/item
* `(u/i)_history_length_days` - difference between min interact date and max interact date
* `(u/i)_last_interaction_gap_days` - number of days since last interaction
* `(u/i)_mean_log_num_interact` - average logarithm of number of item/user interactions that the user/item interacted with
* `(u/i)_(i/u)_log_num_interact_diff` - difference between the logarithm of the number of user/item interactions and **(u/i)_mean_log_num_interact** for this user/item
* `na_(u/i)_log_features` - flag, indicating cold user/item, absent in training log
    

<br>
<br>


* `u_mean_log_num_interact`   $$mean\;log\;num\;interact\;(u) = \frac{\Sigma_{i \in I_u} log(itr(i)) }{\|I_u\|}$$ <br>
Where:<br>
  $\;\;i$ - item<br>$\;\;I_u$ - products that the user interacted with and $\;\;\|I_u\|$ is their number<br>$\;\;{itr(i)}$ - number of interactions of item<br><br><br>
  
* `u_i_log_num_interact_diff`   $$log\;num\;interact\;diff\;(u) = itr(u) - mean\;log\;num\;interact\;(u)$$ <br>Where:<br>
  $\;\;u$ - user<br>$\;\;{itr(u)}$ - number of interactions of user<br><br><br>
  
* `abnormality`:
  $$Abnormality(u) = \frac{\Sigma_{r \in R_u} | n_{u,r} - \overline{n_{r}} | }{\| R_u \|}$$ <br>Where:<br>
  $\;\;n_{u,r}$ - represents the rating that user $u$ assigned to resource $r$<br>$\;\;\overline{n_{r}}$ - the average rating of $r$<br>$\;\;R_u$ - the set of resources rated by $u$ and $\|Ru\|$ is their number<br><br>

* `abnormalityCR` 
  $$Abnormality(u) = \frac{\Sigma_{r \in R_u} (( n_{u,r} - \overline{n_{r}} ) * contr(r))^2 }{\| R_u \|}$$ <br>
  
  $$contr(r) = 1 - \frac{\sigma_r - \sigma_{min} }{\sigma_{max} - \sigma_{min}}$$<br>Where:<br>
  $\;\;\sigma_r$ - the standard deviation of the ratings associated with the resource $r$<br>$\;\;\sigma_{min}$ and $\;\;\sigma_{max}$ are respectively the smallest and the largest possible stanard deviation values, among resources
  
[More about abnormality, abnormalityCR](https://hal.inria.fr/hal-01254172/document)

In [18]:
from replay.history_based_fp import LogStatFeaturesProcessor
lf = LogStatFeaturesProcessor()
lf.fit(log_20_users)

In [19]:
log_trsfrm = lf.transform(log_20_users)

#### Before

In [20]:
log_20_users.show(1, vertical=True)

-RECORD 0------------------------
 user_idx  | 16                  
 item_idx  | 366                 
 relevance | 4.0                 
 timestamp | 2001-01-10 21:07:43 
 month     | 1                   
only showing top 1 row



#### After

In [22]:
log_trsfrm.show(1, vertical=True)

-RECORD 0-------------------------------------------
 item_idx                    | 366                  
 user_idx                    | 16                   
 relevance                   | 4.0                  
 timestamp                   | 2001-01-10 21:07:43  
 month                       | 1                    
 u_log_num_interact          | 6.736966958001855    
 u_log_interact_days_count   | 4.795790545596741    
 u_min_interact_date         | 2001-01-10 20:59:24  
 u_max_interact_date         | 2003-02-27 15:31:39  
 u_std                       | 0.9460530559956203   
 u_mean                      | 3.561091340450771    
 u_quantile_05               | 2.0                  
 u_quantile_5                | 4.0                  
 u_quantile_95               | 5.0                  
 u_history_length_days       | 778                  
 u_last_interaction_gap_days | 0                    
 abnormality                 | 0.5423858158630311   
 abnormalityCR               | 0.1907112888748

### Processing of cold users and items

There are 3 possible scenarios:

1. Cold user - a user which was not presented in the training log.
    All items' statistics will be present, but the user statistics will be `0`.<br>Flag `na_u_log_features` will be `1`.
<br>
<br>
2. Cold item - an item which was not presented in the training log.
    All the user statistics will be present, but the item statistics will be `0`.<br>Flag `na_i_log_features` will be `1`.
<br>
<br>
3. A pair of cold user cold item - user and item which were not presented in the training log.
    All statistics will be `0`.<br>Flags `na_u_log_features`, `na_i_log_features` will be `1`.

#### Add cold user

In [23]:
user_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("u_idx") == sf.col("user_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .limit(1)
)

user_idx, item_idx = user_cold.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(user_cold)

In [24]:
log_trsfrm = lf.transform(log_21_users)

In [26]:
log_trsfrm.where(f"user_idx == {user_idx}").show(1, vertical=True)

                                                                                

-RECORD 0------------------------------------------
 item_idx                    | 2001                
 user_idx                    | 38                  
 relevance                   | 3.0                 
 timestamp                   | 2000-07-12 08:05:23 
 month                       | 7                   
 u_log_num_interact          | 0.0                 
 u_log_interact_days_count   | 0.0                 
 u_min_interact_date         | 1970-01-01 03:00:00 
 u_max_interact_date         | 1970-01-01 03:00:00 
 u_std                       | 0.0                 
 u_mean                      | 0.0                 
 u_quantile_05               | 0.0                 
 u_quantile_5                | 0.0                 
 u_quantile_95               | 0.0                 
 u_history_length_days       | 0                   
 u_last_interaction_gap_days | 0                   
 abnormality                 | 0.0                 
 abnormalityCR               | 0.0                 
 u_mean_i_lo

#### Add cold item

In [27]:
item_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("i_idx") == sf.col("item_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .filter("user_idx < 20")
    .limit(1)
)

user_idx, item_idx = item_cold.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(item_cold)

In [28]:
log_trsfrm = lf.transform(log_21_users)

In [29]:
log_trsfrm.where(f"item_idx == {item_idx}").show(1, vertical=True)

-RECORD 0------------------------------------------
 item_idx                    | 3078                
 user_idx                    | 4                   
 relevance                   | 2.0                 
 timestamp                   | 2000-12-07 03:23:32 
 month                       | 12                  
 u_log_num_interact          | 6.911747300251674   
 u_log_interact_days_count   | 2.6390573296152584  
 u_min_interact_date         | 2000-11-22 03:47:32 
 u_max_interact_date         | 2002-05-13 02:30:58 
 u_std                       | 0.8582482337323498  
 u_mean                      | 3.0468127490039842  
 u_quantile_05               | 2.0                 
 u_quantile_5                | 3.0                 
 u_quantile_95               | 4.0                 
 u_history_length_days       | 537                 
 u_last_interaction_gap_days | 290                 
 abnormality                 | 0.696032092299538   
 abnormalityCR               | 0.30599330777137257 
 u_mean_i_lo

#### Add cold item and user

In [30]:
item_user_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("i_idx") == sf.col("item_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .filter("user_idx > 20")
    .limit(1)
)

user_idx, item_idx = item_user_cold.select("user_idx", "item_idx").first()

log_21_users  = log_20_users.union(item_user_cold)

In [31]:
log_trsfrm = lf.transform(log_21_users)

In [32]:
log_trsfrm.where(f"item_idx == {item_idx} and user_idx == {user_idx}").show(1, vertical=True)

                                                                                

-RECORD 0------------------------------------------
 item_idx                    | 3078                
 user_idx                    | 1335                
 relevance                   | 1.0                 
 timestamp                   | 2000-12-02 02:41:14 
 month                       | 12                  
 u_log_num_interact          | 0.0                 
 u_log_interact_days_count   | 0.0                 
 u_min_interact_date         | 1970-01-01 03:00:00 
 u_max_interact_date         | 1970-01-01 03:00:00 
 u_std                       | 0.0                 
 u_mean                      | 0.0                 
 u_quantile_05               | 0.0                 
 u_quantile_5                | 0.0                 
 u_quantile_95               | 0.0                 
 u_history_length_days       | 0                   
 u_last_interaction_gap_days | 0                   
 abnormality                 | 0.0                 
 abnormalityCR               | 0.0                 
 u_mean_i_lo

## class ConditionalPopularityProcessor()

Calculate popularity based on user or item categorical features.
If user features are provided, item features will be generated and vice versa.

Parameters:
* `cat_features_list` - List of columns with categorical features to use
    for conditional popularity calculation
    
Generated features:
* `(u/i)_pop_by_<cat>` - Calculated popularity of a user or item among categories
* `na_(u/i)_pop_by_<cat>` - flag, indicating the absence of historical data for calculate popularity of a user or item among categories

In [33]:
from replay.history_based_fp import ConditionalPopularityProcessor
cpp = ConditionalPopularityProcessor(["age","gender"])

In [34]:
cpp.fit(log_20_users, user_features)

In [35]:
log_trsfrm = cpp.transform(log_20_users.join(user_features, on="user_idx"))

In [37]:
log_trsfrm.show(1, vertical=True)

-RECORD 0---------------------------------
 item_idx           | 1658                
 gender             | F                   
 age                | 30                  
 user_idx           | 0                   
 relevance          | 5.0                 
 timestamp          | 2000-08-04 00:14:32 
 month              | 8                   
 i_pop_by_age       | 0.5                 
 na_i_pop_by_age    | false               
 i_pop_by_gender    | 0.8333333333333334  
 na_i_pop_by_gender | false               
only showing top 1 row



Popularity is calculated as the proportion of user/item interactions among a certain category of items/users.

We trained the ConditionalPopularityProcessor() on categorical users features. Therefore, we can observe the distribution of interactions with a certain item among different groups of users.

Since popularity is calculated as the proportion of interactions, the total popularity among a particular feature for any item is 1.

In [38]:
cpp.conditional_pop_dict["age"].where("item_idx == 634").show()
cpp.conditional_pop_dict["gender"].where("item_idx == 634").show()

+--------+---+------------+
|item_idx|age|i_pop_by_age|
+--------+---+------------+
|     634| 20|         0.2|
|     634| 40|         0.3|
|     634| 30|         0.5|
+--------+---+------------+

+--------+------+---------------+
|item_idx|gender|i_pop_by_gender|
+--------+------+---------------+
|     634|     F|            0.6|
|     634|     M|            0.4|
+--------+------+---------------+



#### Add item with cold features

If we apply `.transform()` to a new `user/item` - `item/user categorical feature value` pair, absent during training, the popularity feature will be 0 and `na_(u/i)_pop_by_<cat>` feature will be **true**, indicating the absence of historical data for the `user/item` - `feature value` combination.

In [39]:
item_cold = (
    log.join(
        log_20_users.select(sf.col("user_idx").alias("u_idx"), sf.col("item_idx").alias("i_idx")),
        on=sf.col("i_idx") == sf.col("item_idx"),
        how="left"
    )
    .filter(sf.col("i_idx").isNull())
    .select(log.columns)
    .filter("user_idx < 20")
    .limit(1)
)

user_idx, item_idx = item_cold.select("user_idx", "item_idx").first()

user_feature_for_21_users = user_features.union(
    spark.createDataFrame(
        [[item_idx, 35, gender[random.randint(0,1)]]]
    ).toDF("user_idx", "age" , "gender")
)

log_21_users  = log_20_users.union(item_cold)

In [40]:
log_trsfrm = cpp.transform(log_21_users.join(user_feature_for_21_users, on="user_idx"))

In [41]:
log_trsfrm.where(f"item_idx == {item_idx}").show(1, vertical=True)

-RECORD 0---------------------------------
 item_idx           | 3078                
 gender             | M                   
 age                | 30                  
 user_idx           | 4                   
 relevance          | 2.0                 
 timestamp          | 2000-12-07 03:23:32 
 month              | 12                  
 i_pop_by_age       | 0.0                 
 na_i_pop_by_age    | true                
 i_pop_by_gender    | 0.0                 
 na_i_pop_by_gender | true                



## class HistoryBasedFeaturesProcessor()

This class combines the functionality LogStatFeaturesProcessor() and ConditionalPopularityProcessor() for more convenient feature generation.

See LogStatFeaturesProcessor and ConditionalPopularityProcessor documentation
for detailed description of generated features.

Parameters:
* `use_log_features` - if **True** statistical log-based features
    generated by LogStatFeaturesProcessor
* `use_conditional_popularity` - if **True** conditional popularity
    features generated by ConditionalPopularityProcessor
* `user_cat_features_list` - list of user categorical features
    used to calculate item conditional popularity features
* `item_cat_features_list` - list of item categorical features
    used to calculate user conditional popularity features

If `use_log_features` is `True`, features are generated with the LogStatFeaturesProcessor class.

In [42]:
from replay.history_based_fp import HistoryBasedFeaturesProcessor

In [43]:
hbf = HistoryBasedFeaturesProcessor(
    use_log_features=True,
    use_conditional_popularity=False
)

In [44]:
hbf.fit(log_20_users)

In [45]:
hbf.transform(log_20_users).show(1, vertical=True)

-RECORD 0-------------------------------------------
 item_idx                    | 366                  
 user_idx                    | 16                   
 relevance                   | 4.0                  
 timestamp                   | 2001-01-10 21:07:43  
 month                       | 1                    
 u_log_num_interact          | 6.736966958001855    
 u_log_interact_days_count   | 4.795790545596741    
 u_min_interact_date         | 2001-01-10 20:59:24  
 u_max_interact_date         | 2003-02-27 15:31:39  
 u_std                       | 0.9460530559956203   
 u_mean                      | 3.561091340450771    
 u_quantile_05               | 2.0                  
 u_quantile_5                | 4.0                  
 u_quantile_95               | 5.0                  
 u_history_length_days       | 778                  
 u_last_interaction_gap_days | 0                    
 abnormality                 | 0.5423858158630311   
 abnormalityCR               | 0.1907112888748

If `use_conditional_popularity` is `True` and the lists of user/item categorical features are passed, features are generated with the ConditionalPopularityProcessor class.

In [46]:
hbf = HistoryBasedFeaturesProcessor(
    use_log_features=False,
    use_conditional_popularity=True,
    user_cat_features_list=["age","gender"]
)

In [47]:
hbf.fit(log_20_users, user_features=user_features)

In [48]:
hbf.transform(log_20_users.join(user_features, on="user_idx")).show(1, vertical=True)

-RECORD 0---------------------------------
 item_idx           | 1658                
 gender             | F                   
 age                | 30                  
 user_idx           | 0                   
 relevance          | 5.0                 
 timestamp          | 2000-08-04 00:14:32 
 month              | 8                   
 i_pop_by_age       | 0.5                 
 na_i_pop_by_age    | false               
 i_pop_by_gender    | 0.8333333333333334  
 na_i_pop_by_gender | false               
only showing top 1 row



We can also use the full functionality of the class, and get all the features we are interested in.

In [49]:
hbf = HistoryBasedFeaturesProcessor(
    use_log_features=True,
    use_conditional_popularity=True,
    user_cat_features_list=["age","gender"]
)

In [50]:
hbf.fit(log_20_users, user_features=user_features)

In [51]:
hbf.transform(log_20_users.join(user_features, on="user_idx")).show(1, vertical=True)

[Stage 395:>              (0 + 12) / 12][Stage 402:>               (0 + 0) / 36]

-RECORD 0-------------------------------------------
 item_idx                    | 18                   
 gender                      | F                    
 age                         | 40                   
 user_idx                    | 18                   
 relevance                   | 5.0                  
 timestamp                   | 2000-05-09 19:45:35  
 month                       | 5                    
 u_log_num_interact          | 7.085064293952548    
 u_log_interact_days_count   | 3.7376696182833684   
 u_min_interact_date         | 2000-05-09 19:45:35  
 u_max_interact_date         | 2002-10-05 03:36:10  
 u_std                       | 0.7457986974338456   
 u_mean                      | 3.6758793969849246   
 u_quantile_05               | 2.0                  
 u_quantile_5                | 4.0                  
 u_quantile_95               | 5.0                  
 u_history_length_days       | 879                  
 u_last_interaction_gap_days | 145            

                                                                                