# RecSys Challenge 2022 Research Notebook

## Introduction

Authors : 
- Henri Jamet <henri.jamet@epita.fr>
- Corentin Duchene <corentin.duchene@epita.fr>
- Adrien Merat <adrien.merat@epita.fr>
- Erwan Goudard <erwan.goudard@epita.fr>

Projet :
- http://www.recsyschallenge.com/2022/index.html#about
- http://www.recsyschallenge.com/2022/dataset.html

## Data Exploration

The data should first be downloaded and unpacked into the data folder at the root of this project. So we should have :

```shell
data
├── README.txt
├── README_win.txt
├── candidate_items.csv
├── item_features.csv
├── test_final_sessions.csv
├── test_leaderboard_sessions.csv
├── train_purchases.csv
└── train_sessions.csv
```


**Let's load our data**

In [1]:
import pandas as pd
import os

PATH_DATA = os.path.join("..", "data")

train_session_df = pd.read_csv(os.path.join(PATH_DATA, "train_sessions.csv"))
train_purchase_df = pd.read_csv(os.path.join(PATH_DATA, "train_purchases.csv"))
candidate_items_df = pd.read_csv(os.path.join(PATH_DATA, "candidate_items.csv"))
item_features_df = pd.read_csv(os.path.join(PATH_DATA, "item_features.csv"))
test_final_sessions_df = pd.read_csv(os.path.join(PATH_DATA, "test_final_sessions.csv"))
test_leaderboard_sessions_df = pd.read_csv(os.path.join(PATH_DATA, "test_leaderboard_sessions.csv"))

dict = {"train_session_df" : train_session_df, "train_purchase_df" : train_purchase_df, "candidate_items_df" : candidate_items_df, "item_features_df" : item_features_df, "test_final_sessions_df" : test_final_sessions_df, "test_leaderboard_sessions_df" : test_leaderboard_sessions_df}
for key in dict:
    print(key)
    print(dict[key].head(1), end="\n\n")

train_session_df
   session_id  item_id                     date
0           3     9655  2020-12-18 21:25:00.373

train_purchase_df
   session_id  item_id                     date
0           3    15085  2020-12-18 21:26:47.986

candidate_items_df
   item_id
0        4

item_features_df
   item_id  feature_category_id  feature_value_id
0        2                   56               365

test_final_sessions_df
   session_id  item_id                     date
0          61    27088  2021-06-01 08:12:39.664

test_leaderboard_sessions_df
   session_id  item_id                     date
0          26    19185  2021-06-16 09:53:54.158



*How many different items does exist?*

In [2]:
distinct_item_number = len(item_features_df.item_id.unique())
print("Unique item number :", distinct_item_number)
print("Item id are unique : ", item_features_df.item_id.nunique() == len(item_features_df.item_id.unique()))

Unique item number : 23691
Item id are unique :  True


*How many different sessions does exist?*

In [3]:
distinct_session_number = len(pd.concat([train_session_df.session_id, train_purchase_df.session_id]).unique())
print("Unique user number :", distinct_session_number)

Unique user number : 1000000


*Does session always look an item before buying it?*

In [4]:
import numpy as np

print("A user never look at one item before buying it.")
pd.merge(train_purchase_df, train_session_df, on=['session_id','item_id'], how='left', indicator='Exist')["Exist"].value_counts()

A user never look at one item before buying it.


left_only     1000000
right_only          0
both                0
Name: Exist, dtype: int64

*Can a session look at items without buying any?*

In [5]:
print("Every session bought exactly one item.")

pd.merge(train_purchase_df, train_session_df, on=['session_id'], how='left', indicator='Exist')["Exist"].value_counts()

Every session bought exactly one item.


both          4743820
left_only           0
right_only          0
Name: Exist, dtype: int64

*What is the average number of different items every user usually look?*

In [6]:
print("Average number of items seen by user :", train_session_df.groupby("session_id").count()["item_id"].mean())

Average number of items seen by user : 4.74382


*How many feature does exist?*

In [7]:
print("The number of different item features is :")
print(len(item_features_df["feature_category_id"].unique()))

The number of different item features is :
73


*How many items have the same feature?*

In [8]:
from matplotlib import pyplot as plt

fig = plt.figure(figsize=(8,6), dpi= 100, facecolor='w', edgecolor='k')
nbr = item_features_df.groupby("feature_category_id").count()["item_id"]
plt.bar(x=nbr.index, height=nbr.values)
plt.title("Number of different items features")

Text(0.5, 1.0, 'Number of different items features')

*How many feature en feature_score items have in commun generaly?*

This question is quite complex so we will first create a function giving this number for every items compared to one given because we will probably need this function soon or later anyway.

In [9]:
item_features_df.head(1)

Unnamed: 0,item_id,feature_category_id,feature_value_id
0,2,56,365


In [10]:

def get_item_similarity(item_features_df, item_id):
    """Get the similarity between the given item and all the others based on their features and feature values.

    Args:
        item_features_df (pd.DataFrame): The item features dataframe.
        item_id_1 (int): The id of the first item

    Returns:
        (pd.DataFrame): item | similar_feature_id | similar_value_id
    """
    item_df = item_features_df[item_features_df.item_id == item_id]
    if len(item_df) == 0:
        return None
    same_feature_id = item_features_df[item_features_df.feature_category_id.isin(item_df.feature_category_id)]
    same_value_id = same_feature_id[same_feature_id.feature_value_id.isin(item_df.feature_value_id)]
    return pd.concat([same_feature_id.groupby("item_id").count().feature_category_id.drop(index=item_id).rename("similar_feature_id"), same_value_id.groupby("item_id").count().feature_value_id.drop(index=item_id).rename("similar_value_id")], axis=1)

In [15]:
import numpy as np

ESTIMATION_SAMPLE_NUMBER = 1000

similar_feature_id = []
similar_value_id = []
for _, item_id in item_features_df.sample(ESTIMATION_SAMPLE_NUMBER).item_id.items():
    similarity = get_item_similarity(item_features_df, item_id) / distinct_item_number
    if similarity is None:
        continue
    similarity = similarity.sum()
    similar_feature_id.append(similarity.similar_feature_id)
    similar_value_id.append(similarity.similar_value_id)

average_similar_feature = np.mean(similar_feature_id)
average_similar_value = np.mean(similar_value_id)
print("Average number of similar features between items :", average_similar_feature)
print("Average number of similar feature values between items :", average_similar_value)
    

Average number of similar features between items : 12.587787978557259
Average number of similar feature values between items : 4.896744122240514


## Data Transformation

**Let's add a score to every combination of user-item**

In [38]:
train_rating_df = pd.concat([train_session_df.assign(rating=1), train_purchase_df.assign(rating=2)])
train_set_df = train_rating_df.rename(columns={"session_id" : "user_id", "rating" : "raw_ratings"}).sample(len(train_rating_df))
train_set_df

Unnamed: 0,user_id,item_id,date,raw_ratings
470253,440610,27225,2020-11-01 16:49:27.499,1
2284733,2135908,2100,2020-06-10 12:54:42.431,1
2106092,1969663,17795,2020-04-18 08:34:39.384,1
2580411,2409668,27629,2020-08-03 21:55:53.88,1
4306011,4029781,23863,2020-03-15 09:18:56.558,1
...,...,...,...,...
760059,711678,23789,2020-07-05 15:37:01.773,1
1488354,1392963,23019,2021-05-30 15:17:29.18,1
1863175,1741976,27988,2020-08-05 14:31:36.785,1
2527594,2359624,6689,2020-05-25 07:35:21.28,1


**Let's create a SURPRISE training dataset**

In [42]:
TRAIN_SET_REDUCED_SIZE = 10000

import surprise

rating_reader = surprise.Reader(rating_scale=(1, 2))
train_set = surprise.dataset.Dataset.load_from_df(df=train_set_df[["user_id", "item_id", "raw_ratings"]], reader=rating_reader)
train_set_reduced = surprise.dataset.Dataset.load_from_df(df=train_set_df[["user_id", "item_id", "raw_ratings"]].iloc[:TRAIN_SET_REDUCED_SIZE], reader=rating_reader)

## Metrics

## Models

### Let's train a simple Suprise Model

In [43]:
model = surprise.SVD()

surprise.model_selection.cross_validate(model, train_set_reduced, measures=["RMSE"], cv=5, verbose=True, n_jobs=-1)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.3844  0.3736  0.3823  0.3809  0.3844  0.3811  0.0040  
Fit time          0.24    0.26    0.23    0.24    0.23    0.24    0.01    
Test time         0.01    0.01    0.01    0.01    0.01    0.01    0.00    


{'test_rmse': array([0.38444556, 0.37363956, 0.38234942, 0.3808684 , 0.38437717]),
 'fit_time': (0.2350616455078125,
  0.2608654499053955,
  0.225067138671875,
  0.23749852180480957,
  0.22806382179260254),
 'test_time': (0.007378578186035156,
  0.0071811676025390625,
  0.0075910091400146484,
  0.0077953338623046875,
  0.007299184799194336)}

### Let's Compare our models

In [45]:
model_list = [surprise.NormalPredictor(), surprise.BaselineOnly(), surprise.KNNBaseline(), surprise.KNNBasic(), surprise.KNNWithMeans(), surprise.KNNWithZScore(), surprise.SlopeOne(), surprise.SVD(), surprise.SVDpp(), surprise.NMF(), surprise.CoClustering(), surprise.SlopeOne()]

result = {}
for model in model_list:
    scores = surprise.model_selection.cross_validate(model, train_set_reduced, measures=["RMSE"], cv=5, verbose=False)
    result[model.__class__.__name__] = scores["test_rmse"].mean()

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matr

In [46]:
# ------------------------- BEST ALGORITHMS WITH RMSE ------------------------ #
sorted(result.items(), key=lambda x: x[1])

[('KNNBaseline', 0.3799799321860557),
 ('KNNBasic', 0.3801997471291728),
 ('NMF', 0.3802030481463565),
 ('BaselineOnly', 0.38027947203177714),
 ('KNNWithZScore', 0.38036789238157875),
 ('KNNWithMeans', 0.3803827426169235),
 ('SlopeOne', 0.38046389634037225),
 ('CoClustering', 0.38051632761777154),
 ('SVD', 0.38066332557871174),
 ('SVDpp', 0.3812685052348684),
 ('NormalPredictor', 0.4739045570136701)]

### Now, lets perform some Grid Search on our best models

We know that a very similar class exists in Surprise and Sklearn. However, in order to perfectly control its behaviour and to be able to run this code in parallel on the LSE cluster, we preferred to reimplement the gridSearch in our own way.

In [49]:
import sklearn.model_selection

class MyCrossValidation:
    def __init__(self, model, params):
        self.model_list = [
            (model(**args, verbose=False), args)
            for args in list(sklearn.model_selection.ParameterGrid(params))
        ]
        self.full_train_set = train_set_reduced.build_full_trainset()

    def __train_test_model(self, model, params, verbose=1):
        model.fit(
            self.full_train_set,
        )
        predictions = model.test(self.full_train_set.build_testset())
        score = surprise.accuracy.rmse(
            predictions, verbose=True if verbose == 2 else False
        )
        if verbose == 1:
            print("Params {} :".format(str(params)), score)
        return (params, score)

    def __call__(self, verbose=1):
        res = []
        while len(self.model_list):
            model, params = self.model_list.pop()
            if verbose == 1:
                print("{} left".format(len(self.model_list)), end=" --- ")
            res.append(self.__train_test_model(model, params, verbose))
            del model
        return sorted(res, key=lambda x: x[1])

In [55]:
# ---------------------------- GRID SEARCH FOR NMF --------------------------- #
params = {
    "biased" : [False],
    "reg_bu" : [0.05, 0.1, 0.5],
    "reg_bi" : [0.005, 0.01, 0.05],
    "reg_qi" : [0.005, 0.01, 0.05],
    "reg_pu" : [0.0005, 0.001, 0.005],
}

best_nmf_models = MyCrossValidation(surprise.prediction_algorithms.matrix_factorization.NMF, params)()
best_nmf_models[0]

80 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.005, 'reg_qi': 0.05} : 0.1689494216094797
79 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.005, 'reg_qi': 0.01} : 0.630855463621955
78 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.005, 'reg_qi': 0.005} : 0.8178975901586434
77 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.001, 'reg_qi': 0.05} : 0.35472184334815393
76 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.001, 'reg_qi': 0.01} : 0.8512590494453293
75 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.001, 'reg_qi': 0.005} : 0.8915309489655097
74 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.0005, 'reg_qi': 0.05} : 0.4024654028030344
73 left --- Params {'biased': False, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.0005, 'reg_qi': 0.01} : 0.8660040652416524
72 left --- Params {

({'biased': False,
  'reg_bi': 0.01,
  'reg_bu': 0.1,
  'reg_pu': 0.005,
  'reg_qi': 0.05},
 0.167929892879959)

In [56]:
# ---------------------------- GRID SEARCH FOR SVD --------------------------- #
params = {
    "biased" : [False],
    "init_std_dev" : [0.5, 1, 5],
    "lr_all" : [0.001],
    "reg_bu" : [0.05, 0.1, 0.5],
    "reg_bi" : [0.005, 0.01, 0.05],
    "reg_qi" : [0.005, 0.01, 0.05],
    "reg_pu" : [0.0005, 0.001, 0.005],
}

best_svd_models = MyCrossValidation(surprise.prediction_algorithms.matrix_factorization.SVD, params)()
best_svd_models[0]

242 left --- Params {'biased': False, 'init_std_dev': 5, 'lr_all': 0.001, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.005, 'reg_qi': 0.05} : 0.90553626109996
241 left --- Params {'biased': False, 'init_std_dev': 5, 'lr_all': 0.001, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.005, 'reg_qi': 0.01} : 0.9055762371550319
240 left --- Params {'biased': False, 'init_std_dev': 5, 'lr_all': 0.001, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.005, 'reg_qi': 0.005} : 0.9045399352893427
239 left --- Params {'biased': False, 'init_std_dev': 5, 'lr_all': 0.001, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.001, 'reg_qi': 0.05} : 0.9050008062032734
238 left --- Params {'biased': False, 'init_std_dev': 5, 'lr_all': 0.001, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.001, 'reg_qi': 0.01} : 0.9041406931049152
237 left --- Params {'biased': False, 'init_std_dev': 5, 'lr_all': 0.001, 'reg_bi': 0.05, 'reg_bu': 0.5, 'reg_pu': 0.001, 'reg_qi': 0.005} : 0.905466696711087
236 left --- Params {'biased': False, 'init_std

({'biased': False,
  'init_std_dev': 1,
  'lr_all': 0.001,
  'reg_bi': 0.01,
  'reg_bu': 0.1,
  'reg_pu': 0.001,
  'reg_qi': 0.005},
 0.10217656988077073)

## Evaluation

**Let's create a function that will give a score based on the similarity between items. For this, we will use the function defined in the Data Analysis Section : get_item_similarity. We found that the average number of similar feature between items is around 3 times bigger than the average number of similare feature value. Hence, we will score based on the following formula:**

$$
score = \dfrac {similar\_feature\_number + 3 \times similar\_feature\_value} {average\_similare\_feature + 3 \times average\_similar\_feature\_value}
$$

In [32]:
def get_item_similarity_score(item_features_df, item_id):
    similarity = get_item_similarity(item_features_df, item_id)
    if similarity is None:
        return None
    
    res = (similarity.similar_feature_id + 3 * similarity.similar_value_id) / (similarity.similar_feature_id.mean() + 3 * similarity.similar_value_id.mean())
    res /= max(res)
    return res

In [33]:
get_item_similarity_score(item_features_df, item_id=2).describe()

count    22205.000000
mean         0.345297
std          0.159369
min          0.102041
25%          0.224490
50%          0.285714
75%          0.408163
max          1.000000
dtype: float64

The users using the platform in the test set are completely unknown to us. Therefore, we will proceed as follows:
1. Identify all similar users
2. Make predictions for each of them and weight the results by the similarity rate to the user to be predicted
3. Further improve our results by using the similarity between the items 