<a href="https://colab.research.google.com/github/glutamatt/semi_perso_user_cold_start/blob/colab/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!rm -fr repository deezer data
!git clone https://github.com/glutamatt/semi_perso_user_cold_start.git repository
!cd repository && git checkout colab && cd -

Cloning into 'repository'...
remote: Enumerating objects: 157, done.[K
remote: Counting objects: 100% (157/157), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 157 (delta 87), reused 146 (delta 80), pack-reused 0[K
Receiving objects: 100% (157/157), 601.29 KiB | 18.22 MiB/s, done.
Resolving deltas: 100% (87/87), done.
Branch 'colab' set up to track remote branch 'colab' from 'origin'.
Switched to a new branch 'colab'
/content


# Test de contenu

impec 

## Installation

### Code

```Bash
git clone https://github.com/deezer/carousel_bandits
cd carousel_bandits
```

Requirements: python 3, matplotlib, numpy, pandas, scipy, seaborn

Media services providers, such as the music streaming platform [Deezer](https://www.deezer.com/), often leverage **swipeable carousels** to recommend personalized content to their users. These carousels are ranked lists of _L_ items or **cards** from a substantially larger catalog (of size _K_), e.g. _L_ albums, artists or playlists recommended on the homepage of the Deezer app. Only a few cards, say _L_init_ < _L_, are initially displayed to users, who can **swipe** the screen to see additional cards.

Selecting the most relevant content to display in carousels is a challenging task, as the catalog is large and as users have different preferences. Also, ranking matters: some cards might not be seen by some users due to the swipeable structure.

<p align="center">
  <img height="350" src="https://raw.githubusercontent.com/deezer/carousel_bandits/master/images/carousel.png">
</p>

In [2]:
!pip install fastparquet

Collecting fastparquet
[?25l  Downloading https://files.pythonhosted.org/packages/bb/ce/ce91d349d724a879e2c6d5cf18e06c6cd44ce1b27be66232eee445badc4b/fastparquet-0.6.3.tar.gz (318kB)
[K     |████████████████████████████████| 327kB 8.4MB/s 
Collecting thrift>=0.11.0
[?25l  Downloading https://files.pythonhosted.org/packages/97/1e/3284d19d7be99305eda145b8aa46b0c33244e4a496ec66440dac19f8274d/thrift-0.13.0.tar.gz (59kB)
[K     |████████████████████████████████| 61kB 8.8MB/s 
[?25hCollecting cramjam>=2.3.0
[?25l  Downloading https://files.pythonhosted.org/packages/9d/62/fd7cfea4e64b85709a4f0d5dadccebdcb30e0131a2c57b2a7640cd98bde7/cramjam-2.3.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 26.6MB/s 
[?25hCollecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/0e/3a/666e63625a19883ae8e1674099e631f9737bd5478c4790e5ad49c5ac5261/fsspec-2021.6.1-py3-none-any.whl (115kB)
[K     |██████████████████████████

In [3]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import torch.nn
import time
import statistics
import pickle
import random
from sklearn.metrics import ndcg_score, dcg_score
from tqdm import tqdm
from sklearn.cluster import KMeans
from sklearn.preprocessing import Normalizer

In [4]:
def generate(dataset_path, master_path):

    #songs

    song_embeddings_path = dataset_path + "/song_embeddings.parquet"
    song_embeddings = pd.read_parquet(song_embeddings_path, engine = 'fastparquet').fillna(0)

    if not os.path.exists(master_path+"/m_song_dict.pkl"):
        song_dict = {}
        for idx, row in song_embeddings.iterrows():
            song_dict[row['song_index']] = idx
        pickle.dump(song_dict, open("{}/m_song_dict.pkl".format(master_path), "wb"))
    else:
        song_dict = pickle.load(open("{}/m_song_dict.pkl".format(master_path), "rb"))


    # user embeddings (target = only for train users)

    user_embeddings = pd.read_parquet(dataset_path + "/user_embeddings.parquet", engine = 'fastparquet')
    list_embeddings = ["embedding_"+str(i) for i in range(len(user_embeddings["svd_embeddings"][0]))]
    user_embeddings[list_embeddings] = pd.DataFrame(user_embeddings.svd_embeddings.tolist(), index= user_embeddings.index)
    embeddings_train = user_embeddings[list_embeddings].values

    # user features train

    features_train_path = dataset_path + "/user_features_train.parquet"
    features_train = pd.read_parquet(features_train_path, engine = 'fastparquet').fillna(0)
    features_train = features_train.sort_values("user_index")
    features_train = features_train.reset_index(drop=True)#to check it is ok for train data

    # training dataset creation

    dataset = "train"
    if not os.path.exists(master_path+"/"):
        os.mkdir(master_path+"/")
    if not os.path.exists(master_path+"/"+dataset+"/"):
        os.mkdir(master_path+"/"+dataset+"/")
    for idx in range(len(features_train)):
        x_train = torch.FloatTensor(features_train.iloc[idx,2:])
        y_train = torch.FloatTensor(user_embeddings[list_embeddings].iloc[idx,:])
        pickle.dump(x_train, open("{}/{}/x_train_{}.pkl".format(master_path, dataset, idx), "wb"))
        pickle.dump(y_train, open("{}/{}/y_train_{}.pkl".format(master_path, dataset, idx), "wb"))

    # user features validation & test

    for dataset in dataset_eval :
        features_validation_path = dataset_path + "/user_features_" + dataset + ".parquet"
        features_validation = pd.read_parquet(features_validation_path, engine = 'fastparquet').fillna(0)
        features_validation = features_validation.sort_values("user_index")
        features_validation = features_validation.reset_index(drop=True)

        if not os.path.exists(master_path+"/"+dataset+"/"):
            os.mkdir(master_path+"/"+dataset+"/"+"/")
        for i in range(len(features_validation)):
            x_validation = torch.FloatTensor(features_validation.iloc[i,2:])
            y_validation = [song_dict[song_index]  for song_index in features_validation["d1d30_songs"][i]]
            groundtruth_validation_list = [1.0 * (song in y_validation) for song in range(len(song_embeddings))]
            pickle.dump(x_validation, open("{}/{}/x_validation_{}.pkl".format(master_path, dataset, i), "wb"))
            pickle.dump(y_validation, open("{}/{}/y_listened_songs_validation_{}.pkl".format(master_path, dataset, i), "wb"))
            pickle.dump(groundtruth_validation_list, open("{}/{}/groundtruth_list_{}.pkl".format(master_path, dataset, i), "wb"))



In [5]:
master_path= "./deezer"
dataset_path = "repository/data"
dataset_eval = ["validation", "test"]
if not os.path.exists("{}/".format(master_path)):
  os.mkdir("{}/".format(master_path))
  # preparing dataset. It needs about XXGB of your hard disk space.
  generate(dataset_path, master_path)
else:
  print("%s already exists" % master_path)

In [6]:
!apt install -qqy tree && tree deezer

The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Selecting previously unselected package tree.
(Reading database ... 160815 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
deezer
├── m_song_dict.pkl
├── test
│   ├── groundtruth_list_0.pkl
│   ├── groundtruth_list_1.pkl
│   ├── groundtruth_list_2.pkl
│   ├── groundtruth_list_3.pkl
│   ├── groundtruth_list_4.pkl
│   ├── x_validation_0.pkl
│   ├── x_validation_1.pkl
│   ├── x_validation_2.pkl
│   ├── x_validation_3.pkl
│   ├── x_validation_4.pkl
│   ├── y_listened_songs_validation_0.pkl
│   ├── y_listened_songs_validation_1.pkl
│   ├── y_listened_songs_validation_2.pkl
│   ├── y_listened_songs_validation_3.pkl
│   └

In [7]:
import torch
import torch.nn.functional as F
import torch.nn

class RegressionTripleHidden(torch.nn.Module):
    def __init__(self, input_dim, output_dim, first_hidden_dim = 400, second_hidden_dim = 300, third_hidden_dim = 200, drop_out = 0):
        super(RegressionTripleHidden, self).__init__()
        self.input_dim = input_dim
        self.first_hidden_dim  = first_hidden_dim
        self.second_hidden_dim  = second_hidden_dim
        self.third_hidden_dim  = third_hidden_dim
        self.output_dim = output_dim
        self.dpin = torch.nn.Dropout(drop_out)

        self.fc1 = torch.nn.Linear(self.input_dim, self.first_hidden_dim)
        self.fc1_bn = torch.nn.BatchNorm1d(self.first_hidden_dim)

        self.fc2 = torch.nn.Linear(self.first_hidden_dim, self.second_hidden_dim)
        self.fc2_bn = torch.nn.BatchNorm1d(self.second_hidden_dim)

        self.fc3 = torch.nn.Linear(self.second_hidden_dim, self.third_hidden_dim)
        self.fc3_bn = torch.nn.BatchNorm1d(self.third_hidden_dim)

        self.fc4 = torch.nn.Linear(self.third_hidden_dim, self.output_dim)

    def forward(self, x):
        hidden1 = self.fc1_bn(F.relu((self.fc1(self.dpin(x)))))
        hidden2 = self.fc2_bn(F.relu(self.fc2(hidden1)))
        hidden3 = self.fc3_bn(F.relu(self.fc3(hidden2)))
        output = F.normalize(self.fc4(hidden3), dim = 1)
        return output

In [8]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn
import time
import pickle
import random
#from model import RegressionTripleHidden
#from options import config

def training(dataset_path, master_path, eval=True, model_save=True, model_filename=None):#XXXX change None
    if config['use_cuda']:
        cuda = torch.device(0)

    input_dim = 2579 #dataset.shape[1]
    target_dim = config['embeddings_dim']

    nb_epochs = config['nb_epochs']
    learning_rate = config['learning_rate']
    reg_param = config['reg_param']
    drop_out = config['drop_out']
    batch_size = config['batch_size']
    eval_every = config['eval_every']
    k_val = config['k_val']

    regression_model = RegressionTripleHidden(input_dim=input_dim, output_dim = target_dim).cuda(device = cuda)
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.Adam(regression_model.parameters(), lr = learning_rate, weight_decay=reg_param )

    if not os.path.exists(model_filename):

        # Load training dataset.
        training_set_size = int(len(os.listdir("{}/train".format(master_path))) / 2)
        train_xs = []
        train_ys = []
        for idx in range(training_set_size):
            train_xs.append(pickle.load(open("{}/train/x_train_{}.pkl".format(master_path, idx), "rb")))
            train_ys.append(pickle.load(open("{}/train/y_train_{}.pkl".format(master_path, idx), "rb")))
        total_dataset = list(zip(train_xs, train_ys))
        del(train_xs, train_ys)

        if eval:

            # Load validation dataset.

            validation_set_size = int(len(os.listdir("{}/validation".format(master_path))) / 3)
            validation_xs = []
            listened_songs_validation_ys = []
            for idx in range(validation_set_size):
                validation_xs.append(pickle.load(open("{}/validation/x_validation_{}.pkl".format(master_path, idx), "rb")))
                listened_songs_validation_ys.append(pickle.load(open("{}/validation/y_listened_songs_validation_{}.pkl".format(master_path, idx), "rb")))
            total_validation_dataset = list(zip(validation_xs, listened_songs_validation_ys))
            del(validation_xs, listened_songs_validation_ys)

            # Load song embeddings for evaluation

            song_embeddings_path = dataset_path + "/song_embeddings.parquet"
            song_embeddings = pd.read_parquet(song_embeddings_path, engine = 'fastparquet').fillna(0)
            list_features = ["feature_"+str(i) for i in range(len(song_embeddings["features_svd"][0]))]
            song_embeddings[list_features] = pd.DataFrame(song_embeddings.features_svd.tolist(), index= song_embeddings.index)
            song_embeddings_values = song_embeddings[list_features].values
            song_embeddings_values_ = torch.FloatTensor(song_embeddings_values.astype(np.float32))

        training_set_size = len(total_dataset)
        print("training set size : "+str(training_set_size))
        print("validation set size : "+str(validation_set_size))
        print("regression model : "+ str(regression_model))
        print("training running")
        loss_train = []
        for nb in range(nb_epochs):
            print("nb epoch : "+str(nb))
            start_time_epoch = time.time()
            random.Random(nb).shuffle(total_dataset)
            a,b = zip(*total_dataset)
            num_batch = int(training_set_size / batch_size)
            max_loc = batch_size*num_batch
            current_loss = 0
            regression_model = regression_model.to(device = cuda)
            for i in range(num_batch):
                optimizer.zero_grad()
                batch_features_tensor = torch.stack(a[batch_size*i:batch_size*(i+1)]).cuda(device = cuda)
                batch_target_tensor = torch.stack(b[batch_size*i:batch_size*(i+1)]).cuda(device = cuda)
                output_tensor = regression_model(batch_features_tensor)
                loss = criterion(output_tensor, batch_target_tensor)
                loss.backward()
                optimizer.step()
                loss_train.append(loss.item())
            print('epoch ' + str(nb) +  " training loss : "+ str(sum(loss_train)/float(len(loss_train))))
            print("--- seconds ---" + str(time.time() - start_time_epoch))

            if nb != 0 and (nb % eval_every == 0 or nb == nb_epochs - 1):
                print('testing model')
                start_time_eval = time.time()
                reg = regression_model.eval()
                reg = reg.to(device=cuda)
                validation_set_size = len(total_validation_dataset)
                a,b = zip(*total_validation_dataset)
                num_batch_validation = int(validation_set_size / batch_size)
                current_recalls = []
                with torch.set_grad_enabled(False):
                    for i in range(num_batch_validation):
                        batch_features_tensor_validation = torch.stack(a[batch_size*i:batch_size*(i+1)]).cuda(device = cuda)
                        predictions_validation = reg(batch_features_tensor_validation)
                        groundtruth_validation = list(b[batch_size*i:batch_size*(i+1)])
                        predictions_songs_validation = torch.mm(predictions_validation.cpu(), song_embeddings_values_.transpose(0, 1))
                        recommendations_validation = (predictions_songs_validation.topk(k= k_val, dim = 1)[1]).tolist()
                        recalls = list(map(lambda x, y: len(set(x) & set(y))/float(min(len(x), 50)), groundtruth_validation, recommendations_validation))
                        current_recalls.extend(recalls)
                print('epoch ' + str(nb) +  " recall test : "+ str(sum(current_recalls) / float(len(current_recalls))) )
                print("--- %s seconds ---" + str(time.time() - start_time_eval))

        if model_save:
            torch.save(regression_model.state_dict(), master_path + "/"+model_filename+".pt")

    else:
        trained_state_dict = torch.load(master_path + "/"+model_filename+".pt")
        RegressionTripleHidden.load_state_dict(trained_state_dict)


In [9]:
config = {
    # user
    'embeddings_dim': 128,
    # cuda setting
    'use_cuda': True,
    # model setting
    'nb_epochs': 130,
    'learning_rate': 0.00001,
    'batch_size': 2,#512,XXX to change
    'reg_param': 0,
    'drop_out': 0,
    # model training
    'eval_every': 10,
    'k_val': 3,#50, XXX to change
}

training(dataset_path, master_path, eval=True, model_save=True, model_filename="20210709_svd_sample")

training set size : 5
validation set size : 5
regression model : RegressionTripleHidden(
  (dpin): Dropout(p=0, inplace=False)
  (fc1): Linear(in_features=2579, out_features=400, bias=True)
  (fc1_bn): BatchNorm1d(400, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc2): Linear(in_features=400, out_features=300, bias=True)
  (fc2_bn): BatchNorm1d(300, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc3): Linear(in_features=300, out_features=200, bias=True)
  (fc3_bn): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc4): Linear(in_features=200, out_features=128, bias=True)
)
training running
nb epoch : 0
epoch 0 training loss : 0.014968918170779943
--- seconds ---0.2746601104736328
nb epoch : 1
epoch 1 training loss : 0.01514828996732831
--- seconds ---0.0073049068450927734
nb epoch : 2
epoch 2 training loss : 0.015030663926154375
--- seconds ---0.0058879852294921875
nb epoch : 3
epoch 3 training loss : 0.014923