<a href="https://colab.research.google.com/github/glutamatt/semi_perso_user_cold_start/blob/colab/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Test de contenu

impec 

## Installation

### Code

```Bash
git clone https://github.com/deezer/carousel_bandits
cd carousel_bandits
```

Requirements: python 3, matplotlib, numpy, pandas, scipy, seaborn

Media services providers, such as the music streaming platform [Deezer](https://www.deezer.com/), often leverage **swipeable carousels** to recommend personalized content to their users. These carousels are ranked lists of _L_ items or **cards** from a substantially larger catalog (of size _K_), e.g. _L_ albums, artists or playlists recommended on the homepage of the Deezer app. Only a few cards, say _L_init_ < _L_, are initially displayed to users, who can **swipe** the screen to see additional cards.

Selecting the most relevant content to display in carousels is a challenging task, as the catalog is large and as users have different preferences. Also, ranking matters: some cards might not be seen by some users due to the swipeable structure.

<p align="center">
  <img height="350" src="https://raw.githubusercontent.com/deezer/carousel_bandits/master/images/carousel.png">
</p>

In [1]:
!pip install fastparquet



In [2]:
import os
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
import torch.nn
import time
import statistics
import pickle
import random
from sklearn.metrics import ndcg_score, dcg_score
from tqdm import tqdm
from sklearn.cluster import KMeans
from sklearn.preprocessing import Normalizer

In [3]:
!mkdir data
!wget -O data/song_embeddings.parquet https://github.com/glutamatt/semi_perso_user_cold_start/raw/master/data/user_features_validation.parquet

mkdir: cannot create directory ‘data’: File exists
--2021-07-08 15:07:44--  https://github.com/glutamatt/semi_perso_user_cold_start/raw/master/data/user_features_validation.parquet
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/glutamatt/semi_perso_user_cold_start/master/data/user_features_validation.parquet [following]
--2021-07-08 15:07:44--  https://raw.githubusercontent.com/glutamatt/semi_perso_user_cold_start/master/data/user_features_validation.parquet
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 770115 (752K) [application/octet-stream]
Saving to: ‘data/song_embeddings.parquet’


2021-

In [14]:
def generate(master_path):
    dataset_path = os.getcwd() + "/data"

    #songs

    song_embeddings_path = dataset_path + "/song_embeddings.parquet"
    song_embeddings = pd.read_parquet(song_embeddings_path, engine = 'fastparquet').fillna(0)

    if not os.path.exists(master_path+"/m_song_dict.pkl"):
        song_dict = {}
        for idx, row in song_embeddings.iterrows():
            print(row['song_index'])
            song_dict[row['song_index']] = idx
        pickle.dump(song_dict, open("{}/m_song_dict.pkl".format(master_path), "wb"))
    else:
        song_dict = pickle.load(open("{}/m_song_dict.pkl".format(master_path), "rb"))


    # user embeddings (target = only for train users)

    user_embeddings = pd.read_parquet(dataset_path + "/user_embeddings.parquet", engine = 'fastparquet')
    list_embeddings = ["embedding_"+str(i) for i in range(len(user_embeddings["svd_embeddings"][0]))]
    user_embeddings[list_embeddings] = pd.DataFrame(user_embeddings.svd_embeddings.tolist(), index= user_embeddings.index)
    embeddings_train = user_embeddings[list_embeddings].values

    # user features train

    features_train_path = dataset_path + "/user_features_train.parquet"
    features_train = pd.read_parquet(features_train_path, engine = 'fastparquet').fillna(0)
    features_train = features_train.sort_values("user_index")
    features_train = features_train.reset_index(drop=True)#to check it is ok for train data

    # training dataset creation

    state = "train"
    if not os.path.exists(master_path+"/"):
        os.mkdir(master_path+"/")
    if not os.path.exists(master_path+"/"+state+"/"):
        os.mkdir(master_path+"/"+state+"/")
    for idx in range(len(features_train)):
        x_train = torch.FloatTensor(features_train.iloc[idx,2:])
        y_train = torch.FloatTensor(user_embeddings[list_embeddings].iloc[idx,:])
        pickle.dump(x_train, open("{}/{}/x_train_{}.pkl".format(master_path, state, idx), "wb"))
        pickle.dump(y_train, open("{}/{}/y_train_{}.pkl".format(master_path, state, idx), "wb"))

    # user features validation & test

    states = ["validation", "test"]
    for state in states :
        features_validation_path = dataset_path + "/user_features_" + state + ".parquet"
        features_validation = pd.read_parquet(features_validation_path, engine = 'fastparquet').fillna(0)
        features_validation = features_validation.sort_values("user_index")
        features_validation = features_validation.reset_index(drop=True)

        if not os.path.exists(master_path+"/"+state+"/"):
            os.mkdir(master_path+"/"+state+"/"+"/")
        for i in range(len(features_validation)):
            x_validation = torch.FloatTensor(features_validation.iloc[i,2:])
            y_validation = [song_dict[song_index]  for song_index in features_validation["d1d30_songs"][i]]
            groundtruth_validation_list = [1.0 * (song in y_validation) for song in range(len(song_embeddings))]
            pickle.dump(x_validation, open("{}/{}/x_validation_{}.pkl".format(master_path, state, i), "wb"))
            pickle.dump(y_validation, open("{}/{}/y_listened_songs_validation_{}.pkl".format(master_path, state, i), "wb"))
            pickle.dump(groundtruth_validation_list, open("{}/{}/groundtruth_list_{}.pkl".format(master_path, state, i), "wb"))


In [15]:
master_path= "./deezer"
if not os.path.exists("{}/".format(master_path)):
  os.mkdir("{}/".format(master_path))
  # preparing dataset. It needs about XXGB of your hard disk space.
  generate(master_path)
else:
  print("%s already exists" % master_path)

KeyError: ignored