## Resources:  
* https://making.lyst.com/lightfm/docs/lightfm.html
* https://towardsdatascience.com/how-to-build-a-movie-recommender-system-in-python-using-lightfm-8fa49d7cbe3b
* https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/


In [1]:

import pandas as pd
import numpy as np
from time import time

from lightfm import LightFM
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k
from lightfm.cross_validation import random_train_test_split
from lightfm.data import Dataset

from scipy.sparse import csr_matrix



In [2]:
plays = pd.read_csv('datasets/user_artists.dat', sep='\t')
artists = pd.read_csv('datasets/artists.dat', sep='\t', usecols=['id','name'])

# Merge (fusionner) artist and user pref data
ap = pd.merge(artists, plays, how="inner", left_on="id", right_on="artistID")
ap = ap.rename(columns={"weight": "playCount"})

# Group artist by name
artist_rank = ap.groupby(['name']) \
    .agg({'userID' : 'count', 'playCount' : 'sum'}) \
    .rename(columns={"userID" : 'totalUsers', "playCount" : "totalPlays"}) \
    .sort_values(['totalPlays'], ascending=False)

artist_rank['avgPlays'] = artist_rank['totalPlays'] / artist_rank['totalUsers']
print(artist_rank)

                    totalUsers  totalPlays     avgPlays
name                                                   
Britney Spears             522     2393140  4584.559387
Depeche Mode               282     1301308  4614.567376
Lady Gaga                  611     1291387  2113.563011
Christina Aguilera         407     1058405  2600.503686
Paramore                   399      963449  2414.659148
...                        ...         ...          ...
Morris                       1           1     1.000000
Eddie Kendricks              1           1     1.000000
Excess Pressure              1           1     1.000000
My Mine                      1           1     1.000000
A.M. Architect               1           1     1.000000

[17632 rows x 3 columns]


In [3]:
#---------------------------------------------------------------------------------------------------
print(80*("_"))
print("\nPlays info:\n")
plays.info()
print(80*("_"))
print("\nap info:\n")
ap.info()
print(80*("_"))
print("\nartist info:\n")
artists.info()

________________________________________________________________________________

Plays info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92834 entries, 0 to 92833
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   userID    92834 non-null  int64
 1   artistID  92834 non-null  int64
 2   weight    92834 non-null  int64
dtypes: int64(3)
memory usage: 2.1 MB
________________________________________________________________________________

ap info:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92834 entries, 0 to 92833
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         92834 non-null  int64 
 1   name       92834 non-null  object
 2   userID     92834 non-null  int64 
 3   artistID   92834 non-null  int64 
 4   playCount  92834 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 4.2+ MB
___________________________________________________

In [4]:
# Merge into ap matrix
ap = ap.join(artist_rank, on="name", how="inner") \
    .sort_values(['playCount'], ascending=False)

# Preprocessing
pc = ap.playCount
play_count_scaled = (pc - pc.min()) / (pc.max() - pc.min())
ap = ap.assign(playCountScaled=play_count_scaled)
#print(ap)

# Build a user-artist rating matrix 
ratings_df = ap.pivot(index='userID', columns='artistID', values='playCountScaled')
ratings = ratings_df.fillna(0).values

# Show sparsity . C'est plutôt une densité.... Indique le pourcentage de valeur non nul de la matrice. 
sparsity = float(len(ratings.nonzero()[0])) / (ratings.shape[0] * ratings.shape[1]) * 100
print(f"sparsity: {sparsity:.2f} %")


sparsity: 0.28 %


In [11]:
ratings.shape

(1892, 17632)

In [5]:
print(ap.info())
ap.artistID

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92834 entries, 2800 to 63982
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               92834 non-null  int64  
 1   name             92834 non-null  object 
 2   userID           92834 non-null  int64  
 3   artistID         92834 non-null  int64  
 4   playCount        92834 non-null  int64  
 5   totalUsers       92834 non-null  int64  
 6   totalPlays       92834 non-null  int64  
 7   avgPlays         92834 non-null  float64
 8   playCountScaled  92834 non-null  float64
dtypes: float64(2), int64(6), object(1)
memory usage: 7.1+ MB
None


2800        72
35843      792
27302      511
8152       203
26670      498
         ...  
38688      913
32955      697
71811     4988
91319    17080
63982     3201
Name: artistID, Length: 92834, dtype: int64

In [6]:
# Build a sparse matrix                      PEUT ON créer une matrice creuse coo directement ?
X = csr_matrix(ratings)

n_users, n_items = ratings_df.shape
print("rating matrix shape", ratings_df.shape)

user_ids = ratings_df.index.values
artist_names = ap.sort_values("artistID")["name"].unique()

rating matrix shape (1892, 17632)


In [7]:
# Build data references + train test
Xcoo = X.tocoo()
data = Dataset()
data.fit(np.arange(n_users), np.arange(n_items))
interactions, weights = data.build_interactions(zip(Xcoo.row, Xcoo.col, Xcoo.data)) 
train, test = random_train_test_split(interactions)

# Ignore that (weight seems to be ignored...)
#train = train_.tocsr()
#test = test_.tocsr()
#train[train==1] = X[train==1]
#test[test==1] = X[test==1]

# To be completed...

In [8]:
# Train
model = LightFM(learning_rate=0.05, loss='warp')
model.fit(train, epochs=10, num_threads=2)

<lightfm.lightfm.LightFM at 0x7fcc2f240550>

In [9]:
# Evaluate
train_precision = precision_at_k(model, train, k=10).mean()
test_precision = precision_at_k(model, test, k=10, train_interactions=train).mean()

train_auc = auc_score(model, train).mean()
test_auc = auc_score(model, test, train_interactions=train).mean()

print('Precision: train %.2f, test %.2f.' % (train_precision, test_precision))
print('AUC: train %.2f, test %.2f.' % (train_auc, test_auc))

Precision: train 0.37, test 0.13.
AUC: train 0.96, test 0.86.


In [10]:
# Predict
scores = model.predict(0, np.arange(n_items))
top_items = artist_names[np.argsort(-scores)]
print(top_items)

['Depeche Mode' 'Radiohead' 'Muse' ... 'Daville' 'Akira Senju/Akira Senju'
 'International Observer']


In [35]:
test.shape
train.shape


(1892, 17632)

In [72]:
param_loss= ["warp", "bpr", "warp-kos", "logistic"]
resultats= ""
dictionnaire_param_loss= {}
for ploss in param_loss:
    tps_deb= time()

    # Train
    model = LightFM(learning_rate=0.05, loss= ploss)
    model.fit(train, epochs=10, num_threads=2)

    # Evaluate
    train_precision = precision_at_k(model, train, k=10).mean()
    test_precision = precision_at_k(model, test, k=10, train_interactions=train).mean()

    train_auc = auc_score(model, train).mean()
    test_auc = auc_score(model, test, train_interactions=train).mean()
    
    # Predict
    scores = model.predict(0, np.arange(n_items))
    top_items = artist_names[np.argsort(-scores)]
    
    tps_fin= time()


    ch= "Méthode "+ ploss + ":"+ f"\nPrecision: train {train_precision:.2f}, test {test_precision:.2f}." + \
    f"\nAUC: train {train_auc:.2f}, test {test_auc:.2f}." +\
    f"\nRecommandation:\n{top_items}" + "\n\n"
    resultats+= ch
    dictionnaire_param_loss["Méthode " + ploss]={"Précision train": train_precision, 
                                                "Précision test": test_precision, "AUC train": train_auc, 
                                                "AUC test": test_auc, "Temps": tps_fin-tps_deb,
                                                "Recommandation": top_items}

print(resultats)

Méthode warp:
Precision: train 0.39, test 0.13.
AUC: train 0.97, test 0.85.
Recommandation:
['Duran Duran' 'Depeche Mode' 'Radiohead' ... 'DOGinTheパラレルワールドオーケストラ'
 'Donavon Hill' 'Envus']

Méthode bpr:
Precision: train 0.39, test 0.12.
AUC: train 0.85, test 0.78.
Recommandation:
['Depeche Mode' 'Duran Duran' 'Pet Shop Boys' ... 'Ke$ha' 'Miley Cyrus'
 'Avril Lavigne']

Méthode warp-kos:
Precision: train 0.34, test 0.12.
AUC: train 0.89, test 0.82.
Recommandation:
['Depeche Mode' 'The Beatles' 'Madonna' ... 'Elveda Rumeli' 'Hatice'
 'Styles & Breeze']

Méthode logistic:
Precision: train 0.21, test 0.07.
AUC: train 0.89, test 0.81.
Recommandation:
['Lady Gaga' 'Britney Spears' 'Katy Perry' ... 'Secret Shine'
 'Heinrich Ignaz Franz von Biber' 'やなわらばー']




### Le paramètre warp donne le meilleur résultat avec un écart important. Il sera donc choisi et fixer et les autres paramètres optimisés en utilisant GridSearchCV.

Ressource: 
cours factorisation matrice + grid search CV
    https://www.ethanrosenthal.com/2016/10/19/implicit-mf-part-1/

In [74]:
dictionnaire_param_loss

{'Méthode warp': {'Précision train': 0.3878579,
  'Précision test': 0.13297759,
  'AUC train': 0.96636134,
  'AUC test': 0.8542546,
  'Temps': 36.927067279815674,
  'Recommandation': array(['Duran Duran', 'Depeche Mode', 'Radiohead', ...,
         'DOGinTheパラレルワールドオーケストラ', 'Donavon Hill', 'Envus'], dtype=object)},
 'Méthode bpr': {'Précision train': 0.38531283,
  'Précision test': 0.12406617,
  'AUC train': 0.8501012,
  'AUC test': 0.77569395,
  'Temps': 37.497663497924805,
  'Recommandation': array(['Depeche Mode', 'Duran Duran', 'Pet Shop Boys', ..., 'Ke$ha',
         'Miley Cyrus', 'Avril Lavigne'], dtype=object)},
 'Méthode warp-kos': {'Précision train': 0.34390244,
  'Précision test': 0.12033084,
  'AUC train': 0.88780546,
  'AUC test': 0.8173134,
  'Temps': 39.19034719467163,
  'Recommandation': array(['Depeche Mode', 'The Beatles', 'Madonna', ..., 'Elveda Rumeli',
         'Hatice', 'Styles & Breeze'], dtype=object)},
 'Méthode logistic': {'Précision train': 0.20540828,
  'Préci

In [75]:
#param_learning_rate= np.arange(0.01, 0.2, 0.02)
#param_epoch= np.arange(5,100,5)
param_learning_rate= [0.01, 0.05, 0.1]
param_epoch= [5, 10, 15]
ploss= "warp"

#print(param_learning_rate)
#print(param_epoch)

resultats= ""
dictionnaire_learningrate_epoch= {}

for learning_rate in [0.01, 0.05, 0.1]:
    for epoch in param_epoch:
        
        tps_deb= time()

        # Train
        model = LightFM(learning_rate= learning_rate, loss= ploss)
        model.fit(train, epochs= epoch, num_threads=2)

        # Evaluate
        train_precision = precision_at_k(model, train, k=10).mean()
        test_precision = precision_at_k(model, test, k=10, train_interactions=train).mean()

        train_auc = auc_score(model, train).mean()
        test_auc = auc_score(model, test, train_interactions=train).mean()

        # Predict
        scores = model.predict(0, np.arange(n_items))
        top_items = artist_names[np.argsort(-scores)]

        tps_fin= time()


        ch= "Méthode "+ ploss + ":"+ f"\nPrecision: train {train_precision:.2f}, test {test_precision:.2f}." + \
        f"\nAUC: train {train_auc:.2f}, test {test_auc:.2f}." +\
        f"\nRecommandation:\n{top_items}" + "\n\n"
        resultats+= ch
        dictionnaire_learningrate_epoch["learning_rate" + str(learning_rate) + " epoch " + str(epoch)]= \
            {"Précision train": train_precision, "Précision test": test_precision, "AUC train": train_auc, \
             "AUC test": test_auc, "Temps": tps_fin-tps_deb, "Recommandation": top_items}


In [77]:
dictionnaire_learningrate_epoch


{'learning_rate0.01 epoch 5': {'Précision train': 0.23865324,
  'Précision test': 0.082337245,
  'AUC train': 0.8701217,
  'AUC test': 0.79707277,
  'Temps': 38.11367702484131,
  'Recommandation': array(['Lady Gaga', 'The Beatles', 'Britney Spears', ...,
         'Three 6 Mafia feat. Flo Rida, Sean Kingston & Tiësto',
         'Mackenzie 1st', '7000$'], dtype=object)},
 'learning_rate0.01 epoch 10': {'Précision train': 0.302386,
  'Précision test': 0.10640342,
  'AUC train': 0.89082557,
  'AUC test': 0.809808,
  'Temps': 36.344521284103394,
  'Recommandation': array(['The Beatles', 'Coldplay', 'Radiohead', ..., 'Coresplittaz',
         'Shiva in Exile', 'Deborah Harry'], dtype=object)},
 'learning_rate0.01 epoch 15': {'Précision train': 0.21399789,
  'Précision test': 0.070437565,
  'AUC train': 0.89287406,
  'AUC test': 0.8102775,
  'Temps': 36.31895351409912,
  'Recommandation': array(['Lady Gaga', 'Britney Spears', 'Katy Perry', ...,
         'Shaik Abu Baker Al-Shatiri', 'Juju', 'W

In [68]:
pd.options.display.max_columns = None

In [70]:
print(resultats)

Méthode warp:
Precision: train 0.38, test 0.13.
AUC: train 0.97, test 0.86.
Recommandation:
['Coldplay' 'Muse' 'The Killers' ... 'Ekaros' 'The Easybeats' 'Swift Guad']

Méthode bpr:
Precision: train 0.37, test 0.12.
AUC: train 0.85, test 0.78.
Recommandation:
['Coldplay' 'Depeche Mode' 'Muse' ... 'Rihanna' 'Ashley Tisdale'
 'Miley Cyrus']

Méthode warp-kos:
Precision: train 0.35, test 0.13.
AUC: train 0.89, test 0.82.
Recommandation:
['Pet Shop Boys' 'Madonna' 'Goldfrapp' ... 'Centr' 'Ассаи' 'True Star']

Méthode logistic:
Precision: train 0.20, test 0.07.
AUC: train 0.89, test 0.81.
Recommandation:
['Lady Gaga' 'Britney Spears' 'Rihanna' ... 'Patrizio Buanne'
 'The Union Underground' 'Bootsy Collins']




In [16]:
model = LightFM()
param={"no_components":[10,25,50,75,100,150,200], "loss": ['warp'], "learning_rate":[0.05, 0.07]}


#train_csr = train_set.tocsr()
#test_set = test_set.tocsr()



(1892, 17632)

In [34]:
grid.fit(train)

ValueError: 'auc_score' is not a valid scoring value. Use sorted(sklearn.metrics.SCORERS.keys()) to get valid options.

In [77]:
from IPython.display import display 

pd.options.display.max_columns = None


In [80]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [81]:
print(resultats)

Méthode warp:
Precision: train 0.38, test 0.13.
AUC: train 0.97, test 0.86.
Recommandation:
['Coldplay' 'Muse' 'The Killers' ... 'Ekaros' 'The Easybeats' 'Swift Guad']

Méthode bpr:
Precision: train 0.37, test 0.12.
AUC: train 0.85, test 0.78.
Recommandation:
['Coldplay' 'Depeche Mode' 'Muse' ... 'Rihanna' 'Ashley Tisdale'
 'Miley Cyrus']

Méthode warp-kos:
Precision: train 0.35, test 0.13.
AUC: train 0.89, test 0.82.
Recommandation:
['Pet Shop Boys' 'Madonna' 'Goldfrapp' ... 'Centr' 'Ассаи' 'True Star']

Méthode logistic:
Precision: train 0.20, test 0.07.
AUC: train 0.89, test 0.81.
Recommandation:
['Lady Gaga' 'Britney Spears' 'Rihanna' ... 'Patrizio Buanne'
 'The Union Underground' 'Bootsy Collins']





# Recommander Systems

Construire, comprendre et tuner un système de recommandation.

# Description

## Familarisation

Les systèmes de recommandations sont utilisé traditionnellement et comme le nom l'indique pour recommander du contenu à des utilisateurs.
Par exemple pour recommander un film à des utilisateurs en fonctions de ceux qu'ils ont vue, ou de la musique, ou des vidéos ou encore implémenter des fonctionnalités "more like this".

Nous allons commencer par suivre et reproduire les étapes de ce tuto: 

*  https://www.datacamp.com/community/tutorials/recommender-systems-python

En assumant que vous avez peu de RAM, nous allons nous arrêter au moment de calculer la  `compute_sim` variable.


**step1 : simple recommander**
Quelle est la complexité en mémoire de cette opération ?
(utiliser cosine_similarity qui utilise moins de mémoire (quand même 8Go, possible sur collab)
Cela rentre t'il sur votre machine ?

Qu'essaye de faire l'auteur avec ce calcul ?
Comment pouvons-nous contourner ce problème ?


**step2 : content based recommander**

implémenter la deuxiéme partie en évitant le produit de matrice.

**step3 : amélioration**

coder les 2 améliorations :
1. Introduce a popularity filter: this recommender would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.
2. Use the PCA to improve the speed of your similarity search with 100 components. Does the result are coherent.


## LastFM Project

M. Pontier vous contact pour l'aider à construire un système de recommandation. Il dispose d'une base de données comportant des données concernant ses utilisateurs (anonymisé) contenant les artistes qu'ils écoutent sur sa plateforme ainsi que le nombre d'écoutes. Monsieur pontier souhaite recommander à ses utilisateur  des artistes qu'il n'ont pas encore écoutés, et cela en fonction de leurs préférences musicale.

Monsieur pontier souhaite utiliser la librairie Lightfm, avec laquelle il a déjà un driver permettant de charger ses données qu'il vous fournit, un vrai bonus.
Monsieur Pontier à pu voir que la documentation comporte plusieurs modèle, il souhaite évaluer les modèle sur une jeux de train/test et utiliser le meilleurs modéle.

Pour l'évaluation, il souhaite comparer la mesure AUC, la précision et le rappel (visiter la documentation de Lightfm), qui devront être présenté dans un tableau.


#### Bonus 1

Comparer les résulats de l'AUC avec le meilleurs modéle de lightfm et une PCA (TruncatedSDV).


#### Bonus 2

L'apprentissage devant être le plus rapide possible tout en obtenant les meilleurs résultats, il vous est demandé de trouver le nombre d'itération permettant d'atteindre la convergence de 95% de la valeur maximal d'AUC sur le jeux de test.


### Veille

Quelle système de recommandation allez vous mettre en place ?

Qu'est ce que Lightfm ?

Qu'est ce un système de recommandation dit à "implicit feedback" ? Et a "explicit feedback ?


### Ressources: 

* LightFM: https://github.com/lyst/lightfm
* Jeux de données Last.fm : https://grouplens.org/datasets/hetrec-2011/
* https://towardsdatascience.com/recommendation-system-in-python-lightfm-61c85010ce17
  
  

# Bout de code ...

In [None]:
import sklearn
sorted(sklearn.metrics.SCORERS.keys())

In [23]:
data1= np.array([["toto","titi","tutu","tyty","tata","tete"],[10,11,12,13,14,15]]).T
df1= pd.DataFrame(data1, columns=["Artiste","id"])
print(df1)
data2= np.array([["d2toto","d2titi","d2tutu","d2tyty","d2tata","d2tete"],[10,11,12,13,14,15]]).T
df2= pd.DataFrame(data2, columns=["Utilisateur","id"])
print(df2)

  Artiste  id
0    toto  10
1    titi  11
2    tutu  12
3    tyty  13
4    tata  14
5    tete  15
  Utilisateur  id
0      d2toto  10
1      d2titi  11
2      d2tutu  12
3      d2tyty  13
4      d2tata  14
5      d2tete  15


In [24]:
df1df2 = pd.merge(df1, df2, how="inner")
df1df2

Unnamed: 0,Artiste,id,Utilisateur
0,toto,10,d2toto
1,titi,11,d2titi
2,tutu,12,d2tutu
3,tyty,13,d2tyty
4,tata,14,d2tata
5,tete,15,d2tete
