# Filtrage collaboratif - partie 2

_ismael Bonneau & Issam Benamara_

### Reprenons notre problématique de filtrage collaboratif

Nous l'avions dit, le filtrage collaboratif permet de réaliser des prédictions automatiques ("filtrage") des intérêts d'un utilisateur en se basant sur les préférences d'un grand nombre d'autres utilisateurs ("collaboratif"), afin de recommander des produits (films, séries, musique, articles sur un site de e-commerce...) pertinents pour un utilisateur.

<img src="images/obelix_cesar_garmonparnas.png" width="600" />

### Approche memory based

L'hypothèse sous-jacente du filtrage collaboratif est que si une personne A a la même opinion qu'une personne B sur un sujet, A a plus de chance d'avoir la même opinion que B sur un autre sujet qu'une personne choisie au hasard.

Une façon de faire du filtrage collaboratif est d'utiliser l'approche dite **memory based**. Cette approche utilise les notes données par chaque utilisateur pour calculer une similarité entre utilisateurs ou items. En se basant sur une similarité entre utilisateurs, on obtient un système de **filtrage collaboratif user-based**

Il s'agit donc, pour recommander des objets à un utilisateur donné, à **calculer les utilisateurs les plus similaires**, puis à recommander ce que ces utilisateurs ont en moyenne aimé.

Par exemple, les utilisateurs les plus similaires à Obélix seraient ici Ordralphabétix, car il aime le sanglier et la cervoise tiède, et dans une moindre mesure Jules César et Ocatarinetabellatchitchix, qui aiment le sanglier. Obélix se verrait donc recommander fortement le poisson (**frais**, attention! ne dites pas qu'il est pas frais) et un peu moins fortement le vin et la pizza (César) et les chataîgnes (Ocatarinetabellatchitchix). Garmonparnas le Grec ne fait pas partie du voisinage d'Obélix car il n'ont aucune préférences en commun.

Plus sérieusement, plusieurs questions se posent. La méthode de calcul des similarités entre utilisateurs est une partie très importante de l'algorithme, car déterminante. Il existe plusieurs méthodes, comme par exemple le coefficient de **corrélation de pearson** (la corrélation "classique" des statistiques) qui indique à quel point deux utilisateurs ont noté les mêmes items de façon similaire, le produit cosinus, et bien d'autres. Dans tous les cas, le but est le même: trouver la meilleure façon de déterminer les utilisateurs les plus similaires d'un autre. On peut ensuite appliquer un algorithme de type ${K}$-nn (${K}$ nearest neighbours, ou ${K}$ plus proches voisins, pour déterminer un "groupe" ou voisinage d'utilisateurs proches, partageant les "mêmes goûts". Il s'agit ensuite d'**aggréger les préférences** (notes) de ce voisinage d'utilisateur pour être capable de produire une recommandation. Là aussi, plusieurs solutions existent.

-------------------

### Notre but:

Nous allons implémenter un algorithme de filtrage collaboratif memory-based basé sur le voisinage utilisateur. Cet algorithme sera au final hybride et intègrera une mesure de similarité item, pour améliorer la qualité des prédictions et résoudre les problèmes de cold start.

### Données:

Nous partons d'une base de ${m = 48705}$ utilisateurs ayant noté ${n = 892}$ séries. Ces données sont extraites du site <a href="https://www.imdb.com/">imdb</a> (voir script <a href="https://github.com/ismaelbonneau/movie_recommender/blob/master/scraping/scrap.py">scraping/scraping.py</a>) et sont résumées dans une matrice de taille ${n,m}$ où chaque entrée ${(u, i)}$ de matrice contient la note que l'utilisateur ${u}$ a attribué à l'item (série) ${i}$, sur 10 (le site ayant choisi un système de notation sur 10 étoiles).

In [1]:
import pandas as pd
import gensim
import numpy as np
from sklearn.metrics import mean_squared_error
import warnings
import pickle
import random
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# Dataframe : line 'serie', column 'user', cell 'Rating/10 or NaN'
df = pd.read_csv('userratings.csv').drop('Unnamed: 0', axis='columns')
df.head()

Unnamed: 0,ToddTee,bkoganbing,betwana,fabiogaucho,killer1h,rasadi27,michael_cure,MashedA,drarthurwells,cureton-281-120941,...,jimjohnson-57331,tatianavoloshka,hectorgarcia-41182,allisonbryan-30611,timothyquaid,eduardoellis,Chris_Tsimpoukas,ToxicAvox,DinoLord94,bratdawg
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,10.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,9.0,,,,,,,,,...,,,,,,,,,,


In [0]:
# Dataframe to convert between the different namings of series
nameConvert = pd.read_csv('namesConvert.csv')

In [0]:
# Series titles ordered by the same order as in the ratings csv
titles = pd.read_csv('titles.csv')
# On charge la matrice de similarités pré-calculée
simMat = np.load("sim_15_.npy")

In [0]:
# Keep.csv contains the lines/columns to keep from the similarity matrix
# keep only the ones that match the series we have in userratings.csv
keep = pd.read_csv('keep.csv')
keeps = keep['index'].values
simMat = simMat[keeps][:,keeps]
series = keep['seriesname'].values

In [0]:
# Corpus building
# both functions take a dataframe : line 'serie', column 'user', cell 'rate'

# Builds each document as the usernames ordered from the highest rating one to the lowest rating one for each serie
# ie:
# we get nbSeries document
# each document is the sequence of all the users that rated it, ordered from the ones who liked it to the the ones who hated it
#           user C     user D     user A      user P
# serie X     10         8          8          3
# the document will be [ 'C', 'D', 'A', 'P' ]
def buildSequences(new):
    result = dict()
    dfs = dict()
    for index, row in new.iterrows():
        x = row.dropna()
        y = pd.Series.argsort(x)
        p = np.array(x.axes).ravel()
        cols = np.flip(p[y]).tolist()
        result[index] = cols
        dfs[index] = x[cols].values
    return result, dfs


# Builds each document as the cluster of users that gave the same rating to the same serie
# we get nbSeries*10 maxiumum documents ( since there is 10 grades to give )
# each document is the usernames of those who gave the exact same rating to a given serie
#           user C     user D     user A      user P
# serie X     10         8          8          3
# serie Y      2         10         2          2
# docs:
# [ 'C' ] who gave 10 to serie X
# [ 'D', 'A'] who gave 8 to serie X
# [ 'P' ] who gave 3 to serie X
# [ 'C', 'A', 'P' ] who gave 2 to serie Y
# [ 'D' ] who gave 10 to serie Y
# so we treat each serie alone and cluster its users based on their ratings

def buildClusters(new):
    result = []
    tt = new.T
    for i in range(len(new)):
        result += list(map(list, list(tt.groupby([i]).groups.values())))
    return result

In [0]:
# User similarity calculations
# both functions input :
#    model : w2v model
#    topn : how many users to consider as similar
#    df : the dataframe containing the ratings
#    cross : how many series in common we want 
# output :
#    dictionary : key 'user', value 'list of users'


# Returns for each user, the list of users that are 'similar' to it
# this version calculates how many series the two users have rated in common and 
# considers two users similar if they have at least topn series in common
def getSims(model,topn,df,cross):
    result = dict()
    users = model.wv.vocab
    yes = 0
    bla =0
    for user in users:
        bla += 1
        print(bla,len(users))
        rows = model.wv.similar_by_word(user,topn=len(users))
        tmp = []
        count = 0
        for row,_ in rows:
            m = df[[row,user]]
            xx = m.replace({0:np.nan})
            xx = xx.dropna()
            if len(xx)>=cross:
                tmp.append(row)
                count += 1
            if count >= topn:
                break
        if len(tmp) < topn:
            yes += 1
        result[user] = tmp
    print(yes)
    return result

# Returns for each user, the list of users that are 'similar' to it
# cross and df are useless here
def getSimsNormal(model,topn, df, cross):
    result = dict()
    users = model.wv.vocab
    for user in users:
        rows = model.wv.similar_by_word(user,topn=topn)
        result[user] = [x for x,_ in rows]
    return result

In [0]:
# predicts the ratings of a user based on its similar users
# input:
#     user : which user to predict
#     sims : users similarity dictionary
#     df : ratings dataframe
#     dropnan : show only intersection between predicted and known if True
# output:
#     dataframe containing two columns 'Predicted' and 'Original',
#     showing in each line the two ratings we know or predicted for each serie
def predictUser(user, sims, df, dropnan = True):
    others = df[sims[user]]
    others = others.replace({0:np.nan})
    m = others.mean(axis=1)
    xx = pd.DataFrame(m,columns=['Predicted'])
    xx['Original'] = df[user]
    xx = xx.replace({0:np.nan})
    if dropnan:
        xx = xx.dropna()
    return xx

In [0]:
# MSE in a two columns dataframe
def msePredicted(df):
    A = df['Predicted'].T
    B = df['Original'].T
    return mean_squared_error(A,B)

# Calculates the MSE for given user similarities
# output:
#    totalMSE
#    skipped : number of users without measurable predictions
#    lengths : number of predictions made for each user
def mseAllPrediction(df, sims):
    users = list(sims.keys())
    lengths = []
    total = 0
    skipped = 0
    for user in users:
        pred = predictUser(user, sims, df)
        lengths.append(len(pred))
    if pred.empty:
        skipped +=1
    else:
        total += msePredicted(pred)
    return total/len(users), skipped, lengths

In [0]:
# modify the ratings dataframe to add this new ratings
# for 'user', concerning 'serie', give the rating 'note', in the dataframe 'df'
# the serie name is given as the IMDB title of the serie, the function
# converts and finds the corresponding line in the dataframe using 'titles'
def rate(user,serie,note,df,titles):
    hoh = np.where( titles.values.ravel() == serie)[0][0]
    print("avant:")
    print(df.loc[hoh,[user]])
    df.loc[hoh,[user]] = note
    print("apres:")
    print(df.loc[hoh,[user]])

In [0]:
# Recommends a list of series (ordered) to a specific user
# input:
#    user : which user to recommend to
#    sims : user similarity dictionary
#    df : ratings dataframe
#    titles : to show clean serie titles
#    itemSim : series similarity matrix ( prebuilt, imported and filtered beforehand )
#    k : how many series we want to recommend
#    simOnly : recommend only based on serie similarities ( most similar to most liked serie first,
#              then if k most similar to best one have been watched we move to second most liked one ..)
#    similOdds : probability of recommending by serie similarity instead of collaborative filtering
#    coldUntil : how many ratings a user needs to make to stop considering it cold start 
#                we recommend the best rated series in the beginning
#    minNbRates : how many ratings required for a serie to be considered in best rated series
# output:
#    dataframe : line 'serie', columns 'Predicted rating' & 'serie title'
#                series are ordered from best rated to lowest rated
#                when recommending using similarities the prediction rating is not considered 
#                ( so it makes the order seem false ) but its ordered by serie similarity insead
def recommend(user, sims, df, titles, itemSim,k,simOnly=False,similOdds=0.0,coldUntil=2, minNbRates = 1000):
  if df[user].count() < coldUntil:
    yuy = df[df.count(axis=1)>minNbRates]
    res = titles.loc[yuy.mean(axis=1).sort_values(ascending=False).index[:k]]
    res['Predicted'] = np.NaN
    return res[['Predicted','title']]
  xx = predictUser(user, sims, df, dropnan = False)
  bestRate = xx['Original'].max()
  simAscending = False
  if bestRate < 6:
    simAscending = True
  xx['title'] = titles['title']
  hih = xx.dropna(subset=['Predicted'])
  res = hih.drop(list(hih.dropna().index))[['Predicted','title']].sort_values(by=['Predicted'],ascending=False)
  takeSims = False
  if random.uniform(0, 1)<=similOdds:
    takeSims = True
  if res.empty or simOnly or takeSims:
    print("Using Similarities")
    seen = xx.dropna(subset=['Original'])
    preferred = list(seen.sort_values(by=['Original'],ascending=simAscending).index)
    for current in range(len(itemSim)):
      s = itemSim[preferred[current]]
      if not simAscending:
        s = -s
      proposed = (s).argsort()[1:k+1]
      b = list(proposed)
      a = set(seen.index)
      for e in a:
        if e in b:
          b.remove(e)
      res = xx.iloc[b][['Predicted','title']]
      if not res.empty:
        return res
  return res

In [0]:
# Shows what are the ratings that this user gave originally
def showUserOriginal(user,df):
    kiko = df[user].dropna()
    kaka = titles.loc[kiko.index]
    kaka['rate'] = kiko
    return kaka

In [0]:
# The real modafaka that calculates everything to recommend later
# input:
#     df : ratings dataframe
#     vec_size : w2v vector size
#     window : how many 'users' to consider as neighbours when calculating
#     sim : how many users to consider as similar ( corresponds to topn in other functions )
#     iters : w2v iterations
#     min_count : minumum appearance of a user in the corpus 
#                 ( which means that min_count should be < cold_until in recommend )
#     clustering : build rating clusters or user sequences to build the corpus
# output:
#     sims : user similarity dictionary
#     t : MSE for all the predictions based on this model
#     s : number of users that didn't have any error measurable prediction
#         ( set of predicted ratings INTER set of originally given ratings = EMPTY )
#     lengths : the number of series predicted for each user
#     poi : for each user, number of ratings and number of predictions made
#     bed : ratio nb prediction / nb ratings for each user
def madafaka(df, vec_size = 500, window = 4, sim = 200, iters = 200, min_count = 1, clustering = False):
    if clustering:
        corpus = buildClusters(df)
    else:
        seqs, _ = buildSequences(df)
        corpus = list(seqs.values())
    model = gensim.models.Word2Vec(corpus, size=vec_size, window=window, iter=iters, workers=4, min_count=min_count)
    sims = getSimsNormal(model,sim,df,0)
    t,s,lengths = mseAllPrediction(df, sims)
    print("MSE , Zero predictions")
    print(t,s)
    lengths = pd.Series(lengths)
    print(lengths.describe())
    counts = df.count()
  
    poi = pd.DataFrame(counts).T
    poi = poi[list(sims.keys())]
    lengths.index = list(sims.keys())
    poi = poi.append(lengths,ignore_index=True)
    bed = poi.iloc[1]/poi.iloc[0]
    print("Prediction percentage")
    print(bed.describe())
  
    return sims, t, s, lengths, poi, bed

In [0]:
vec_size = 1000
window = 1000
sim = 10
iters = 200
min_count = 5
clustering = True
sims, t, s, lengths, poi, bed = madafaka(df, vec_size = vec_size, window = window, sim = sim, iters = iters, min_count = min_count, clustering = clustering )

MSE , Zero predictions
4.161723359802841 5
count    911.000000
mean       4.655324
std        3.772918
min        0.000000
25%        3.000000
50%        4.000000
75%        6.000000
max       45.000000
dtype: float64
Prediction percentage
count    911.000000
mean       0.486942
std        0.201340
min        0.000000
25%        0.333333
50%        0.461538
75%        0.600000
max        1.000000
dtype: float64


In [0]:
user = "issamou"
rate(user,'Lost',9,df,titles)

Before
issamou   NaN
Name: 342, dtype: float64
After
issamou    9.0
Name: 342, dtype: float64


In [0]:
user = "issamou"
showUserOriginal(user,df)

Unnamed: 0,title,rate
0,Batman,9.0
113,The Simpsons,8.0
156,Friends,10.0
247,Son of the Beach,3.0
254,X-Men: Evolution,9.0
268,Smallville,8.0
299,Teen Titans,7.0
400,Hannah Montana,4.0
413,Wolverine and the X-Men,8.0
517,Batman: The Brave and the Bold,8.0


In [0]:
# coldUntil >= min_count ( du madafaka )
r = recommend(user, sims, df, titles, simMat, k=7, simOnly=False, similOdds=0.3, coldUntil=1, minNbRates = 100)
r

Using Similarities


Unnamed: 0,Predicted,title
456,,Samantha Who?
382,,What About Brian
210,,Will & Grace
631,,Melissa & Joey
566,,Modern Family
383,,The New Adventures of Old Christine
391,,The Class


In [0]:
titles.sample(10)

Unnamed: 0,title
168,JAG
530,How the Earth Was Made
254,X-Men: Evolution
111,Family Matters
314,Arrested Development
229,The Vice
731,Men at Work
334,CSI: NY
844,Drunk History
446,Painkiller Jane


In [0]:
list(sims.items())[3]

('Igenlode Wordsmith',
 ['theseanofsydney',
  'pinkmuggle123',
  'emmers591',
  'djpl-719-841133',
  'nickbaizel',
  'manixanoka',
  'rumblinglove',
  'mcgregoronline',
  'highbob',
  'Kmclellan'])

In [0]:
# user who like 'serie' the most but who rated at least 'mini' series
serie = 'Hannibal'
hoh = np.where( titles.values.ravel() == serie)[0][0]
print(serie,hoh)
mini = 10
nieme = 0
while(True):
  huh = list(df.iloc[hoh][pd.Series.argsort(df.iloc[hoh]) == nieme].keys())[0]
  if df[huh].count()<mini:
    nieme += 1
  else:
    break
print('user : ',huh, df[huh].count())

Hannibal 778
user :  deloudelouvain 55


In [0]:
# 'user' and how many series he rated, and showing the original and predicted rating for 'serie'
user = 'poe426'
hoh = 778
print('user : ',user)
xx = predictUser(user, sims, df, dropnan = False)
xx['title'] = titles['title']
print('noted : ',len(xx.dropna(subset=['Original'])))
hih = xx.dropna(subset=['Predicted'])
xx.iloc[hoh]

user :  poe426
noted :  15


Predicted         NaN
Original          NaN
title        Hannibal
Name: 778, dtype: object

In [0]:
vec_size = 500
window = 4
sim = 200
iters = 200
min_count = 1
model = gensim.models.Word2Vec(corpus, size=vec_size, window=window, iter=iters, workers=4, min_count=min_count)
sims = getSimsNormal(model,sim,df,0)
t,s,lengths = mseAllPrediction(df, sims)
print(len(sims))
print(t,s)
lengths = pd.Series(lengths)
lengths.describe()

48704
1.4360507188877563 174


count    48704.000000
mean         1.294267
std          1.679641
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         82.000000
dtype: float64

In [0]:
# Some parameters and their results
vec_size = 500
window = 4
sim = 100
iters = 200
sans min_count
910
4.720264335690516 0
count    910.000000
mean      10.196703
std       11.853608
min        3.000000
25%        5.000000
50%        7.000000
75%       10.000000
max      207.000000

vec_size = 500
window = 4
sim = 100
iters = 200
min_count = 4
1423
4.3631012618304705 0
count    1423.000000
mean        7.881940
std         9.512453
min         2.000000
25%         4.000000
50%         5.000000
75%         8.000000
max       178.000000

vec_size = 500
window = 4
sim = 100
iters = 200
min_count = 3
2580
3.6229894607755724 0
count    2580.000000
mean        5.590698
std         6.879334
min         1.000000
25%         3.000000
50%         4.000000
75%         5.000000
max       135.000000

vec_size = 500
window = 4
sim = 50
iters = 200
min_count = 3
2580
3.3837499149680212 0
count    2580.000000
mean        5.190698
std         5.400378
min         1.000000
25%         3.000000
50%         3.000000
75%         5.000000
max        74.000000

vec_size = 500
window = 4
sim = 10
iters = 200
min_count = 3
2580
2.8071500462288896 6
count    2580.000000
mean        3.598450
std         2.522792
min         0.000000
25%         2.000000
50%         3.000000
75%         4.000000
max        30.000000


vec_size = 500
window = 4
sim = 100
iters = 200
min_count = 2
6597
2.791165190867845 0
count    6597.000000
mean        3.331060
std         4.247586
min         1.000000
25%         2.000000
50%         2.000000
75%         3.000000
max       113.000000

vec_size = 500
window = 4
sim = 100
iters = 200
min_count = 1
48704
1.055816536226366 329
count    48704.000000
mean         1.258562
std          1.432282
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         55.000000

vec_size = 500
window = 4
sim = 200
iters = 200
min_count = 1
48704
1.4360507188877563 174
count    48704.000000
mean         1.294267
std          1.679641
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         82.000000