<h1>Recommender system using active learning and recursive algorithm</h1>

In this notebook we present the implementation of an approach based on active learning used in collaborative filtering to suggest movies to a new user. We proceed with two layers, the first will use the active learning approach to select the first films to be evaluated by the user when registering, and in the second layer we will use the recursive prediction algorithm applied to nearest neighbor  based collaborative filtering to predict the rating of the movies by the active user (new user) and suggest the highest ones. The data set used is from [MovieLens], a movie recommendation service.

In [None]:
import numpy as np
import random
import pandas as pd

url_links = 'https://raw.githubusercontent.com/haith-gi/Recommender-system-atelier-IA-/main/dataset/links.csv'
url_movies = 'https://raw.githubusercontent.com/haith-gi/Recommender-system-atelier-IA-/main/dataset/movies.csv'
url_ratings = 'https://raw.githubusercontent.com/haith-gi/Recommender-system-atelier-IA-/main/dataset/ratings.csv'
url_tags = 'https://raw.githubusercontent.com/haith-gi/Recommender-system-atelier-IA-/main/dataset/tags.csv'

links = pd.read_csv(url_links)
movies = pd.read_csv(url_movies)
ratings = pd.read_csv(url_ratings)
tags = pd.read_csv(url_tags)

In [None]:
links.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


In [None]:
movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [None]:
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [None]:
tags.head(3)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992


<h2>1. Preprocessing data</h2>
In this part, we will prepare the data structures in order to use them in the collaborative filtering algorithm. We will start with the matrix to be used for the calculations of a possible rating by the active user (new user) taking into account the similarities and the ratings of other users. This step is called "data preprocessing".

In [None]:
rating_df = ratings[['userId']]
rating_df = rating_df.drop_duplicates()
rating_df.reset_index(drop=True, inplace=True)

v = [pd.NA for i in range(len(rating_df.userId))]

for i in (movies.movieId):
  rating_df[str(i)] = v

rating_df

  


Unnamed: 0,userId,1,2,3,4,5,6,7,8,9,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,1,,,,,,,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,4,,,,,,,,,,...,,,,,,,,,,
4,5,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606,,,,,,,,,,...,,,,,,,,,,
606,607,,,,,,,,,,...,,,,,,,,,,
607,608,,,,,,,,,,...,,,,,,,,,,
608,609,,,,,,,,,,...,,,,,,,,,,


In [None]:
#fill in ratings values 
for i in rating_df.userId:
  df = ratings.loc[ratings['userId'] == i]
  for j in df.movieId:
    rating_df.at[i-1,str(j)] = float(df.loc[df['movieId'] == j].rating)

rating_df

Unnamed: 0,userId,1,2,3,4,5,6,7,8,9,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,1,4.0,,4.0,,,4.0,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,4,,,,,,,,,,...,,,,,,,,,,
4,5,4.0,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606,2.5,,,,,,2.5,,,...,,,,,,,,,,
606,607,4.0,,,,,,,,,...,,,,,,,,,,
607,608,2.5,2.0,2.0,,,,,,,...,,,,,,,,,,
608,609,3.0,,,,,,,,,...,,,,,,,,,,


<h2>2. Active Learning implementation</h2>

<h3>2.1. Variance Strategy:</h3>

This strategy selects the items with the
highest variance, hence, it favours the items that have
been rated diversely by the users on the assumption that
the variance gives an indication of the uncertainty of the
system about that item’s ratings.

In [None]:
#variance strategy
import numpy as np

def item_var(i):
  U = rating_df.loc[~rating_df[str(i)].isnull()] # the set of users who rated item i
  U.reset_index(drop=True, inplace=True)
  card_U = len(U.index) # cadinal of the users who rated item i
  if card_U !=0:
    somme = 0
    avr = U[str(i)].mean()
    for index, row in U.iterrows():
      s = (row[str(i)] - avr)**2
      somme = somme+s
    v = (1/(card_U))*somme
    return(v)
  else:
    return 0


items_list =[col for col in rating_df.columns]
items_list.pop(0)

items_var_dict_not_sorted = {}
for i in items_list:
  items_var_dict_not_sorted[i]= item_var(i) #return a dictionary of items and their variances

items_var_dict = {k: v for k, v in sorted(items_var_dict_not_sorted.items(), key=lambda item: item[1], reverse = True)} #sorted items by variance
#items_var_dict

In [None]:
print("Using the Variance approach of active learning, the user will be asked to rate the following movies :\n")

def movies_to_rate_var(n):
  l=list(items_var_dict)
  l_to_rate=[]
  for i in range (int(n)):
    l_to_rate.append(l[i])
  return(l_to_rate)

movies_to_rate_var(5)

Using the Variance approach of active learning, the user will be asked to rate the following movies :



['2068', '32892', '70946', '484', '3223']

In [None]:
rating_df

Unnamed: 0,userId,1,2,3,4,5,6,7,8,9,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
0,1,4.0,,4.0,,,4.0,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,4,,,,,,,,,,...,,,,,,,,,,
4,5,4.0,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,606,2.5,,,,,,2.5,,,...,,,,,,,,,,
606,607,4.0,,,,,,,,,...,,,,,,,,,,
607,608,2.5,2.0,2.0,,,,,,,...,,,,,,,,,,
608,609,3.0,,,,,,,,,...,,,,,,,,,,


In [None]:
#The user will rate the movies proposed to him using the variance active learning approach 
def user_ratings(number_of_items_to_rate_by_variance ):

  global rating_df
  u = rating_df["userId"].iloc[-1] #the new "userId" in the rating_df (this is used to acess the value of certain column in the last row)
  print("To propose you the best movies to watch, please rate the following items...")
  l=[0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5]
  
  for i in movies_to_rate_var(number_of_items_to_rate_by_variance):
    '''v= True
    while v:
      print("Rate from 0 to 5 the item ", i," :  ")
    
      x = float(input())
      
      if (x<=5 and x>=0):
        v = False'''
    
    rating_df.at[u-1, i] = random.choice(l)

In [None]:
#This function add the new user and ask him for rating (these ratings would be added to his account in the dataframe)
def user_subscribe():

  global rating_df
  rating_df_copy = rating_df.copy()
  new_row = rating_df.iloc[-1:]
  new_user={}

  for column in rating_df:
      new_user[column]=pd.NA

  new_user["userId"]= new_row["userId"].iloc[-1] + 1
  new_user_df = pd.DataFrame(new_user, index=[0])
  rating_df = pd.concat([rating_df_copy, new_user_df], ignore_index = True, axis = 0)

  #insertion of the rating function
  print("Please insert the number of movies you want to rate: ")
  number_of_items_to_rate_by_variance = 20
  print("The number of rated movies is ", number_of_items_to_rate_by_variance)
  user_ratings(number_of_items_to_rate_by_variance )
  
  return(rating_df.iloc[-1:])

In [None]:
#example
user_subscribe()

Please insert the number of movies you want to rate: 
The number of rated movies is  20
To propose you the best movies to watch, please rate the following items...


Unnamed: 0,userId,1,2,3,4,5,6,7,8,9,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
610,611,,,,,,,,,,...,,,,,,,,,,


In [None]:
# example of rating (item: 32892)
rating_df["32892"].iloc[-1]

1.5

<h2>3. NEAREST-NEIGHBOR BASED COLLABORATIVE FILTERING</h2>

To predict the rating
value of a given item for an active user, a subset of neighbor
users are chosen based on their similarity to the active user – called nearest-neighbor users – and their ratings of the
given item are aggregated to generate the prediction value
for it.

<h3>3.1. User Similarity</h3>

We choose the Pearson correlation as the metric for user similarity.

In [None]:
#This function returns a list of the items rated by both users passed in

def items_rated_by_boths_users(x,y):

  l=[]

  for j in movies.movieId:
    r1 = rating_df[str(j)].iloc[int(x-1)] #This is to get the rating of item j by the user x
    r2 = rating_df[str(j)].iloc[int(y-1)]
    if pd.isna(r1) or pd.isna(r2):  #This is used to check for NAN values (instead we get typeerror boolean value of na is ambiguous)
      continue
    else:
      l.append(j) #list of movies rated by these two user

  return l

In [None]:
items_rated_by_boths_users(611,567)

[26171, 32892]

In [None]:
#This function returns the average ratings of the user x

def average_rating_by_user(x):

  l=[]
  
  for j in movies.movieId:
    r = rating_df[str(j)].iloc[int(x-1)] #This is to get the rating of item j by the user x
    if type(r)==float :
      l.append(r)
  
  return(np.nanmean(l))

In [None]:
#this function return the similarity between two user using the Pearson correlation as the metric

def similarity_users(x, y):
  
  l = items_rated_by_boths_users(x,y)
  R_x = average_rating_by_user(x)
  R_y = average_rating_by_user(y)
  
  numerateur = 0
  denominateur_1 = 0
  denominateur_2 = 0

  if len(l) >0 :
    for j in l: 
      R_xj = rating_df[str(j)].iloc[int(x-1)]
      R_yj = rating_df[str(j)].iloc[int(y-1)]
      numerateur = numerateur + ((R_xj-R_x)*(R_yj-R_y))
      denominateur_1 = denominateur_1 + ((R_xj - R_x))**2
      denominateur_2 = denominateur_2 + ((R_yj - R_y))**2
    result = numerateur/((np.sqrt(denominateur_1))*(np.sqrt(denominateur_2)))
    result = float("{:.2f}".format(result))
  else : 
    result = 0
  return result

In [None]:
similarity_users(611,422)

0

<h3>3.2. Selecting Neighbors</h3>

We choose the K nearest-neighbor strategy as the baseline strategy of selecting neighbors based on Pearson correlation similarity. We will use the Baseline Selecting Neighbors strategy "BS".

Baseline strategy(BS): selects the top K nearest-neighbors who have rated the given item.

In [None]:
#this function returns the a list of users who have rated the same item

def users_rated_item(i):
  l=[]
  users = rating_df["userId"].iloc[-1]
  for u in range(1, users+1):
    r = rating_df[str(i)].iloc[int(u-1)] #the rating of the user u to the item i
    if pd.isna(r):  #This is used to check for NAN values (instead we get typeerror boolean value of na is ambiguous)
      continue
    else:
      l.append(u) #list of users who have rated the movie i
  return l

In [None]:
#This function returns the top K nearest-neighbors to the active user x have rated the item i sorted by similarity with x

def Baseline_strategy(x, i):
  
  n_list = users_rated_item(i)
  similarity_list_not_sorted = {}
  for u in n_list:
    if u != x:
      similarity_list_not_sorted[u] = similarity_users(x, u)

  similarity_list = {k: v for k, v in sorted(similarity_list_not_sorted.items(), key=lambda item: item[1], reverse = True)}
  return list(similarity_list.keys())

In [None]:
Baseline_strategy(611, 32892)

[567, 105]

In [None]:
#This function returns the prediction on items i for the active user x based on baseline_strategy (this function is used in RecursivePrediction function)

def BaselinePrediction(x, i):

  global rating_df
  k=10 
  bs = Baseline_strategy(x, i)
  if len(bs)>= k:
    l = [bs[i] for i in range(k)] #neighbor users
  else:
    l = bs

  R_x = average_rating_by_user(x)
  numerateur = 0
  denominateur = 0

  if len(l) >0 :
    for y in l: 

      R_y = average_rating_by_user(y)

      R_xi = rating_df[str(i)].iloc[int(x-1)]
      R_yi = rating_df[str(i)].iloc[int(y-1)]

      numerateur = numerateur + ((R_yi-R_y) * similarity_users(x, y))
      denominateur = denominateur + abs(similarity_users(x, y))
    if (denominateur == 0):
      result = average_rating_by_user(x)
    else:
      result = average_rating_by_user(x) + numerateur/denominateur
    result = round(result * 2) / 2
    rating_df[str(i)].iloc[int(x-1)] = result
    return result

<h2>4. The Recursive Prediction Algorithm</h2>

In [None]:
#Configuration Values:
'''k = 10    #Neighbor Size (set it when calling the function Baseline_strategy(x, i))       
threashold = 2    #a good choice of the overlap size threshold would be around 10.
lambda = 0.5   #Weight Threshold

#Function Parameters:
x = rating_df["userId"].iloc[-1]    #active user
# i =  item to be predicted
level = 0   #the current recursive level'''

x = rating_df["userId"].iloc[-1]    #active user
R_x = average_rating_by_user(x)

#This function returns the predicted rating value of the user x for the item i

def RecursivePrediction (x, i, level): #starting with level=0
  #Configuration Values
  threashold = 2
  k = 10
  lamb_da = 0.5

  if level >= threashold:
    return BaselinePrediction(x, i)
  else:
    bs = Baseline_strategy(x, i)
    if len(bs)>= k:
      U = [bs[i] for i in range(k)]
    else:
      U = bs

    alpha = 0
    beta = 0
    for y in U:
      R_y = average_rating_by_user(y)
      R_yi = rating_df[str(i)].iloc[int(y-1)]
      if pd.isna(R_yi) == False:
        alpha = alpha + (R_yi - R_y)*similarity_users(x, y)
        beta = beta + abs(similarity_users(x, y))
      else:
        R_yi_hat = RecursivePrediction (y, i, level + 1)
        alpha = alpha + lamb_da * (R_yi_hat - R_y)*similarity_users(x, y)
        beta = beta + lamb_da * abs(similarity_users(x, y))
    if (beta == 0):
      return(BaselinePrediction(x, i))
    return (R_x + alpha/beta )



#This function will set the active user predicted rate using the Recursive Prediction
#the function will predict the rating of the first 100 items ordred by variance
def setting_active_user_prediction():
  global rating_df
  #for i in movies.movieId:
  for i in movies_to_rate_var(100): 
    r_u = rating_df[i].iloc[int(x-1)]
    if pd.isna(r_u):
      pred = RecursivePrediction (x, i, 0)
      rating_df.at[int(x-1), i] = None
      rating_df[i].iloc[int(x-1)] = (round(pred * 2) / 2)
  return(rating_df.iloc[-1:])

In [None]:
setting_active_user_prediction()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,userId,1,2,3,4,5,6,7,8,9,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
610,611,,,,,,,,,,...,,,,,,,,,,


In [None]:
#display the dataframe containing the predictions
print("Display of 100 predictions")
df_pred = rating_df.iloc[-1:]
df_pred = df_pred.dropna(axis=1, how='all')
df_pred

Display of 100 predictions


Unnamed: 0,userId,102,213,484,487,619,879,984,1107,1415,...,120635,121097,156726,158783,159858,160565,172547,173145,184253,185029
610,611,1.0,2.0,3.5,2.5,1.5,1.5,1.0,0.0,1.0,...,2.0,0.0,2.5,3.0,1.5,1.0,3.5,1.5,2.0,2.0


*Haithem BEN DRISSI*

*Master Informatique M2 – MODO*

*Département MIDO*

*Université Paris Dauphine-PSL*
