This notebook was created by Josselin Deloste.

<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Collaborative Filtering in Recommender Systems</div>

This notebook aims to explain and experiment how collaborative filtering are used in recommender systems through easy examples.

1. [Recommender systems](#sec1)
2. [Item based collaborative filtering](#sec2)
3. [User based collaborative filtering](#sec3)


# 1. <a id="sec2"></a>Recommander Systems

Recommender systems are omnipresent in most of your everyday consulted websites. Indeed you all have been recommanded either a movie to watch (on Netflix, Amazon prime), a song to listen (on Deezer or Spotify), an item to buy (Ebay, Amazon), a person to follow (Instagram, Facebook, Twitter). All of these approaches are the results of recommander systems.

Recommender systems have obvious applications



**Idea:**

<img src="fig1.jpg" width="600px">  
Word-of-mouth has always been one of the most powerful marketing tool ! Have you liked something recently ? It is highly likely that your best friend will appreciate it too or that you would like a similar product !


**Some notations:**
- $b_{u,i}$ : baseline prediction for user $u$ and item $i$
- $r_{u,i}$ : rating score gave by user $u$ for item $i$
- $\mu$ : overall average rating
- $\overline{\mu_{u}}$ : average rating by user $u$ 
- $\overline{\mu_{i}}$ : average rating for item $i$
- $\beta_{u}, \beta_{i}$ : user and item biases

Baseline predictors : 
- $b_{u} := \frac{1}{\mid I_{u} \mid + \beta_{u}} \underset{i \in I_{u}}{\sum}(r_{u,i} - \mu)$ : user baseline predictor
- $b_{i} := \frac{1}{\mid U_{i} \mid + \beta_{i}} \underset{u \in U_{i}}{\sum}(r_{u,i} - b_{u} - \mu)$ : item baseline predictor
- $b_{u,i} := \mu + b_{u} + b_{i}$ : overall baseline predictor

Remark : we can first generate predictions using the user's preference and then produce his recommendations by ranking candidate items by predicted preferences. 

# 3. <a id="sec2"></a> Item-based collaborative filtering 

** Idea:**
- A user who liked a specific item is likely to like a similar item (content-based filtering)

In [2]:
import pandas as pd
from scipy.spatial.distance import cosine

In [7]:
data = pd.read_csv('./Divers/data.csv')

print(data)

       user  a perfect circle  abba  ac/dc  adam green  aerosmith  afi  air  \
0         1                 0     0      0           0          0    0    0   
1        33                 0     0      0           1          0    0    0   
2        42                 0     0      0           0          0    0    0   
3        51                 0     0      0           0          0    0    0   
4        62                 0     0      0           0          0    0    0   
5        75                 0     0      0           0          0    0    0   
6       130                 0     0      0           0          0    0    0   
7       141                 0     0      0           0          0    0    0   
8       144                 0     0      0           0          0    0    0   
9       150                 0     0      0           0          0    0    0   
10      205                 0     0      0           0          0    0    0   
11      247                 0     0      0          

In [6]:
ratings = pd.read_csv('./Movies_data/ratings.csv')

print(ratings.shape)
print(ratings.head())
max(ratings['userId'])

(100234, 4)
   userId  movieId  rating  timestamp
0       1        1     5.0  847117005
1       1        2     3.0  847642142
2       1       10     3.0  847641896
3       1       32     4.0  847642008
4       1       34     4.0  847641956


718

In [9]:
movies_data = ratings.drop("userId",1)
movies_data = movies_data.drop("timestamp",1)
movies_temp = pd.DataFrame( index = max(ratings['userId']), columns = movies.shape[0])

print(movies_data.shape)
print(movies_data.head())

TypeError: Index(...) must be called with a collection of some kind, 8927 was passed

In [12]:
item_data = data.drop('user',1)

In [13]:
data_bis = pd.DataFrame(index=item_data.columns,columns=item_data.columns)
for i in range(0,len(data_bis.columns)) :
    # Loop through the columns for each column
    for j in range(0,len(data_bis.columns)) :
      # Fill in placeholder with cosine similarities
      data_bis.iloc[i,j] = 1-cosine(item_data.iloc[:,i],item_data.iloc[:,j])
        

In [14]:
print(data_bis.shape)
print(data_bis.head().iloc[0:5,0:5])
#data_neighbours.head(6)

(285, 285)
                 a perfect circle       abba      ac/dc adam green  aerosmith
a perfect circle                1          0  0.0179172  0.0515539  0.0627765
abba                            0          1  0.0522788  0.0250706  0.0610563
ac/dc                   0.0179172  0.0522788          1   0.113154   0.177153
adam green              0.0515539  0.0250706   0.113154          1  0.0566365
aerosmith               0.0627765  0.0610563   0.177153  0.0566365          1


In [15]:
the_subways = data_bis['the subways']
print(the_subways.sort_values(ascending = False)[1:10])

the kooks         0.308343
the fratellis     0.281546
the wombats       0.262071
arctic monkeys    0.262071
mando diao        0.243599
bloc party        0.240192
kings of leon     0.220493
billy talent      0.217262
deichkind         0.205963
Name: the subways, dtype: object


# 4. <a id="sec2"></a> User-based collaborative filtering

** Idea:**
- The active user's preference can be predicted using other selected and aggregated users' opinions 

In [80]:
def getScore(history, similarities):
   return sum(history*similarities)/sum(similarities)

In [108]:
similarities = data
data_neighbours = pd.DataFrame(index=data_bis.columns,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_bis.columns)):
    data_neighbours.iloc[i,:10] = data_bis.iloc[0:,i].sort_values(ascending=False)[1:11].index

In [110]:
for i in range(0,len(similarities.index)):
    for j in range(1,len(similarities.columns)):
        user = similarities.index[i]
        product = similarities.columns[j]
 
        if data.iloc[i][j] == 1:
            similarities.iloc[i][j] = 0
        else:
            product_top_names = data_neighbours.iloc[product][1:10]
            product_top_sims = data_bis.iloc[product].sort_values(ascending=False)[1:10]
            user_purchases = data.iloc[user,product_top_names]
 
            data_sims.iloc[i][j] = getScore(user_purchases,product_top_sims)

TypeError: cannot do positional indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [a perfect circle] of <class 'str'>

In [None]:
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]

for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()
    
print data_recommend.ix[:10,:4]



# divers


In [16]:
ratings = pd.read_csv('./Movies_data/ratings.csv')

print(ratings.shape)
print(ratings.head())
max(ratings['userId'])

(100234, 4)
   userId  movieId  rating  timestamp
0       1        1     5.0  847117005
1       1        2     3.0  847642142
2       1       10     3.0  847641896
3       1       32     4.0  847642008
4       1       34     4.0  847641956


718