## Collaborative Filtering

To address some of the limitations of content-based filtering, collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B. Furthermore, the embeddings can be learned automatically, without relying on hand-engineering of features.

### A Movie Recommendation Example
Consider a movie recommendation system in which the training data consists of a feedback matrix in which:

Each row represents a user.
Each column represents an item (a movie).
The feedback about movies falls into one of two categories:
Users specify how much they liked a particular movie by providing a numerical rating.
When a user visits the homepage, the system should recommend movies based on both:
similarity to movies the user has liked in the past (item-item filtering) and movies that similar users liked (usser-item filtering)

In the below example we have 610 users and 9724 movies


In [25]:
#import necessary libraries

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

__author__ = "Kanchan Pandhare"
__email__ = "kanchan.pandhare08@gmail.com"


movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

#Combine movies and ratings to a single dataframe movie_ratings
movie_ratings  = pd.merge(movies,ratings,)
print(movie_ratings.head())


   movieId             title                                       genres  \
0        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
1        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
2        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
3        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   
4        1  Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy   

   userId  rating   timestamp  
0       1     4.0   964982703  
1       5     4.0   847434962  
2       7     4.5  1106635946  
3      15     2.5  1510577970  
4      17     4.5  1305696483  



## Item-Item Similarity

Item-item collaborative filtering, or item-based is a form of collaborative filtering for recommender systems based on the similarity between items calculated using people's ratings of those items.

### Movie Matrix
For Item-Item Similarity we have movie Ids in rows and user Ids in columns. This will form a matrix of (no_of_movies) x (no_of_users)



In [26]:
movie_matrix = movie_ratings.pivot(index = 'movieId', columns = 'userId', values = 'rating').fillna(0)
print(movie_matrix)


userId   1    2    3    4    5    6    7    8    9    10   ...  601  602  603  \
movieId                                                    ...                  
1        4.0  0.0  0.0  0.0  4.0  0.0  4.5  0.0  0.0  0.0  ...  4.0  0.0  4.0   
2        0.0  0.0  0.0  0.0  0.0  4.0  0.0  4.0  0.0  0.0  ...  0.0  4.0  0.0   
3        4.0  0.0  0.0  0.0  0.0  5.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
4        0.0  0.0  0.0  0.0  0.0  3.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
5        0.0  0.0  0.0  0.0  0.0  5.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
...      ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...  ...   
193581   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
193583   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
193585   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
193587   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  ...  0.0  0.0  0.0   
193609   0.0  0.0  0.0  0.0 

In [27]:
movie_similarity =  cosine_similarity(movie_matrix)
np.fill_diagonal(movie_similarity,0)
ratings_matrix_items = pd.DataFrame( movie_similarity )
print(ratings_matrix_items)


          0         1         2         3         4         5         6     \
0     0.000000  0.410562  0.296917  0.035573  0.308762  0.376316  0.277491   
1     0.410562  0.000000  0.282438  0.106415  0.287795  0.297009  0.228576   
2     0.296917  0.282438  0.000000  0.092406  0.417802  0.284257  0.402831   
3     0.035573  0.106415  0.092406  0.000000  0.188376  0.089685  0.275035   
4     0.308762  0.287795  0.417802  0.188376  0.000000  0.298969  0.474002   
...        ...       ...       ...       ...       ...       ...       ...   
9719  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
9720  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
9721  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
9722  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
9723  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   

          7         8         9     ...  9714  9715  9716  9717

### Similar Movies Function 
Here the function takes movie name as input and finds the movie id from the movie name. Later it finds the similar movies associated with the movie name and returns the movies with the similarity score as output

### Recommend Movies As per Item Similarity
In this function we take user id for user A as input and find the movies watched by the user A and having the highest rating to get the movies liked by user A and pass these movies to the above similarMovies function to get the similar movies and sort it in descending order to get the most similar movies in sorted_movies_as_per_userChoice. We also have user2movies to get the movies already watched by user A. Based on this we recommend the movies in sorted_movies_as_per_userChoice but also at the same time we check if they are not already watched by the user A

In [29]:

def similarMovies(movieName): 
    """
    recomendates similar movies
   :param data: name of the movie 
   """
    try:
        #user_inp=input('Enter the reference movie title based on which recommendations are to be made: ')
        inp= movies[movies['title']==movieName].index.tolist()
        movies['similarity'] = ratings_matrix_items.iloc[inp[0]]
        movies.columns = ['movie_id', 'title', 'release_date','similarity']
    except:
        print("Sorry, the movie is not in the database!")

def recommendMoviesAsperItemSimilarity(user_id):
    """
     Recommending movie which user hasn't watched as per Item Similarity
    :param user_id: user_id to whom movie needs to be recommended
    :return: movieIds to user 
    """
    user_movie= movie_ratings[(movie_ratings.userId==user_id) & movie_ratings.rating.isin([5,4.5,4])][['title']]
    user_movie=user_movie.iloc[0,0]
    similarMovies(user_movie)
    sorted_movies_as_per_userChoice= movies.sort_values( ["similarity"], ascending = False )

    #print(sorted_movies_as_per_userChoice.head())
    sorted_movies_as_per_userChoice=sorted_movies_as_per_userChoice[sorted_movies_as_per_userChoice['similarity'] >=0.25]['movie_id']
    recommended_movies=list()
    df_recommended_item=pd.DataFrame()
    user2Movies= ratings[ratings['userId']== user_id]['movieId']
    #print(user2Movies)
    best10 = []
    for movieId in sorted_movies_as_per_userChoice:
            if movieId not in user2Movies:
                df_new= ratings[(ratings.movieId==movieId)]
                df_recommended_item=pd.concat([df_recommended_item,df_new])
            best10=df_recommended_item.sort_values(["rating"], ascending = False )[1:11] 
    return best10['movieId']

def movieIdToTitle(listMovieIDs):
    """
     Converting movieId to titles
    :param user_id: List of movies
    :return: movie titles
    """
    movie_titles= []
    for id in listMovieIDs:
        movie_titles.append(movies[movies['movie_id']==id]['title'].to_string())
    return movie_titles

In [30]:
#Call the function to find the similar movies
movieIdToTitle(recommendMoviesAsperItemSimilarity(2))

['459    Rudy (1993)',
 '314    Forrest Gump (1994)',
 '140    First Knight (1995)',
 '157    Nine Months (1995)',
 '322    Lion King, The (1994)',
 '2524    Taking of Pelham One Two Three, The (1974)',
 '314    Forrest Gump (1994)',
 '277    Shawshank Redemption, The (1994)',
 '322    Lion King, The (1994)',
 '257    Pulp Fiction (1994)']

## User-Item Similarity
The method identifies users that are similar to the queried user and estimate the desired rating to be the weighted average of the ratings of these similar users.

### User Matrix
For User-Item Similarity we have user Ids in rows and movies Ids in columns. This will form a matrix of (no_of_users} x (no_of_movies).




In [24]:
user_matrix  = movie_ratings.pivot(index = 'userId', columns = 'movieId', values = 'rating').fillna(0)
print(user_matrix)

movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           4.0     0.0     4.0     0.0     0.0     4.0     0.0     0.0   
2           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
5           4.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
...         ...     ...     ...     ...     ...     ...     ...     ...   
606         2.5     0.0     0.0     0.0     0.0     0.0     2.5     0.0   
607         4.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
608         2.5     2.0     2.0     0.0     0.0     0.0     0.0     0.0   
609         3.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
610         5.0     0.0     0.0     0.0     0.0     5.0     0.0     0.0   

movieId  9       10     

Here I have used the cosine similarity to calculate the weight given in the above formula. This gives the similarity score of a user A with other users. The resulting matrix is of size (no_of_users) x (no_of_users)

We also set the diagonal elements to 0 so as to avoid the same movie to be shown in recommendation as every given movie will have the similarity score with the same movie as equal to 1


In [31]:
user_similarity =  cosine_similarity(user_matrix)
np.fill_diagonal(user_similarity,0)
ratings_matrix_users = pd.DataFrame( user_similarity )
print(ratings_matrix_users)



          0         1         2         3         4         5         6    \
0    0.000000  0.027283  0.059720  0.194395  0.129080  0.128152  0.158744   
1    0.027283  0.000000  0.000000  0.003726  0.016614  0.025333  0.027585   
2    0.059720  0.000000  0.000000  0.002251  0.005020  0.003936  0.000000   
3    0.194395  0.003726  0.002251  0.000000  0.128659  0.088491  0.115120   
4    0.129080  0.016614  0.005020  0.128659  0.000000  0.300349  0.108342   
..        ...       ...       ...       ...       ...       ...       ...   
605  0.164191  0.028429  0.012993  0.200395  0.106435  0.102123  0.200035   
606  0.269389  0.012948  0.019247  0.131746  0.152866  0.162182  0.186114   
607  0.291097  0.046211  0.021128  0.149858  0.135535  0.178809  0.323541   
608  0.093572  0.027565  0.000000  0.032198  0.261232  0.214234  0.090840   
609  0.145321  0.102427  0.032119  0.107683  0.060792  0.052668  0.193219   

          7         8         9    ...       600       601       602  \
0  

### User's Similarity
After having the cosine similarity matrix which gives the similarity score for each user against another user, we then find the most similar user to the input user by using idxmax which gives the user index for highest score against each user. The below output shows that, the similar user corresponding to user 0 is 265.

In [15]:
similar_users = ratings_matrix_users.idxmax(axis=1)
print(similar_users)


0      265
1      365
2      312
3      390
4      469
      ... 
605    473
606    569
607    479
608    339
609    248
Length: 610, dtype: int64


### Recommend Movies As per User Similarity
In this function we pass the user id as input and find the similar user corresponding to the user id of user A. After having the similar user we get the movies watched by the similar user and we also find the movies watched/rated by user A, so that we don't show the same movies in recommendation to the user. We sort the recommendation movies in descending order of the ratings so that the most rated movie is recommended first to the user



In [41]:
def recommendMoviesAsperUserSimilarity(user_id):
    """
     Recommending movie which user hasn't watched as per Item Similarity
    :param user_id: user_id to whom movie needs to be recommended
    :return: movieIds to user 
    """
    similar_user = similar_users[user_id]
    print("User "+str(user_id)+" is similar to User " +str(similar_user))
    user2Movies= ratings[ratings['userId']== user_id]['movieId']
    df_recommended=pd.DataFrame(columns=['movieId','title','genres','userId','rating','timestamp'])
    #print(user2Movies)
    best10 = []
    similarUsersMovies = ratings[ratings['userId']== similar_user]['movieId']
    for movieId in similarUsersMovies:
            if movieId not in user2Movies:
                df_new= movie_ratings[(movie_ratings.movieId==movieId) & (movie_ratings.userId==similar_user)]
                df_recommended=pd.concat([df_recommended,df_new])
            best10=df_recommended.sort_values(["rating"], ascending = False )[1:11] 
    return best10['title']

In [42]:
print(recommendMoviesAsperUserSimilarity(2))

User 2 is similar to User 312
18580    Dr. Strangelove or: How I Learned to Stop Worr...
20299                                       Vertigo (1958)
20636                                    Casablanca (1942)
21128                         2001: A Space Odyssey (1968)
21519                      Night of the Living Dead (1968)
30192    Nosferatu (Nosferatu, eine Symphonie des Graue...
56806                           Battleship Potemkin (1925)
56525                 Road Warrior, The (Mad Max 2) (1981)
23573                    E.T. the Extra-Terrestrial (1982)
24299                                  Delicatessen (1991)
Name: title, dtype: object
