In [19]:
!pip install pandas
!pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


# Recommendation Engine
## Building a Movie Recommendation Engine using MovieLens dataset 
We will be using a MovieLens dataset. This dataset contains 100004 ratings across 9125 movies for 671 users. All selected users had at least rated 20 movies. 
We are going to build a recommendation engine which will suggest movies for a user which he hasn't watched yet based on the movies which he has already rated. We will be using k-nearest neighbour algorithm which we will implement from scratch.

In [3]:
import pandas as pd

Movie file contains information like movie id, title, genre of movies and ratings file contains data like user id, movie id, rating and timestamp in which each line after header row represents one rating of one movie by one user.

In [4]:
#change path here
movie_file = "/home/jupyter-user/movie_recommendation_engine_dataset/movie-recommendation-engine-dataset/movies.csv"
movie_data = pd.read_csv(movie_file, usecols = [0, 1])
movie_data.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [5]:
!pwd

/home/jupyter-user/movie_recommendation_engine_dataset/notebooks/movie-recommendation-engine-notebook


In [6]:
ratings_file = "/home/jupyter-user/movie_recommendation_engine_dataset/movie-recommendation-engine-dataset/ratings.csv"
ratings_info = pd.read_csv(ratings_file, usecols = [0, 1, 2])
ratings_info.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


In [7]:
movie_info = pd.merge(movie_data, ratings_info, left_on = 'movieId', right_on = 'movieId')
movie_info.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),3,4.0
1,1,Toy Story (1995),6,5.0
2,1,Toy Story (1995),8,4.0
3,1,Toy Story (1995),10,4.0
4,1,Toy Story (1995),11,4.5


In [8]:
movie_info.loc[0:10, ['userId']]
movie_info[movie_info.title == "Toy Story (1995)"].head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),3,4.0
1,1,Toy Story (1995),6,5.0
2,1,Toy Story (1995),8,4.0
3,1,Toy Story (1995),10,4.0
4,1,Toy Story (1995),11,4.5


In [9]:
movie_info = pd.DataFrame.sort_values(movie_info, ['userId', 'movieId'], ascending = [0, 1])
movie_info.head()

Unnamed: 0,movieId,title,userId,rating
2568,1,Toy Story (1995),7120,4.5
3723,2,Jumanji (1995),7120,4.0
7163,7,Sabrina (1995),7120,4.0
12733,17,Sense and Sensibility (1995),7120,4.5
14103,19,Ace Ventura: When Nature Calls (1995),7120,3.0


Let us see the number of users and number of movies in our dataset

In [10]:
num_users = max(movie_info.userId)
num_movies = max(movie_info.movieId)
print(num_users)
print(num_movies)

7120
130642


how many movies were rated by each user and the number of users rated each movie

In [11]:
movie_per_user = movie_info.userId.value_counts()
movie_per_user.head()

3907    2711
2261    2644
903     2608
4358    2575
4222    2553
Name: userId, dtype: int64

In [12]:
users_per_movie = movie_info.title.value_counts()
users_per_movie.head()

Pulp Fiction (1994)                 3498
Forrest Gump (1994)                 3476
Silence of the Lambs, The (1991)    3247
Shawshank Redemption, The (1994)    3216
Jurassic Park (1993)                3129
Name: title, dtype: int64

Function to find top N favourite movies of a user

In [13]:
def fav_movies(current_user, N):
    # get rows corresponding to current user and then sort by rating in descending order
    # pick top N rows of the dataframe
    fav_movies = pd.DataFrame.sort_values(movie_info[movie_info.userId == current_user], 
                                          ['rating'], ascending = [0]) [:N]
    # return list of titles
    return list(fav_movies.title)

print(fav_movies(5, 3))
    
    

['Jerry Maguire (1996)', 'Wallace & Gromit: The Best of Aardman Animation (1996)', 'Speed (1994)']


Lets build recommendation engine now

- We will use a neighbour based collaborative filtering model. 
- The idea is to use k-nearest neighbour algorithm to find neighbours of a user
-  We will use their ratings to predict ratings of a movie not already rated by a current user.

We will represent movies watched by a user in a vector - the vector will have values for all the movies in our dataset.
If a user hasn't rated a movie, it would be represented as NaN.

In [14]:
user_movie_rating_matrix = pd.pivot_table(movie_info, values = 'rating', index=['userId'], columns=['movieId'])
user_movie_rating_matrix.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,129350,129354,129428,129707,130052,130073,130219,130462,130490,130642
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,3.5,,,,,,,,,...,,,,,,,,,,
2,,,4.0,,,,,,,,...,,,,,,,,,,
3,4.0,,,,,,,,,,...,,,,,,,,,,
4,,,,,,3.0,,,,4.0,...,,,,,,,,,,
5,,3.0,,,,,,,,,...,,,,,,,,,,


Now, we will find the similarity between 2 users by using correlation 

In [15]:
from scipy.spatial.distance import correlation
import numpy as np
def similarity(user1, user2):
    # normalizing user1 rating i.e mean rating of user1 for any movie
    # nanmean will return mean of an array after ignore NaN values 
    user1 = np.array(user1) - np.nanmean(user1) 
    user2 = np.array(user2) - np.nanmean(user2)
    
    # finding the similarity between 2 users
    # finding subset of movies rated by both the users
    common_movie_ids = [i for i in range(len(user1)) if user1[i] > 0 and user2[i] > 0]
    if(len(common_movie_ids) == 0):
        return 0
    else:
        user1 = np.array([user1[i] for i in common_movie_ids])
        user2 = np.array([user2[i] for i in common_movie_ids])
        return correlation(user1, user2)

 We will now use the similarity function to find the nearest neighbour of a current user

In [16]:
# nearest_neighbour_ratings function will find the k nearest neighbours of the current user and
# then use their ratings to predict the current users ratings for other unrated movies 

def nearest_neighbour_ratings(current_user, K):
    # Creating an empty matrix whose row index is userId and the value
    # will be the similarity of that user to the current user
    similarity_matrix = pd.DataFrame(index = user_movie_rating_matrix.index, 
                                    columns = ['similarity'])
    for i in user_movie_rating_matrix.index:
        # finding the similarity between user i and the current user and add it to the similarity matrix
        similarity_matrix.loc[i] = similarity(user_movie_rating_matrix.loc[current_user],
                                             user_movie_rating_matrix.loc[i])
    # Sorting the similarity matrix in descending order
    similarity_matrix = pd.DataFrame.sort_values(similarity_matrix,
                                                ['similarity'], ascending= [0])
    # now we will pick the top k nearest neighbou
    nearest_neighbours = similarity_matrix[:K]

    neighbour_movie_ratings = user_movie_rating_matrix.loc[nearest_neighbours.index]

    # This is empty dataframe placeholder for predicting the rating of current user using neighbour movie ratings
    predicted_movie_rating = pd.DataFrame(index = user_movie_rating_matrix.columns, columns = ['rating'])

    # Iterating all movies for a current user
    for i in user_movie_rating_matrix.columns:
        # by default, make predicted rating as the average rating of the current user
        predicted_rating = np.nanmean(user_movie_rating_matrix.loc[current_user])

        for j in neighbour_movie_ratings.index:
            # if user j has rated the ith movie
            if(user_movie_rating_matrix.loc[j,i] > 0):
                predicted_rating += ((user_movie_rating_matrix.loc[j,i] -np.nanmean(user_movie_rating_matrix.loc[j])) *
                                                    nearest_neighbours.loc[j, 'similarity']) / nearest_neighbours['similarity'].sum()

        predicted_movie_rating.loc[i, 'rating'] = predicted_rating

    return predicted_movie_rating

Predicting top N recommendations for a current user

In [17]:
def top_n_recommendations(current_user, N):
    predicted_movie_rating = nearest_neighbour_ratings(current_user, 10)
    movies_already_watched = list(user_movie_rating_matrix.loc[current_user]
                                  .loc[user_movie_rating_matrix.loc[current_user] > 0].index)
    
    predicted_movie_rating = predicted_movie_rating.drop(movies_already_watched)
    
    top_n_recommendations = pd.DataFrame.sort_values(predicted_movie_rating, ['rating'], ascending=[0])[:N]
    
    top_n_recommendation_titles = movie_data.loc[movie_data.movieId.isin(top_n_recommendations.index)]

    return list(top_n_recommendation_titles.title)

finding out the recommendations for a user

In [18]:
current_user = 140
print("User's favorite movies are : ", fav_movies(current_user, 5),
      "\nUser's top recommendations are: ", top_n_recommendations(current_user, 3))

  dist = 1.0 - uv / np.sqrt(uu * vv)


User's favorite movies are :  ['Fish Called Wanda, A (1988)', 'Amadeus (1984)', 'Amos & Andrew (1993)', 'Big Blue, The (Grand bleu, Le) (1988)', 'Kama Sutra: A Tale of Love (1996)'] 
User's top recommendations are:  ["Schindler's List (1993)", 'Lord of the Rings: The Fellowship of the Ring, The (2001)', 'Inception (2010)']


## Conclusion
We have built a movie recommendation engine using k-nearest neighbour algorithm implemented from scratch. 