Recommendation systems offer personalized suggestions tailored to user preferences by relying on past behaviour for different products and services. There are 2 types of recommendation systems:
1. Content based - which makes suggestions based on user's past history
2. Collaborative - which makes suggestions based on users with similar preferences and is dependent on multiple users

Collaborative Recommendation Systems can be further classified into:
1. User based - which recommends products and services to user A based on preferences of similar users in the database and involves creation of similarity scores. 
2. Item based - which recommends products and services based on similarity in ratings of users for the same products and services.

In [1126]:
import re
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors

In [1127]:
ratings = pd.read_csv('https://s3-us-west-2.amazonaws.com/recommender-tutorial/ratings.csv')

# OR use the local copy
# ratings = pd.read_csv('ratings.csv')

In [1128]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [1129]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [1130]:
movies = pd.read_csv('https://s3-us-west-2.amazonaws.com/recommender-tutorial/movies.csv')

# OR use the local copy
# movies = pd.read_csv('movies.csv')

In [1131]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [1132]:
movies.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


Update the 'ratings' dataframe to map the movie rating for each user. Use 0 where user has not rated the movie.

In [1133]:
X = ratings.pivot(index='movieId', columns='userId', values='rating').fillna(0)
X

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Understand which movies have been rated most by users as well as the users who are rating movies frequently

In [1134]:
movies_rated_most = ratings.groupby('movieId')['rating'].agg('count').sort_values(ascending=False)
users_rating_frequently = ratings.groupby('userId')['rating'].agg('count').sort_values(ascending=False)
movies_rated_most

movieId
356       329
318       317
296       307
593       279
2571      278
         ... 
4093        1
4089        1
58351       1
4083        1
193609      1
Name: rating, Length: 9724, dtype: int64

In [1135]:
users_rating_frequently

userId
414    2698
599    2478
474    2108
448    1864
274    1346
       ... 
442      20
569      20
320      20
576      20
53       20
Name: rating, Length: 610, dtype: int64

Use sparse matrix (csr_matrix) to reduce the matrix size and improve calculation performance.

In [1136]:
csr_data = csr_matrix(X.values)
X.reset_index(inplace=True)
X

userId,movieId,1,2,3,4,5,6,7,8,9,...,601,602,603,604,605,606,607,608,609,610
0,1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
1,2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
2,3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
3,4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9719,193581,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9720,193583,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9721,193585,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9722,193587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Item Based Collaborative Recommendation

Check the results using different algorithms - brute and ball_tree. There are other settings (especially radius and leaf_size) which can be used to understand the impact of those hyperparameters on model output.

Inputs for item based collaborative recommendation:

In [1137]:
''' Inputs for Item Based Collaborative Recommendation: '''
ibc_no_recommendation = 5
metric = 'cosine'
algorithm = 'auto'
n_neighbors = 20
n_jobs = -1
knn = NearestNeighbors(metric = metric, algorithm = algorithm, n_neighbors = n_neighbors, n_jobs = n_jobs)
# knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
# knn = NearestNeighbors(algorithm='ball_tree', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)

In [1138]:
def item_based_recommendation(movie_name, ibc_no_recommendation):

    # Optimize movie search and enable case-insensitive match and ignore NaN values
    # re.escape was added to ignore the special characters and make the input a literal string
    movie_matches = movies[movies['title'].str.contains(re.escape(movie_name), case = False, na = False)]
    if movie_matches.empty:
        return 'No movies found matching the input'
    
    else:
    
        # Take the first matching movie and it's index
        base_movie = movie_matches.iloc[0]
        movie_idx = X[X['movieId'] == base_movie['movieId']].index[0]
        
        # Use more efficient neighbor finding
        distances, indices = knn.kneighbors(csr_data[movie_idx], n_neighbors = ibc_no_recommendation + 1)
        
        # recommendation extraction
        recommend_frame = []
        for idx, dist in zip(indices.squeeze()[1:], distances.squeeze()[1:]):
            movie_id = X.iloc[idx]['movieId']
            recommend_frame.append({
                'Title': movies.loc[movies['movieId'] == movie_id, 'title'].values[0],
                'Distance': dist
            })
        
        # Remove the title in the return call if you need to store the entire dataframe
        return pd.DataFrame(recommend_frame, index=range(1, ibc_no_recommendation + 1))['Title']

In [1139]:
item_based_recommendation('Gladiator', ibc_no_recommendation)

1                                   Matrix, The (1999)
2    Lord of the Rings: The Fellowship of the Ring,...
3                          Bourne Identity, The (2002)
4                           Saving Private Ryan (1998)
5        Lord of the Rings: The Two Towers, The (2002)
Name: Title, dtype: object

# User Based Collaborative Recommendation

In [1140]:
def user_based_recommendation(ratings, movies, target_user_id, X, ubc_no_recommendation):
    
    # get movie ratings for this target user
    user_ratings = ratings.query(' userId == @target_user_id ')
    
    # find the first movie with max rating from this userID
    movie_id_for_target_user = user_ratings.query(' rating == rating.max() ').iloc[0]['movieId']
    top_rated_movie = movies.loc[movies['movieId'] == movie_id_for_target_user]['title'].values[0]
    
    # Get recommendation based on top rated movie
    recommended_movies = item_based_recommendation(top_rated_movie, ubc_no_recommendation)
        
    print(f"Since you enjoyed watching {top_rated_movie}, you may also enjoy watching:")
    print('\n')
    print(recommended_movies)
    
    return recommended_movies

In [1141]:
''' Inputs for User Based Collaborative Recommendation'''
target_user_id = 150
ubc_no_recommendation = 5

recommended_movies = user_based_recommendation(ratings, movies, target_user_id, X, ubc_no_recommendation)

Since you enjoyed watching Twelve Monkeys (a.k.a. 12 Monkeys) (1995), you may also enjoy watching:


1                     Pulp Fiction (1994)
2       Terminator 2: Judgment Day (1991)
3    Independence Day (a.k.a. ID4) (1996)
4             Seven (a.k.a. Se7en) (1995)
5                            Fargo (1996)
Name: Title, dtype: object
