## Recommender System using kNN

### - Ishita Kapur



### Dataset

MovieLens https://grouplens.org/datasets/movielens/20m/

### Approach

The important concepts used to create the recommender system are **k-Nearest Neighbor** and **Matrix Factorization**. The approach for the implementation also includes the evaluation of the recommender system using a test dataset. The recommender system is created based on collaborative filtering and content based filtering. Implementing the recommender system solely based on kNN definitely gives good results, but it does not guarantee that the system would recommend movies that the user has never watched before and to avoid this scenario Matrix Factorization is included in the implementation.

Data has first been filtered to ignore users who have predicted very few movies and movies that have been rated by very few users **(PreProcess.ipynb)**. A matrix is then prepared where the rows are the various users and columns are the various movies. Matrix Factorization has been used to generate user features and movie features and both are used to predict ratings of movies that the user has not yet watched. The model first uses **SVD(Single Value Decomposition)** which is one of the techniques to produce feature matrices for sparse matrices. The data is divided into training and test data (80:20) randomly and the best feature count is found using RMSE as the evaluation metric and the predicted rating matrix is computed using the best feature count. For a particular user the 20 recommended(predicted) movies are found from the predicted rating matrix by finding the movies which get the highest rating.

From these predicted movies better choices are found using kNN where one of the movies which the user likes is taken into consideration and other movies similar to that movie are recommended to the user. The distance metric used here is cosine similarity where the genre of the movie is compared with other movies that are predicted from Matrix Factorization. The model finally recommends 10 movies to the user. The mean of the users selected movie is compared to the mean of the recommended movies and that comes up to be almost the same. The approach used to build the recommender system performed well.

#### Import the required libraries

In [1]:
import numpy as np
import pandas as pd

#### Read the movies data into a dataframe

In [2]:
movies = pd.read_csv('ml-20m/ml-20m/movies.csv', header=0, usecols=range(3))
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#### Preprocessed data is read into a dataframe

In [3]:
filtered = pd.read_csv('Dataset/preprocessedMovieLens.csv', header=0)
filtered.head()

Unnamed: 0,userId,movieId,rating
0,2,110,4.0
1,2,589,5.0
2,2,891,2.0
3,2,1196,5.0
4,2,1214,5.0


#### Preprocessed data is randomly sampled into training and test data (80:20)

In [4]:
train_data = filtered.sample(frac=0.8,random_state=200)
test_data = filtered.drop(train_data.index)

print('\nTraining data\n ', train_data)
print('\n\nTesting data\n ', test_data)


Training data
          userId  movieId  rating
474735   99142     1247     5.0
216504   44751     1193     4.0
512114  107049      329     4.0
605784  126153      590     4.0
57555    11425      589     3.0
...        ...      ...     ...
211465   43605     1219     4.0
387016   81197     1625     4.0
105518   21592     2571     5.0
48371     9618     1231     4.0
78503    15729     1224     4.0

[532700 rows x 3 columns]


Testing data
          userId  movieId  rating
0            2      110     4.0
2            2      891     2.0
8            2     3928     5.0
9            6        1     5.0
13          10     1222     3.0
...        ...      ...     ...
665846  138483      535     5.0
665866  138489     1275     4.0
665870  138492      163     3.5
665873  138492     1304     4.5
665874  138492     2571     5.0

[133175 rows x 3 columns]


#### Matrix of user item * movie item is created for the training data and the empty cells are filled with 0

In [5]:
matrix_movie_user = train_data.pivot(index='userId', columns='movieId', values='rating').fillna(0)
print('\n\n Matrix:')
matrix_movie_user.head()



 Matrix:


movieId,1,2,4,5,6,9,10,15,16,18,...,7031,7035,7041,7249,7312,7761,7769,7953,7982,8191
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Getting only the values from the matrix

In [6]:
d = matrix_movie_user.values
userIds = matrix_movie_user.index

#### Mean of the user ratings are computed and the demeaned data is stored into another ndarray

In [7]:
mean_userratings = d.mean(axis = 1)
ratings_demeaned = d - mean_userratings.reshape(-1, 1)

### Training the Recommender System

Functions for training the model using the training dataset are implemented below. The count of best features for Matrix Factorization is obtained by using RMSE evaluation metric.

In [8]:
from scipy.sparse.linalg import svds

def performMF(data_demeaned, data, matrix, k, userIds):
    U, sigma, Vt = svds(data_demeaned, k)
    sigma = np.diag(sigma)
    predicted_ratings = np.dot(np.dot(U, sigma), Vt) + data.reshape(-1, 1)
    user_predictions = pd.DataFrame(predicted_ratings, columns = matrix.columns)
    user_predictions = user_predictions.set_index(userIds)
    return user_predictions

def recommendMovies(user_predictions, movies, ratings, user_id, n_rec):
    user_row_number = user_id
    sorted_user_predictions = user_predictions.iloc[user_row_number].sort_values(ascending=False)
    sorted_user_predictions.name = 'Predictions'
    user_data = ratings[ratings.userId == user_id]
    user_full = (user_data.merge(movies, how = 'left', left_on = 'movieId', right_on = 'movieId').sort_values(['rating'], ascending=False))
    not_watched = movies[~movies['movieId'].isin(user_full['movieId'])]
    recommendations = (not_watched.merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left', on = 'movieId').rename(columns = {user_row_number: 'Predictions'}).sort_values('Predictions', ascending = False).iloc[:n_rec, :-1])
    return user_full, recommendations

def performTrainingForMF(data_demeaned, data, matrix, userIds, n_features, movies, movies_genre, genres, train_data, userID, movie_name, n_rec):
    user_predictions = performMF(data_demeaned, data, matrix, n_features, userIds)
    already_rated, predictions = recommendMovies(user_predictions, movies, train_data, userID, n_rec)
    user_pref = movies_df.loc[movies_df['title'] == movie_name, 'mean']
    mean_user_pref = user_pref.values[0]
    predictions = predictions.set_index('movieId')
    predicted_movieIds = predictions.index
    predicted_movie = movies_df.loc[predicted_movieIds, ['title','mean']]
    rmse = np.sqrt(((predicted_movie['mean'] - mean_user_pref) ** 2).mean())
    return rmse

#### Finding the count of best features(can vary based on the dataset split).

For user 2, 20 movies are predicted based on his choice of movie 'Star Trek: First Contact (1996)'.

In [9]:
aboutUser = filtered.groupby('userId').agg({'rating': ['count', 'mean']})
aboutUser.columns = aboutUser.columns.droplevel()

aboutMovie = filtered.groupby('movieId').agg({'rating': ['count', 'mean']})
aboutMovie.columns = aboutMovie.columns.droplevel()

n_features = [50, 70, 100]
userID = 2
nrec_fromMF = 20
movie_name = 'Star Trek: First Contact (1996)'
movies_df = pd.read_csv('Dataset/preprocessedMovies.csv', header=0)
movies_df = movies_df.set_index('movieId')
genres = movies_df.columns[1:21]
movies_df = pd.merge(movies_df, aboutMovie['mean'], on='movieId', how='inner')
rmse_nfeatures = []
for n in n_features:
    rmse = performTrainingForMF(ratings_demeaned, mean_userratings, matrix_movie_user, userIds, n, movies, movies_df, genres, train_data, userID, movie_name, nrec_fromMF)
    rmse_nfeatures.append(rmse)
best_nfeatures = n_features[rmse_nfeatures.index(min(rmse_nfeatures))]
print(n_features)
print(rmse_nfeatures)
print('Best n Features: ', best_nfeatures)

[50, 70, 100]
[0.3799323722937944, 0.37855470828341214, 0.3874543885646134]
Best n Features:  70


#### Already watched movies and the predicted movies are listed

In [10]:
def getPredictionsforBestnFeatures(data_demeaned, data, matrix, userIds, n_features, movies, movies_genre, genres, train_data, userID, movie_name, n_rec):
    user_predictions = performMF(data_demeaned, data, matrix, n_features, userIds)
    already_rated, predictions = recommendMovies(user_predictions, movies, train_data, userID, n_rec)
    user_pref = movies_df.loc[movies_df['title'] == movie_name, 'mean']
    mean_user_pref = user_pref.values[0]
    predictions = predictions.set_index('movieId')
    return already_rated, predictions

already_rated, predictions = getPredictionsforBestnFeatures(ratings_demeaned, mean_userratings, matrix_movie_user, userIds, best_nfeatures, movies, movies_df, genres, train_data, userID, movie_name, nrec_fromMF)
print('\nAlready Rated Movies\n\n', already_rated)
print('\n\nRecommended Movies\n', predictions)


Already Rated Movies

    userId  movieId  rating                                              title  \
0       2     1196     5.0  Star Wars: Episode V - The Empire Strikes Back...   
1       2      589     5.0                  Terminator 2: Judgment Day (1991)   
3       2     1356     5.0                    Star Trek: First Contact (1996)   
5       2     1214     5.0                                       Alien (1979)   
4       2     3917     4.0                                  Hellraiser (1987)   
2       2     1970     2.0  Nightmare on Elm Street 3: Dream Warriors, A (...   

                             genres  
0           Action|Adventure|Sci-Fi  
1                     Action|Sci-Fi  
3  Action|Adventure|Sci-Fi|Thriller  
5                     Horror|Sci-Fi  
4                            Horror  
2                   Horror|Thriller  


Recommended Movies
                                              title  \
movieId                                              
1247        

#### kNN can now be used to find the similar movies based on user's selection.

Here the predicted movies are compared using (cosine similarity) to the user's selected movie and the most similar movies are recommended to the user.

In [11]:
def vectorizedCosineSimilarity(user_vector, vector_data):
    np.seterr(divide='ignore', invalid='ignore')
    similarities = np.dot(user_vector/np.linalg.norm(user_vector, axis=1)[...,None],(vector_data/np.linalg.norm(vector_data,axis=1)[...,None]).T)
    np.seterr(divide='warn', invalid='warn')
    return similarities

def getRecommendedMovieDetails(data, movies_df, genres):
    movieIds = data.index.to_numpy()
    genre_list = genres.tolist()
    genre_det = movies_df.loc[movieIds.tolist(), genre_list]
    return movieIds, genre_det.values

def getDetailsUserPrefMovie(movie_name, movies_df, genres):
    user_pref_movie = movies_df.loc[movies_df['title'] == movie_name, genres]
    user_pref_vector = user_pref_movie.values
    return user_pref_vector

def performkNN(user_pref_vector, recommended_movie_vectors, n_nearest):
    similarity = vectorizedCosineSimilarity(user_pref_vector, recommended_movie_vectors)
    ind = np.argpartition(similarity[0], -n_nearest)[-n_nearest:]
    recommended_movieIds = movieIds[ind]
    recommended_movies = movies_df.loc[recommended_movieIds, ['title', 'mean']]
    return recommended_movies

#### kNN is used to list the recommended movies.

The mean ratings of the movies have been displayed alongside and it can be seen that they are as good as the user's selection.

In [12]:
movie_name = 'Star Trek: First Contact (1996)'
k_nearest = 10
user_pref_vector = getDetailsUserPrefMovie(movie_name, movies_df, genres)
movieIds, recommended_movie_genre_values_MF = getRecommendedMovieDetails(predictions, movies_df, genres)
recommended_movies = performkNN(user_pref_vector, recommended_movie_genre_values_MF, k_nearest)
print(recommended_movies)

                                             title      mean
movieId                                                     
765                                    Jack (1996)  3.061271
1                                 Toy Story (1995)  3.925835
724                              Craft, The (1996)  3.065250
1224                                Henry V (1989)  4.122016
1254      Treasure of the Sierra Madre, The (1948)  4.190814
493                       Menace II Society (1993)  3.605736
1275                             Highlander (1986)  3.595580
1921                                     Pi (1998)  3.709809
1304     Butch Cassidy and the Sundance Kid (1969)  4.051922
3168                             Easy Rider (1969)  3.724755


#### Mean ratings of the user's selected movie

In [13]:
rating_user_pref = movies_df.loc[movies_df['title'] == movie_name, 'mean'].values
print(rating_user_pref)

[3.66628346]


#### Mean rating of all the recommended movies

In [14]:
mean_recommendedMovies = recommended_movies['mean'].mean()
print(mean_recommendedMovies)

3.7052988406175813


### Testing the Recommender System

Test Data has been used for testing the recommender system.

In [15]:
print('Test Data\n', test_data)

matrix_movie_user_test_data = test_data.pivot(index='userId', columns='movieId', values='rating').fillna(0)
print('\n\n Matrix:')
matrix_movie_user_test_data.head()

Test Data
         userId  movieId  rating
0            2      110     4.0
2            2      891     2.0
8            2     3928     5.0
9            6        1     5.0
13          10     1222     3.0
...        ...      ...     ...
665846  138483      535     5.0
665866  138489     1275     4.0
665870  138492      163     3.5
665873  138492     1304     4.5
665874  138492     2571     5.0

[133175 rows x 3 columns]


 Matrix:


movieId,1,2,4,5,6,9,10,15,16,18,...,7031,7035,7041,7249,7312,7761,7769,7953,7982,8191
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### The already rated movies and the predicted movies are listed

In [16]:
d_test_data = matrix_movie_user_test_data.values
userIds_test_data = matrix_movie_user_test_data.index
mean_userratings_test_data = d_test_data.mean(axis = 1)
ratings_demeaned_test_data = d_test_data - mean_userratings_test_data.reshape(-1, 1)

n_features = best_nfeatures
userID = 2
nrec_fromMF = 20
movie_name = 'Braveheart (1995)'
already_rated_test_data, predictions_test_data = getPredictionsforBestnFeatures(ratings_demeaned_test_data, mean_userratings_test_data, matrix_movie_user_test_data, userIds_test_data, n_features, movies, movies_df, genres, test_data, userID, movie_name, nrec_fromMF)
print('\nAlready Rated Movies\n\n', already_rated_test_data)
print('\n\nRecommended Movies\n', predictions_test_data)


Already Rated Movies

    userId  movieId  rating                                              title  \
2       2     3928     5.0       Abbott and Costello Meet Frankenstein (1948)   
0       2      110     4.0                                  Braveheart (1995)   
1       2      891     2.0  Halloween: The Curse of Michael Myers (Hallowe...   

             genres  
2     Comedy|Horror  
0  Action|Drama|War  
1   Horror|Thriller  


Recommended Movies
                                              title  \
movieId                                              
1250          Bridge on the River Kwai, The (1957)   
1222                      Full Metal Jacket (1987)   
3359                          Breaking Away (1979)   
1884         Fear and Loathing in Las Vegas (1998)   
1178                         Paths of Glory (1957)   
3168                             Easy Rider (1969)   
1927         All Quiet on the Western Front (1930)   
1104              Streetcar Named Desire, A (1951)   
2

#### The best predicted movies are recommended to the user using the kNN algorithm

In [17]:
movie_name = 'Braveheart (1995)'
k_nearest = 10
user_pref_vector_test = getDetailsUserPrefMovie(movie_name, movies_df, genres)
movieIds, recommended_movie_genre_values_MF_test = getRecommendedMovieDetails(predictions_test_data, movies_df, genres)
recommended_movies_test = performkNN(user_pref_vector_test, recommended_movie_genre_values_MF_test, k_nearest)
print(recommended_movies_test)

                                             title      mean
movieId                                                     
1254      Treasure of the Sierra Madre, The (1948)  4.190814
1104              Streetcar Named Desire, A (1951)  4.010700
1250          Bridge on the River Kwai, The (1957)  4.124499
1944                  From Here to Eternity (1953)  3.973186
1218     Killer, The (Die xue shuang xiong) (1989)  4.088297
428                           Bronx Tale, A (1993)  3.836240
1178                         Paths of Glory (1957)  4.263023
1224                                Henry V (1989)  4.122016
1222                      Full Metal Jacket (1987)  4.037757
1927         All Quiet on the Western Front (1930)  3.923302


#### Mean rating of the user's selected movie

In [18]:
rating_user_pref_test = movies_df.loc[movies_df['title'] == movie_name, 'mean'].values
print(rating_user_pref_test)

[4.04419729]


#### Mean rating of the recommended movies

In [19]:
mean_recommendedMovies_test = recommended_movies_test['mean'].mean()
print(mean_recommendedMovies_test)

4.056983540651212


### Evaluating the Recommender System

#### For training data

In [20]:
print('Mean rating of user\'s selected movie: ' , rating_user_pref[0])

print(recommended_movies)

print('\n\nRMSE:')

rmse_train = np.sqrt(((recommended_movies['mean'] - rating_user_pref[0]) ** 2).mean())
print(rmse_train)

Mean rating of user's selected movie:  3.666283461980621
                                             title      mean
movieId                                                     
765                                    Jack (1996)  3.061271
1                                 Toy Story (1995)  3.925835
724                              Craft, The (1996)  3.065250
1224                                Henry V (1989)  4.122016
1254      Treasure of the Sierra Madre, The (1948)  4.190814
493                       Menace II Society (1993)  3.605736
1275                             Highlander (1986)  3.595580
1921                                     Pi (1998)  3.709809
1304     Butch Cassidy and the Sundance Kid (1969)  4.051922
3168                             Easy Rider (1969)  3.724755


RMSE:
0.37949542319066215


#### For test data

In [21]:
print('Mean rating of user\'s selected movie: ' , rating_user_pref_test[0])

print(recommended_movies_test)

print('\n\nRMSE:')

rmse_test = np.sqrt(((recommended_movies_test['mean'] - rating_user_pref_test[0]) ** 2).mean())
print(rmse_test)

Mean rating of user's selected movie:  4.044197286438013
                                             title      mean
movieId                                                     
1254      Treasure of the Sierra Madre, The (1948)  4.190814
1104              Streetcar Named Desire, A (1951)  4.010700
1250          Bridge on the River Kwai, The (1957)  4.124499
1944                  From Here to Eternity (1953)  3.973186
1218     Killer, The (Die xue shuang xiong) (1989)  4.088297
428                           Bronx Tale, A (1993)  3.836240
1178                         Paths of Glory (1957)  4.263023
1224                                Henry V (1989)  4.122016
1222                      Full Metal Jacket (1987)  4.037757
1927         All Quiet on the Western Front (1930)  3.923302


RMSE:
0.12161328825993363


#### The recommender system did a good job.

This can be noted from the RMSE which was used to evaluate the performance of the system.