# CS 143 Final Project Recommendation Algorithm Implementation

For this project we are using a kNN model to serve as a baseline recommendation algorithm. All predictions will be based off of the MovieLens Data set that contains 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Recommendations will be built off of movie ratings. 

References:

(1) https://github.com/KevinLiao159/MyDataSciencePortfolio/blob/master/movie_recommender/movie_recommendation_using_KNN.ipynb

(2) https://grouplens.org/datasets/movielens/latest/

(3) https://www.geeksforgeeks.org/implementation-k-nearest-neighbors/

# 1. Upload Data

In [None]:
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

In [None]:
# import movie data
movie_data = pd.read_csv("data/movies-small.csv",
    usecols=['movieId', 'title'],
    dtype={'movieId': 'int32', 'title': 'str'})

# import corresponding ratings
rating_data = pd.read_csv("data/ratings-small.csv",
    usecols=['userId', 'movieId', 'rating'],
    dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})

In [None]:
print("Number of unique movies: ", len(rating_data['movieId'].unique()))
print("Number of unique users: ", len(rating_data['userId'].unique()))

# 2. Clean Data

As with all data sets, there are likely points that do not represent the vast majority of users. In this case, there are likely unpopular movies or raters who provide very few ratings. To avoid any skewing in predictions based off this data, we will remove those individuals from the data.

In [None]:
# determine least popular movies and drop
movies_count = pd.DataFrame(rating_data.groupby('movieId').size(), columns=['count'])
popular_movie_ids = movies_count[movies_count['count'] >= 50].index
ratings_drop_movies = rating_data[rating_data.movieId.isin(popular_movie_ids)]
updated_movie_data = movie_data[movie_data.movieId.isin(popular_movie_ids)]

In [None]:
# determine least active users and drop
ratings_count = pd.DataFrame(rating_data.groupby('userId').size(), columns=['count'])
active_user_ids = ratings_count[ratings_count['count'] >= 50].index
ratings_drop_movies_users = ratings_drop_movies[ratings_drop_movies.userId.isin(active_user_ids)]

In [None]:
print("Original number of ratings: ", rating_data.shape[0])
print("Dropping unpopular movies number of ratings: ", ratings_drop_movies.shape[0])
print("Dropping unpopular movies and inactive users number of ratings: ", ratings_drop_movies_users.shape[0])

In [None]:
print("Number of final unique movies: ", len(ratings_drop_movies_users['movieId'].unique()))
print("Number of final unique users: ", len(ratings_drop_movies_users['userId'].unique()))

In [None]:
# create movie vs user matrix for kNN computations
movie_user_matrix = ratings_drop_movies_users.pivot(index='movieId', columns='userId', values='rating').fillna(0)
movie_user_matrix.shape

# 3. Implement kNN Model

In [None]:
# reformat movie_data to be indexed on movie_id
movie_data = movie_data.set_index('movieId') 

In [None]:
def euclidean_distance (x,y):
    return np.linalg.norm(x-y)

In [None]:
def rec_model (data, num_recs, movie_title, movie_mapping):
    
    # get movie_id and data
    movie_id = movie_mapping[movie_mapping["title"] == movie_title].index[0]
    movie_data = data.loc[movie_id]
    
    # drop movie from data set
    data = data.drop(movie_id)
    
    # list to save all distances
    dists = []
    
    # iterate over all points in the data set to calculate distance from inputted val
    # CAN BE PARALLELIZED
    for index, row in data.iterrows():
        dist = euclidean_distance(movie_data, row)
        dists.append((index, dist))
        
    # sort distances in ascending order
    dists.sort(key=lambda x: x[1])
    
    # trim to num_recs recommendations and drop first 
    top_movies = dists[:num_recs]
    
    # map selected movies back to titles
    titles = [movie_mapping.loc[movie[0]]["title"] for movie in top_movies]
    
    return titles

In [None]:
rec_model(movie_user_matrix, 5, 'Pocahontas (1995)', movie_data)