# Project 4

## System1

### Proposal 1
1. This system would first center all ratings from each user by subtracting the mean rating for the user from each rating. This would ensure that users who tend to rate everything consistently higher give the same results as users that are consistently more negative. 
2. Then it would select all movies in the given genre.
3. Then it would filter out all movies that have fewer than a given threshold of ratings to eliminate movies that only had a few (perhaps overly positive) ratings.
4. Then it would calculated the average centered rating for each selected movie and return the top _n_ movies.

### Proposal 2
1. Again this system would center all ratings by subtracting the mean rating.
2. Then it would calculate the average centered rating for each user for each genre.
3. Then it would select the _m_ users with the top average ratings for the given genre. This should give the users that are the biggest fans of the genre.
4. Then using only those _m_ users, find the average rating for each movie, and return the top _n_ movies.

## System 2

### User based collaborative recomendations

For this algorithim we find the top _m_ users with similar ratings to the new ratings using the cosine similarity metric. We then use their ratings to find to top rated _n_ movies the user has not already rated.

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from data import load_movies_rated, load_user_ratings, load_movies
from data import load_ratings, load_users
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle
import pandas as pd
import math
import time
from IPython.display import display, HTML

In [2]:
def user_based_collaborative(new_ratings, user_ratings=None, movies=None, num_users=100):
    """
    Get recommendations based on the user's ratings
    """
    user_ratings = user_ratings if user_ratings is not None else load_user_ratings()
    movies = movies if movies is not None else load_movies()
    rated_movies = [int(m) for m in new_ratings.keys() if int(m) in user_ratings.columns]
    relevant_ratings = user_ratings[rated_movies]
    mean_rating = sum(new_ratings.values())/len(new_ratings)
    new_vector = [rating-mean_rating for m, rating in new_ratings.items() if int(m) in user_ratings.columns]
    similarities = cosine_similarity(relevant_ratings, [new_vector])
    similar_users = [i for sim, i in sorted(zip(similarities[:,0], user_ratings.index), reverse=True)[:num_users]]
    movie_ratings = user_ratings.loc[similar_users].mean()
    movie_ratings.name = "rating"
    ranked_movies = movies.join(movie_ratings).sort_values('rating', ascending=False)
    return ranked_movies

### Item based collaborative recommendations

This approach will use the training rankings to produce a similarity matrix for each movie using the cosine similarity metric. It will then take the highest ranked movies the user provided and the movies with the highest average mean.

In [3]:
from data import load_movie_ratings

In [4]:
def item_based_collaborative(new_ratings, movie_ratings=None, movies=None):
    """
    Get recommendations based on similar movies to the user's favorites
    """
    movie_ratings = movie_ratings if movie_ratings is not None else load_movie_ratings()
    movies = movies if movies is not None else load_movies()
    highest_rating = max(new_ratings.values())
    rated_movies = [int(movie) for movie in new_ratings if int(movie) in movie_ratings.index]
    index_list = movie_ratings.index.tolist()
    favorite_movies = [index_list.index(movie) for movie in rated_movies if new_ratings[str(movie)] == highest_rating]
    sims = cosine_similarity(movie_ratings)[favorite_movies,].mean(axis=0)
    ranked = movies.join(pd.DataFrame({"rating": sims}, index=index_list)).sort_values("rating", ascending=False)
    ranked.drop(rated_movies, axis=0, inplace=True)
    return ranked

### Testing the approaches

Since predicted ratings aren't particularly relevant compared to the ability of the system to predict whether or not the user would like the movie, I will treat this as a classification problem where the truth is whether or not the user give the movie is given an above average (for that user) rating. For each split group I will use the training users as the basis of the ratings and the test users as users of the system. Each training users's ratings will be split in half, with the first half being used to generate the recommendations and the second half used to test those recommendations. I will then compute the ROC-AUC for the scores assigned to those predictions relative to whether or not the user gave the actual movie an above average rating.

In [5]:
def get_split(all_ratings, users, index):
    all_ratings = all_ratings.join(users['split'], on="user")
    train = all_ratings[all_ratings['split'] != index]
    test = all_ratings[all_ratings['split'] == index]
    return train, test
    

In [6]:
def test_user(user_ratings, train_ratings, algorithm):
    """
    Test a user's recommendations
    """
    n_ratings = len(user_ratings)
    use_ratings = {str(int(r['movie'])): r['rating'] for i, r in user_ratings[:n_ratings//2].iterrows()}
    test_ratings = {int(r['movie']): r['rating'] for i, r in user_ratings[n_ratings//2:].iterrows()}
    recs = algorithm(use_ratings, train_ratings)
    rec_ratings = recs.loc[test_ratings]['rating']
    rec_ratings.fillna(recs['rating'].min(), inplace=True)
    mean_rating = user_ratings['rating'].mean()
    pos_rating = [r > mean_rating for r in test_ratings.values()]
    if sum(pos_rating) in {0, len(pos_rating)}:
        return None
    return roc_auc_score(pos_rating, rec_ratings)

def run_test(test_data, train_ratings, algorithm):
    score_sum = 0
    n = 0
    by_user = test_data.groupby('user')
    for u, ratings in by_user:
        user_auc = test_user(ratings, train_ratings, algorithm)
        if user_auc is not None:
            score_sum += user_auc
            n = n+1
    return score_sum/n
        
def run_tests_for_split(split, seed=19820618):
    print(f"For split {split}")
    ratings = load_ratings()
    users = load_users()
    users = shuffle(users, random_state=seed)
    users['split'] = np.repeat(range(100), math.ceil(len(users)/100))[:len(users)]
    train, test = get_split(ratings, users, split)
    user_ratings = load_user_ratings(train)
    movie_ratings = load_movie_ratings(train)
    start = time.time()
    user_based_auc = run_test(test, user_ratings, user_based_collaborative)
    timetook = time.time() - start
    print(f"\tUser based predictions had an AUC of {user_based_auc:.4f} and took {timetook:2f} seconds")
    start = time.time()
    item_based_auc = run_test(test, movie_ratings, item_based_collaborative)
    timetook = time.time() - start
    print(f"\tItem based predictions had an AUC of {item_based_auc:.4f} and took {timetook:2f} seconds")
    return user_based_auc, item_based_auc

In [7]:
ub_auc, ib_auc = zip(*[run_tests_for_split(i) for i in range(10)])

For split 0
	User based predictions had an AUC of 0.7399 and took 6.008794 seconds
	Item based predictions had an AUC of 0.7461 and took 77.623325 seconds
For split 1
	User based predictions had an AUC of 0.7174 and took 5.908779 seconds
	Item based predictions had an AUC of 0.7317 and took 81.749648 seconds
For split 2
	User based predictions had an AUC of 0.7214 and took 5.933233 seconds
	Item based predictions had an AUC of 0.7193 and took 80.095006 seconds
For split 3
	User based predictions had an AUC of 0.7157 and took 6.286104 seconds
	Item based predictions had an AUC of 0.7120 and took 79.540403 seconds
For split 4
	User based predictions had an AUC of 0.7018 and took 6.277930 seconds
	Item based predictions had an AUC of 0.7184 and took 83.305579 seconds
For split 5
	User based predictions had an AUC of 0.7233 and took 5.685016 seconds
	Item based predictions had an AUC of 0.7414 and took 83.677603 seconds
For split 6
	User based predictions had an AUC of 0.7111 and took 6.03

#### Results

In [8]:
print(f"User based Average: {sum(ub_auc)/len(ub_auc)}")
print(f"Item based Average: {sum(ib_auc)/len(ib_auc)}")
results = pd.DataFrame({"Split": range(1, 11), "User Based": ub_auc, "Item Based": ib_auc})
results.set_index("Split", inplace=True)
display(HTML(results.to_html()))

User based Average: 0.7180734878066313
Item based Average: 0.7290629974841013


Unnamed: 0_level_0,User Based,Item Based
Split,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.739889,0.746109
2,0.717423,0.73171
3,0.721366,0.719327
4,0.715675,0.711961
5,0.701808,0.718358
6,0.723324,0.741372
7,0.711096,0.715205
8,0.714821,0.732675
9,0.723424,0.738651
10,0.71191,0.735262


Both approaches produce similar results, with the item based prediction slightly outperforming user based prediction. However the user based predictions takes over an order of magnitude less time, so that is the approach I will go with.