# User-based Collaborative Filtering

Collaborative filtering is a method that can be used when building a recommender system. Collaborative filtering is the process of using data about a pool of users to recommend items to a specific user. This collaborative filtering comes in two flavors: user-based or item-based. 
- **User-based collborative filtering** is when you find users that are similar to a specific user. Then, items that are well-liked by the other users are recommended to the specific user. 
- **Item-based collaborative filtering** is when you find an item that a specific user likes. With that information, you can find other users who liked the same item and then recommend to the specific user additional items that were liked by the other users.

In this notebook, we'll look at the MovieLens 100k dataset. This is a dataset that has many users and their ratings of various movies. We'll perform user-based collaborative filtering. That is, we will identify users who gave similar movie ratings and then recommend movies that any given user may like and has not yet seen/rated.

## Try it!

In [143]:
print("View your top recommended movies!\n")

print("User IDs are numbers from 1-943.")
while True:
    try:
        user_id = int(input("Input a User ID: "))
        if 1 <= user_id <= 943:
            break
        else:
            print("Input out of range. Please try again.")
    except ValueError:
        print("Invalid input. Please enter an integer.")

print(f"\nSome of User {user_id}'s favorite movies are:\n")
for i in top_user_movies[user_id][:10]:
    print(movie_details[movie_details['movie_id'] == i]['movie_title'].iloc[0])

print(f"\nThe top 10 recommended movies for User {user_id} are:\n")
for i in top_unwatched_recommended_movies[user_id][:10]:
    print(movie_details[movie_details['movie_id'] == i]['movie_title'].iloc[0])

View your top recommended movies!

User IDs are numbers from 1-943.
Input a User ID: 943

Some of User 943's favorite movies are:

GoldenEye (1995)
Usual Suspects, The (1995)
Clerks (1994)
Professional, The (1994)
Pulp Fiction (1994)
Shawshank Redemption, The (1994)
Forrest Gump (1994)
Fugitive, The (1993)
True Romance (1993)
Silence of the Lambs, The (1991)

The top 10 recommended movies for User 943 are:

Braveheart (1995)
Empire Strikes Back, The (1980)
Terminator 2: Judgment Day (1991)
Schindler's List (1993)
Star Wars (1977)
Seven (Se7en) (1995)
Raiders of the Lost Ark (1981)
Aladdin (1992)
Return of the Jedi (1983)
Twelve Monkeys (1995)


## Logic Used

### Imports

In [103]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

### Data Ingestion

Data is pulled into an Excel doc for this project. Movie ratings, movie details, and user details are on three different tabs

In [104]:
movie_ratings = pd.read_excel('MovieLens 100k Dataset.xlsx', sheet_name = 0)
movie_details = pd.read_excel('MovieLens 100k Dataset.xlsx', sheet_name = 1)
user_details = pd.read_excel('MovieLens 100k Dataset.xlsx', sheet_name = 2)

### User Similarity
First, I created a matrix with user IDs in the rows and movie IDs in the columns. The values in the matrix are the rating that a user gave to a movie. It is assumed in this dataset that a user has rated every movie that they have watched. If a movie is unwatched, it will have a value of 0 in the matrix.

User similarity is then calculated by using cosine similarity. Cosine similarity looks at the user vectors and provides a score from 0 to 1 in likeness of direction of a user's vector to other users.

In [105]:
user_movie_matrix = pd.pivot_table(movie_ratings, values='rating', index='user_id', columns='movie_id', fill_value=0)

cosine_sim = cosine_similarity(user_movie_matrix)
user_cosine_sim = pd.DataFrame(cosine_sim, index=user_movie_matrix.index, columns=user_movie_matrix.index)

### Find Top Similar Users
Next, for every user, I am finding the top 10 users in simlarity score. I am storing this information in a dictionary.

In [106]:
top_similar_users = {}
for index, user_similarities in user_cosine_sim.iterrows():
    sorted_similarities = user_similarities.sort_values(ascending=False)
    top10_similar_users = sorted_similarities.drop(index).index[:10]
    top_similar_users[index] = list(top10_similar_users)

### Find Top Movie by User
Here I am finding the top movies (those that are rated a 5) for each user and storing them in another dictionary.

In [107]:
top_user_movies = {}
for user_id, row in user_movie_matrix.iterrows():
    user_rated_movies = user_movie_matrix.loc[user_id]
    user_rated_movies = user_rated_movies[user_rated_movies == 5].index[:]
    top_user_movies[user_id] = list(user_rated_movies)

### Find All Recommended Movies
Then, I am going through each user's similar users and creating a list that tracks all top movies (rated a 5/5) liked by the similar users. This, again, is stored in a dictionary.

In [108]:
top_recommended_movies = {}
for user_id, user_similar_users in top_similar_users.items():
    movie_list_temp = []
    for similar_user_id in user_similar_users:
        movie_list_temp.append(top_user_movies[similar_user_id])
    movie_counter = Counter([movie_id for movie_sublist in movie_list_temp for movie_id in movie_sublist])
    sorted_movie_list = sorted(movie_counter.keys(), key=lambda x: (movie_counter[x]), reverse=True)
    top_recommended_movies[user_id] = sorted_movie_list

### Find Unwatched Recommended Movies
Finally, now we have a list of a user's already watched movies and a list of that user's recommended movies from similar users. 
We don't want to recommend movies that the given user has already seen, so we remove those from the recommended movies list.

In [109]:
top_unwatched_recommended_movies = {}
for user_id, values in top_recommended_movies.items():
    top_unwatched_recommended_movies[user_id] = [item for item in top_recommended_movies[user_id] if item not in top_user_movies[user_id]]