# Item-based Collaborative Filtering

Collaborative filtering is a method that can be used when building a recommender system. Collaborative filtering is the process of using data about a pool of users to recommend items to a specific user. This collaborative filtering comes in two flavors: user-based or item-based. 
- **User-based collborative filtering** is when you find users that are similar to a specific user. Then, items that are well-liked by the other users are recommended to the specific user. 
- **Item-based collaborative filtering** is when you find an item that a specific user likes. With that information, you can find other users who liked the same item and then recommend to the specific user additional items that were liked by the other users.

In this notebook, we'll look at the MovieLens 100k dataset. This is a dataset that has many users and their ratings of various movies. We'll perform item-based collaborative filtering. 

## Try it!

In [62]:
print("View your top recommended movies!\n")

print("User IDs are numbers from 1-943.")
while True:
    try:
        user_id = int(input("Input a User ID: "))
        if 1 <= user_id <= 943:
            break
        else:
            print("Input out of range. Please try again.")
    except ValueError:
        print("Invalid input. Please enter an integer.")

print(f"\nSome of User {user_id}'s favorite movies are:\n")
print("ID | Title")
print("----------")
for i in top_user_movies[user_id][:10]:
    movie = movie_details[movie_details['movie_id'] == i][['movie_id', 'movie_title']].values
    print(movie[0][0], movie[0][1])

while True:
    try:
        movie_id = int(input("Input a Movie ID to see similar movies: "))
        if 1 <= movie_id <= 1682:
            break
        else:
            print("Input out of range. Please try again.")
    except ValueError:
        print("Invalid input. Please enter an integer.")

print(f"\nThe top similar movies to {movie_details.loc[movie_details['movie_id'] == movie_id, 'movie_title'].values[0]} that User {user_id} has not seen are:\n")

movie_counter = 0
for i in top_similar_movies[movie_id][:]:
    if movie_counter == 10:
        break
    user_rated_movies = user_movie_matrix.loc[user_id]
    user_rated_movies = list(user_rated_movies[user_rated_movies > 0].index)
    if i not in user_rated_movies:
        movie = movie_details[movie_details['movie_id'] == i][['movie_id', 'movie_title']].values
        print(movie[0][0], movie[0][1])
        movie_counter += 1
if movie_counter == 0:
    print("No Suggestions.")


View your top recommended movies!

User IDs are numbers from 1-943.

Some of User 12's favorite movies are:

ID | Title
----------
4 Get Shorty (1995)
15 Mr. Holland's Opus (1995)
28 Apollo 13 (1995)
69 Forrest Gump (1994)
88 Sleepless in Seattle (1993)
97 Dances with Wolves (1990)
98 Silence of the Lambs, The (1991)
132 Wizard of Oz, The (1939)
143 Sound of Music, The (1965)
157 Platoon (1986)

The top similar movies to Wizard of Oz, The (1939) that User 12 has not seen are:

423 E.T. the Extra-Terrestrial (1982)
496 It's a Wonderful Life (1946)
483 Casablanca (1942)
135 2001: A Space Odyssey (1968)
419 Mary Poppins (1964)
357 One Flew Over the Cuckoo's Nest (1975)
99 Snow White and the Seven Dwarfs (1937)
173 Princess Bride, The (1987)
210 Indiana Jones and the Last Crusade (1989)
22 Braveheart (1995)


## Logic Used

In [1]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

### Data Ingestion

Data is pulled into an Excel doc for this project. Movie ratings, movie details, and user details are on three different tabs

In [53]:
movie_ratings = pd.read_excel('MovieLens 100k Dataset.xlsx', sheet_name = 0)
movie_details = pd.read_excel('MovieLens 100k Dataset.xlsx', sheet_name = 1)
user_details = pd.read_excel('MovieLens 100k Dataset.xlsx', sheet_name = 2)

### Movie Similarity
First, I created a matrix with user IDs in the rows and movie IDs in the columns. The values in the matrix are the rating that a user gave to a movie. It is assumed in this dataset that a user has rated every movie that they have watched. If a movie is unwatched, it will have a value of 0 in the matrix.

Movie similarity is then calculated by using cosine similarity. Cosine similarity looks at the movie vectors and provides a score from 0 to 1 in likeness of direction of a movie's vector to other movies.

In [54]:
user_movie_matrix = pd.pivot_table(movie_ratings, values='rating', index='user_id', columns='movie_id', fill_value=0)

cosine_sim = cosine_similarity(user_movie_matrix.T)
movie_cosine_sim = pd.DataFrame(cosine_sim, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)

### Find Top Similar Movies
Next, for every movie, I am finding the top 10 movies in simlarity score. I am storing this information in a dictionary.

In [55]:
top_similar_movies = {}
for index, movie_similarities in movie_cosine_sim.iterrows():
    sorted_similarities = movie_similarities.sort_values(ascending=False)
    top10_similar_movies = sorted_similarities.drop(index).index[:50]
    top_similar_movies[index] = list(top10_similar_movies)

### Find Top Movie by User
Here I am finding the top movies (those that are rated a 5) for each user and storing them in another dictionary.

In [57]:
top_user_movies = {}
for user_id, row in user_movie_matrix.iterrows():
    user_rated_movies = user_movie_matrix.loc[user_id]
    user_rated_movies = user_rated_movies[user_rated_movies == 5].index[:]
    top_user_movies[user_id] = list(user_rated_movies)