### Recommendation systems basics
In the following we will go through the basics of creating recommendations with the tools available in Python. We start with some basic analysis that focuses on identifying similar movies as the basis of recommendation, and then we look at more specifically collaborative filtering ( we will briefly look at content-based recommendations after we are familiar with text analysis in the last week).

In [None]:
# Here we load the required libraries
import pandas as pd
import numpy as np

In [None]:
# In the analysis we will work with the dataset we have used before, the MovieLense data
# We have users who rated the movies they have watched in the past
# We have already created a type of recommendation using association rules
# We will also start here with some general ideas on how to find similar movies based on ratings
# but evetually we want to create personalized recommendations, not only general rules

#  We have one file with all the information already combined

ratings = pd.read_csv('user_ratings.csv')

print(ratings.head())

In [None]:
# Before we come to the recommendations, we can perform some data exploration
# First looking at the most frequently reated movies
# This could also be seen as the most trivial solution: recommending the most watched movie
#  In this case Forest Gump

ratings.title.value_counts()[:10]

In [None]:
# We can improve this slightly if we consider also rating
# First we can calculate the mean of the ratings given to each title
mean_rating = ratings[["title", "rating"]].groupby('title').mean()

# Order the entries by highest average rating to lowest
sorted_mean_rating = mean_rating.sort_values(by="rating", ascending=False)

# We can check the best movies
# We have a problem: these are movies that were only watched by a handful of people who watched them
# but this is most likely not a good basis for recommendations

sorted_mean_rating.head(10)

In [None]:
# To correct this problem, we can filter out movies that were not rated by a sufficient number of users

# First we create a list of movies appearing > 50 times in the dataset
movie_popularity = ratings["title"].value_counts()
popular_movies = movie_popularity[movie_popularity > 50].index

# We will have 437 movies left
print(popular_movies)

In [None]:
# We can use this list to filter the original DataFrame
popular_movies_rankings = ratings[ratings["title"].isin(popular_movies)]

# We can now calculate average rating again for this filtered list
popular_movies_average = popular_movies_rankings[["title", "rating"]].groupby('title').mean()

# Based on this, we would recommend The Shawshank Redemption as the highest rated movie watched by at least 100 users
print(popular_movies_average.sort_values(by="rating", ascending=False).head(10))

In [None]:
#  Still a simple question to ask is: if somebody just finished watching a specific movie, what should be the next to recommend?
# As simple strategy to answer this would be to check which other movie has been watched
# most frequently together with that movie

# For this, we will need to create all the pairs of movies that have been rated by at least one user together
#  For this we will make use of a function that is imported here

from itertools import permutations

# To see how this works, lets try it for only one user

data_1 = ratings[ratings.userId == 1]

# Using permutations, we can create a list in which each element is a pair of movie that both have been watched by the user
# If we do this for all the users, we can count how many users have watched a given pair of movies 

list(permutations(data_1.title.values, 2))

In [None]:
# We can create a function that creates the pairs and arranges them in two columns of a dataframe
def find_movie_pairs(x):
  movie_pairs = pd.DataFrame(list(permutations(x.values, 2)), columns=['movie_a', 'movie_b'])
  return movie_pairs

# We can apply this to title after we group by the user column
movie_combinations = ratings.groupby('userId')['title'].apply(find_movie_pairs)

# .reset_index(drop=True)

# This takes a bit of time, and we end up with 60 million pairs
print(movie_combinations)

In [None]:
# If we group by the two columns, we will get the counts for all the pairs

combination_counts = movie_combinations.groupby(['movie_a', 'movie_b']).size()

combination_counts.head()

In [None]:
# This is a series, we can convert it to dataframe and use it in that form
#  We need to reset index to take the movie titles as the first two column

combination_counts_df = combination_counts.to_frame(name='size').reset_index()

combination_counts_df.head()

In [None]:
# Then we can find what to recommend to somebody who just watched Batman Begins
# First we filter the data when movie_a is 'Batman Begins (2005)'
# and then sort it by size, and check the top
# The lord of the rings would be a good choice

combination_counts_df[combination_counts_df['movie_a'] == 'Batman Begins (2005)'].sort_values('size', ascending = False).head(10)

### Collaborative filtering
The last method still does not consider ratings, only frequency. A general model, as introduced in the lecture, is collaborative filtering. It is based on the idea that we can find similar users/items based on the similarity of ratings. For example in user-based collaborative filtering, that assumes that if two users gave similar rating in the past, they are likely to rate future movies similarly.

In [None]:
# As the basis of collaborative filtering, we need to create the rating matrix:
# Rows are users and columns are movies, and each element is the rating for the given user/movie pair if available
# This is actually a pivot table based on the columns userId, title and rating

ratings_table = ratings.pivot(index='userId', columns='title', values='rating')

# As you can see there is a problem
# We actually have some users that rated the same movie several times
# We need to identify this and exclude before we can create the ratings table

print(ratings_table.head())

In [None]:
# To make it simpler, first we creat a dataframe of the three columns of interest

ratings_test = ratings[['userId', 'title', 'rating']]

# We can see that there are three duplicated values
ratings_test[ratings_test.duplicated()]


In [None]:
#  We can then drop duplicates, we keep the last entry, as that should be a more recent rating

ratings_test.drop_duplicates(subset = ['userId', 'title'] ,keep='last', inplace = True)

#  Now we should be able to create the ratings table

ratings_table = ratings_test.pivot(index='userId', columns='title', values='rating')

print(ratings_table.head())

In [None]:
# The next question is how to fill in missing values
# (Specifying 0 for a missing value is not good, why?)
# We will use a simple approach: 

# First we calculate the mean rating for each user

avg_ratings = ratings_table.mean(axis=1)

avg_ratings.head()

In [None]:
# Second we substract this value from the available ratings

ratings_table_centered = ratings_table.sub(avg_ratings, axis=0)

#  As we check, the average rating for a user will be now 0 (or a very low number), which means it becomes a neutral value

ratings_table_centered.apply(np.mean, axis = 1)

In [None]:
# we can fill in the missing values with 0 now

ratings_table_normed = ratings_table_centered.fillna(0)

ratings_table_normed.head()

In [None]:
# Now that we have the table ready, we can identify similar users and movies
#  For this we need the cosine similarity measure

from sklearn.metrics.pairwise import cosine_similarity

#  We can calculate cosine similarity for all pairs of users

similarities = cosine_similarity(ratings_table_normed)

similarities

In [None]:
# We can create a dataframe based on the similarities

cosine_similarity_df = pd.DataFrame(similarities, index=ratings_table_normed.index, columns=ratings_table_normed.index)

cosine_similarity_df.head()

In [None]:
# Then we can find the most similar users for example to user 1
#  First we select the row 
cosine_similarity_1 = cosine_similarity_df.loc[1]

# We can sort these values highest to lowest
ordered_similarities = cosine_similarity_1.sort_values(ascending=False)

#  Except for 1, the most imilar user is 301
# So if there is a movie that 301 has already rated and 1 has not, we can use 301's rating as the basis of recommendation
print(ordered_similarities)

In [None]:
# We can do the same thing from the perspective of the movies 
#  We simply need to reverse the role of users and movies
#  This can simply be done by changing the role of rows and columns in the rating table

movie_ratings = ratings_table_normed.T

similarities_movies = cosine_similarity(movie_ratings)

cosine_similarity_movies = pd.DataFrame(similarities_movies, index=movie_ratings.index, columns=movie_ratings.index)

cosine_similarity_movies.head()

In [None]:
#  Now we can checl what are the most similar movies to Batman Begins based on ratings

cosine_similarity_bb = cosine_similarity_movies.loc['Batman Begins (2005)']


ordered_similarities = cosine_similarity_bb.sort_values(ascending=False)

#  The most similar movies include some that we would expect
print(ordered_similarities[:15])