User-based collaborative filtering. This technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. <br>

I am going to use simple method here based on the Pearson Correlation using pandas. It's a simple approach and it doesn't cover many topics, but it good for get acquainted with recommender systems. <br>

P.S. In this notebook I use updated version of dataset from [ here ](https://github.com/zygmuntz/goodbooks-10k)

In [None]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [None]:
books_df = pd.read_csv('https://raw.githubusercontent.com/kungfumas/stbi/master/books.csv')
ratings_df = pd.read_csv('https://raw.githubusercontent.com/kungfumas/stbi/master/bookratings.csv')

In [None]:
books_df.head()

In [None]:
books_df.info()

In [None]:
books_df.sample(5)

In [None]:
ratings_df.info()

In [None]:
ratings_df.head()

In [None]:
ratings_df.sample(5)

As a fan of literature I prepared my own list of some my favourite books (you can also use any user from dataset): 

   - Jack London _"Martin Eden"_ <br>
   - Franz Kafka _"The trial"_ <br>
   - Stephen King _"Pet Sematary"_ <br>
   - Gabriel Garcia Markes _"One Hundred Years of Solitude"_ <br>
   - Charles Bukowski _"Ham on Rye"_ <br>
   - John Steinbeck _"The Grapes of Wrath"_ <br>
   - Kurt Vonnegut _"Cat's Cradle"_ <br>
   - Fedor Dostoyevsky _"Crime and Punishment"_ <br>

And below I rated these books.

In [None]:
my_list = {'Martin Eden': 5,
            'Pet Sematary': 5,
            'One Hundred Years of Solitude': 5,
            'Ham on Rye': 5,
            'The Grapes of Wrath': 4, 
            "Cat's Cradle": 5,
            'Crime and Punishment': 4,
            'The Trial': 4}

In [None]:
# Create dataframe for new user (me)
user_books = pd.DataFrame(columns=['title', 'rating'], data=my_list.items())
user_books

In [None]:

# Add book_id from books_df
new_user = pd.merge(user_books, books_df, on='title', how='inner')
new_user = new_user[['book_id', 'title', 'rating']].sort_values(by='book_id')
new_user

Select a users with books I have read.

In [None]:
other_users = ratings_df[ratings_df['book_id'].isin(new_user['book_id'].values)]
other_users

In [None]:
other_users['user_id'].nunique()

18637 users had read at least one book from my list. 

In [None]:
# Sort users by count of most mutual books with me
users_mutual_books = other_users.groupby(['user_id'])
users_mutual_books = sorted(users_mutual_books, key=lambda x: len(x[1]), reverse=True)
users_mutual_books[0]

I will be using only first 100 records, these with the most common part of books

In [None]:
top_users = users_mutual_books[:100]

Now let's calculate a similarity score for each user using Pearson correlation function and use these scores as weights later. 

In [None]:
# Pearson correlation
from scipy.stats import pearsonr

pearson_corr = {}

for user_id, books in top_users:
    # Books should be sorted
    books = books.sort_values(by='book_id')
    book_list = books['book_id'].values

    new_user_ratings = new_user[new_user['book_id'].isin(book_list)]['rating'].values 
    user_ratings = books[books['book_id'].isin(book_list)]['rating'].values

    corr = pearsonr(new_user_ratings, user_ratings)
    pearson_corr[user_id] = corr[0]

In [None]:
# Get top50 users with the highest similarity indices
pearson_df = pd.DataFrame(columns=['user_id', 'similarity_index'], data=pearson_corr.items())
pearson_df = pearson_df.sort_values(by='similarity_index', ascending=False)[:50]
pearson_df

In [None]:
# Get all books for these users and add weighted book's ratings
users_rating = pearson_df.merge(ratings_df, on='user_id', how='inner')
users_rating['weighted_rating'] = users_rating['rating'] * users_rating['similarity_index']
users_rating

Now it rests only calculate average recommendation score and select items with the highest score. <br>

In [None]:
# Calculate sum of similarity index and weighted rating for each book
grouped_ratings = users_rating.groupby('book_id').sum()[['similarity_index', 'weighted_rating']]

In [None]:
recommend_books = pd.DataFrame()

# Add average recommendation score
recommend_books['avg_reccomend_score'] = grouped_ratings['weighted_rating']/grouped_ratings['similarity_index']
recommend_books['book_id'] = grouped_ratings.index
recommend_books = recommend_books.reset_index(drop=True)

# Left books with the highest score
recommend_books = recommend_books[(recommend_books['avg_reccomend_score'] == 5)]
recommend_books

Let's have a look for list of 10 recommendation. 

In [None]:
# Let's see our recomendations
recommendation = books_df[books_df['book_id'].isin(recommend_books['book_id'])][['authors', 'title', 'book_id']].sample(10)
recommendation

In [None]:
"""
User object attributes:

    user likes
    user played(audio)
    users liked audio(from other people)
    users interacted tags
    user lat
    user lng
    user replies
"""
import pandas as pd
import numpy as np
# Pearson correlation
from scipy.stats import pearsonr

user_info = {'Martin Eden': 5,
            'Pet Sematary': 5,
            'One Hundred Years of Solitude': 5,
            'Ham on Rye': 5,
            'The Grapes of Wrath': 4, 
            "Cat's Cradle": 5,
            'Crime and Punishment': 4,
            'The Trial': 4}

In [None]:
ratings_df

In [None]:
def user_colab(user,others,db_features):
    # Create dataframe for new user (me)
    user = pd.DataFrame(columns=['hashtag', 'likes'], data=user.items())

    # Add book_id from books_df
    new_user = pd.merge(user, db_features, on='hashtag', how='inner')
    new_user = new_user[['book_id', 'hashtag', 'likes']].sort_values(by='book_id')

    other_users = others[others['book_id'].isin(new_user['book_id'].values)]

    # Sort users by count of most mutual books with me
    users_mutual_books = other_users.groupby(['user_id'])
    users_mutual_books = sorted(users_mutual_books, key=lambda x: len(x[1]), reverse=True)

    pearson_corr = {}

    for user_id, features in top_users:
        # Books should be sorted
        features = features.sort_values(by='book_id')
        features_list = features['book_id'].values

        new_user_ratings = new_user[new_user['book_id'].isin(features_list)]['likes'].values 
        user_ratings = features[features['book_id'].isin(features_list)]['rating'].values

        corr = pearsonr(new_user_ratings, user_ratings)
        pearson_corr[user_id] = corr[0]

    # Get top50 users with the highest similarity indices
    pearson_df = pd.DataFrame(columns=['user_id', 'similarity_index'], data=pearson_corr.items())
    pearson_df = pearson_df.sort_values(by='similarity_index', ascending=False)[:50]

    # Get all books for these users and add weighted book's ratings
    users_rating = pearson_df.merge(ratings_df, on='user_id', how='inner')
    users_rating['weighted_rating'] = users_rating['rating'] * users_rating['similarity_index']

    # Calculate sum of similarity index and weighted rating for each book
    grouped_ratings = users_rating.groupby('book_id').sum()[['similarity_index', 'weighted_rating']]

    recommend_tags = pd.DataFrame()

    # Add average recommendation score
    recommend_tags['avg_reccomend_score'] = grouped_ratings['weighted_rating']/grouped_ratings['similarity_index']
    recommend_tags['book_id'] = grouped_ratings.index
    recommend_tags = recommend_books.reset_index(drop=True)

    # Left books with the highest score
    recommend_tags = recommend_books[(recommend_books['avg_reccomend_score'] == 5)]
    return recommend_tags

books_df.rename(columns={"title":"hashtag"},inplace=True)
user_colab(user_info,ratings_df,books_df)

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({'user_0':[0,3,0,5,0,0,4,5,0,2], 'user_1':[0,0,3,2,5,0,4,0,3,0],
                   'user_2':[3,1,0,3,5,0,0,4,0,0], 'user_3':[4,3,4,2,0,0,0,2,0,0], 
                   'user_4':[2,0,0,0,0,4,4,3,5,0], 'user_5':[1,0,2,4,0,0,4,0,5,0], 
                   'user_6':[2,0,0,3,0,4,3,3,0,0], 'user_7':[0,0,0,3,0,2,4,3,4,0], 
                   'user_8':[5,0,0,0,5,3,0,3,0,4], 'user_9':[1,0,2,0,4,0,4,3,0,0]}, 
                  index=['movie_0','movie_1','movie_2','movie_3','movie_4','movie_5',
                         'movie_6','movie_7','movie_8','movie_9'])
df

In [None]:
from sklearn.neighbors import NearestNeighbors

In [None]:
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=3)

In [None]:
indices

In [None]:
distances

In [None]:
for title in df.index:

  index_user_likes = df.index.tolist().index(title) # get an index for a movie
  sim_movies = indices[index_user_likes].tolist() # make list for similar movies
  movie_distances = distances[index_user_likes].tolist() # the list for distances of similar movies
  id_movie = sim_movies.index(index_user_likes) # get the position of the movie itself in indices and distances

  print('Similar Movies to '+str(df.index[index_user_likes])+':\n')


  sim_movies.remove(index_user_likes) # remove the movie itself in indices
  movie_distances.pop(id_movie) # remove the movie itself in distances

  j = 1
  
  for i in sim_movies:
    print(str(j)+': '+str(df.index[i])+', the distance with '+str(title)+': '+str(movie_distances[j-1]))
    j = j + 1
      
  print('\n')

In [None]:
def recommend_movie(title):

  index_user_likes = df.index.tolist().index(title) # get an index for a movie
  sim_movies = indices[index_user_likes].tolist() # make list for similar movies
  movie_distances = distances[index_user_likes].tolist() # the list for distances of similar movies
  id_movie = sim_movies.index(index_user_likes) # get the position of the movie itself in indices and distances

  print('Similar Movies to '+str(df.index[index_user_likes])+': \n')

  sim_movies.remove(index_user_likes) # remove the movie itself in indices
  movie_distances.pop(id_movie) # remove the movie itself in distances

  j = 1
    
  for i in sim_movies:
    print(str(j)+': '+str(df.index[i])+', the distance with '+str(title)+': '+str(movie_distances[j-1]))
    j = j + 1

In [None]:
recommend_movie('movie_3')

In [None]:
df

In [None]:
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=3)

In [None]:
index_for_movie = df.index.tolist().index('movie_0') # it returns 0
sim_movies = indices[index_for_movie].tolist() # make list for similar movies
movie_distances = distances[index_for_movie].tolist() # the list for distances of similar movies
id_movie = sim_movies.index(index_for_movie) # get the position of the movie itself in indices and distances
sim_movies.remove(index_for_movie) # remove the movie itself in indices
movie_distances.pop(id_movie) # remove the movie itself in distances

print('The Nearest Movies to movie_0:', sim_movies)
print('The Distance from movie_0:', movie_distances)

In [None]:
movie_similarity = [-x+1 for x in movie_distances] # inverse distance 

predicted_rating = (movie_similarity[0]*df.iloc[sim_movies[0],7] + movie_similarity[1]*df.iloc[sim_movies[1],7])/sum(movie_similarity)
print(predicted_rating)

In [None]:
# find the nearest neighbors using NearestNeighbors(n_neighbors=3)
number_neighbors = 3
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df.values)
distances, indices = knn.kneighbors(df.values, n_neighbors=number_neighbors)

# copy df
df1 = df.copy()

# convert user_name to user_index
user_index = df.columns.tolist().index('user_4')

# t: movie_title, m: the row number of t in df
for m,t in list(enumerate(df.index)):
  
  # find movies without ratings by user_4
  if df.iloc[m, user_index] == 0:
    sim_movies = indices[m].tolist()
    movie_distances = distances[m].tolist()
    
    # Generally, this is the case: indices[3] = [3 6 7]. The movie itself is in the first place.
    # In this case, we take off 3 from the list. Then, indices[3] == [6 7] to have the nearest NEIGHBORS in the list. 
    if m in sim_movies:
      id_movie = sim_movies.index(m)
      sim_movies.remove(m)
      movie_distances.pop(id_movie) 

    # However, if the percentage of ratings in the dataset is very low, there are too many 0s in the dataset. 
    # Some movies have all 0 ratings and the movies with all 0s are considered the same movies by NearestNeighbors(). 
    # Then,even the movie itself cannot be included in the indices. 
    # For example, indices[3] = [2 4 7] is possible if movie_2, movie_3, movie_4, and movie_7 have all 0s for their ratings.
    # In that case, we take off the farthest movie in the list. Therefore, 7 is taken off from the list, then indices[3] == [2 4].
    else:
      sim_movies = sim_movies[:number_neighbors-1]
      movie_distances = movie_distances[:number_neighbors-1]
        
    # movie_similarty = 1 - movie_distance    
    movie_similarity = [1-x for x in movie_distances]
    movie_similarity_copy = movie_similarity.copy()
    nominator = 0

    # for each similar movie
    for s in range(0, len(movie_similarity)):
      
      # check if the rating of a similar movie is zero
      if df.iloc[sim_movies[s], user_index] == 0:

        # if the rating is zero, ignore the rating and the similarity in calculating the predicted rating
        if len(movie_similarity_copy) == (number_neighbors - 1):
          movie_similarity_copy.pop(s)
          
        else:
          movie_similarity_copy.pop(s-(len(movie_similarity)-len(movie_similarity_copy)))

      # if the rating is not zero, use the rating and similarity in the calculation
      else:
        nominator = nominator + movie_similarity[s]*df.iloc[sim_movies[s],user_index]

    # check if the number of the ratings with non-zero is positive
    if len(movie_similarity_copy) > 0:
      
      # check if the sum of the ratings of the similar movies is positive.
      if sum(movie_similarity_copy) > 0:
        predicted_r = nominator/sum(movie_similarity_copy)

      # Even if there are some movies for which the ratings are positive, some movies have zero similarity even though they are selected as similar movies.
      # in this case, the predicted rating becomes zero as well  
      else:
        predicted_r = 0

    # if all the ratings of the similar movies are zero, then predicted rating should be zero
    else:
      predicted_r = 0

  # place the predicted rating into the copy of the original dataset
    df1.iloc[m,user_index] = predicted_r

In [None]:
def recommend_movies(user, num_recommended_movies):

  print('The list of the Movies {} Has Watched \n'.format(user))

  for m in df[df[user] > 0][user].index.tolist():
    print(m)
  
  print('\n')

  recommended_movies = []

  for m in df[df[user] == 0].index.tolist():

    index_df = df.index.tolist().index(m)
    predicted_rating = df1.iloc[index_df, df1.columns.tolist().index(user)]
    recommended_movies.append((m, predicted_rating))

  sorted_rm = sorted(recommended_movies, key=lambda x:x[1], reverse=True)
  
  print('The list of the Recommended Movies \n')
  rank = 1
  for recommended_movie in sorted_rm[:num_recommended_movies]:
    
    print('{}: {} - predicted rating:{}'.format(rank, recommended_movie[0], recommended_movie[1]))
    rank = rank + 1

In [None]:
recommend_movies('user_4',5)

In [None]:
df1 = df.copy()

def movie_recommender(user, num_neighbors, num_recommendation):
  
  number_neighbors = num_neighbors

  knn = NearestNeighbors(metric='cosine', algorithm='brute')
  knn.fit(df.values)
  distances, indices = knn.kneighbors(df.values, n_neighbors=number_neighbors)

  user_index = df.columns.tolist().index(user)

  for m,t in list(enumerate(df.index)):
    if df.iloc[m, user_index] == 0:
      sim_movies = indices[m].tolist()
      movie_distances = distances[m].tolist()
    
      if m in sim_movies:
        id_movie = sim_movies.index(m)
        sim_movies.remove(m)
        movie_distances.pop(id_movie) 

      else:
        sim_movies = sim_movies[:num_neighbors-1]
        movie_distances = movie_distances[:num_neighbors-1]
           
      movie_similarity = [1-x for x in movie_distances]
      movie_similarity_copy = movie_similarity.copy()
      nominator = 0

      for s in range(0, len(movie_similarity)):
        if df.iloc[sim_movies[s], user_index] == 0:
          if len(movie_similarity_copy) == (number_neighbors - 1):
            movie_similarity_copy.pop(s)
          
          else:
            movie_similarity_copy.pop(s-(len(movie_similarity)-len(movie_similarity_copy)))
            
        else:
          nominator = nominator + movie_similarity[s]*df.iloc[sim_movies[s],user_index]
          
      if len(movie_similarity_copy) > 0:
        if sum(movie_similarity_copy) > 0:
          predicted_r = nominator/sum(movie_similarity_copy)
        
        else:
          predicted_r = 0

      else:
        predicted_r = 0
        
      df1.iloc[m,user_index] = predicted_r
  recommend_movies(user,num_recommendation)

In [None]:
movie_recommender('user_4', 4, 5)

https://github.com/yjeong5126/movie_recommender

In [None]:
ratings = pd.read_csv('ratings.csv', usecols=['userId','movieId','rating'])
movies = pd.read_csv('https://github.com/yjeong5126/movie_recommender/blob/master/item_based_collaborative_filtering/movies.csv', 
                     usecols=['movieId','title'])
ratings2 = pd.merge(ratings, movies, how='inner', on='movieId')