# **MOVIE RECOMMENDER SYSTEM**

The goal of this project is to build 3 types of recommender systems:

- Popularity based

- Item-based with correlation

- User-based with cosine similarity

## Reading Data & First Glance

In [2]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# import data
url = "https://drive.google.com/file/d/18TReZs3uJmJh0hIofeOXDzjOq-bnywYT/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
movies_df = pd.read_csv(path)

url = "https://drive.google.com/file/d/19A69kCZ33oTc_1oF8TX3XymJ5AGj2APC/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
ratings_df = pd.read_csv(path)

url = "https://drive.google.com/file/d/12KAAKmRT4l9QZEh4b3FAIToKCeFtlwCe/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
tags_df = pd.read_csv(path)

url = "https://drive.google.com/file/d/1MU1eYadkdX739KM2JZ_zn1HJad39XiaQ/view?usp=share_link"
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
links_df = pd.read_csv(path)

In [3]:
movies_df.sample(5)

Unnamed: 0,movieId,title,genres
3524,4815,Hearts in Atlantis (2001),Drama
1091,1416,Evita (1996),Drama|Musical
7775,91653,We Bought a Zoo (2011),Comedy|Drama
581,714,Dead Man (1995),Drama|Mystery|Western
8516,114265,Laggies (2014),Comedy|Romance


In [4]:
ratings_df.sample(5)

Unnamed: 0,userId,movieId,rating,timestamp
72162,464,72998,5.0,1275548743
91341,592,325,3.0,837350907
71753,462,3462,3.5,1123893781
95430,600,2161,3.5,1237715291
73695,474,2110,3.5,1089387033


In [5]:
tags_df.sample(5)

Unnamed: 0,userId,movieId,tag,timestamp
2434,474,31658,In Netflix queue,1137201531
1182,474,830,adultery,1137374123
3494,599,296,non-linear timeline,1498456578
996,474,28,In Netflix queue,1137201942
1694,474,2872,England,1137191745


In [6]:
links_df.sample(5)

Unnamed: 0,movieId,imdbId,tmdbId
6267,47404,452039,21712.0
6778,60161,1054485,12889.0
993,1295,96332,10644.0
9530,172233,223954,44015.0
2961,3969,223897,10647.0


# **1.Popularity Based Recommendations**

**A function to Generate Top n Films based on Rating**

A popularity-based, non-personalised recommender system that takes as an input the ratings and movies datasets and outputs the “best” movies. How you define “best” is up to you. Those movies will appear as the top row of the WBSFLIX site.

In [7]:
def top_movies(n):
   
   # group the movies and get the mean rating 
    df = pd.DataFrame(ratings_df.groupby('movieId')['rating'].mean())

    # add the rating count 
    df['rating_count'] = ratings_df.groupby('movieId')['rating'].count()

    # merge to get the movie title 
    df_1 = pd.merge(df, movies_df, on="movieId", how="inner")[["movieId", "title", "rating", "rating_count"]]

    # select only moves with over 50 ratings and sort movies by rating highest to lowest selcting only n values 
    top_movies = df_1[df_1['rating_count'] >= 50].sort_values(by='rating', ascending=False)[:n]

    # return top n movies as a dataframe
    return top_movies

In [8]:
top_movies(5)

Unnamed: 0,movieId,title,rating,rating_count
277,318,"Shawshank Redemption, The (1994)",4.429022,317
659,858,"Godfather, The (1972)",4.289062,192
2224,2959,Fight Club (1999),4.272936,218
974,1276,Cool Hand Luke (1967),4.27193,57
602,750,Dr. Strangelove or: How I Learned to Stop Worr...,4.268041,97


# **2.Item Based Recommendations**

**A function that takes the name of a movie, and a number (n), and outputs the n most similar movies to the selected title.**

A similarity-based, semi-personalised recommender system that takes a movie as an input – when put into production, it will be a movie that the user has watched recently or rated highly, for now, it’s a manually inputted movie – and then outputs a list of movies that are “similar” to the one inputted based on rating correlations from the user-item matrix. Those movies will appear as the second row of the WBSFLIX site.

In [9]:
def similar_movies(movie_id, n):

    #creating movies cross tab 
    ratings_crosstab = pd.pivot_table(data=ratings_df, values='rating', index='userId', columns='movieId')

    #list of the movie user ratings - exclusing NaNs
    movies_ratings = ratings_crosstab[movie_id]
    movies_ratings[movies_ratings>=0]

    #find similar movies 
    similar_movies = ratings_crosstab.corrwith(movies_ratings)

    #getting correlation score and dropping NaNs
    corr_score = pd.DataFrame(similar_movies, columns=['PearsonR'])
    corr_score.dropna(inplace=True)    
    
    #creating a ratings dataframe 
    rating = pd.DataFrame(ratings_df.groupby('movieId')['rating'].mean())
    rating['rating_count'] = ratings_df.groupby('movieId')['rating'].count()

    #joining correlation scores and rating count
    movies_corr_summary = corr_score.join(rating['rating_count'])
    #drop the choosen movie 
    movies_corr_summary.drop(movie_id, inplace=True) 

    #select only movies with over 10 ratings and sort by correlation score highest to lowest selecting only n values 
    movies_score = movies_corr_summary[movies_corr_summary['rating_count']>=100].sort_values('PearsonR', ascending=False)[:n]

    #merging with movie_df to gte title 
    movie_recommendations = pd.merge(movies_score, movies_df, on="movieId", how="inner")[["movieId", "title", "PearsonR", "rating_count"]]

    #return top n movies as a dataframe
    return movie_recommendations

In [10]:
similar_movies(1, 5)

  c = cov(x, y, rowvar, dtype=dtype)
  c *= np.true_divide(1, fact)


Unnamed: 0,movieId,title,PearsonR,rating_count
0,8961,"Incredibles, The (2004)",0.643301,125
1,6377,Finding Nemo (2003),0.618701,141
2,588,Aladdin (1992),0.611892,183
3,4886,"Monsters, Inc. (2001)",0.490231,132
4,500,Mrs. Doubtfire (1993),0.446261,144


# **3.User Based Recommendations**

**A function that takes the users userId, and a number (n) and outputs the n most recommended movies based on the cosine similarity of other users.**

Fully personalised recommender system, which will generate the third row on the WBSFLIX site.

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

In [12]:
def top_movies_user(user_id, n):

  #reshaping the data, so that we have users as rows and movies as columns
  users_items = pd.pivot_table(data=ratings_df, 
                                 values='rating', 
                                 index='userId', 
                                 columns='movieId')
  
  #replacing NaNs with zeros 
  users_items.fillna(0, inplace=True)


  #compute cosine similarities
  user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)
  
  #compute the weights for the inputed user
  weights = (
    user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id])
          )
  
  #find restaurants the inputed user has not rated
  users_items.loc[user_id,:]==0

  #select restaurants that the inputed user has not visited
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]

  #dot product between the not-visited-restaurants and the weights
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])

  #merge with places to get name 
  recommendations = weighted_averages.merge(movies_df, left_index=True, right_on="movieId")[["movieId", "title", "predicted_rating"]]

  #sort values by predicted ratings highest to lowest selecting only n values 
  recommendations_return = recommendations.sort_values("predicted_rating", ascending=False)[:n]

  return recommendations_return

In [13]:
top_movies_user(1,5)

Unnamed: 0,movieId,title,predicted_rating
277,318,"Shawshank Redemption, The (1994)",2.654727
507,589,Terminator 2: Judgment Day (1991),2.087327
659,858,"Godfather, The (1972)",1.859548
2078,2762,"Sixth Sense, The (1999)",1.663564
3638,4993,"Lord of the Rings: The Fellowship of the Ring,...",1.62482


# **4.Implement a Chatbot**

In [14]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

users_items = pd.pivot_table(data=ratings_df, 
                                 values='rating', 
                                 index='userId', 
                                 columns='movieId')

users_items.fillna(0, inplace=True)

user_similarities = pd.DataFrame(cosine_similarity(users_items),
                                 columns=users_items.index, 
                                 index=users_items.index)

In [15]:
def weighted_user_rec(user_id, n):
  weights = (user_similarities.query("userId!=@user_id")[user_id] / sum(user_similarities.query("userId!=@user_id")[user_id]))
  not_watched_movies = users_items.loc[users_items.index!=user_id, users_items.loc[user_id,:]==0]
  weighted_averages = pd.DataFrame(not_watched_movies.T.dot(weights), columns=["predicted_rating"])
  recommendations = weighted_averages.merge(movies_df, left_index=True, right_on="movieId")
  top_recommendations = recommendations.sort_values("predicted_rating", ascending=False).head(n)
  return top_recommendations

In [16]:
weighted_user_rec(1, 1)

Unnamed: 0,predicted_rating,movieId,title,genres
277,2.654727,318,"Shawshank Redemption, The (1994)",Crime|Drama


In [None]:
def chat_bot():
    print("Hi! I'm your personal recommender, let me recommend you some movies! Tell me your user ID.")
    user_id = input()
    user_id = int(user_id)
    recom = weighted_user_rec(user_id, 1)
    print(f"You will probably like the movie: {list(recom['title'])[0]}")
    
chat_bot()

Hi! I'm your personal recommender, let me recommend you some movies! Tell me your user ID.
