**Data Description**

The data consists of 105339 ratings applied over 10329 movies. The average rating is 3.5 and minimum and maximum rating is 0.5 and 5 respectively. There are 668 users who have given their ratings for 149532 movies.

- There are two data files which are provided:

* Movies.csv

  - movieId: ID assigned to a movie
  - title: Title of a movie
  - genres: pipe separated list of movie genres.


* Ratings.csv

  - userId: ID assigned to a user
  - movieId: ID assigned to a movie
  - rating: rating by a user to a movie
  - Timestamp: time at which the rating was provided.

**Objective**

- Create a **popularity based recommender system** at a genre level. User will input a genre (g), minimum ratings threshold (t) for a movie and no. of recommendations(N) for which it should be recommended top N movies which are most popular within that genre (g) ordered by ratings in descending order where each movie has at least (t) reviews.

- Create a **content based recommender system** which recommends top N movies based on similar movie(m) genres.

- Create a **collaborative based recommender system** which recommends top N movies based on “K” similar users for a target user “u”

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from ipywidgets import *

In [None]:
movies=pd.read_csv('movies.csv')
ratings=pd.read_csv('ratings.csv')

In [None]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


In [None]:
#null values....NaN
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10329 entries, 0 to 10328
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  10329 non-null  int64 
 1   title    10329 non-null  object
 2   genres   10329 non-null  object
dtypes: int64(1), object(2)
memory usage: 242.2+ KB


In [None]:
movies.shape

(10329, 3)

In [None]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105339 entries, 0 to 105338
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     105339 non-null  int64  
 1   movieId    105339 non-null  int64  
 2   rating     105339 non-null  float64
 3   timestamp  105339 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.2 MB


In [None]:
ratings.shape

(105339, 4)

In [None]:
#no null values present

In [None]:
ratings['userId'].nunique()  #668 users

668

In [None]:
movies['movieId'].nunique()

10329

In [None]:
ratings['movieId'].nunique()

10325

In [None]:
#popularity  >>>genres
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
movies['genres']=movies['genres'].str.split("|")
movies['genres']

Unnamed: 0,genres
0,"[Adventure, Animation, Children, Comedy, Fantasy]"
1,"[Adventure, Children, Fantasy]"
2,"[Comedy, Romance]"
3,"[Comedy, Drama, Romance]"
4,[Comedy]
...,...
10324,"[Animation, Children, Comedy]"
10325,[Comedy]
10326,[Comedy]
10327,[Drama]


In [None]:
movies2=movies.explode('genres')
movies2.head(15)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
1,2,Jumanji (1995),Adventure
1,2,Jumanji (1995),Children
1,2,Jumanji (1995),Fantasy
2,3,Grumpier Old Men (1995),Comedy
2,3,Grumpier Old Men (1995),Romance


In [None]:
movies2['genres'].nunique()

20

In [None]:
movies2['genres'].unique()

array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'IMAX', 'War', 'Musical', 'Documentary',
       'Western', 'Film-Noir', '(no genres listed)'], dtype=object)

In [None]:
movies2=movies2[movies2['genres']!='(no genres listed)']

In [None]:
movies2['genres'].unique()

array(['Adventure', 'Animation', 'Children', 'Comedy', 'Fantasy',
       'Romance', 'Drama', 'Action', 'Crime', 'Thriller', 'Horror',
       'Mystery', 'Sci-Fi', 'IMAX', 'War', 'Musical', 'Documentary',
       'Western', 'Film-Noir'], dtype=object)

In [None]:
movies2['genres'].nunique()

19

In [None]:
#group data based upon genres and i will get average ratings for all the genres

In [None]:
merged_info=pd.merge(ratings,movies2,on=['movieId'],how='inner')  #we want to join the two datasets based upon common movieID


In [None]:
popularity_df=merged_info.groupby(['genres','title']).agg({'rating':["mean","size"]}).reset_index()
popularity_df.columns=['genre','title','ratings_mean','ratings_counts']
popularity_df

Unnamed: 0,genre,title,ratings_mean,ratings_counts
0,Action,'71 (2014),3.500000,1
1,Action,'Hellboy': The Seeds of Creation (2004),3.000000,1
2,Action,10 to Midnight (1983),2.500000,1
3,Action,12 Rounds (2009),2.875000,4
4,Action,13 Assassins (Jûsan-nin no shikaku) (2010),3.500000,5
...,...,...,...,...
23093,Western,Wyatt Earp (1994),3.200000,30
23094,Western,Young Guns (1988),3.375000,36
23095,Western,Young Guns II (1990),3.083333,12
23096,Western,Young Ones (2014),2.000000,1


In [None]:
#genres
#threshold
popularity_df[(popularity_df['genre']=='Action')&(popularity_df['ratings_counts']>=50)].sort_values(by=['ratings_mean'],ascending=False).head()

Unnamed: 0,genre,title,ratings_mean,ratings_counts
1179,Action,Princess Mononoke (Mononoke-hime) (1997),4.384615,52
1076,Action,North by Northwest (1959),4.273973,73
975,Action,"Matrix, The (1999)",4.264368,261
1433,Action,Star Wars: Episode V - The Empire Strikes Back...,4.22807,228
1331,Action,Seven Samurai (Shichinin no samurai) (1954),4.217742,62


In [None]:
def TopNPopularMovies(genre,num_ratings_threshold,topN=5):
    popularity_df=merged_info.groupby(['genres','title']).agg({'rating':["mean","size"]}).reset_index()
    popularity_df.columns=['genre','title','ratings_mean','ratings_counts']
    #filtering data
    topN_recommendations=popularity_df[(popularity_df['genre']==genre)&(popularity_df['ratings_counts']>=num_ratings_threshold)].sort_values(by=['ratings_mean'],ascending=False).head(topN)

    #refactoring output
    topN_recommendations['SNo.']=list(range(1,len(topN_recommendations)+1))
    topN_recommendations.index=range(0,len(topN_recommendations))
    topN_recommendations.columns=['Genre',"Movie Title","Average Movie Rating","Number of Reviews","SNo."]
    return topN_recommendations[["SNo.","Movie Title","Average Movie Rating","Number of Reviews"]]

In [None]:
#test running
genre="Action"
num_ratings_threshold=100
topN=10
TopNPopularMovies(genre=genre,num_ratings_threshold=num_ratings_threshold,topN=topN)

Unnamed: 0,SNo.,Movie Title,Average Movie Rating,Number of Reviews
0,1,"Matrix, The (1999)",4.264368,261
1,2,Star Wars: Episode V - The Empire Strikes Back...,4.22807,228
2,3,Raiders of the Lost Ark (Indiana Jones and the...,4.212054,224
3,4,Inception (2010),4.18932,103
4,5,Star Wars: Episode IV - A New Hope (1977),4.188645,273
5,6,Fight Club (1999),4.188406,207
6,7,Blade Runner (1982),4.169872,156
7,8,"Princess Bride, The (1987)",4.163743,171
8,9,Aliens (1986),4.146497,157
9,10,"Dark Knight, The (2008)",4.141732,127


In [None]:
#test case 2
genre='Comedy'
num_ratings_threshold=50
topN=5
TopNPopularMovies(genre=genre,num_ratings_threshold=num_ratings_threshold,topN=topN)

Unnamed: 0,SNo.,Movie Title,Average Movie Rating,Number of Reviews
0,1,Monty Python and the Holy Grail (1975),4.301948,154
1,2,Fargo (1996),4.271144,201
2,3,Life Is Beautiful (La Vita è bella) (1997),4.253425,73
3,4,"Sting, The (1973)",4.207792,77
4,5,Annie Hall (1977),4.205882,68


In [None]:
#Content Based Recommender System

Content considered here will be genres, so we need to convert the genres to vectors, there can be multiple approached to perform this however we are going to use TF-IDF vectorizer to achieve the same.

In [None]:
movies2    #more than 1 genres>>>join together

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Children
10324,146684,Cosmic Scrat-tastrophe (2015),Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy


In [None]:
movies2.groupby("title").agg({"genres":"count"})   #Action|Drama|Fantasy....

Unnamed: 0_level_0,genres
title,Unnamed: 1_level_1
'71 (2014),4
'Hellboy': The Seeds of Creation (2004),5
'Round Midnight (1986),2
'Til There Was You (1997),2
"'burbs, The (1989)",1
...,...
loudQUIETloud: A Film About the Pixies (2006),1
xXx (2002),3
xXx: State of the Union (2005),3
¡Three Amigos! (1986),2


In [None]:
#Toy Story>>>Children, Animated, Adventurous, Funny ....Aladdin>>>Children,Animated,Adventurous

In [None]:
genres_sample=["ACTION","DRAMA","THRILLER","ROMANCE"]
" ".join(genres_sample)

'ACTION DRAMA THRILLER ROMANCE'

In [None]:
#create a string of genres in order to apply tf-idf
movies3=movies2.groupby("title").agg({"genres":lambda x:" ".join(list(x))}).reset_index()
movies3

Unnamed: 0,title,genres
0,'71 (2014),Action Drama Thriller War
1,'Hellboy': The Seeds of Creation (2004),Action Adventure Comedy Documentary Fantasy
2,'Round Midnight (1986),Drama Musical
3,'Til There Was You (1997),Drama Romance
4,"'burbs, The (1989)",Comedy
...,...,...
10315,loudQUIETloud: A Film About the Pixies (2006),Documentary
10316,xXx (2002),Action Crime Thriller
10317,xXx: State of the Union (2005),Action Crime Thriller
10318,¡Three Amigos! (1986),Comedy Western


In [None]:
#Tf-idf Vectorizer
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),min_df=0.01,stop_words='english')  #irrelevant
tf

In [None]:
#1>>>unigrams, 3>>>trigrams

In [None]:
tf_matrix=tf.fit_transform(movies3['genres'])

In [None]:
#cosine_similarity

In [None]:
cosine_sim=cosine_similarity(tf_matrix,tf_matrix)
cosine_sim


array([[1.        , 0.09549101, 0.04804803, ..., 0.2051825 , 0.        ,
        0.        ],
       [0.09549101, 1.        , 0.        , ..., 0.09658346, 0.08560542,
        0.06231727],
       [0.04804803, 0.        , 1.        , ..., 0.        , 0.        ,
        0.35526663],
       ...,
       [0.2051825 , 0.09658346, 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.        , 0.08560542, 0.        , ..., 0.        , 1.        ,
        0.11567848],
       [0.        , 0.06231727, 0.35526663, ..., 0.        , 0.11567848,
        1.        ]])

In [None]:
#user defined function
def recommendations_genre(movie_df,similarity_matrix,movie_title,topN=5):
    #Target movie title
    titles = movie_df['title']
    #Index of all other movies
    indices = pd.Series(movie_df.index, index=movie_df['title'])
    #Index of target movie
    index = indices[movie_title]
    #Generating cosine similarity scores
    cosine_similarity_scores = list(enumerate(similarity_matrix[index]))
    #Descending order sorting based on scores
    cosine_similarity_scores = sorted(cosine_similarity_scores, key=lambda x: x[1], reverse=True)
    #Selecting topN movies to be recommended ( it can contain the same movie as well, hence choosing topN+2)
    cosine_similarity_scores = cosine_similarity_scores[1:topN+2]
    #Extracting matched movies
    matching_movies = [i[0] for i in cosine_similarity_scores]
    matches_df=movie_df.iloc[matching_movies]
    matches_df=matches_df[matches_df['title']!=movie_title]
    #Refactoring output
    matches_df.rename(columns={'title':'Movie Title'},inplace=True)
    matches_df['S.No']=range(1,len(matches_df)+1)
    matches_df.index=range(len(matches_df))
    return matches_df[['S.No','Movie Title']].head(topN)



In [None]:
#test case 1
movie_title="Silence of the Lambs, The (1991)"
topN=5
recommendations_genre(movie_df=movies3,similarity_matrix=cosine_sim,movie_title=movie_title,topN=topN)

Unnamed: 0,S.No,Movie Title
0,1,"Collector, The (2009)"
1,2,Cure (1997)
2,3,Deliver Us from Evil (2014)
3,4,FearDotCom (a.k.a. Fear.com) (a.k.a. Fear Dot ...
4,5,Fright (1972)


In [None]:
#test case 2
movie_title="Waiting to Exhale (1995)"
topN=20
recommendations_genre(movie_df=movies3,similarity_matrix=cosine_sim,movie_title=movie_title,topN=topN)

Unnamed: 0,S.No,Movie Title
0,1,10 Items or Less (2006)
1,2,101 Reykjavik (101 Reykjavík) (2000)
2,3,2 Days in Paris (2007)
3,4,3 Idiots (2009)
4,5,About Last Night... (1986)
5,6,About a Boy (2002)
6,7,"Accidental Tourist, The (1988)"
7,8,Adaptation (2002)
8,9,After Sex (2007)
9,10,Alex and Emma (2003)


In [None]:
#ipywidgets interactions
from ipywidgets import *

In [None]:
#popularity recommendations
#input informations
#dropdown menu
genres=Dropdown(options=list(set(movies2.genres)),description='Genres',style={"description_width":"initial"})
num_reviews=IntText(description="MIN REVIEWS",style={"description_width":"initial"})
num_recommendations1=IntText(description="Number Of Recommendations",style={"description_width":"initial"})

#stacking inputs in different tabs
b1=Button(description="Recommend Me",style={"description_width":"initial"})
h1=HBox([num_reviews,num_recommendations1])
popularity_tab=VBox([genres,h1,b1])


#Content based system
title=Textarea(description="Movie Title",style={"description_width":"initial"})
num_recommendations2=IntText(description="Number Of Recommendations",style={"description_width":"initial"})
#stacking inputs in different tabs


h2=HBox([title,num_recommendations2])
b2=Button(description="Recommend Me",style={"description_width":"initial"})
content_tab=VBox([h2,b2])

#Create tabs
all_tabs=[popularity_tab,content_tab]
tabs=widgets.Tab(all_tabs)

#title for the tabs
names=['Popularity Based','Content Based']
[tabs.set_title(i,title) for i,title in enumerate(names)]

display(tabs)

Tab(children=(VBox(children=(Dropdown(description='Genres', options=('Comedy', 'Documentary', 'Mystery', 'Dram…

In [None]:
#set up events for this respond button >>>>for both the tabs
#popularity tab
def b1_clicked(b):
    global output
    output=TopNPopularMovies(genre=genres.value,
                             num_ratings_threshold=num_reviews.value,
                             topN=num_recommendations1.value)
b1.on_click(b1_clicked)


##content tabs

def b2_clicked(b):
    global output
    result=recommendations_genre(movie_df=movies3,
                                        similarity_matrix=cosine_sim,movie_title=title.value,
                                        topN=num_recommendations2.value)
    output=result
b2.on_click(b2_clicked)


In [None]:
display(tabs)

Tab(children=(VBox(children=(Dropdown(description='Genres', index=4, options=('Comedy', 'Documentary', 'Myster…

In [None]:
output

Unnamed: 0,SNo.,Movie Title,Average Movie Rating,Number of Reviews
0,1,Shall We Dance (1937),4.357143,7
1,2,Top Hat (1935),4.307692,13
2,3,"Lives of Others, The (Das leben der Anderen) (...",4.306452,31
3,4,Notorious (1946),4.305556,18
4,5,Harold and Maude (1971),4.287879,33
