# Recommender or Recommendation system
It is a subclass of `information filtering` system that provide suggestions for items that are most pertinent to a particular user.

**Cite**: https://en.wikipedia.org/wiki/Information_filtering_system

There are three types of recommender systems that are mostly used:
* Popularity Based Recommender System
* Content Based Recommender System
* Collaborative Filtering based Recommender System

## Popularity based recommender system
Popularity based recommender system recommends the most popular items to the users. Most popular items is the item that is used by most number of users. For example, Nextflix trending list recommends the most popular items in the area or around gthe world.

## Content Based Recommender System
Content based recommender systems recommends similar items used by the user in the past. For eaxample. YouTube recommneds the items based on watch history by user.

## Collaborative Filtering based Recommender System
Collaborative Filtering based recommender system creates profiles of users based on the items the user likes. Then it recommends the items liked by a user to the user with similar profile. For example, Google creates our profile based on our browsing history and then shows us the relevant ads.

## Content Based Recommender System - Python
* [5000 movie Dataset]('https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset')


In [4]:
import numpy as np
import pandas as pd

In [21]:
data = pd.read_csv('movie_metadata.csv')
data.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [36]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 29 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   color                      5024 non-null   object 
 1   director_name              5043 non-null   object 
 2   num_critic_for_reviews     4993 non-null   float64
 3   duration                   5028 non-null   float64
 4   director_facebook_likes    4939 non-null   float64
 5   actor_3_facebook_likes     5020 non-null   float64
 6   actor_2_name               5043 non-null   object 
 7   actor_1_facebook_likes     5036 non-null   float64
 8   gross                      4159 non-null   float64
 9   genres                     5043 non-null   object 
 10  actor_1_name               5043 non-null   object 
 11  movie_title                5043 non-null   object 
 12  num_voted_users            5043 non-null   int64  
 13  cast_total_facebook_likes  5043 non-null   int64

In [22]:
data.isnull().sum(axis=0)

color                         19
director_name                104
num_critic_for_reviews        50
duration                      15
director_facebook_likes      104
actor_3_facebook_likes        23
actor_2_name                  13
actor_1_facebook_likes         7
gross                        884
genres                         0
actor_1_name                   7
movie_title                    0
num_voted_users                0
cast_total_facebook_likes      0
actor_3_name                  23
facenumber_in_poster          13
plot_keywords                153
movie_imdb_link                0
num_user_for_reviews          21
language                      12
country                        5
content_rating               303
budget                       492
title_year                   108
actor_2_facebook_likes        13
imdb_score                     0
aspect_ratio                 329
movie_facebook_likes           0
dtype: int64

In [23]:
## Fill the null values
data['actor_1_name']  = data['actor_1_name'].replace(np.nan, 'unknown')
data['actor_2_name']  = data['actor_2_name'].replace(np.nan, 'unknown')
data['actor_3_name']  = data['actor_3_name'].replace(np.nan, 'unknown')
data['director_name'] = data['director_name'].replace(np.nan, 'unknown')

In [26]:
data['genres']      = data['genres'].str.replace('|', ' ')
data['movie_title'] = data['movie_title'].str.lower()
data['genres'].head()

0    Action Adventure Fantasy Sci-Fi
1           Action Adventure Fantasy
2          Action Adventure Thriller
3                    Action Thriller
4                        Documentary
Name: genres, dtype: object

In [27]:
data['movie_title'][0]

'avatar\xa0'

In [28]:
## Remove special character in the end of each movie title
data['movie_title'] = data['movie_title'].str[:-1]

In [29]:
## Combining all actors name feature into one.
data['combitions'] = data['actor_1_name'] + ' ' + data['actor_2_name'] + ' ' + data['actor_3_name'] + ' ' + data['genres'] + data['director_name']
data['combitions'].head()

0    CCH Pounder Joel David Moore Wes Studi Action ...
1    Johnny Depp Orlando Bloom Jack Davenport Actio...
2    Christoph Waltz Rory Kinnear Stephanie Sigman ...
3    Tom Hardy Christian Bale Joseph Gordon-Levitt ...
4    Doug Walker Rob Walker unknown DocumentaryDoug...
Name: combitions, dtype: object

In [32]:
# %pip install scikit-learn

In [33]:
## make count matrix and similarity score matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [34]:
cv = CountVectorizer()
count_mat = cv.fit_transform(data['combitions'])
sim_mat   = cosine_similarity(count_mat)
sim_mat

array([[1.       , 0.1754116, 0.1754116, ..., 0.0877058, 0.       ,
        0.       ],
       [0.1754116, 1.       , 0.2      , ..., 0.       , 0.       ,
        0.       ],
       [0.1754116, 0.2      , 1.       , ..., 0.       , 0.       ,
        0.       ],
       ...,
       [0.0877058, 0.       , 0.       , ..., 1.       , 0.1      ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 0.1      , 1.       ,
        0.       ],
       [0.       , 0.       , 0.       , ..., 0.       , 0.       ,
        1.       ]])

In [35]:
sim_mat.shape

(5043, 5043)

In [37]:
# Get the index of the movie in the dataset. Then fetch the row on the same index from the sim_mat, which has the similarity scores of all the movies to the movie user like
# E.g.
mov = 'AVATAR'
mov = mov.lower()
mov in data['movie_title'].unique()

True

In [40]:
idx = data.loc[data['movie_title'] == mov].index[0]
idx

0

In [42]:
# Sort the sim_mat bases on similarity score
# Enumerate the row, so we don't lose the indices
sim_list = list(enumerate(sim_mat[idx]))
sim_list = sorted(sim_list, key = lambda x:x[1], reverse=True)

In [43]:
# Getting top 7 similar movies, 0th is movie itself
sim_list = sim_list[1:8]
sim_list

[(2486, 0.41812100500354543),
 (637, 0.4003203845127179),
 (1127, 0.4003203845127179),
 (288, 0.3508232077228117),
 (3575, 0.3508232077228117),
 (39, 0.3344968040028363),
 (95, 0.3344968040028363)]

In [44]:
# make a complete function of this
def content_based_recommender(movie_name):
    movie_name = movie_name.lower()
    if movie_name not in data['movie_title'].unique():
        print('Movie is not avaiable')
    else:
        idx = data.loc[data['movie_title'] == movie_name].index[0]
        sim_list = list(enumerate(sim_mat[idx]))
        sim_list = sorted(sim_list, key = lambda x:x[1], reverse=True)
        sim_list = sim_list[1:8]
        rec = []
        for i in range(len(sim_list)):
            title = sim_list[i][0]
            rec.append(data['movie_title'][title])
        for i in range(len(rec)):
            print(rec[i])

In [47]:
content_based_recommender('Batman')

batman returns
raiders of the lost ark
batman forever
the heat
tango & cash
flash gordon
haywire
