# Explain the difference between a content-based and a collaborative recommender system.

## Content-based filtering:
It relies on __similarities between features of the items__. It recommends items to a customer based on previously rated highest items by the same customer. List of features about these items needs to be generated.
* Each item will have an __item profile__ (items: can be retails, people, text)
> * Movies: Genres, author, directory, title, cast.
> * People: Set of friends.
> * Text features: Set of words. (TF-IDF or any document representation)
* User Profiles: We build a __user profile__ from the items’ profiles they rated. This can be done through:
> * __Simple__: Weighted average of rated items profiles.
> * __Variant__: Normalize weights using the average rating of the user.
People differ in their behavior when rating movies, for some 4 out of 5 means an extraordinary movie, for others it is just an average rating.
This is accounted for by subtracting the baseline (average ratings of a user) from his ratings.


* Making Predictions:
We use __cosine similarity__ between a user profile and other movies to find and recommend movies with the highest similarity to the user profile.

* __Pros__: 
> * No need for other users’ data.
> * Able to recommend to users with unique taste.
> * Able to recommend new items, i.e. no first-rater problem.
> * Provides an explanation for recommended items.
* __Cons__:
> * Finding The appropriate features is really hard.
> * Over Specialization: Never recommend items outside the user’s profile.
> * Cold start problem for new users.



## Collaborative filtering:
The theory is that: users like what like-minded users like. Two users are considered like-minded when they rate items’ similarly. When like-minded users are identified, items that one user rated positively are recommended to the other user, and vice versa.

* __Rating prediction__:
Since we want to predict a certain item i to a certain user x, we need to find the N similar users to x that have rated i.
To make predictions we can:
> * The average rating of N to item i (This ignores the actual similarity values between users).
> * The weighted average of the users’ ratings to the item where the weight here is the similarity between user x and user y.
* To define __similarity__ we can use:
> * Jaccard Similarity.
> * Cosine Similarity.
> * Centered Cosine (Pearson Correlation).
* Matrix Factorization based CF:
> >Concretely, matrix factorization based CF aims at two goals. The first goal is to __reduce the dimension__ of the rating matrix. The second goal is to discover potential __latent features__ under the rating matrix and such features will serve a purpose of recommendation. 
>13 to 20 features are enough to represent a matrix.

* Pros:
> * CF takes into account real quality assessments
> * Can adapt to any type of items (movies, books, articles, etc.…)
> * Requires no feature selection. This solves the problem to find the set of features that are most relevant.
* Cons:
> * Cold Start: Need enough users in the system to find a match.
> * Sparsity: Most users have not rated most items.
> * First Rater: Cannot recommend new items.
> * Popularity bias: Tends to recommend popular items. Amazon named this phenomenon “The Harry Potter effect”. This is not a bad thing in general but it can crowd out unique recommendations that can be made to specific users.


# Loading, preprocessing, and transformations

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/the-movies-dataset/ratings.csv
/kaggle/input/the-movies-dataset/links_small.csv
/kaggle/input/the-movies-dataset/credits.csv
/kaggle/input/the-movies-dataset/keywords.csv
/kaggle/input/the-movies-dataset/movies_metadata.csv
/kaggle/input/the-movies-dataset/ratings_small.csv
/kaggle/input/the-movies-dataset/links.csv


In [None]:
ratings = pd.read_csv('/kaggle/input/the-movies-dataset/ratings_small.csv')
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


In [None]:
movies = pd.read_csv('/kaggle/input/the-movies-dataset/movies_metadata.csv')
movies

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [None]:
links = pd.read_csv('/kaggle/input/the-movies-dataset/links_small.csv')
links

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0
...,...,...,...
9120,162672,3859980,402672.0
9121,163056,4262980,315011.0
9122,163949,2531318,391698.0
9123,164977,27660,137608.0


## Checking for null values

In [None]:
np.sum(movies.isna())

adult                        0
belongs_to_collection    40972
budget                       0
genres                       0
homepage                 37684
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25054
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

## Deleting rows in movies dataset with invalid id entries

In [None]:
# wrong values will become nan
movies['id'] = pd.to_numeric(movies['id'], errors='coerce')

# dropping these rows
movies = movies[movies['id'].notnull()]

## Joining with links dataset to get the id that can be found in the rating dataset

In [None]:
# Just making the id as type object
# we fill with 0 as there's no movie with id 0 in movies dataset so it doesn't matter 
links['tmdbId'] = links['tmdbId'].fillna(0).map(int)
links['tmdbId'] = links['tmdbId'].astype(np.object)

movies['id'] = movies['id'].map(int)

movies = pd.merge(links,movies,right_on='id',left_on='tmdbId',how='inner')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


## Dropping some columns

In [None]:
movies['adult'].value_counts() # adult column is almost completely false so we drop it

False    9099
Name: adult, dtype: int64

In [None]:
movies['status'].value_counts() # status column is almost the same val so we drop it

Released           9078
Rumored              11
Post Production       7
In Production         1
Name: status, dtype: int64

In [None]:
movies['original_language'].value_counts()  # original_language column is almost the same val so we drop it

en    7955
fr     274
ja     188
de     114
it      98
es      88
cn      47
sv      42
zh      39
da      31
ko      31
ru      27
pt      25
nl      16
fa      15
hi      13
fi      13
cs      10
tr       8
th       7
no       7
pl       6
he       6
el       5
sr       4
hu       4
bn       3
id       3
xx       3
ro       2
vi       2
ar       2
nb       2
af       1
bs       1
et       1
lo       1
uk       1
sk       1
ps       1
is       1
bo       1
Name: original_language, dtype: int64

In [None]:
# to be used for later info about predicted movies
movies_meta = movies.drop(['imdbId','id','imdb_id'],axis=1)

# drop not necessary movies
movies = movies.drop(['imdbId','tmdbId','id','adult','belongs_to_collection','homepage','imdb_id','original_language'
             ,'original_title','poster_path','production_companies','production_countries'
            ,'spoken_languages','status', 'tagline', 'title','video'],axis = 1)

ratings = ratings.drop('timestamp',axis=1)

# Extract year

In [None]:
def get_year(date):
    if not date or len(str(date))<10:
        return np.nan
    return str(date)[:4]

movies['release_date'] = movies['release_date'].map(get_year).astype(np.float)
movies['release_date'].fillna((movies['release_date'].median()), inplace=True)

# Scaling Numerical Columns

In [None]:
# Adding profit column instead of revenue and budget
movies['budget'] = pd.to_numeric(movies['budget'], errors='coerce') # to remove erroneus values
movies['profit'] = movies['revenue'].fillna(0) - movies['budget'].fillna(0)
movies = movies.drop(['budget','revenue'],axis = 1)

In [None]:
for col in ['profit','popularity','runtime','vote_count','release_date','vote_average']:
    movies[col] = pd.to_numeric(movies[col], errors='coerce')

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(movies[['profit','popularity','runtime','vote_count','release_date','vote_average']])
movies[['profit','popularity','runtime','vote_count','release_date','vote_average']] = scaler.transform(movies[['profit','popularity','runtime','vote_count','release_date','vote_average']])

In [None]:
movies

Unnamed: 0,movieId,genres,overview,popularity,release_date,runtime,vote_average,vote_count,profit
0,1,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...",1.567887,0.157030,-0.812707,1.286735,5.000124,2.904924
1,2,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,1.041321,0.157030,-0.054501,0.517957,1.983947,1.543759
2,3,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,0.475116,0.157030,-0.153397,0.133569,-0.348014,-0.303392
3,4,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","Cheated on, mistreated and stepped on, the wom...",-0.363455,0.157030,0.703706,-0.250820,-0.406288,0.307840
4,5,"[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,0.120038,0.157030,0.011431,-0.635209,-0.266632,0.411749
...,...,...,...,...,...,...,...,...,...
9094,161944,"[{'id': 18, 'name': 'Drama'}]",A man must cope with the loss of his wife and ...,-0.771400,0.466432,-0.680845,0.614055,-0.439444,-0.378101
9095,162542,"[{'id': 53, 'name': 'Thriller'}, {'id': 10749,...","Rustom Pavri, an honourable officer of the Ind...",0.007454,1.239936,1.461913,0.902346,-0.415331,-0.312730
9096,162672,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...","Village lad Sarman is drawn to big, bad Mohenj...",-0.623581,1.239936,1.626741,0.325763,-0.414326,-0.292839
9097,163056,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",From the mind behind Evangelion comes a hit la...,0.215925,1.239936,0.472948,0.229666,-0.287731,0.275602


# Genres

In [None]:
import re
genres_ids = []

## get only the id of the genre
def genre_ids(text):
    return re.findall(r'\d+', text)

## create genres_ids which is all the ids
for val in movies['genres']:
    genres_ids += [int(x) for x in re.findall(r'\d+', val)]
    genres_ids = sorted(list(set(genres_ids)))
    
genres_ids
genres_indx = list(range(len(genres_ids)))

## a dict that holds the id as key and it's index as val
genres_dict = dict(zip(genres_ids, genres_indx) )
genres_dict

# create a placeholder for the genres with columns = total unique genres
genres_cols = np.zeros((len(movies),len(genres_ids)))


movies['genres'] = movies['genres'].map(genre_ids)

for ind in range(len(movies)):
    for gen in movies['genres'][ind]:
        genres_cols[ind,genres_dict[int(gen)]] = 1

In [None]:
# final array for genres
genres_cols

array([[0., 0., 1., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
movies = movies.drop('genres',axis = 1)

# TFIDF overview column

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(max_features = 500, stop_words='english')

#Replace NaN with an empty string
movies['overview'] = movies['overview'].fillna('')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies['overview'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(9099, 500)

In [None]:
movies = movies.drop('overview',axis=1)

# Merging the overview tfidf representation, genres cols, and ofc movies itself

In [None]:
# Making sure shapes are correct and everything is fine
print(tfidf_matrix.shape)
print(genres_cols.shape)
print(movies.shape)

(9099, 500)
(9099, 20)
(9099, 7)


In [None]:
# Merging the overview representation and the genres representations as well
overview_genres = np.concatenate((tfidf_matrix.toarray(),genres_cols), axis=1)
del tfidf_matrix, genres_cols

overview_genres = pd.DataFrame(overview_genres)

In [None]:
movies = pd.concat([movies.reset_index(drop=True),overview_genres.reset_index(drop=True)], axis=1)

# Joining ratings with movies 

In [None]:
# Just making the id as type object
ratings['movieId'] = ratings['movieId'].astype(np.object)

full_set = pd.merge(ratings,movies,on='movieId',how='inner')

In [None]:
full_set

Unnamed: 0,userId,movieId,rating,popularity,release_date,runtime,vote_average,vote_count,profit,0,...,510,511,512,513,514,515,516,517,518,519
0,1,31,2.5,0.236834,0.157030,-0.219328,0.037471,-0.190273,1.377557,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,31,3.0,0.236834,0.157030,-0.219328,0.037471,-0.190273,1.377557,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,31,31,4.0,0.236834,0.157030,-0.219328,0.037471,-0.190273,1.377557,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,32,31,4.0,0.236834,0.157030,-0.219328,0.037471,-0.190273,1.377557,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,36,31,3.0,0.236834,0.157030,-0.219328,0.037471,-0.190273,1.377557,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99845,664,64997,2.5,-0.635250,0.672700,-0.186363,-1.884472,-0.430401,-0.312730,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99846,664,72380,3.5,0.337591,0.878967,0.308120,-0.923500,0.172432,-0.272261,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99847,665,129,3.0,-0.700919,0.208597,-0.351190,0.614055,-0.438439,-0.303392,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
99848,665,4736,1.0,-0.188299,0.466432,0.077362,-1.500084,-0.377151,-0.436991,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


# Collaborative Filtering using Ratings 

In [None]:
ratings

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0
...,...,...,...
99999,671,6268,2.5
100000,671,6269,4.0
100001,671,6365,4.0
100002,671,6385,2.5


In [None]:
# Create the user movie matrix
user_movie_mat = ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
user_movie_mat

movieId,1,2,3,4,5,6,7,8,9,10,...,161084,161155,161594,161830,161918,161944,162376,162542,162672,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
668,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
669,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
670,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
## Creating a function that measures how similar two users are based on cosine similarity

from sklearn.metrics.pairwise import cosine_similarity
import operator
def similar_users(user_id, matrix, k=3):
    # create a df of just the current user
    user = matrix[matrix.index == user_id]
    
    # and a df of all other users
    other_users = matrix[matrix.index != user_id]
    
    # calc cosine similarity between user and each other user
    similarities = cosine_similarity(user,other_users)[0].tolist()
    
    # create list of indices of these users
    indices = other_users.index.tolist()
    
    # create key/values pairs of user index and their similarity
    index_similarity = dict(zip(indices, similarities))
    
    # sort by similarity
    index_similarity_sorted = sorted(index_similarity.items(), key=operator.itemgetter(1))
    index_similarity_sorted.reverse()
    
    # grab k users off the top
    top_users_similarities = index_similarity_sorted[:k]
    users = [u[0] for u in top_users_similarities]
    
    return users
    
current_user = 1

# Showing the most similar users to user 1
similar_user_indices = similar_users(current_user, user_movie_mat)
print(similar_user_indices)

[325, 634, 341]


In [None]:
## Creating a function that recommends the movie based on similarities with users

def recommend_item(user_index, similar_user_indices, matrix, items=5):
    
    # load vectors for similar users
    similar_users = matrix[matrix.index.isin(similar_user_indices)]
    # calc avg ratings across the 3 similar users
    similar_users = similar_users.mean(axis=0)
    # convert to dataframe so its easy to sort and filter
    similar_users_df = pd.DataFrame(similar_users, columns=['mean'])
    
    
    # load vector for the current user
    user_df = matrix[matrix.index == user_index]
    # transpose it so its easier to filter
    user_df_transposed = user_df.transpose()
    # rename the column as 'rating'
    user_df_transposed.columns = ['rating']
    # remove any rows without a 0 value. Movies not watched yet
    user_df_transposed = user_df_transposed[user_df_transposed['rating']==0]
    # generate a list of movies the user has not seen
    unseen_movies = user_df_transposed.index.tolist()
    
    # filter avg ratings of similar users for only movies the current user has not seen
    similar_users_df_filtered = similar_users_df[similar_users_df.index.isin(unseen_movies)]
    # order the dataframe
    similar_users_df_ordered = similar_users_df.sort_values(by=['mean'], ascending=False)
    # grab the top n movies   
    top_rec_movies = similar_users_df_ordered.head(items)
    top_rec_movies.columns = ['Expected Rating']
    return top_rec_movies #movies

current_user = 1
collaboraritive_filtering_rec = recommend_item(current_user, similar_users(current_user, user_movie_mat), user_movie_mat, )

In [None]:
# These are the movies recommendations for yser 1 by collaborative filtering
movies_meta[movies_meta['movieId'].isin(collaboraritive_filtering_rec.index.to_numpy())]

Unnamed: 0,movieId,tmdbId,adult,belongs_to_collection,budget,genres,homepage,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
30,31,9909,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",,en,Dangerous Minds,Former Marine Louanne Johnson lands a gig teac...,...,1995-08-11,180000000.0,99.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,She broke the rules... and changed their lives.,Dangerous Minds,False,6.4,249.0
906,1129,1103,False,"{'id': 115838, 'name': 'Escape From ... Collec...",6000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",http://www.theofficialjohncarpenter.com/escape...,en,Escape from New York,"In 1997, the island of Manhattan has been wall...",...,1981-05-22,50244700.0,99.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,1997. New York City is now a maximum security ...,Escape from New York,False,6.9,720.0
930,1172,11216,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,it,Nuovo Cinema Paradiso,"A filmmaker recalls his childhood, when he fel...",...,1988-11-17,11990401.0,124.0,"[{'iso_639_1': 'it', 'name': 'Italiano'}]",Released,"A celebration of youth, friendship, and the ev...",Cinema Paradiso,False,8.2,834.0
1110,1371,152,False,"{'id': 151, 'name': 'Star Trek: The Original S...",35000000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",,en,Star Trek: The Motion Picture,When a destructive space entity is spotted app...,...,1979-12-06,139000000.0,132.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The human adventure is just beginning.,Star Trek: The Motion Picture,False,6.2,541.0
3265,4085,90,False,"{'id': 85861, 'name': 'Beverly Hills Cop Colle...",15000000,"[{'id': 28, 'name': 'Action'}, {'id': 35, 'nam...",,en,Beverly Hills Cop,Tough-talking Detroit cop Axel Foley heads to ...,...,1984-11-29,316360478.0,105.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Heat Is On!,Beverly Hills Cop,False,6.8,985.0


> #### to be done: Latent factors to represent users and movies 
> #### Content Based
> #### Hybrid Model

# Content Based

In [None]:
ratings = full_set[['userId','movieId','rating']]
user_profile = full_set.drop(['movieId','rating','userId'],axis=1) 

In [None]:
# multiply movies dimensions by users' ratings
user_profile = user_profile.multiply(ratings['rating'], axis=0)

In [None]:
user_profile['userId'] = ratings['userId']
user_profile = user_profile.groupby('userId').mean()
user_profile

Unnamed: 0_level_0,popularity,release_date,runtime,vote_average,vote_count,profit,0,1,2,3,...,510,511,512,513,514,515,516,517,518,519
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.285599,-1.417308,1.076627,1.719595,0.698192,0.960383,0.000000,0.000000,0.015740,0.014350,...,0.575000,0.000000,0.600000,0.125000,0.000000,0.375000,0.300000,0.100000,0.000000,0.0
2,2.507888,0.267316,1.498612,1.528295,3.291836,4.221248,0.021658,0.013773,0.008767,0.020183,...,0.573333,0.000000,0.240000,0.346667,0.040000,1.173333,0.680000,0.173333,0.000000,0.0
3,5.596328,0.839956,1.925966,3.224848,9.318085,6.853815,0.012857,0.015119,0.000000,0.014767,...,0.813725,0.147059,0.264706,0.147059,0.117647,0.647059,0.480392,0.392157,0.000000,0.0
4,2.822185,-2.342740,0.507339,2.677471,3.964036,3.627516,0.034044,0.009325,0.036304,0.034898,...,0.710784,0.039216,0.906863,0.254902,0.235294,0.563725,1.019608,0.068627,0.000000,0.0
5,2.756531,0.708613,1.279581,2.225096,6.125171,7.758421,0.019552,0.010422,0.012884,0.039472,...,0.445000,0.075000,0.190000,0.150000,0.215000,1.300000,0.715000,0.120000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,2.187109,0.135137,1.205661,2.332764,2.139293,2.799761,0.019110,0.000000,0.000000,0.011718,...,0.500000,0.000000,0.250000,0.529412,0.073529,1.088235,0.352941,0.176471,0.044118,0.0
668,7.241525,0.035967,2.727402,4.719258,8.568948,1.305905,0.000000,0.000000,0.000000,0.000000,...,1.947368,0.000000,0.052632,0.000000,0.000000,0.000000,0.315789,0.210526,0.000000,0.0
669,2.096903,-0.268147,0.351926,1.156677,3.077223,2.505194,0.000000,0.000000,0.000000,0.000000,...,0.810811,0.000000,0.702703,0.351351,0.081081,0.567568,0.000000,0.000000,0.000000,0.0
670,3.710105,0.739121,2.151176,3.208444,8.500677,5.340117,0.031439,0.000000,0.000000,0.000000,...,1.354839,0.000000,0.709677,0.741935,0.000000,0.516129,0.258065,0.419355,0.000000,0.0


In [None]:
# set the index of movies profile to movies id
movies = movies.set_index('movieId')
movies

Unnamed: 0_level_0,popularity,release_date,runtime,vote_average,vote_count,profit,0,1,2,3,...,510,511,512,513,514,515,516,517,518,519
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.567887,0.157030,-0.812707,1.286735,5.000124,2.904924,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.041321,0.157030,-0.054501,0.517957,1.983947,1.543759,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.475116,0.157030,-0.153397,0.133569,-0.348014,-0.303392,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,-0.363455,0.157030,0.703706,-0.250820,-0.406288,0.307840,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,0.120038,0.157030,0.011431,-0.635209,-0.266632,0.411749,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161944,-0.771400,0.466432,-0.680845,0.614055,-0.439444,-0.378101,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162542,0.007454,1.239936,1.461913,0.902346,-0.415331,-0.312730,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
162672,-0.623581,1.239936,1.626741,0.325763,-0.414326,-0.292839,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
163056,0.215925,1.239936,0.472948,0.229666,-0.287731,0.275602,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


> #### we need to divide each row in movies and user_profile by the l2 norm so that later dot product will yield the cosine similarity

In [None]:
# a function that returns l2_norm of a matrix
def l2_norm(df):
    return np.sqrt(np.square(df).sum(axis=1))

In [None]:
# divide each row by its l2 norm in users and movies
user_profile = user_profile.multiply((1 / l2_norm(user_profile).values), axis=0)
movies = movies.multiply((1 / l2_norm(movies).values), axis=0)

## Dot product(cosine similarity between each user and the movies)

In [None]:
user_movies = user_profile.dot(movies.transpose())*10

## Now for the recommendation part

In [None]:
user_movies

movieId,1,2,3,4,5,6,7,8,9,10,...,161336,161582,161594,161830,161918,161944,162542,162672,163056,163949
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.463928,4.792056,1.225762,1.770904,-0.723444,7.329895,0.082307,-0.636491,-0.091876,5.529907,...,-4.376665,2.749642,2.114942,-4.457539,-4.589701,-1.527258,1.620822,1.221796,2.768600,-0.489179
2,7.697109,7.523240,0.905494,2.729787,1.267896,7.530912,-1.002500,-0.722674,-0.307645,7.628051,...,-2.751370,4.435503,1.966824,-1.632955,-2.333981,-1.875299,1.664953,1.515558,3.074524,-0.311475
3,8.895757,8.217956,-0.273257,0.253060,0.057203,7.199673,-2.188909,-2.147159,-1.288621,6.961831,...,-3.143245,4.531378,0.991011,-2.387440,-2.766898,-2.808798,0.584779,-0.030144,1.921105,-0.175001
4,7.912040,7.670191,0.766526,0.803377,0.380619,6.579838,-1.533739,-1.108130,-0.459447,6.689917,...,-4.774122,3.085995,1.163622,-3.514501,-3.457263,-2.364076,-0.393485,-0.915288,1.595967,-0.995157
5,8.672067,7.913793,0.252209,1.823465,1.506002,6.347242,-1.930671,-1.789218,-1.115803,7.515367,...,-2.603128,3.609461,0.726971,-2.130679,-2.041531,-2.306991,0.511821,0.245140,1.916887,-0.322915
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,6.720289,6.322484,1.956543,3.442856,0.952853,7.726039,-0.034585,-0.704300,-0.637654,6.610476,...,-2.576712,5.081015,2.671122,-1.599943,-2.946758,-0.283527,2.576025,2.254270,3.494087,0.772530
668,7.420161,6.728341,0.539915,-0.383726,-1.308760,7.750262,-0.703750,-2.501265,-1.997081,4.548742,...,-3.455976,5.497147,1.392242,-2.761287,-3.715078,-2.232098,1.465715,0.301515,1.573701,0.484807
669,8.269239,7.369801,1.275193,1.383279,1.361124,6.934838,-1.041036,-1.752030,-0.428656,6.764168,...,-3.616328,4.607247,1.083312,-1.811304,-1.568904,-2.442673,0.435585,-0.279904,2.324759,-0.587413
670,8.706188,7.774316,-0.537962,0.577595,-0.310849,7.645443,-1.916695,-2.075372,-1.255033,6.641463,...,-2.849725,5.056890,1.226072,-2.286031,-2.868333,-2.178914,1.077272,0.493687,2.070632,-0.080149


In [None]:
# A function to recommend to users top 5 unseen movies
def recommend(user_id):
    # user watched movies
    user_watched = ratings[ratings['userId']==user_id]['movieId'].to_numpy() 

    # user pred rating for movies
    user_pred_ratings = user_movies.loc[user_id,:].index.to_numpy()  

    # get unseen movies
    user_unseen_movies = np.array(list(filter(lambda x: x not in user_watched, user_pred_ratings)))

    # the 5 most recommended movies 
    top_recommended = user_movies.loc[user_id,user_unseen_movies].sort_values(ascending =False)[:5]
    
    top_recommended.columns = ['Predicted rating']
    
    return top_recommended

In [None]:
# Those are the movie recommendations to user 1 by collaboarative filtering
content_based_recommendations = recommend(1)
movies_meta[movies_meta['movieId'].isin(content_based_recommendations.index.to_numpy())]

Unnamed: 0,movieId,tmdbId,adult,belongs_to_collection,budget,genres,homepage,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
724,904,567,False,,1000000,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,en,Rear Window,"Professional photographer L.B. ""Jeff"" Jeffries...",...,1954-08-01,36764313.0,112.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It only takes one witness to spoil the perfect...,Rear Window,False,8.2,1531.0
962,1207,595,False,,2000000,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,en,To Kill a Mockingbird,"In a small Alabama town in the 1930s, scrupulo...",...,1962-12-25,13129846.0,129.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,To Kill a Mockingbird,False,7.9,676.0
988,1234,9277,False,"{'id': 330605, 'name': 'The Sting Collection',...",5500000,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,en,The Sting,Set in the 1930's this intricate caper deals w...,...,1973-12-25,159616327.0,129.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,...all it takes is a little confidence.,The Sting,False,7.9,639.0
2025,2529,871,False,"{'id': 1709, 'name': 'Planet of the Apes Origi...",5800000,"[{'id': 878, 'name': 'Science Fiction'}, {'id'...",http://www.foxmovies.com/movies/planet-of-the-...,en,Planet of the Apes,"An U.S. Spaceship lands on a desolate planet,...",...,1968-02-07,33395426.0,112.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,"Somewhere in the Universe, there must be somet...",Planet of the Apes,False,7.5,958.0
2568,3198,5924,False,,12000000,"[{'id': 80, 'name': 'Crime'}, {'id': 18, 'name...",,en,Papillon,A man befriends a fellow criminal as the two o...,...,1973-12-13,53267000.0,151.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The greatest adventure of escape!,Papillon,False,7.8,445.0


# Lastly, 
> #### I would choose content based filtering in this scenario as there is abundant information about the movies that can help.
> #### I believe a better approach would be to combine both collaboraitive filtering and content based.