# Recommendation System

In [1]:
import pandas as pd
import numpy as np

In [2]:
from scipy.sparse import coo_matrix, csr_matrix
from scipy.sparse.linalg import svds, norm
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
import operator
from collections import defaultdict

## Data 

We will use the MovieLens data which contains ratings of movies by users. The data which is publically available from [MovieLens Website](https://grouplens.org/datasets/movielens/). We are using 100k data which has 100k ratings. 

In [4]:
data_path = '../data/'

In [5]:
rating_df = pd.read_csv(data_path + 'ratings.csv', sep=',', header=0)

In [6]:
rating_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
rating_df.shape

(100836, 4)

In [8]:
rating_df.userId.nunique()

610

In [9]:
rating_df.movieId.nunique()

9724

- The data has over 100k ratings by 610 users on 9724 movies

In [10]:
max(rating_df.userId), max(rating_df.movieId)

(610, 193609)

  - The movie ids do not follow an order. 
  - 9724 movies have been selected that users with id 1 to 610 have rated so as to have 100k ratings. 

### Movies Information

We are also provided the titles and genres of the movies in a separate file. 

In [11]:
movie_df = pd.read_csv(data_path + 'movies.csv', sep=',', header=0)

In [12]:
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
movie_df[movie_df['title'].str.contains('Avengers')]

Unnamed: 0,movieId,title,genres
1611,2153,"Avengers, The (1998)",Action|Adventure
6148,44020,Ultimate Avengers (2006),Action|Animation|Children|Sci-Fi
7693,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX
8551,115727,Crippled Avengers (Can que) (Return of the 5 D...,Action|Adventure
8686,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi
8693,122912,Avengers: Infinity War - Part I (2018),Action|Adventure|Sci-Fi
9153,147657,Masked Avengers (1981),Action
9488,170297,Ultimate Avengers 2 (2006),Action|Animation|Sci-Fi


## Matrix Factorization

A very popular technique for recommendation systems. We factorize the user-item matrix to obtain the user factors and item factors which are the low-dimensional embeddings such that 'similar' user/items are mapped to 'nearby' points. Moreover, the user and the movies are embedded to the same space, which provides a way to compute user-movie similarity.  

Create a matrix of ratings

In [14]:
ratings_mat = np.ndarray(
    shape=(np.max(rating_df.movieId.values), np.max(rating_df.userId.values)),
    dtype=np.uint8)
ratings_mat[rating_df.movieId.values-1, rating_df.userId.values-1] = rating_df.rating.values

In [15]:
ratings_mat.shape

(193609, 610)

Normalize the rating matrix

In [16]:
normalised_mat = ratings_mat - np.asarray([(np.mean(ratings_mat, 1))]).T

We will use Singular Value Decomposition (SVD) for factorizing the matrix. Since the user-movie rating matrix is very sparse, it is more efficient to use the implementation from scipy.sparse. 

The number of the latent-factors is chosen to be 50 i.e. top-50 singular values of the SVD are considered. 

In [17]:
n_factors = 50

In [18]:
A = normalised_mat.T / np.sqrt(ratings_mat.shape[0] - 1)
U, S, V = svds(A, n_factors)

In [19]:
U.shape

(610, 50)

In [20]:
V.shape

(50, 193609)

In [21]:
movie_factors = V.T
user_factors = U

Let's study some examples to have a qualitative understanding. Cosine similarity of the latent factors of two movies signifies how similar the movies are.

In [22]:
idx = 260
movie_df[movie_df.movieId == idx].title.values[0],  movie_df[movie_df.movieId == 1196].title.values[0]

('Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)')

In [23]:
1.0 - cosine(movie_factors[259], movie_factors[1195])

0.8777832979913568

In [24]:
movie_df[movie_df.movieId == 1210].title.values[0], 1.0 - cosine(movie_factors[259], movie_factors[1209])

('Star Wars: Episode VI - Return of the Jedi (1983)', 0.8518636866885215)

In [25]:
movie_df[movie_df.movieId == 1].title.values[0], 1.0 - cosine(movie_factors[259], movie_factors[0])

('Toy Story (1995)', 0.20152844265131886)

The similarity of the 'Star Wars: Episode IV - A New Hope' is higher for the movies 'Star Wars: Episode V - The Empire Strikes Back' and 'Star Wars: Episode VI - Return of the Jedi' and is much lower for 'Toy Story'. Moreover, the 'Star Wars: Episode VI' is closer to 'Star Wars: Episode IV' than the 'Star Wars: Episode V'.  

Function to get top-n movies similar to a given movie. 

In [26]:
def get_similar_movies_matrix_factorization(data, movieid, top_n=10):
    index = movieid - 1 # Movie id starts from 1
    movie = movie_df[movie_df.movieId == movieid].title.values[0]
    movie_row = data[index].reshape(1,-1)
    similarity = cosine_similarity(movie_row, data)
    sort_indexes = np.argsort(-similarity)[0]
    return {'movie': movie, 'sim_movies': [movie_df[movie_df.movieId == id].title.values[0] for id in sort_indexes[:top_n] + 1]}

In [27]:
movie_id = 260
get_similar_movies_matrix_factorization(movie_factors, movie_id)

{'movie': 'Star Wars: Episode IV - A New Hope (1977)',
 'sim_movies': ['Star Wars: Episode IV - A New Hope (1977)',
  'Star Wars: Episode V - The Empire Strikes Back (1980)',
  'Star Wars: Episode VI - Return of the Jedi (1983)',
  'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
  'Lesson Faust (1994)',
  'Touch (1997)',
  'Inferno (2016)',
  'Beverly Hills Chihuahua (2008)',
  'Matrix, The (1999)',
  'Star Wars: Episode III - Revenge of the Sith (2005)']}

In [28]:
movie_id = 1
get_similar_movies_matrix_factorization(movie_factors, movie_id)

{'movie': 'Toy Story (1995)',
 'sim_movies': ['Toy Story (1995)',
  'Toy Story 2 (1999)',
  'Adventures of Pinocchio, The (1996)',
  'Eddie (1996)',
  'Children of the Corn IV: The Gathering (1996)',
  'Twister (1996)',
  'Sudden Death (1995)',
  'Dear God (1996)',
  'Kazaam (1996)',
  'Sunset Park (1996)']}

In [29]:
user_factors.shape, movie_factors.shape

((610, 50), (193609, 50))

Since the user and movies are in the same space, we can also compute movies similar to a user. A recommendation model can be defined as showing movies similar to the given user.  

In [30]:
def get_recommendations_matrix_factorization(userid, user_factors, movie_factors, top_n=5):
    user_vec = user_factors[userid - 1].reshape(1,-1)
    similarity = cosine_similarity(user_vec, movie_factors)
    sort_indexes = np.argsort(-similarity)[0]
    return [movie_df[movie_df.movieId == id].title.values[0] for id in sort_indexes[:top_n] + 1]   

In [31]:
top_recos = get_recommendations_matrix_factorization(1, user_factors, movie_factors)
top_recos

['Best Men (1997)',
 "Gulliver's Travels (1939)",
 'Newton Boys, The (1998)',
 'Teenage Mutant Ninja Turtles III (1993)',
 'Welcome to Woop-Woop (1997)']

## Graph Embeddings

Create a user-movie graph with edge weights as the ratings. We will use [DeepWalk](https://arxiv.org/abs/1403.6652) to embed every node of the graph to a low-dimensional space. 

In [33]:
import networkx as nx

In [34]:
user_item_edgelist = rating_df[['userId', 'movieId', 'rating']]
user_item_edgelist.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [35]:
user_item_edgelist.shape

(100836, 3)

Since userids and movieids both start from 1, and thus same id can correspond to a user and a movie. We will map the ids to unique integers. 

In [36]:
user2dict = dict()
movie2dict = dict()
cnt = 0
for x in user_item_edgelist.values:
    usr = (x[0], 'user')
    movie = (x[1], 'movie')
    if usr in user2dict:
        pass
    else:
        user2dict[usr] = cnt
        cnt += 1
    if movie in movie2dict:
        pass
    else:
        movie2dict[movie] = cnt
        cnt += 1

In [37]:
len(user2dict), len(movie2dict)

(610, 9724)

In [38]:
len(user2dict) + len(movie2dict)

10334

Create a user-movie weighted graph using python library networkx. 

In [39]:
user_movie_graph = nx.Graph()

In [40]:
for x in user_item_edgelist.values:
    usr = (x[0], 'user')
    movie = (x[1], 'movie')
    user_movie_graph.add_node(user2dict[usr])
    user_movie_graph.add_node(movie2dict[movie])
    user_movie_graph.add_edge(user2dict[usr], movie2dict[movie], weight=float(x[2]))

In [41]:
user_movie_graph.number_of_edges()

100836

In [42]:
user_movie_graph.number_of_nodes()

10334

Write the edgelist to a file. 

In [49]:
path = './data/user_movie_rating.csv'

In [50]:
nx.readwrite.edgelist.nx.write_edgelist(user_movie_graph, path, delimiter=' ', data=['weight'])

### DeepWalk

We will use the implementation of DeepWalk provided in node2vec which is a bit different from original DeepWalk e.g. it uses negative sampling whereas the original DeepWalk paper used hierarchical sampling for the skip-gram model. 

The code and the intructions to run node2vec is provided in the [repository](https://github.com/aditya-grover/node2vec).

In [51]:
! python ../node2vec/src/main.py --help

usage: main.py [-h] [--input [INPUT]] [--output [OUTPUT]]
               [--dimensions DIMENSIONS] [--walk-length WALK_LENGTH]
               [--num-walks NUM_WALKS] [--window-size WINDOW_SIZE]
               [--iter ITER] [--workers WORKERS] [--p P] [--q Q] [--weighted]
               [--unweighted] [--directed] [--undirected]

Run node2vec.

optional arguments:
  -h, --help            show this help message and exit
  --input [INPUT]       Input graph path
  --output [OUTPUT]     Embeddings path
  --dimensions DIMENSIONS
                        Number of dimensions. Default is 128.
  --walk-length WALK_LENGTH
                        Length of walk per source. Default is 80.
  --num-walks NUM_WALKS
                        Number of walks per source. Default is 10.
  --window-size WINDOW_SIZE
                        Context size for optimization. Default is 10.
  --iter ITER           Number of epochs in SGD
  --workers WORKERS     Number of parallel workers. Default is 8.
  --p P     

In [71]:
! python ../node2vec/src/main.py --input ./user_movie_rating.csv --output ./user_movie_embeddings --dimensions 50 --walk-length 40 --weighted

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10


In [72]:
def build_embeddings(embed_file):
    embedding_dict = dict()
    for line in open(embed_file):
        if len(line.split()) > 2:
            record = line.rstrip().split()
            embedding_dict[int(record[0])] = np.array(record[1:], dtype=float)
        else:
            pass
    return embedding_dict

In [73]:
embedding_dict = build_embeddings('./user_movie_embeddings')

In [74]:
embedding_dict[0].shape

(50,)

In [75]:
len(embedding_dict)

10334

Try same examples

In [837]:
1.0 - cosine(embedding_dict[movie2dict[(260, 'movie')]], embedding_dict[movie2dict[(1196, 'movie')]])

0.9011281003170812

In [838]:
1.0 - cosine(embedding_dict[movie2dict[(260, 'movie')]], embedding_dict[movie2dict[(1210, 'movie')]])

0.9035285097641825

Build an array of embeddings for all nodes. 

In [80]:
node_vecs = [embedding_dict[x] for x in range(len(embedding_dict))]

In [81]:
node_vecs = np.array(node_vecs)
node_vecs.shape

(10334, 50)

In [82]:
reverse_movie2dict = {k:v for v,k in movie2dict.items()}

In [83]:
def get_similar_movies_graph_embeddings(movieid, movie_vecs, top_n=10):
    movie_idx = movie2dict[movieid]
    query = movie_vecs[movie_idx].reshape(1,-1)
    ranking = cosine_similarity(query, movie_vecs)
    top_ids = np.argsort(-ranking)[0]
    top_movie_ids = [reverse_movie2dict[j] for j in top_ids if j in reverse_movie2dict][:top_n]
    sim_movies = [movie_df[movie_df.movieId == id[0]].title.values[0] for id in top_movie_ids]
    return sim_movies

In [84]:
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs)[:10]

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
 'Matrix, The (1999)',
 'Frankie and Johnny (1966)',
 'The Nut Job 2: Nutty by Nature (2017)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Spy Next Door, The (2010)',
 'Saving Private Ryan (1998)']

In [85]:
get_similar_movies_graph_embeddings((122912, 'movie'), node_vecs)[:10]

['Avengers: Infinity War - Part I (2018)',
 'Thor: Ragnarok (2017)',
 'Guardians of the Galaxy 2 (2017)',
 'Deadpool 2 (2018)',
 'Star Wars: The Last Jedi (2017)',
 'Untitled Spider-Man Reboot (2017)',
 'Jumanji: Welcome to the Jungle (2017)',
 'Justice League (2017)',
 'Black Panther (2017)',
 'Blade Runner 2049 (2017)']

In [86]:
get_similar_movies_graph_embeddings((1, 'movie'), node_vecs)[:10]

['Toy Story (1995)',
 'Mission: Impossible (1996)',
 'Nutty Professor, The (1996)',
 'Willy Wonka & the Chocolate Factory (1971)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Twister (1996)',
 'Touch (1997)',
 'Sense and Sensibility (1995)',
 "Mr. Holland's Opus (1995)",
 'The Nut Job 2: Nutty by Nature (2017)']

In [87]:
def get_recommended_movies_graph_embeddings(userid, movie_vecs, top_n=10):
    user_idx = user2dict[userid]
    query = movie_vecs[user_idx].reshape(1,-1)
    ranking = cosine_similarity(query, movie_vecs)
    top_ids = np.argsort(-ranking)[0]
    top_movie_ids = [reverse_movie2dict[j] for j in top_ids if j in reverse_movie2dict][:top_n]
    reco_movies = [movie_df[movie_df.movieId == id[0]].title.values[0] for id in top_movie_ids]
    return reco_movies

In [88]:
get_recommended_movies_graph_embeddings((1, 'user'), node_vecs, top_n=10)

['Newton Boys, The (1998)',
 'Shaft (1971)',
 'American Tail, An (1986)',
 "Gulliver's Travels (1939)",
 'Best Men (1997)',
 'Howard the Duck (1986)',
 'Quiet Man, The (1952)',
 'Three Caballeros, The (1945)',
 'Sword in the Stone, The (1963)',
 'Honey, I Shrunk the Kids (1989)']

In [93]:
idx = 1
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
recos.intersection(true_pos)

{'American Tail, An (1986)',
 "Gulliver's Travels (1939)",
 'Newton Boys, The (1998)',
 'Quiet Man, The (1952)',
 'Shaft (1971)',
 'Sword in the Stone, The (1963)',
 'Three Caballeros, The (1945)'}

In [95]:
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
mf_recos.intersection(true_pos)

{"Gulliver's Travels (1939)", 'Newton Boys, The (1998)'}

In [97]:
idx = 2
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
print(recos.intersection(true_pos))
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(mf_recos.intersection(true_pos))

{'Town, The (2010)', 'Inside Job (2010)', 'Warrior (2011)', 'Wolf of Wall Street, The (2013)', 'The Jinx: The Life and Deaths of Robert Durst (2015)'}
{'The Jinx: The Life and Deaths of Robert Durst (2015)'}


In [98]:
idx = 3
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
print(recos.intersection(true_pos))
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(mf_recos.intersection(true_pos))

{'Troll 2 (1990)', 'Android (1982)', 'Alien Contamination (1980)', 'Looker (1981)', 'The Lair of the White Worm (1988)', 'Galaxy of Terror (Quest) (1981)', 'Hangar 18 (1980)', 'Master of the Flying Guillotine (Du bi quan wang da po xue di zi) (1975)', 'Death Race 2000 (1975)', 'Clonus Horror, The (1979)'}
{'Alien Contamination (1980)', 'Galaxy of Terror (Quest) (1981)', 'Looker (1981)', 'Master of the Flying Guillotine (Du bi quan wang da po xue di zi) (1975)'}


In [99]:
idx = 4
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
print(recos.intersection(true_pos))
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(mf_recos.intersection(true_pos))

{"I'm the One That I Want (2000)", 'Flirting With Disaster (1996)', 'L.I.E. (2001)', 'Saboteur (1942)', 'Six Degrees of Separation (1993)'}
{"I'm the One That I Want (2000)", 'Beautiful Thing (1996)'}


In [105]:
idx = 10
recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])
print(recos.intersection(true_pos))
mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(mf_recos.intersection(true_pos))

{'Priceless (Hors de prix) (2006)', 'First Daughter (2004)'}
{'The Hundred-Foot Journey (2014)'}


## Enriched network with additional information : Genres

In [52]:
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#### Genres of the movies can be used as additional signal for better recommendations

In [53]:
movie_genre_edgelist = movie_df[['movieId', 'genres']]
movie_genre_edgelist.head()

Unnamed: 0,movieId,genres
0,1,Adventure|Animation|Children|Comedy|Fantasy
1,2,Adventure|Children|Fantasy
2,3,Comedy|Romance
3,4,Comedy|Drama|Romance
4,5,Comedy


In [54]:
cnt

10334

In [55]:
genre2int = dict()
for x in movie_genre_edgelist.values:
    genres = x[1].split('|')
    for genre in genres:
        if genre in genre2int:
            pass
        else:
            genre2int[genre] = cnt
            cnt += 1

In [56]:
cnt

10354

In [57]:
genre2int

{'Adventure': 10334,
 'Animation': 10335,
 'Children': 10336,
 'Comedy': 10337,
 'Fantasy': 10338,
 'Romance': 10339,
 'Drama': 10340,
 'Action': 10341,
 'Crime': 10342,
 'Thriller': 10343,
 'Horror': 10344,
 'Mystery': 10345,
 'Sci-Fi': 10346,
 'War': 10347,
 'Musical': 10348,
 'Documentary': 10349,
 'IMAX': 10350,
 'Western': 10351,
 'Film-Noir': 10352,
 '(no genres listed)': 10353}

In [58]:
movie_genre_graph = nx.Graph()
for x in movie_genre_edgelist.values:
    movie = (x[0], 'movie')
    genres = x[1].split('|')
    if movie in movie2dict:
        for genre in genres:
            if genre in genre2int:
                movie_genre_graph.add_node(movie2dict[movie])
                movie_genre_graph.add_node(genre2int[genre])
                movie_genre_graph.add_edge(movie2dict[movie], genre2int[genre], weight=1.0)
            else:
                pass

In [59]:
movie_genre_graph.number_of_nodes()

9744

In [61]:
rating_df.movieId.nunique()

9724

In [62]:
movie_genre_graph.number_of_edges()

22046

In [63]:
list(movie_genre_graph.edges())[:5]

[(1, 10334), (1, 10335), (1, 10336), (1, 10337), (1, 10338)]

In [65]:
movie_genre_graph[1][10334]

{'weight': 1.0}

#### Combine the user-movie and movie-genre graph

In [66]:
user_movie_genre_graph =  nx.Graph()
user_movie_genre_graph.add_weighted_edges_from([(x,y,user_movie_graph[x][y]['weight']) for x,y in user_movie_graph.edges()])
user_movie_genre_graph.add_weighted_edges_from([(x,y,movie_genre_graph[x][y]['weight']) for x,y in movie_genre_graph.edges()])

In [67]:
user_movie_genre_graph.number_of_edges()

122882

In [68]:
list(user_movie_genre_graph.edges())[0]

(0, 1)

In [69]:
user_movie_genre_edgelist_path = './user_movie_genres_edgelist.csv'

In [70]:
nx.readwrite.edgelist.nx.write_edgelist(user_movie_genre_graph, user_movie_genre_edgelist_path, delimiter=' ', data=['weight'])

In [106]:
! python ../node2vec/src/main.py --input ./user_movie_genres_edgelist.csv --output ./user_movie_genre_embeddings --dimensions 50 --walk-length 40 --weighted

Walk iteration:
1 / 10
2 / 10
3 / 10
4 / 10
5 / 10
6 / 10
7 / 10
8 / 10
9 / 10
10 / 10


In [107]:
user_movie_genre_embeddings = build_embeddings('./user_movie_genre_embeddings')

In [108]:
len(user_movie_genre_embeddings)

10354

In [109]:
user_movie_genre_embeddings[0].shape

(50,)

In [110]:
1.0 - cosine(user_movie_genre_embeddings[movie2dict[(260, 'movie')]], user_movie_genre_embeddings[movie2dict[(1196, 'movie')]])

0.9183064575616585

In [111]:
1.0 - cosine(user_movie_genre_embeddings[movie2dict[(260, 'movie')]], user_movie_genre_embeddings[movie2dict[(1210, 'movie')]])

0.9206418521656725

In [112]:
node_vecs_all = [user_movie_genre_embeddings[x] for x in range(len(user_movie_genre_embeddings))]

In [113]:
len(node_vecs_all)

10354

In [114]:
node_vecs_all = np.array(node_vecs_all)

In [115]:
node_vecs_all.shape

(10354, 50)

In [116]:
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs_all)[:10]

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
 'Matrix, The (1999)',
 'Back to the Future (1985)',
 'Terminator, The (1984)',
 'Fargo (1996)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Alien (1979)']

In [118]:
get_similar_movies_graph_embeddings((260, 'movie'), node_vecs)[:10]

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
 'Matrix, The (1999)',
 'Frankie and Johnny (1966)',
 'The Nut Job 2: Nutty by Nature (2017)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Spy Next Door, The (2010)',
 'Saving Private Ryan (1998)']

Recommendations

In [120]:
get_recommended_movies_graph_embeddings((1, 'user'), node_vecs_all, top_n=10)

["Gulliver's Travels (1939)",
 'Best Men (1997)',
 'Great Mouse Detective, The (1986)',
 'American Tail, An (1986)',
 'Newton Boys, The (1998)',
 'Teenage Mutant Ninja Turtles II: The Secret of the Ooze (1991)',
 'Three Caballeros, The (1945)',
 'Welcome to Woop-Woop (1997)',
 'Red Dawn (1984)',
 "Charlotte's Web (1973)"]

In [121]:
get_recommended_movies_graph_embeddings((1, 'user'), node_vecs, top_n=10)

['Newton Boys, The (1998)',
 'Shaft (1971)',
 'American Tail, An (1986)',
 "Gulliver's Travels (1939)",
 'Best Men (1997)',
 'Howard the Duck (1986)',
 'Quiet Man, The (1952)',
 'Three Caballeros, The (1945)',
 'Sword in the Stone, The (1963)',
 'Honey, I Shrunk the Kids (1989)']

In [131]:
idx = 2
true_pos = set([movie_df[movie_df.movieId == id].title.values[0] for id in rating_df[(rating_df['userId'] == idx) & (rating_df['rating'] >= 4.5)].movieId.values])

mf_recos = set(get_recommendations_matrix_factorization(idx, user_factors, movie_factors))
print(len(mf_recos.intersection(true_pos)))

ge_recos = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs, top_n=10))
print(len(ge_recos.intersection(true_pos)))

ge_enriched_reso = set(get_recommended_movies_graph_embeddings((idx, 'user'), node_vecs_all, top_n=10))
print(len(ge_enriched_reso.intersection(true_pos)))

1
5
6
