# Meta-path based Heterogeneous Graph Embeddings for Recommendations

Recommendation problems can be modeled as link prediction on graphs, and neural graph embedding methods (i.e. mapping of each node to a low-dimensional space such that node 'similar' in graph space are mapped to nearyby points in the embedding space) have been particularly succesful at various graph analytical tasks. Unsupervised graph embedding methods like [**DeepWalk**](https://arxiv.org/abs/1403.6652) and [**node2vec**](https://snap.stanford.edu/node2vec/) use word2vec models (e.g.  skip-gram) to map nodes co-occuring on a fixed window on truncated random walks to nearby points. These techniques can be used for collaborative filtering by mapping users and items to the same space, kind of an extension of matrix factorization based embedding.  

However, graphs for recommendations e.g. User-Item, User-Item-Genre, User-Item-Movie-Actor-Director etc. are often heterogeneous. Methods like DeepWalk are developed for homogeneous graphs with only one type of nodes, and might not perform optimal for heterogeneous graphs. As a direct extension of DeepWalk, [**metapath2vec**]() provides an unsupervised method to obtain node embeddings for heterogeneous graphs. **metapath2vec** generates random walks guided by a metapath (path of node types) and then uses the appropriate skip-gram like model for embedding. 

In this notebook, we will explore the use to metapath2vec embeddings for recommendation tasks. We will explore the efficacy of different metapaths. 

## Data 

We will use the MovieLens data which contains ratings of movies by users. The data which is publically available from [MovieLens Website](https://grouplens.org/datasets/movielens/). We are using 100k data which has 100k ratings. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from scipy.sparse import coo_matrix, csr_matrix
from scipy.sparse.linalg import svds, norm
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
import operator
from collections import defaultdict

In [4]:
data_path = '../data/'

In [5]:
rating_df = pd.read_csv(data_path + 'ratings.csv', sep=',', header=0)

In [6]:
rating_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
rating_df.shape

(100836, 4)

In [8]:
rating_df.userId.nunique()

610

In [9]:
rating_df.movieId.nunique()

9724

- The data has over 100k ratings by 610 users on 9724 movies

In [10]:
max(rating_df.userId), max(rating_df.movieId)

(610, 193609)

  - The movie ids do not follow an order. 
  - 9724 movies have been selected that users with id 1 to 610 have rated so as to have 100k ratings. 

### Movies Information

We are also provided the titles and genres of the movies in a separate file. 

In [11]:
movie_df = pd.read_csv(data_path + 'movies.csv', sep=',', header=0)

In [12]:
movie_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [13]:
movie_df[movie_df['title'].str.contains('Avengers')]

Unnamed: 0,movieId,title,genres
1611,2153,"Avengers, The (1998)",Action|Adventure
6148,44020,Ultimate Avengers (2006),Action|Animation|Children|Sci-Fi
7693,89745,"Avengers, The (2012)",Action|Adventure|Sci-Fi|IMAX
8551,115727,Crippled Avengers (Can que) (Return of the 5 D...,Action|Adventure
8686,122892,Avengers: Age of Ultron (2015),Action|Adventure|Sci-Fi
8693,122912,Avengers: Infinity War - Part I (2018),Action|Adventure|Sci-Fi
9153,147657,Masked Avengers (1981),Action
9488,170297,Ultimate Avengers 2 (2006),Action|Animation|Sci-Fi


## Train/Test data

Split the data into train and test sets.  

In [14]:
train_df = rating_df.sample(frac=0.7,random_state=200) #random state is a seed value
test_df = rating_df.drop(train_df.index)

In [15]:
train_df.shape, test_df.shape

((70585, 4), (30251, 4))

In [16]:
test_users = test_df.userId.unique()
test_movies = test_df.movieId.unique()

In [17]:
test_users.shape, test_movies.shape

((610,), (6137,))

## Data processing

We will be used the implementation of metapath2vec provided with the original paper which is in C++.

Suffix 'a' to every user node id, and 'v' to every movie id. 

In [18]:
user2int = dict()
movie2int = dict()

for x in train_df.values:
    usr = int(x[0])
    movie = int(x[1])
    if usr in user2int:
        pass
    else:
        user2int[usr] = 'a' + str(usr)
    if movie in movie2int:
        pass
    else:
        movie2int[movie] = 'v' + str(movie)

In [19]:
len(user2int), len(movie2int)

(610, 8519)

In [20]:
list(user2int.keys())[:5]

[474, 368, 232, 506, 89]

### Adjacency lists of user and movies

Create the adjacency lists of the movies and users. Unlike the setup of *metapath2vec*, we have a *weighted* graph, so inaddition to the adjacent nodes, we also record the weight of each connection. Since these weights will determine transition probabilies of the random walks, we also normalize the weights of each edge by the sum of the weights of all edges from a given node. 

In [21]:
user_movielist = dict()
for x in train_df.groupby(by=['userId'])[['movieId', 'rating']]:
    usr = user2int[x[0]]
    movieids = x[1][['movieId', 'rating']].values
    user_movielist[usr] = [(movie2int[y[0]], float(y[1])) for y in movieids]
    norm_sum = np.sum([y[1] for y in user_movielist[usr]])
    user_movielist[usr] = [(y[0], float(y[1]) / norm_sum) for y in user_movielist[usr]] 

In [22]:
user_movielist['a1'][:5]

[('v1224', 0.007132667617689016),
 ('v1092', 0.007132667617689016),
 ('v2096', 0.005706134094151213),
 ('v1445', 0.0042796005706134095),
 ('v2997', 0.005706134094151213)]

In [23]:
movie_userlist = dict()
for x in train_df.groupby(by=['movieId'])[['userId', 'rating']]:
    mve = movie2int[int(x[0])]
    userids = x[1][['userId', 'rating']].values
    movie_userlist[mve] = [(user2int[y[0]], float(y[1])) for y in userids]
    norm_sum = np.sum([y[1] for y in movie_userlist[mve]])
    movie_userlist[mve] = [(y[0], float(y[1]) / norm_sum) for y in movie_userlist[mve]] 

In [24]:
movie_userlist['v1'][:5]

[('a193', 0.0034158838599487617),
 ('a477', 0.006831767719897523),
 ('a252', 0.007685738684884714),
 ('a606', 0.004269854824935952),
 ('a471', 0.008539709649871904)]

## metapath2vec

metapath2vec operates in following steps:
- Choose a metapath e.g. User-Movie, User-Movie-User etc. 
- Generate random walks starting from each node guided by the chosen metapath.
- Run the embedding model i.e. kip-gram with negative sampling. 

### Random walks for User-Movie-User metapath

In [25]:
def generate_metapaths_umu(user_movielist, movie_userlist, num_walks=10, walk_length=80):
    walks = []
    for usr in user_movielist:
        usr0 = usr
        for i in range(0, num_walks):
            walk = usr0
            for j in range(0, walk_length):
                movielist = user_movielist[usr]
                movieids = [x[0] for x in movielist]
                mve_probs =  [x[1] for x in movielist]
                next_movie = np.random.choice(movieids, 1, p=mve_probs)[0]
                walk += " " + next_movie
                
                usrlist = movie_userlist[next_movie]
                usrids = [x[0] for x in usrlist]
                usr_probs = [x[1] for x in usrlist]
                next_user = np.random.choice(usrids, 1, p=usr_probs)[0]
                walk += " " + next_user
            walks.append(walk)
    return walks
                   

In [27]:
metapath_walks_umu = generate_metapaths_umu(user_movielist, movie_userlist, num_walks=10, walk_length=100)

In [29]:
metapath_walks_umu[0]

'a1 v2143 a605 v2470 a68 v1210 a120 v3176 a202 v1587 a325 v2005 a20 v3253 a111 v553 a353 v2268 a265 v3639 a164 v2617 a312 v2987 a391 v2291 a489 v2046 a288 v2700 a141 v1408 a39 v1445 a414 v2628 a591 v2858 a503 v1009 a414 v3527 a387 v2174 a260 v3052 a282 v6 a325 v1617 a608 v1009 a51 v2291 a122 v260 a434 v552 a395 v2858 a606 v923 a122 v3703 a330 v2094 a68 v2193 a226 v1278 a524 v1291 a475 v2542 a122 v2387 a294 v553 a274 v2141 a522 v3703 a202 v4006 a167 v2991 a288 v1396 a385 v2944 a156 v543 a387 v1291 a524 v216 a1 v661 a274 v2657 a177 v260 a239 v2161 a160 v2406 a452 v2529 a483 v552 a58 v2571 a68 v1030 a234 v1210 a96 v2761 a288 v543 a608 v3744 a182 v1377 a596 v1092 a160 v1073 a140 v1089 a266 v1213 a91 v3578 a10 v590 a223 v2502 a17 v1587 a469 v1049 a57 v2048 a525 v2137 a217 v1049 a19 v590 a565 v2628 a57 v2459 a1 v2991 a448 v480 a514 v2644 a1 v2492 a186 v2193 a414 v1587 a469 v543 a226 v2048 a186 v954 a290 v919 a517 v954 a66 v2141 a387 v1278 a477 v260 a555 v1617 a448 v3062 a202 v1587 a313 v356 

In [31]:
len(metapath_walks_umu)

6100

Write the random walks to a file to be used to *metapath2vec* code. 

In [32]:
outfile = '../data/user_movie_user_metapath_walks_umu.txt'

In [33]:
with open(outfile, 'w') as f:
    for walk in metapath_walks_umu:
        f.write(walk + "\n")

Call metapath2vec with following config

- embedding dimension = 128
- context window size = 7
- negative samples = 5

In [512]:
! ../../deep_learning_graphs/metapath2vec/metapath2vec -train ../data/user_movie_user_metapath_walks_umu.txt -output ../data/metapath_umu_embed -pp 0 -size 128 -window 7 -negative 5 -threads 32


Starting training using file ../data/user_movie_user_metapath_walks_umu.txt
Vocab size: 5835
Words in train file: 1227052
Alpha: 0.000894  Progress: 98.71%  Words/thread/sec: 35.73k  

Load embeddings

In [513]:
def build_embeddings(embed_file):
    embedding_dict = dict()
    for line in open(embed_file):
        if line.startswith('v') or line.startswith('a'):
            record = line.rstrip().split()
            key = record[0]
            embedding_dict[key] = np.array(record[1:], dtype=float)
        else:
            pass
    return embedding_dict

In [514]:
embeddings_umu = build_embeddings('../data/metapath_umu_embed.txt')

In [515]:
len(embeddings_umu)

5834

In [516]:
embeddings_umu['v1'].shape

(128,)

Movie-Movie Similarity 

In [517]:
movie1 = movie2int[260]
movie2 = movie2int[1196]
1.0 - cosine(embeddings_umu[movie1], embeddings_umu[movie2])

0.5740252373388659

In [518]:
movie1, movie2

('v260', 'v1196')

In [519]:
movie3 = movie2int[1210]
1.0 - cosine(embeddings_umu[movie1], embeddings_umu[movie3])

0.5456195353020642

In [520]:
movie4 = movie2int[1]
1.0 - cosine(embeddings_umu[movie1], embeddings_umu[movie4])

0.17959358716862783

In [521]:
metapath_vecs_umu = []
metapath_umu_ids = []
for x in embeddings_umu:
    metapath_vecs_umu.append(embeddings_umu[x])
    metapath_umu_ids.append(x)

In [522]:
metapath_vecs_umu = np.array(metapath_vecs_umu)
metapath_vecs_umu.shape, len(metapath_umu_ids)

((5834, 128), 5834)

In [523]:
reverse_movie2int = {k:v for v,k in movie2int.items()}

Compute k-nearest neighboring movies to a given movie. 

In [524]:
def get_similar_movies_metapath(movieid, embed_arr, embed_keys, top_n=10):
    movie_idx = movie2int[movieid]
    idx = embed_keys.index(movie_idx)
    query = embed_arr[idx].reshape(1,-1)
    ranking = cosine_similarity(query, embed_arr)
    top_ids = np.argsort(-ranking)[0]
    top_indices = [embed_keys[y] for y in top_ids]
    top_movie_ids = [reverse_movie2int[j] for j in top_indices if j in reverse_movie2int][:top_n] 
    sim_movies = [movie_df[movie_df.movieId == id].title.values[0] for id in top_movie_ids]
    return sim_movies

In [525]:
get_similar_movies_metapath(260, metapath_vecs_umu, metapath_umu_ids)

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Cage Dive (2017)',
 'Battlefield Earth (2000)',
 'Indiana Jones and the Last Crusade (1989)',
 'Matrix, The (1999)',
 'Rush (2013)',
 'Geronimo: An American Legend (1993)']

In [526]:
get_similar_movies_metapath(1, metapath_vecs_umu, metapath_umu_ids)[:10]

['Toy Story (1995)',
 'Cemetery Man (Dellamorte Dellamore) (1994)',
 'Adventures of Pinocchio, The (1996)',
 'Great White Hype, The (1996)',
 'Hunchback of Notre Dame, The (1996)',
 'Carpool (1996)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Nutty Professor, The (1996)',
 "Mr. Holland's Opus (1995)",
 'Man of the Year (1995)']

In [527]:
get_similar_movies_metapath(2, metapath_vecs_umu, metapath_umu_ids)[:10]

['Jumanji (1995)',
 'Stargate (1994)',
 'Mask, The (1994)',
 'RoboCop 3 (1993)',
 'Lion King, The (1994)',
 'Amazing Panda Adventure, The (1995)',
 'Swan Princess, The (1994)',
 'Batman Forever (1995)',
 'NeverEnding Story III, The (1994)',
 'Boys of St. Vincent, The (1992)']

### Recommendation

In [528]:
def get_recommended_movies_metapath(userid, embed_arr, embed_keys, top_n=10):
    user_idx = user2int[userid]
    idx = embed_keys.index(user_idx)
    query = embed_arr[idx].reshape(1,-1)
    ranking = cosine_similarity(query, embed_arr)
    top_ids = np.argsort(-ranking)[0]
    top_indices = [embed_keys[y] for y in top_ids]
    top_movie_ids = [reverse_movie2int[j] for j in top_indices if j in reverse_movie2int][:top_n]
    reco_movies = [movie_df[movie_df.movieId == id].title.values[0] for id in top_movie_ids]
    return reco_movies

In [529]:
get_recommended_movies_metapath(1, metapath_vecs_umu, metapath_umu_ids)

["Gulliver's Travels (1939)",
 'Welcome to Woop-Woop (1997)',
 'Newton Boys, The (1998)',
 'Black Cauldron, The (1985)',
 'Iron Eagle (1986)',
 'Canadian Bacon (1995)',
 'Highlander: Endgame (Highlander IV) (2000)',
 "McHale's Navy (1997)",
 'Trial and Error (1997)',
 'Quiet Man, The (1952)']

In [530]:
get_recommended_movies_metapath(2, metapath_vecs_umu, metapath_umu_ids)

['The Drop (2014)',
 'Warrior (2011)',
 'Louis C.K.: Hilarious (2010)',
 'Wolf of Wall Street, The (2013)',
 'Mad Max: Fury Road (2015)',
 'Exit Through the Gift Shop (2010)',
 'Inside Job (2010)',
 'Talladega Nights: The Ballad of Ricky Bobby (2006)',
 'We Could Be King (2014)',
 'Whiplash (2014)']

In [531]:
get_recommended_movies_metapath(3, metapath_vecs_umu, metapath_umu_ids)

['Death Race 2000 (1975)',
 'Alien Contamination (1980)',
 'Clonus Horror, The (1979)',
 'Galaxy of Terror (Quest) (1981)',
 'The Lair of the White Worm (1988)',
 'Saturn 3 (1980)',
 'Hangar 18 (1980)',
 'Looker (1981)',
 'Android (1982)',
 'Piranha (1978)']

### Movie-User-Movie metapath

In [98]:
def generate_metapaths_mum(user_movielist, movie_userlist, num_walks=10, walk_length=80):
    walks = []
    for mve in movie_userlist:
        mve0 = mve
        for i in range(0, num_walks):
            walk = mve0
            for j in range(0, walk_length):
                usrlist = movie_userlist[mve]
                usrids = [x[0] for x in usrlist]
                usr_probs = [x[1] for x in usrlist]
                next_user = np.random.choice(usrids, 1, p=usr_probs)[0]
                walk += " " + next_user

                movielist = user_movielist[next_user]
                movieids = [x[0] for x in movielist]
                mve_probs =  [x[1] for x in movielist]
                next_movie = np.random.choice(movieids, 1, p=mve_probs)[0]
                walk += " " + next_movie
            walks.append(walk)
    return walks
                   

In [119]:
metapath_walks_mum = generate_metapaths_mum(user_movielist, movie_userlist, num_walks=10, walk_length=100)

In [120]:
metapath_walks_mum[0]

'v1 a573 v6586 a522 v1210 a477 v55442 a434 v6502 a562 v2420 a533 v8874 a606 v6591 a119 v6539 a96 v367 a469 v1037 a178 v4226 a359 v2324 a21 v4896 a50 v27513 a378 v63082 a483 v8644 a43 v339 a608 v4896 a411 v235 a57 v1370 a274 v4223 a239 v367 a411 v159 a5 v364 a206 v141 a280 v7451 a608 v3499 a214 v628 a381 v3247 a380 v3156 a201 v2683 a18 v104879 a359 v44555 a90 v1183 a229 v150 a276 v104 a471 v4973 a500 v2987 a597 v1673 a550 v122904 a54 v639 a451 v17 a411 v587 a573 v2394 a71 v1101 a91 v260 a601 v72378 a57 v1580 a178 v296 a63 v7254 a353 v457 a191 v308 a282 v2028 a339 v1265 a414 v150548 a130 v225 a334 v61024 a264 v3510 a579 v2396 a63 v1278 a332 v1224 a159 v597 a328 v2717 a167 v5103 a229 v353 a456 v1 a247 v106782 a469 v3022 a130 v150 a277 v849 a82 v34 a389 v17 a193 v55442 a269 v63 a357 v7153 a334 v1265 a477 v2657 a378 v7153 a46 v282 a579 v356 a328 v51255 a134 v173 a323 v105 a357 v33794 a380 v5323 a572 v150 a229 v586 a46 v10 a276 v370 a91 v2804 a601 v1584 a436 v235 a381 v7147 a605 v2617 a412 v

In [121]:
len(metapath_walks_mum)

85190

In [122]:
outfile_mum = '../data/user_movie_user_metapath_walks_mum.txt'

In [123]:
with open(outfile_mum, 'w') as f:
    for walk in metapath_walks_mum:
        f.write(walk + "\n")

In [532]:
! ../../deep_learning_graphs/metapath2vec/metapath2vec -train ../data/user_movie_user_metapath_walks_mum.txt -output ../data/metapath_mum_embed -pp 0 -size 128 -window 7 -negative 5 -threads 32

Starting training using file ../data/user_movie_user_metapath_walks_mum.txt
Vocab size: 9130
Words in train file: 17208380
Alpha: 0.000037  Progress: 99.86%  Words/thread/sec: 36.25k  

In [533]:
embeddings_mum = build_embeddings('../data/metapath_mum_embed.txt')

In [534]:
embeddings_mum['v1'].shape

(128,)

In [535]:
len(embeddings_mum)

9129

In [536]:
1.0 - cosine(embeddings_mum[movie1], embeddings_mum[movie2])

0.7035925852157052

In [537]:
1.0 - cosine(embeddings_mum[movie1], embeddings_mum[movie3])

0.686735952361414

In [538]:
1.0 - cosine(embeddings_mum[movie1], embeddings_mum[movie4])

0.6477228064390673

In [539]:
metapath_vecs_mum = []
metapath_mum_ids = []
for x in embeddings_mum:
    metapath_vecs_mum.append(embeddings_mum[x])
    metapath_mum_ids.append(x)

In [540]:
metapath_vecs_mum = np.array(metapath_vecs_mum)
metapath_vecs_mum.shape, len(metapath_mum_ids)

((9129, 128), 9129)

In [541]:
get_similar_movies_metapath(260, metapath_vecs_mum, metapath_mum_ids)

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Twister (1996)',
 'Birdcage, The (1996)',
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
 'Toy Story (1995)',
 'Mighty Aphrodite (1995)',
 'Shawshank Redemption, The (1994)']

In [542]:
get_similar_movies_metapath(1, metapath_vecs_mum, metapath_mum_ids)

['Toy Story (1995)',
 'Eraser (1996)',
 'Sense and Sensibility (1995)',
 'Father of the Bride Part II (1995)',
 'Star Trek: First Contact (1996)',
 'Nutty Professor, The (1996)',
 'Nixon (1995)',
 'Mission: Impossible (1996)',
 'Mighty Aphrodite (1995)',
 'Star Wars: Episode IV - A New Hope (1977)']

In [543]:
get_similar_movies_metapath(2, metapath_vecs_mum, metapath_mum_ids)

['Jumanji (1995)',
 'Lion King, The (1994)',
 'Santa Clause, The (1994)',
 'Pretty Woman (1990)',
 'Dances with Wolves (1990)',
 'Beauty and the Beast (1991)',
 'Terminator 2: Judgment Day (1991)',
 'Mrs. Doubtfire (1993)',
 'Star Trek: Generations (1994)',
 'Stargate (1994)']

In [544]:
get_recommended_movies_metapath(1, metapath_vecs_mum, metapath_mum_ids)

["Gulliver's Travels (1939)",
 'Welcome to Woop-Woop (1997)',
 'Newton Boys, The (1998)',
 "McHale's Navy (1997)",
 'Dracula (1931)',
 'Red Dawn (1984)',
 'Ghost and the Darkness, The (1996)',
 'Very Bad Things (1998)',
 'Ghost and Mrs. Muir, The (1947)',
 'Winnie the Pooh and the Blustery Day (1968)']

In [545]:
get_recommended_movies_metapath(2, metapath_vecs_mum, metapath_mum_ids)

['Inside Job (2010)',
 'Warrior (2011)',
 'Louis C.K.: Hilarious (2010)',
 'Town, The (2010)',
 'Shutter Island (2010)',
 'Inception (2010)',
 'Django Unchained (2012)',
 "In My Father's Den (2004)",
 'The Drop (2014)',
 'Girl with the Dragon Tattoo, The (2011)']

In [546]:
get_recommended_movies_metapath(3, metapath_vecs_mum, metapath_mum_ids)

['Galaxy of Terror (Quest) (1981)',
 'Alien Contamination (1980)',
 'Looker (1981)',
 'Saturn 3 (1980)',
 'Hangar 18 (1980)',
 'Clonus Horror, The (1979)',
 'Android (1982)',
 'Death Race 2000 (1975)',
 'The Lair of the White Worm (1988)',
 'Piranha (1978)']

### UMU and MUM meta-paths

In [547]:
metapath_vecs_mum_umu = []
metapath_vecs_mum_umu.extend(metapath_walks_umu)
metapath_vecs_mum_umu.extend(metapath_walks_mum)

In [548]:
outfile_mum_umu = '../data/user_movie_user_metapath_walks_mum_umu.txt'

In [549]:
with open(outfile_mum_umu, 'w') as f:
    for walk in metapath_vecs_mum_umu:
        f.write(walk + "\n")

In [550]:
! ../../deep_learning_graphs/metapath2vec/metapath2vec -train ../data/user_movie_user_metapath_walks_mum_umu.txt -output ../data/metapath_mum_umu_embed -pp 0 -size 128 -window 7 -negative 5 -threads 32

Starting training using file ../data/user_movie_user_metapath_walks_mum_umu.txt
Vocab size: 9130
Words in train file: 18440580
Alpha: 0.000037  Progress: 99.86%  Words/thread/sec: 36.77k  

In [551]:
embeddings_mum_umu = build_embeddings('../data/metapath_mum_umu_embed.txt')

In [552]:
embeddings_mum_umu['v1'].shape

(128,)

In [553]:
len(embeddings_mum_umu)

9129

In [554]:
1.0 - cosine(embeddings_mum_umu[movie1], embeddings_mum_umu[movie2])

0.7090602132423149

In [555]:
1.0 - cosine(embeddings_mum_umu[movie1], embeddings_mum_umu[movie3])

0.7114531489752923

In [556]:
1.0 - cosine(embeddings_mum_umu[movie1], embeddings_mum_umu[movie4])

0.6760874734261341

In [557]:
metapath_vecs_mum_umu = []
metapath_mum_umu_ids = []
for x in embeddings_mum_umu:
    metapath_vecs_mum_umu.append(embeddings_mum_umu[x])
    metapath_mum_umu_ids.append(x)

In [558]:
metapath_vecs_mum_umu = np.array(metapath_vecs_mum_umu)
metapath_vecs_mum_umu.shape, len(metapath_mum_umu_ids)

((9129, 128), 9129)

In [559]:
get_similar_movies_metapath(260, metapath_vecs_mum_umu, metapath_mum_umu_ids)

['Star Wars: Episode IV - A New Hope (1977)',
 'Star Wars: Episode VI - Return of the Jedi (1983)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)',
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 'Toy Story (1995)',
 'Fargo (1996)',
 'Rock, The (1996)',
 'Independence Day (a.k.a. ID4) (1996)',
 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)',
 'Birdcage, The (1996)']

In [560]:
get_similar_movies_metapath(1, metapath_vecs_mum_umu, metapath_mum_umu_ids)

['Toy Story (1995)',
 'Nutty Professor, The (1996)',
 'Fargo (1996)',
 'Mission: Impossible (1996)',
 'Willy Wonka & the Chocolate Factory (1971)',
 'Sense and Sensibility (1995)',
 'Star Trek: First Contact (1996)',
 'Father of the Bride Part II (1995)',
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)',
 'Birdcage, The (1996)']

In [561]:
get_similar_movies_metapath(2, metapath_vecs_mum_umu, metapath_mum_umu_ids)

['Jumanji (1995)',
 'Lion King, The (1994)',
 'Star Trek: Generations (1994)',
 'Sleepless in Seattle (1993)',
 'Pretty Woman (1990)',
 'Santa Clause, The (1994)',
 'Dances with Wolves (1990)',
 'Mrs. Doubtfire (1993)',
 'Apollo 13 (1995)',
 'Maverick (1994)']

In [562]:
get_recommended_movies_metapath(1, metapath_vecs_mum_umu, metapath_mum_umu_ids)

["Gulliver's Travels (1939)",
 'Newton Boys, The (1998)',
 'Black Cauldron, The (1985)',
 'Ghost and the Darkness, The (1996)',
 'Welcome to Woop-Woop (1997)',
 'Transformers: The Movie (1986)',
 'Three Musketeers, The (1993)',
 "McHale's Navy (1997)",
 'Honey, I Shrunk the Kids (1989)',
 'Star Wars: Episode V - The Empire Strikes Back (1980)']

In [563]:
get_recommended_movies_metapath(2, metapath_vecs_mum_umu, metapath_mum_umu_ids)

['Warrior (2011)',
 'Louis C.K.: Hilarious (2010)',
 'Inception (2010)',
 'Town, The (2010)',
 'The Drop (2014)',
 'Wolf of Wall Street, The (2013)',
 'Inside Job (2010)',
 'Django Unchained (2012)',
 'Shutter Island (2010)',
 'Girl with the Dragon Tattoo, The (2011)']

In [564]:
get_recommended_movies_metapath(3, metapath_vecs_mum_umu, metapath_mum_umu_ids)

['Galaxy of Terror (Quest) (1981)',
 'Alien Contamination (1980)',
 'Hangar 18 (1980)',
 'Looker (1981)',
 'Saturn 3 (1980)',
 'Clonus Horror, The (1979)',
 'The Lair of the White Worm (1988)',
 'Death Race 2000 (1975)',
 'Android (1982)',
 'Piranha (1978)']

## Evaluation on test data

In [565]:
def rank_movies_metapath(userid, movie_list, embed_dict, top_n=10):
    usr_idx = user2int[userid]
    usr_vec = embed_dict[usr_idx]
    sim_dict = dict()
    for mve in movie_list:
        if (mve in movie2int):
            mve_idx = movie2int[mve]
            if (mve_idx in embed_dict):
                mve_vec = embed_dict[mve_idx]
                sim_dict[mve] = cosine(usr_vec, mve_vec)
        else:
            pass
    ranked_movies = sorted(sim_dict.items(), key=operator.itemgetter(1))
    return ranked_movies[:top_n]

In [579]:
test_df.shape

(30251, 4)

In [580]:
test_df[test_df['rating'] >= 3.5].shape

(18478, 4)

In [593]:
test_pos_examples = dict((user, set(movies)) for user, movies in test_df[test_df['rating'] >= 3.5].groupby('userId')['movieId'])

In [594]:
len(test_pos_examples)

605

In [595]:
list(test_pos_examples.keys())[:10]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [596]:
def precision_at_k(userid, true_pos_dict, embeddings, k):
    true_pos = true_pos_dict[userid]
    recos = [x for x,y in rank_movies_metapath(userid, test_movies, embeddings)]
    return (len(set(recos).intersection(true_pos)) * 1.0) / k

In [607]:
precision_at_k(103, test_pos_examples, embeddings_mum, k=10)

0.0

In [608]:
def average_precision_at_k(userid, true_pos_dict, embeddings, k=10):
    ap_at_k = 0
    for i in range(1,k):
        ap_at_k += precision_at_k(userid, true_pos_dict, embeddings, i)
    return ap_at_k / k

In [609]:
average_precision_at_k(2, test_pos_examples, embeddings_mum, k=10)

0.28289682539682537

In [610]:
def mAP(true_pos_dict, embeddings, k=10):
    score = 0
    for usr in true_pos_dict:
        score += average_precision_at_k(usr, true_pos_dict, embeddings, k)
    return score / len(true_pos_dict)

In [320]:
true_pos_dict = {1:test_pos_examples[1], 2:test_pos_examples[2], 3:test_pos_examples[3]}

In [612]:
mAP(test_pos_examples, embeddings_mum, k=10)

0.03787544273907912

In [613]:
mAP(test_pos_examples, embeddings_umu, k=10)

0.012157549521185888

In [614]:
mAP(test_pos_examples, embeddings_mum_umu, k=10)

0.0752832874196511

##  Node Classification

In [624]:
movie_genre_edgelist = movie_df[['movieId', 'genres']]
movie_genre_edgelist.head()

Unnamed: 0,movieId,genres
0,1,Adventure|Animation|Children|Comedy|Fantasy
1,2,Adventure|Children|Fantasy
2,3,Comedy|Romance
3,4,Comedy|Drama|Romance
4,5,Comedy


In [616]:
movie2genre = dict()
genre2movie = defaultdict(list)
for x in movie_genre_edgelist.values:
    genrelist = x[1].split('|')
    movie2genre[x[0]] = genrelist
    for g in genrelist:
        genre2movie[g].append(x[0])

In [747]:
genre_dict = {'Adventure': 1, 'Children': 1, 'Fantasy': 1 , 'Crime': 5, 'Thriller' : 5, 'Mystery' : 5, 'Sci-Fi' : 5, 'Comedy': 3, 
             'Drama': 2, 'Action': 4}

In [675]:
movie_embed_mum = []
movie_genres = []
for mve in movie2int:
    mve_id = movie2int[mve]
    genre = movie2genre[mve][0]
    if genre in genre_dict:
        if mve_id in embeddings_mum:
            embed = embeddings_mum[mve_id]
            movie_embed_mum.append(embed)
            movie_genres.append(genre_dict[genre] - 1)
        else:
            pass

In [676]:
len(movie_embed_mum), len(movie_genres)

(7441, 7441)

In [677]:
movie_embed_mum = np.array(movie_embed_mum)
movie_embed_mum.shape

(7441, 128)

In [678]:
movie_genres[:5]

[4, 2, 3, 2, 3]

In [646]:
from sklearn.model_selection import train_test_split

In [680]:
X_train, X_test, y_train, y_test = train_test_split(movie_embed_mum, np.array(movie_genres), test_size=0.33, random_state=42)

In [681]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4985, 128), (2456, 128), (4985,), (2456,))

In [687]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, confusion_matrix

In [741]:
clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [742]:
y_pred = clf.predict(X_test)

In [743]:
y_pred[:5]

array([2, 3, 2, 3, 0])

In [744]:
confusion_matrix(y_test, y_pred)

array([[ 57,  50,  72,  54,   3],
       [ 12, 335, 169,  99,  14],
       [ 24, 186, 505,  99,  11],
       [ 13,  79, 120, 345,   7],
       [  5,  87,  54,  44,  12]])

In [745]:
f1_score(y_test, y_pred, average='macro')

0.4133619267082745

In [746]:
f1_score(y_test, y_pred, average='micro')

0.510586319218241

## Parameter Senstivity

In [760]:
metapath = 'mum'
train_file = '../data/user_movie_user_metapath_walks_' + metapath + '.txt'
train_file

'../data/user_movie_user_metapath_walks_mum.txt'

In [761]:
embed_out_file = '../data/metapath_mum_umu_embed_dim_' + str(embed_dim)
embed_out_file

'../data/metapath_mum_umu_embed_dim_10'

In [764]:
embed_dim = 50
embed_dim

50

In [756]:
! ../../deep_learning_graphs/metapath2vec/metapath2vec -train $train_file -output $embed_out_file -pp 0 -size $embed_dim -window 7 -negative 5 -threads 32

Starting training using file ../data/user_movie_user_metapath_walks_mum_umu.txt
Vocab size: 9130
Words in train file: 18440580
Alpha: 0.000037  Progress: 99.86%  Words/thread/sec: 72.75k  

In [766]:
embedding_dict = build_embeddings(embed_out_file + '.txt')

In [765]:
mAP(test_pos_examples, embedding_dict, k=10)

In [777]:
def build_data_for_classification(embeddings):
    movie_embed = []
    movie_genres = []
    for mve in movie2int:
        mve_id = movie2int[mve]
        genre = movie2genre[mve][0]
        if genre in genre_dict:
            if mve_id in embeddings:
                embed = embeddings[mve_id]
                movie_embed.append(embed)
                movie_genres.append(genre_dict[genre] - 1)
            else:
                pass
    X_train, X_test, y_train, y_test = train_test_split(np.array(movie_embed), np.array(movie_genres), test_size=0.33, random_state=42)
    return X_train, X_test, y_train, y_test

In [778]:
X_train, X_test, y_train, y_test = build_data_for_classification(embeddings_mum_umu)

In [779]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((4985, 128), (2456, 128), (4985,), (2456,))

In [780]:
def evaluate_node_classification(embeddings):
    X_train, X_test, y_train, y_test = build_data_for_classification(embeddings)
    clf = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=1000)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    return {'macro': f1_score(y_test, y_pred, average='macro'), 'micro': f1_score(y_test, y_pred, average='micro')}

In [781]:
evaluate_node_classification(embeddings_mum)

{'macro': 0.4133619267082745, 'micro': 0.510586319218241}

In [782]:
evaluate_node_classification(embeddings_umu)

{'macro': 0.43443120450968564, 'micro': 0.5097150259067358}

In [783]:
evaluate_node_classification(embeddings_mum_umu)

{'macro': 0.41798399934738273, 'micro': 0.5175081433224755}