# Building a Recommender System

In this notebook we will build a recommender system training embeddings on a subset of the data we extracted in the other notebook. This data consists of a subset of 10k movies. You can find it in `data/wp_movies_10k.ndjson`.

## Before starting...

We need to import the libaries we'll use.

In [17]:
import inspect
from helpers import *

## Training movie embeddings

We want to use our link data between entities to recommend content. 

We'll achieve this by training embeddings using connections drawn from some metainformation about the outgoing links of each movie. Why? Because they might share the same director, staff, actors, or have been released close to each other. In general, links from a movie article to another shows a certain relationship we want to exploit.

Let's start by counting the outgoing links as a quick way to see if what we have is reasonable:

In [8]:
with open('data/wp_movies_10k.ndjson') as f:
    movies = [json.loads(line) for line in f]

link_counts = Counter()

for movie in movies:
    movie_links = movie[2]
    link_counts.update(movie_links)

print(link_counts.most_common(10))

[('Rotten Tomatoes', 9393), ('Category:English-language films', 5882), ('Category:American films', 5867), ('Variety (magazine)', 5450), ('Metacritic', 5112), ('Box Office Mojo', 4186), ('The New York Times', 3818), ('The Hollywood Reporter', 3553), ('Roger Ebert', 2707), ('Los Angeles Times', 2454)]


Now, our task is to determine wether a certain link can be found on the Wikipedia page of a movie. Hence, we need to feed it a proper dataset of matches vs non-matches. For this, we'll preserve only links that occur at least three times.

We will also build a list of valid (link, movie) pairs that'll speed up our lookups in the future.

In [9]:
top_links = [link for link, count in link_counts.items() if count >= 3]
link_to_index = {link: index for index, link in enumerate(top_links)}
movie_to_index = {movie[0]: index for index, movie in enumerate(movies)}

pairs = []
for movie in movies:
    movie_title = movie[0]
    movie_links = movie[2]
    pairs.extend((link_to_index[link], movie_to_index[movie_title])
                 for link in movie_links
                 if link in link_to_index)

pairs_set = set(pairs)
print(f'Number of pairs: {len(pairs)}')
print(f'Number of links: {len(top_links)}')
print(f'Number of movies: {len(movie_to_index)}')

Number of pairs: 949544
Number of links: 66913
Number of movies: 10000


Good. We can now move on to building our embeddings. We'll use Keras for this purpose.

The way this model will work is by taking both the link_id and movie_id, feeding them to the corresponding embedding layers, which then will allocate a vector of `embedding_size` for each possible input. Afterwards, the output of the model will be the dot product of both vectors. What'll happen is that the model will learn weights such that this dot product is as close to the actual label as possible. 

These weights will then act as projectors of movies and links to a multidimensional space where similar movies end up close to each other.

In [10]:
print(inspect.getsource(get_movie_embedding_model))

model = get_movie_embedding_model(top_links, movie_to_index)
model.summary()

def get_movie_embedding_model(top_links, movie_to_index, embedding_size=50):
    link = Input(name='link', shape=(1,))
    movie = Input(name='movie', shape=(1,))

    link_embedding = Embedding(name='link_embedding',
                               input_dim=len(top_links),
                               output_dim=embedding_size)(link)
    movie_embedding = Embedding(name='movie_embedding',
                                input_dim=len(movie_to_index),
                                output_dim=embedding_size)(movie)
    dot = Dot(name='dot_product',
              normalize=True,
              axes=2)([link_embedding, movie_embedding])

    merged = Reshape(target_shape=(1,))(dot)

    model = Model(inputs=[link, movie], outputs=[merged])
    model.compile(optimizer='nadam', loss='mse')

    return model

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to      

To train the model, we will use a generator that yields batches of data made up of possitive and negative examples.

The positives instances are sampled from the `pairs` list and then fill it up with negative examples, which are randomly picked (we double check they're not in the `pairs_set`).

In [11]:
print(inspect.getsource(batchifier))

def batchifier(pairs, pairs_set, top_links, movie_to_index, positive_samples=50, negative_ratio=10):
    batch_size = positive_samples * (1 + negative_ratio)
    batch = np.zeros((batch_size, 3))

    while True:
        for index, (link_id, movie_id) in enumerate(random.sample(pairs, positive_samples)):
            batch[index, :] = (link_id, movie_id, 1)

        index = positive_samples

        while index < batch_size:
            movie_id = random.randrange(len(movie_to_index))
            link_id = random.randrange(len(top_links))

            if not (link_id, movie_id) in pairs_set:
                batch[index, :] = (link_id, movie_id, -1)
                index += 1

        np.random.shuffle(batch)

        yield {'link': batch[:, 0],
               'movie': batch[:, 1]}, batch[:, 2]



Let's train the model:

In [12]:
random.seed(5)

positive_samples_per_batch = 512
data_generator = batchifier(pairs, pairs_set, top_links, movie_to_index, positive_samples_per_batch)
steps_per_epoch = len(pairs) // positive_samples_per_batch

model.fit_generator(data_generator,
                    epochs=15,
                    steps_per_epoch=steps_per_epoch)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f64869fbbe0>

We can extract the movie and link embeddings from the model accessing the corresponding layer by their name:

In [14]:
movie_embedding = model.get_layer('movie_embedding')
movie_weights = movie_embedding.get_weights()[0]
movie_lengths = np.linalg.norm(movie_weights, axis=1)
normalized_movies = (movie_weights.T / movie_lengths).T

Let's check the embeddings make sense:

In [15]:
print(inspect.getsource(similar_movies))

similar_movies('Rogue One', movies, normalized_movies, movie_to_index)

def similar_movies(movie, movies, normalized_movies, movie_to_index, top_n=10):
    distances = np.dot(normalized_movies, normalized_movies[movie_to_index[movie]])
    closest = np.argsort(distances)[-top_n:]

    for c in reversed(closest):
        movie_title = movies[c][0]
        distance = distances[c]
        print(c, movie_title, distance)

29 Rogue One 0.99999994
19 Interstellar (film) 0.9814422
3349 Star Wars: The Force Awakens 0.9689913
25 Star Wars sequel trilogy 0.9672627
659 Rise of the Planet of the Apes 0.96509415
245 Gravity (film) 0.9645235
86 Tomorrowland (film) 0.9591576
372 The Amazing Spider-Man (2012 film) 0.9586833
181 Pacific Rim (film) 0.9578317
37 Avatar (2009 film) 0.9577962


Same deal with links:

In [18]:
print(inspect.getsource(similar_links))
    
link_embedding = model.get_layer('link_embedding')
link_weights = link_embedding.get_weights()[0]
link_lengths = np.linalg.norm(link_weights, axis=1)
normalized_links = (link_weights.T / link_lengths).T

similar_links('George Lucas', top_links, normalized_links, link_to_index)

def similar_links(link, top_links, normalized_links, link_to_index, top_n=10):
    distances = np.dot(normalized_links, normalized_links[link_to_index[link]])
    closest = np.argsort(distances)[-top_n:]
    
    for l in reversed(closest):
        distance = distances[l]
        print(l, top_links[l], distance)

127 George Lucas 0.99999994
2707 Star Wars 0.93741417
4830 widescreen 0.93099356
3176 Star Wars (film) 0.9273331
976 Hugo Award for Best Dramatic Presentation 0.91403615
2931 LaserDisc 0.89742184
2829 storyboard 0.8899722
2860 Steven Spielberg 0.8812173
4051 novelization 0.88026863
1732 Academy Award for Best Visual Effects 0.8759304


## Building a Movie Recommender

With our embeddings properly trained and working, we can use them to train a simple classifier, such as an SVM to separate positively ranked items from negative.

Given we don't have any users, we cannot use user data to train the classifier, so we need to fake it:

In [19]:
best = ['Star Wars: The Force Awakens', 'The Martian (film)', 'Tangerine (film)', 'Straight Outta Compton (film)',
        'Brooklyn (film)', 'Carol (film)', 'Spotlight (film)']
worst = ['American Ultra', 'The Cobbler (2014 film)', 'Entourage (film)', 'Fantastic Four (2015 film)',
         'Get Hard', 'Hot Pursuit (2015 film)', 'Mortdecai (film)', 'Serena (2014 film)', 'Vacation (2015 film)']

all_data = best + worst
y = np.asarray(([1] * len(best)) + ([0] * len(worst)))
X = np.asarray([normalized_movies[movie_to_index[movie]]
                for movie in all_data])

print(X.shape)

(16, 50)


Training an SVM on this data is so easy that it feels like cheating:

In [20]:
classifier = svm.SVC(kernel='linear')
classifier.fit(X, y)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Let's run the classifier over all of our movies and pick the top 5 best and top 5 worst:

In [21]:
estimated_movie_ratings = classifier.decision_function(normalized_movies)
    
best = np.argsort(estimated_movie_ratings)
print('Best: ')
for movie_index in reversed(best[-5:]):
    movie_title = movies[movie_index][0]
    movie_rating = estimated_movie_ratings[movie_index]
    print(movie_index, movie_title, movie_rating)

print('Worst: ')
for movie_index in best[:5]:
    movie_title = movies[movie_index][0]
    movie_rating = estimated_movie_ratings[movie_index]
    print(movie_index, movie_title, movie_rating)

Best: 
66 Skyfall 1.3088888053043541
481 The Devil Wears Prada (film) 1.3019425459507856
458 Hugo (film) 1.1653304719497952
307 Les Misérables (2012 film) 1.1385154200244971
3 Spectre (2015 film) 1.056846426209078
Worst: 
5097 Ready to Rumble -1.5441925266385934
9595 Speed Zone -1.527474667206325
1878 The Little Rascals (film) -1.4972133841701225
8559 Air Buddies -1.4825302881861921
7593 Trojan War (film) -1.4619956658429054


## Predict Simple Movie Properties

We can also use our embeddings to predict simple stuff about movies, like Rotten Tomatoes ratings. Let's do that!

For this task we'll resort to Linear Regression.

The Rotten Tomatoes score of a movie is in `movie[-2]`:

In [22]:
# Rotten tomatoes score.
rotten_y = np.asarray([float(movie[-2][:-1]) / 100 for movie in movies if movie[-2]])
# Vectors representing movie titles.
rotten_X = np.asarray([normalized_movies[movie_to_index[movie[0]]] for movie in movies if movie[-2]])

In [23]:
TRAINING_SIZE = 0.8
SPLIT_POINT = int(len(rotten_X) * TRAINING_SIZE)
rotten_X_train = rotten_X[:SPLIT_POINT]
rotten_y_train = rotten_y[:SPLIT_POINT]

rotten_X_test = rotten_X[SPLIT_POINT:]
rotten_y_test = rotten_y[SPLIT_POINT:]

regressor = LinearRegression()
regressor.fit(rotten_X_train, rotten_y_train)

error = regressor.predict(rotten_X_test) - rotten_y_test
print(f'Mean Squared Error using linear regression: {np.mean(error ** 2)}')

training_rotten_tomatoes_mean_score = np.mean(rotten_y_train)
error = training_rotten_tomatoes_mean_score - rotten_y_test
print(f'Mean Squared Error using just the mean: {np.mean(error ** 2)}')

Mean Squared Error using linear regression: 0.06131464247800219
Mean Squared Error using just the mean: 0.08957773784832902


Great! It seems our little regressor does well. But there's a catch: Given we used only the top 10k movies, their Rotten Tomatoes score is fairly similar and tend to be good. Hence, just predicting the mean won't give bad results.