# Recommender Systems

**Recommender systems** are about recommending something to someone based on data regadring historical activities and/or similarities of other people, their behavior or choices.

In [1]:
users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

## Recommending What's Popular

In [2]:
from collections import Counter

popular_interests = Counter(interest
                            for user_interests in users_interests
                            for interest in user_interests)

In [3]:
popular_interests

Counter({'Hadoop': 2,
         'Big Data': 3,
         'HBase': 3,
         'Java': 3,
         'Spark': 1,
         'Storm': 1,
         'Cassandra': 2,
         'NoSQL': 1,
         'MongoDB': 2,
         'Postgres': 2,
         'Python': 4,
         'scikit-learn': 2,
         'scipy': 1,
         'numpy': 1,
         'statsmodels': 2,
         'pandas': 2,
         'R': 4,
         'statistics': 3,
         'regression': 3,
         'probability': 3,
         'machine learning': 2,
         'decision trees': 1,
         'libsvm': 2,
         'C++': 2,
         'Haskell': 1,
         'programming languages': 1,
         'mathematics': 1,
         'theory': 1,
         'Mahout': 1,
         'neural networks': 2,
         'deep learning': 2,
         'artificial intelligence': 2,
         'MapReduce': 1,
         'databases': 1,
         'MySQL': 1,
         'support vector machines': 1})

In [4]:
# Suggest to a user the most popular interests that she is not already interested in
from typing import List, Tuple

def most_popular_new_interests(
        user_interests: List[str],
        max_results: int = 5) -> List[Tuple[str, int]]:
    suggestions = [(interest, frequency)
                   for interest, frequency in popular_interests.most_common()
                   if interest not in user_interests]
    return suggestions[:max_results]

In [8]:
user0_interests = users_interests[0]
print(f"User interests: {user0_interests}")
print(f"Recommended interests: {most_popular_new_interests(user0_interests, 5)}")

User interests: ['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']
Recommended interests: [('Python', 4), ('R', 4), ('statistics', 3), ('regression', 3), ('probability', 3)]


Of course, “lots of people are interested in Python, so maybe you should be too” is not the most compelling sales pitch. If someone is brand new to our site and we don’t know anything about them, that’s possibly the best we can do.

## User-Based Collaborative Filtering

One way of taking a user’s interests into account is to look for users who are somehow `similar` to her, and then suggest the things that those users are interested in. 

In order to do that, `we’ll need a way to measure how similar two users are`. Here we’ll use `cosine similarity`,  to measure how similar two word vectors were.

We’ll apply this to vectors of 0s and 1s, each vector `v `representing one user’s interests. `v[i]` will be 1 if the user specified the ith interest, and 0 otherwise. Accordingly, `“similar users” will mean “users whose interest vectors most nearly point in the same direction.”` Users with identical interests will have similarity 1. Users with no identical interests will have similarity 0. Otherwise, the similarity will fall in between, with numbers closer to 1 indicating “very similar” and numbers closer to 0 indicating “not very similar.

In [9]:
# Use set comprehension to find the unique interests
{interest for user_interests in users_interests for interest in user_interests}

{'Big Data',
 'C++',
 'Cassandra',
 'HBase',
 'Hadoop',
 'Haskell',
 'Java',
 'Mahout',
 'MapReduce',
 'MongoDB',
 'MySQL',
 'NoSQL',
 'Postgres',
 'Python',
 'R',
 'Spark',
 'Storm',
 'artificial intelligence',
 'databases',
 'decision trees',
 'deep learning',
 'libsvm',
 'machine learning',
 'mathematics',
 'neural networks',
 'numpy',
 'pandas',
 'probability',
 'programming languages',
 'regression',
 'scikit-learn',
 'scipy',
 'statistics',
 'statsmodels',
 'support vector machines',
 'theory'}

In [10]:
unique_interests = sorted({interest
                           for user_interests in users_interests
                           for interest in user_interests})

In [11]:
unique_interests[:5]

['Big Data', 'C++', 'Cassandra', 'HBase', 'Hadoop']

In [12]:
# Produce an "interest" vector of 0s and 1s for each user
def make_user_interest_vector(user_interests: List[str]) -> List[int]:
    """
    Given a list ofinterests, produce a vector whose ith element is 1
    if unique_interests[i] is in the list, 0 otherwise
    """
    return [1 if interest in user_interests else 0
            for interest in unique_interests]

In [14]:
print(f"{make_user_interest_vector(user0_interests)}")

[1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [15]:
user_interest_vectors = [make_user_interest_vector(user_interests)
                         for user_interests in users_interests]

In [17]:
# Computer the pairwise similarities using cosine similarity
from scratch.nlp import cosine_similarity

#for interest_vector_i in user_interest_vectors:
    #print(f"interest_vector_i -> {interest_vector_i}")

user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j)
                      for interest_vector_j in user_interest_vectors]
                     for interest_vector_i in user_interest_vectors]


In [21]:
# Similarity bewteen users 0 and 1
user_similarities[0][1]

0.3380617018914066

In [26]:
# Find most similar users
def most_similar_users_to(user_id: int) -> List[Tuple[int, float]]:
    pairs = [(other_user_id, similarity)                      # Find other
             for other_user_id, similarity in                 # users with
                enumerate(user_similarities[user_id])         # nonzero
             if user_id != other_user_id and similarity > 0]  # similarity.

    return sorted(pairs,                                      # Sort them
                  key=lambda pair: pair[-1],                  # most similar
                  reverse=True)                               # first.

In [27]:
most_similar_users_to(0)

[(9, 0.5669467095138409),
 (1, 0.3380617018914066),
 (8, 0.1889822365046136),
 (13, 0.1690308509457033),
 (5, 0.1543033499620919)]

In [48]:
# Add up similar users' similarities (and exclude your own) to see 
# in what else a user would be interested in

from collections import defaultdict

def user_based_suggestions(user_id: int,
                           include_current_interests: bool = False):
    # Sum up the similarities.
    suggestions: Dict[str, float] = defaultdict(float)
    for other_user_id, similarity in most_similar_users_to(user_id):
        print(f"other_user_id -> {other_user_id}, similarity -> {similarity}")
        for interest in users_interests[other_user_id]:
            print(f"\tinterest -> {interest}")
            print(f"\t\tbefore: suggestions[{interest}] = {suggestions[interest]}")
            suggestions[interest] += similarity
            print(f"\t\tafter: suggestions[{interest}] = {suggestions[interest]}")

    # Convert them to a sorted list.
    suggestions = sorted(suggestions.items(),
                         key=lambda pair: pair[-1],  # weight
                         reverse=True)
    
    #print(f"suggestions = {suggestions}")

    # And (maybe) exclude already-interests
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight)
                for suggestion, weight in suggestions
                if suggestion not in users_interests[user_id]]

In [49]:
user_based_suggestions(0)

other_user_id -> 9, similarity -> 0.5669467095138409
	interest -> Hadoop
		before: suggestions[Hadoop] = 0.0
		after: suggestions[Hadoop] = 0.5669467095138409
	interest -> Java
		before: suggestions[Java] = 0.0
		after: suggestions[Java] = 0.5669467095138409
	interest -> MapReduce
		before: suggestions[MapReduce] = 0.0
		after: suggestions[MapReduce] = 0.5669467095138409
	interest -> Big Data
		before: suggestions[Big Data] = 0.0
		after: suggestions[Big Data] = 0.5669467095138409
other_user_id -> 1, similarity -> 0.3380617018914066
	interest -> NoSQL
		before: suggestions[NoSQL] = 0.0
		after: suggestions[NoSQL] = 0.3380617018914066
	interest -> MongoDB
		before: suggestions[MongoDB] = 0.0
		after: suggestions[MongoDB] = 0.3380617018914066
	interest -> Cassandra
		before: suggestions[Cassandra] = 0.0
		after: suggestions[Cassandra] = 0.3380617018914066
	interest -> HBase
		before: suggestions[HBase] = 0.0
		after: suggestions[HBase] = 0.3380617018914066
	interest -> Postgres
		before:

[('MapReduce', 0.5669467095138409),
 ('MongoDB', 0.50709255283711),
 ('Postgres', 0.50709255283711),
 ('NoSQL', 0.3380617018914066),
 ('neural networks', 0.1889822365046136),
 ('deep learning', 0.1889822365046136),
 ('artificial intelligence', 0.1889822365046136),
 ('databases', 0.1690308509457033),
 ('MySQL', 0.1690308509457033),
 ('Python', 0.1543033499620919),
 ('R', 0.1543033499620919),
 ('C++', 0.1543033499620919),
 ('Haskell', 0.1543033499620919),
 ('programming languages', 0.1543033499620919)]

`This approach doesn’t work as well when the number of items gets very large`. Recall **the curse of dimensionality** - in large-dimensional vector spaces most vectors are very far apart (and also point in very different directions). That is, when there are a large number of interests the “most similar users” to a given user might not be similar at all.

Imagine a site like Amazon.com, from which I’ve bought thousands of items over the last couple of decades. You could attempt to identify similar users to me based on buying patterns, but most likely in all the world there’s no one whose purchase history looks even remotely like mine.

## Item-Based Collaborative Filtering

An alternative approach is to `compute similarities between interests directly`. We can then generate suggestions for each user by aggregating interests that are similar to her current interests.

In [50]:
# Transpose user-interest matrix so that rows correspond to interests
# and columns to users
interest_user_matrix = [[user_interest_vector[j]
                         for user_interest_vector in user_interest_vectors]
                        for j, _ in enumerate(unique_interests)]
interest_user_matrix

[[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,

In [52]:
# users having Big Data interest -> 0, 8, 9
interest_user_matrix[0]

[1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]

We can now use `cosine similarity` again. If precisely the same users are interested in two topics, their similarity will be 1. If no two users are interested in both topics, their similarity will be 0: 

In [53]:
interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j)
                          for user_vector_j in interest_user_matrix]
                         for user_vector_i in interest_user_matrix]

In [54]:
def most_similar_interests_to(interest_id: int):
    similarities = interest_similarities[interest_id]
    pairs = [(unique_interests[other_interest_id], similarity)
             for other_interest_id, similarity in enumerate(similarities)
             if interest_id != other_interest_id and similarity > 0]
    return sorted(pairs,
                  key=lambda pair: pair[-1],
                  reverse=True)

In [55]:
# Find interests most similar to Big Data
most_similar_interests_to(0)

[('Hadoop', 0.8164965809277261),
 ('Java', 0.6666666666666666),
 ('MapReduce', 0.5773502691896258),
 ('Spark', 0.5773502691896258),
 ('Storm', 0.5773502691896258),
 ('Cassandra', 0.4082482904638631),
 ('artificial intelligence', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('neural networks', 0.4082482904638631),
 ('HBase', 0.3333333333333333)]

In [62]:
# Create recommendations for a user by summing up the similarities
# of the interests similar to this:

def item_based_suggestions(user_id: int,
                           include_current_interests: bool = False):
    # Add up the similar interests
    suggestions = defaultdict(float)
    user_interest_vector = user_interest_vectors[user_id]
    print(f"user_interest_vector -> {user_interest_vector}")
    for interest_id, is_interested in enumerate(user_interest_vector):
        print(f"\tinterest_id -> {interest_id}, is_interested -> {is_interested}")
        if is_interested == 1:
            similar_interests = most_similar_interests_to(interest_id)
            print(f"\t\tsimilar_interests -> {similar_interests}")
            for interest, similarity in similar_interests:
                suggestions[interest] += similarity
                print(f"\t\t\tinterest -> {interest}, similarity -> {similarity}")
                print(f"\t\t\tsuggestions[{interest}] = {suggestions[interest]}")

    # Sort them by weight
    suggestions = sorted(suggestions.items(),
                         key=lambda pair: pair[-1],
                         reverse=True)

    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight)
                for suggestion, weight in suggestions
                if suggestion not in users_interests[user_id]]

In [None]:
print(f"other_user_id -> {other_user_id}, similarity -> {similarity}")
        for interest in users_interests[other_user_id]:
            print(f"\tinterest -> {interest}")
            print(f"\t\tbefore: suggestions[{interest}] = {suggestions[interest]}")
            suggestions[interest] += similarity
            print(f"\t\tafter: suggestions[{interest}] = {suggestions[interest]}")

In [63]:
item_based_suggestions(0)

user_interest_vector -> [1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
	interest_id -> 0, is_interested -> 1
		similar_interests -> [('Hadoop', 0.8164965809277261), ('Java', 0.6666666666666666), ('MapReduce', 0.5773502691896258), ('Spark', 0.5773502691896258), ('Storm', 0.5773502691896258), ('Cassandra', 0.4082482904638631), ('artificial intelligence', 0.4082482904638631), ('deep learning', 0.4082482904638631), ('neural networks', 0.4082482904638631), ('HBase', 0.3333333333333333)]
			interest -> Hadoop, similarity -> 0.8164965809277261
			suggestions[Hadoop] = 0.8164965809277261
			interest -> Java, similarity -> 0.6666666666666666
			suggestions[Java] = 0.6666666666666666
			interest -> MapReduce, similarity -> 0.5773502691896258
			suggestions[MapReduce] = 0.5773502691896258
			interest -> Spark, similarity -> 0.5773502691896258
			suggestions[Spark] = 0.5773502691896258
			interest -> Storm, similarity -> 0.5773502691896

[('MapReduce', 1.861807319565799),
 ('MongoDB', 1.3164965809277263),
 ('Postgres', 1.3164965809277263),
 ('NoSQL', 1.2844570503761732),
 ('MySQL', 0.5773502691896258),
 ('databases', 0.5773502691896258),
 ('Haskell', 0.5773502691896258),
 ('programming languages', 0.5773502691896258),
 ('artificial intelligence', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('neural networks', 0.4082482904638631),
 ('C++', 0.4082482904638631),
 ('Python', 0.2886751345948129),
 ('R', 0.2886751345948129)]

## Matrix Factorization

In this section we’ll assume we have such ratings data and try to `learn a model that can predict the rating for a given user and item`.

One way of approaching the problem is to assume that every user has some `latent “type,”` which can be represented as a vector of numbers, and that each item similarly has some latent “type.”

If the user types are represented as a `[num_users, dim]` matrix, and the transpose of the item types is represented as a `[dim, num_items]` matrix, their product is a `[num_users, num_items]` matrix. Accordingly, one way of building such a model is by **“factoring” the preferences matrix** into the product of a user matrix and an item matrix.

In [64]:
# Dataset from: https://grouplens.org/datasets/movielens/ml-100k.zip
MOVIES = "data/ml-100k/u.item"   # pipe-delimited: movie_id|title|...
RATINGS = "data/ml-100k/u.data"  # tab-delimited: user_id, movie_id, rating, timestamp

In [65]:
from typing import NamedTuple
    
class Rating(NamedTuple):
    user_id: str
    movie_id: str
    rating: float

import csv
# We specify this encoding to avoid a UnicodeDecodeError.
# see: https://stackoverflow.com/a/53136168/1076346
with open(MOVIES, encoding="iso-8859-1") as f:
    reader = csv.reader(f, delimiter="|")
    movies = {movie_id: title for movie_id, title, *_ in reader}

# Create a list of [Rating]
with open(RATINGS, encoding="iso-8859-1") as f:
    reader = csv.reader(f, delimiter="\t")
    ratings = [Rating(user_id, movie_id, float(rating))
               for user_id, movie_id, rating, _ in reader]

# 1682 movies rated by 943 users
assert len(movies) == 1682
assert len(list({rating.user_id for rating in ratings})) == 943

In [67]:
# Exemplary EDA for an average ratings for Star Wars movies
import re
    
# Data structure for accumulating ratings by movie_id
star_wars_ratings = {movie_id: []
                     for movie_id, title in movies.items()
                     if re.search("Star Wars|Empire Strikes|Jedi", title)}

# Iterate over ratings, accumulating the Star Wars ones
for rating in ratings:
    if rating.movie_id in star_wars_ratings:
        star_wars_ratings[rating.movie_id].append(rating.rating)

# Compute the average rating for each movie
avg_ratings = [(sum(title_ratings) / len(title_ratings), movie_id)
               for movie_id, title_ratings in star_wars_ratings.items()]

# And then print them in order
for avg_rating, movie_id in sorted(avg_ratings, reverse=True):
    print(f"{avg_rating:.2f} {movies[movie_id]}")

4.36 Star Wars (1977)
4.20 Empire Strikes Back, The (1980)
4.01 Return of the Jedi (1983)


In [68]:
# Build a model to predict these ratings

# Train, Validation, Test split
import random
random.seed(0)
random.shuffle(ratings)

split1 = int(len(ratings) * 0.7)
split2 = int(len(ratings) * 0.85)

train = ratings[:split1]              # 70% of the data
validation = ratings[split1:split2]   # 15% of the data
test = ratings[split2:]               # 15% of the data

# Simple baseline model to make sure ours does better than that
avg_rating = sum(rating.rating for rating in train) / len(train)
baseline_error = sum((rating.rating - avg_rating) ** 2
                     for rating in test) / len(test)

# This is what we hope to do better than
assert 1.26 < baseline_error < 1.27

Given our embeddings, the predicted ratings are given by the matrix product of the user embeddings and the movie embeddings. For a given user and movie, that value is just the `dot product` of the corresponding embeddings.

In [69]:
# Embedding vectors for matrix factorization model
    
from scratch.deep_learning import random_tensor

EMBEDDING_DIM = 2

# Find unique ids
user_ids = {rating.user_id for rating in ratings}
movie_ids = {rating.movie_id for rating in ratings}

# Then create a random vector per id
user_vectors = {user_id: random_tensor(EMBEDDING_DIM)
                for user_id in user_ids}
movie_vectors = {movie_id: random_tensor(EMBEDDING_DIM)
                 for movie_id in movie_ids}


# Training loop for matrix factorization model

from typing import List
import tqdm
from scratch.linear_algebra import dot

def loop(dataset: List[Rating],
         learning_rate: float = None) -> None:
    with tqdm.tqdm(dataset) as t:
        loss = 0.0
        for i, rating in enumerate(t):
            movie_vector = movie_vectors[rating.movie_id]
            user_vector = user_vectors[rating.user_id]
            predicted = dot(user_vector, movie_vector)
            error = predicted - rating.rating
            loss += error ** 2

            if learning_rate is not None:
                #     predicted = m_0 * u_0 + ... + m_k * u_k
                # So each u_j enters output with coefficent m_j
                # and each m_j enters output with coefficient u_j
                user_gradient = [error * m_j for m_j in movie_vector]
                movie_gradient = [error * u_j for u_j in user_vector]

                # Take gradient steps
                for j in range(EMBEDDING_DIM):
                    user_vector[j] -= learning_rate * user_gradient[j]
                    movie_vector[j] -= learning_rate * movie_gradient[j]

            t.set_description(f"avg loss: {loss / (i + 1)}")

In [None]:
# Train the model
learning_rate = 0.05
for epoch in range(20):
    learning_rate *= 0.9
    print(epoch, learning_rate)
    loop(train, learning_rate=learning_rate)
    loop(validation)
loop(test)

avg loss: 16.189668670986727:   0%|          | 94/70000 [00:00<01:14, 934.80it/s]

0 0.045000000000000005


avg loss: 14.912816622819324:   4%|▍         | 2956/70000 [00:02<01:02, 1080.67it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

avg loss: 14.741475940144364:  10%|▉         | 6730/70000 [00:06<00:56, 1128.27it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

avg loss: 14.547117742356662:  15%|█▌        | 10712/70000 [00:09<00:56, 1053.45it/s]IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the clie

In [None]:
# Inspect the learned vectors
from scratch.working_with_data import pca, transform

original_vectors = [vector for vector in movie_vectors.values()]
components = pca(original_vectors, 2)

In [None]:
# Transform our vectors to represent the PCA and join in the movie IDs and average ratings
ratings_by_movie = defaultdict(list)
for rating in ratings:
    ratings_by_movie[rating.movie_id].append(rating.rating)

vectors = [
    (movie_id,
     sum(ratings_by_movie[movie_id]) / len(ratings_by_movie[movie_id]),
     movies[movie_id],
     vector)
    for movie_id, vector in zip(movie_vectors.keys(),
                                transform(original_vectors, components))
]

# Print top 25 and bottom 25 by first principal component
print(sorted(vectors, key=lambda v: v[-1][0])[:25])
print(sorted(vectors, key=lambda v: v[-1][0])[-25:])

## Resources

- [Surprise](http://surpriselib.com/) - Surprise is a Python scikit building and analyzing recommender systems.