<h2>Recommender Systems</h2>

A fairly common data problem is producing <i>recommendations</i> of some sort. Netflix recommends movies you might want to watch, Amazon recommends products you may want to buy, and Twitter recommends users you might want to follow. Here we'll explore several ways to use data to make recommendations.

In [1]:
# Let's take some user's interests and recommend new interests based on current ones
users_interests = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

In [2]:
# An easy approach is to just recommend what is most common
from collections import Counter

popular_interests = Counter(interest
                     for users_interests in users_interests
                     for interest in users_interests).most_common()
popular_interests[:5]

[('Python', 4), ('R', 4), ('Big Data', 3), ('HBase', 3), ('Java', 3)]

In [3]:
# Suggest a popular new interest as long as the user doesn't already have it
def most_popular_new_interests(user_interests, max_results=5):
    suggested_interests = [(interest, freq)
                          for interest, freq in popular_interests if interest not in user_interests]
    return suggested_interests[:max_results]

# Results for user in 1st index of users_interests list
print('User Interests:', users_interests[1])
print('Suggested New Interests:', most_popular_new_interests(users_interests[1]))

User Interests: ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres']
Suggested New Interests: [('Python', 4), ('R', 4), ('Big Data', 3), ('Java', 3), ('statistics', 3)]


In [4]:
# Let's see what it returns for user 3 given that they are interested in many of the previous suggestions
print('User Interests:', users_interests[3])
print('Suggested New Interests:', most_popular_new_interests(users_interests[3]))

User Interests: ['R', 'Python', 'statistics', 'regression', 'probability']
Suggested New Interests: [('Big Data', 3), ('HBase', 3), ('Java', 3), ('Hadoop', 2), ('Cassandra', 2)]


<h2>User-Based Collaborative Filtering</h2>

Recommending the most popular options doesn't take into account anything unique to a user. One way to take into account a user's specific interests is to look for users that are somehow similar to one another, and then make suggestions based on what those similar users are interested in.

To do this, we'll declare a `cosine_similarity` function that measures the 'angle' between two vectors, v and w. If v and w point in the same direction, then both the numerator and denominator are equal, so their cosine similarity equals 1. If v and w point opposite directions, then their cosine similarity equals -1. Finally, if v is 0 whenever w is not, then $dot(v, w)$ is 0 and their cosine similarity will be 0.

We'll apply this to vectors of 0s and 1s, each vector v representing one user's interests. v[i] will be 1 if the user is specified the ith interest, 0 otherwise. Accordingly, 'similar users' will mean 'users whose interest vectors most nearly point in the same direction.' Users with identical interests will have similarity 1. Users with no identical interests will have similarity 0. Otherwise the similarity will fall in between, with numbers closer to 1 indicating 'very similar' and numbers closer to 0 indicating 'not very similar.'

In [5]:
import numpy as np

def cosine_similarity(v, w):
    return np.dot(v, w) / np.sqrt(np.dot(v, v) * np.dot(w, w))

In [6]:
unique_interests = sorted(list({interest for interests in users_interests for interest in interests}))

In [7]:
# Generate an interest vector for each interest for each user
def generate_user_interest_vector(user_interests):
    """given a list of interests, produce a vector whose ith element is 1 
    if unique_interests[i] is in the list, 0 otherwise"""
    return [1 if interest in user_interests else 0
           for interest in unique_interests]

In [8]:
# user_interest_matrix[i][j] equals 1 if user i specified interest j, 0 otherwise
user_interest_matrix = list(map(generate_user_interest_vector, users_interests))

In [9]:
user_similarities = [[cosine_similarity(interest_vector_i, interest_vector_j) 
                      for interest_vector_j in user_interest_matrix]
                     for interest_vector_i in user_interest_matrix]

In [10]:
# These two users share three interests -- Big Data, Hadoop, and Java
print(f'User 0 Interests: {sorted(users_interests[0])}')
print(f'User 9 Interests: {sorted(users_interests[9])}')
user_similarities[0][9]

User 0 Interests: ['Big Data', 'Cassandra', 'HBase', 'Hadoop', 'Java', 'Spark', 'Storm']
User 9 Interests: ['Big Data', 'Hadoop', 'Java', 'MapReduce']


0.5669467095138409

In [11]:
# These two users only share a single interest -- Big Data
print(f'User 0 Interests: {sorted(users_interests[0])}')
print(f'User 8 Interests: {sorted(users_interests[8])}')
user_similarities[0][8]

User 0 Interests: ['Big Data', 'Cassandra', 'HBase', 'Hadoop', 'Java', 'Spark', 'Storm']
User 8 Interests: ['Big Data', 'artificial intelligence', 'deep learning', 'neural networks']


0.1889822365046136

In [12]:
# Create a way to rank the most similar users to a given user
def get_similar_users(user_id):
    pairs = [(other_user_id, similarity)
            for other_user_id, similarity in enumerate(user_similarities[user_id])
            if user_id != other_user_id and similarity > 0]
    return sorted(pairs, key=lambda x : x[1], reverse=True)

In [13]:
# For user 0, expect that user 9 with .56 is returned first
get_similar_users(0)

[(9, 0.5669467095138409),
 (1, 0.3380617018914066),
 (8, 0.1889822365046136),
 (13, 0.1690308509457033),
 (5, 0.1543033499620919)]

In [14]:
# Now use this to suggest new interests to a user -- for each interest, just add up 
# user-similarities of the other users interested in it
from collections import defaultdict

def user_based_suggestions(user_id, include_current_interests=False):
    suggestions = defaultdict(float)
    for other_user_id, similarity in get_similar_users(user_id):
        for interest in users_interests[other_user_id]:
            suggestions[interest] += similarity
            
    # convert suggestions to sorted descending list
    suggestions = sorted(suggestions.items(), key=lambda x : x[1], reverse=True)
    
    # exclude already included interests depending on flag
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight) for suggestion, weight in suggestions
               if suggestion not in users_interests[user_id]]

In [15]:
print('User 0 interests:')
users_interests[0]

User 0 interests:


['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']

In [16]:
# These seem like good suggestions for someone interested in 'big data' and database related things
print('User based suggestions:')
user_based_suggestions(0)

User based suggestions:


[('MapReduce', 0.5669467095138409),
 ('MongoDB', 0.50709255283711),
 ('Postgres', 0.50709255283711),
 ('NoSQL', 0.3380617018914066),
 ('neural networks', 0.1889822365046136),
 ('deep learning', 0.1889822365046136),
 ('artificial intelligence', 0.1889822365046136),
 ('databases', 0.1690308509457033),
 ('MySQL', 0.1690308509457033),
 ('Python', 0.1543033499620919),
 ('R', 0.1543033499620919),
 ('C++', 0.1543033499620919),
 ('Haskell', 0.1543033499620919),
 ('programming languages', 0.1543033499620919)]

<h2>Dimensionality</h2>
Note that this approach doesn't work as well when the number of items gets very large. There is a 'curse of dimensionality' -- in large-dimensional vector spaces, most vectors are very far apart (and therefore point in very different directions). When there are a large number of interests, the 'most similar users' to a given user might not be similar at all. For a site like Amazon, you could attempt to identify similar users to an individual based on buying patterns, but most likely in all the world, there's nobody whose purchase history looks even remotely like that individual. Whoever their 'most similar' shopper is, is probably not similar to them at all, and their purchases would likely make for lousy recommendations.

<h2>Item-Based Collaborative Filtering</h2>
An alternative is to compute similarities between interests directly. We can then generate suggestions for each user by aggregating interests that are similar to their current interests. 

In [17]:
# Start by transposing user-interest matrix so that rows correspond 
# to interests and columns correspond to users
interest_user_matrix = [[user_interest_vector[j]
                        for user_interest_vector in user_interest_matrix]
                       for j, _ in enumerate(unique_interests)]

We can now use cosine similarity again. If precisely the same users are interested in two topics, their similarity will be 1. If no two users are interested in both topics, their similarity will be 0.

In [18]:
interest_similarities = [[cosine_similarity(user_vector_i, user_vector_j)
                         for user_vector_j in interest_user_matrix]
                        for user_vector_i in interest_user_matrix]

In [19]:
def get_similar_interests(interest_id):
    similarities = interest_similarities[interest_id]
    pairs = [(unique_interests[other_interest_id], similarity)
             for other_interest_id, similarity in enumerate(similarities)
             if interest_id != other_interest_id and similarity > 0]
    return sorted(pairs, key=lambda x : x[1], reverse=True)

In [20]:
# Big Data is interest 0, let's see which interests are similar:
print(unique_interests[0])
get_similar_interests(0)

Big Data


[('Hadoop', 0.8164965809277261),
 ('Java', 0.6666666666666666),
 ('MapReduce', 0.5773502691896258),
 ('Spark', 0.5773502691896258),
 ('Storm', 0.5773502691896258),
 ('Cassandra', 0.4082482904638631),
 ('artificial intelligence', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('neural networks', 0.4082482904638631),
 ('HBase', 0.3333333333333333)]

In [21]:
# Create recommendations for a user by summing up similarities of interests similar to theirs
def item_based_suggestions(user_id, include_current_interests=False):
    # add up similar interests
    suggestions = defaultdict(float)
    user_interest_vector = user_interest_matrix[user_id]
    for interest_id, is_interested in enumerate(user_interest_vector):
        if is_interested == 1:
            similar_interests = get_similar_interests(interest_id)
            for interest, similarity in similar_interests:
                suggestions[interest] += similarity

    suggestions = sorted(suggestions.items(), key=lambda x : x[1], reverse=True)
    
    if include_current_interests:
        return suggestions
    else:
        return [(suggestion, weight)
               for suggestion, weight in suggestions
               if suggestion not in users_interests[user_id]]

In [22]:
# Seemingly reasonable suggestions for user 0
print(users_interests[0])
item_based_suggestions(0)

['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']


[('MapReduce', 1.861807319565799),
 ('MongoDB', 1.3164965809277263),
 ('Postgres', 1.3164965809277263),
 ('NoSQL', 1.2844570503761732),
 ('MySQL', 0.5773502691896258),
 ('databases', 0.5773502691896258),
 ('Haskell', 0.5773502691896258),
 ('programming languages', 0.5773502691896258),
 ('artificial intelligence', 0.4082482904638631),
 ('deep learning', 0.4082482904638631),
 ('neural networks', 0.4082482904638631),
 ('C++', 0.4082482904638631),
 ('Python', 0.2886751345948129),
 ('R', 0.2886751345948129)]