# Content Recommendations With MXNet

In [1]:
import pandas as pd
import numpy as np
import mxnet as mx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

## Import sample data  

In [2]:
df = pd.read_csv("../data/sample-data.csv")

For this content recommendation systems, you'll use the TF-IDF Vectorizer.

TF-IDF decides, for each term in a document and a given collection, the weights for each one of the components of a vector that can be used for cosine similarity (among other things).

In [3]:
tf = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 3),
                     min_df=0,
                     stop_words='english')

In [4]:
tfidf_matrix = tf.fit_transform(df['description'])

## Compute Cosine Similarities (CPU) 

Cosine similarity measures the angle between two different vectors in a Euclidean space, independently of how the weights have been calculated.

First, you can build the model using the `linear_kernal` method, which is CPU bound.

In [6]:
%time cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

CPU times: user 54.4 ms, sys: 2.16 ms, total: 56.6 ms
Wall time: 54.3 ms


Running on my system, this takes `54ms`. This isn't an issue for small datasets like this, but becomes a pain point when dealing with millions of records.

## Compute Cosine Similarity With MXNet (GPU)

Using MXNet, we'll be able to build a content recommender the same way, but even faster than before.

First, we'll convert the TFIDF matrix to an MXNet NDArray. We'll also set a context to have the matrix exist on the GPU, using the `mx.gpu()` context.

In [8]:
mx_tfidf = mx.nd.array(tfidf_matrix, ctx=mx.gpu())

As a sanity check you can look at the `mx_tfidf` context. This ensures the data is living on the GPU.

In [9]:
mx_tfidf.context

gpu(0)

### Compute Cosine Similarities (GPU)

Now we can compute the cosine similarity of the TFIDF matrix on the GPU. We'll use the `mx_cosine_distance` function below.

In [13]:
def mx_cosine_distance(arr):
    return mx.nd.dot(arr, arr.T)

In [14]:
with mx.Context(mx.gpu()):
    %time mx_cosine_sim = mx_cosine_distance(mx_tfidf)

CPU times: user 708 µs, sys: 488 µs, total: 1.2 ms
Wall time: 693 µs


The wall time on my computer is 693 microseconds!

For comparison, you can evaluate the speedup by taking the wall time of the CPU model against the GPU model.

In [15]:
print("GPU is ~{} times faster than CPU".format(int(round((54.3 * 1000) / 693))))

GPU is ~78 times faster than CPU


For 500 rows, the GPU recommender is _78 times_ faster than the CPU version. 

## Sanity Checks 

To make sure the cosine similarity matrices are identical, we can review the output of both. 

Let's print the first 10 values of the first array from the `linear_kernel` implementation.

In [18]:
cosine_similarities[0, 0:10]

array([1.        , 0.10110642, 0.06487353, 0.05420526, 0.04566789,
       0.04303635, 0.03836477, 0.03348336, 0.06532573, 0.02368301])

Now print the first 10 values of the MXNet implementation. Notice the context shows our object is still on the GPU!

In [19]:
mx_cosine_sim[0, 0:10]


[1.         0.10110642 0.06487353 0.05420526 0.04566789 0.04303635
 0.03836477 0.03348336 0.06532573 0.02368301]
<NDArray 10 @gpu(0)>

## Get Recommendations

In [63]:
def get_recommendations(df, item_id, cosine_sim):
    # Function that takes in item ID as input and outputs most similar users

    indices = pd.Series(df.index, index=df['id']).drop_duplicates()

    # Get the index of the item that matches the id
    idx = indices[item_id]

    # Get the pairwsie similarity scores of all items
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar items
    sim_scores = sim_scores[1:11]

    # Get the item indices
    item_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar items
    return df.iloc[item_indices]

Get the recommendations using the `cosine_similarities` matrix.

In [78]:
get_recommendations(df, 5, cosine_similarities)

Unnamed: 0,id,description
307,308,"Alpine wind jkt - On high ridges, steep ice an..."
95,96,Nine trails jkt - Somewhere between the Bridge...
280,281,Nine trails jkt - The Nine Trails Jacket is fo...
292,293,"Houdini full-zip jkt - Now you see it, now you..."
209,210,Nine trails vest - Simplicity in action - this...
363,364,Nine trails vest - Simplicity in action - thi...
96,97,Nine trails shorts - For those who view trails...
215,216,"Simple guide jkt - Skin-in by headlamp, summit..."
206,207,"Multi use shorts - Streamlined, technical and ..."
119,120,"Simple guide jkt - Skin-in by headlamp, summit..."


Get the recommendations using `mx_cosine_sim` GPU matrix.

In [103]:
get_recommendations(df, 5, mx_cosine_sim)

Unnamed: 0,id,description
307,308,"Alpine wind jkt - On high ridges, steep ice an..."
95,96,Nine trails jkt - Somewhere between the Bridge...
280,281,Nine trails jkt - The Nine Trails Jacket is fo...
292,293,"Houdini full-zip jkt - Now you see it, now you..."
209,210,Nine trails vest - Simplicity in action - this...
363,364,Nine trails vest - Simplicity in action - thi...
96,97,Nine trails shorts - For those who view trails...
215,216,"Simple guide jkt - Skin-in by headlamp, summit..."
206,207,"Multi use shorts - Streamlined, technical and ..."
119,120,"Simple guide jkt - Skin-in by headlamp, summit..."
