# Content Recommendations With MXNet

In [1]:
import pandas as pd
import numpy as np
import mxnet as mx
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

  from ._conv import register_converters as _register_converters


## Import sample data  

In [2]:
df = pd.read_csv("../data/sample-data.csv")
print(df.head())
print(df.shape)

   id                                        description
0   1  Active classic boxers - There's a reason why o...
1   2  Active sport boxer briefs - Skinning up Glory ...
2   3  Active sport briefs - These superbreathable no...
3   4  Alpine guide pants - Skin in, climb ice, switc...
4   5  Alpine wind jkt - On high ridges, steep ice an...
(500, 2)


For this content recommendation systems, you'll use the TF-IDF Vectorizer.

TF-IDF decides, for each term in a document and a given collection, the weights for each one of the components of a vector that can be used for cosine similarity (among other things).

In [3]:
tf = TfidfVectorizer(analyzer='word',
                     ngram_range=(1, 3),
                     min_df=0,
                     stop_words='english')

In [4]:
tfidf_matrix = tf.fit_transform(df['description'])

## Compute Cosine Similarities (CPU) 

Cosine similarity measures the angle between two different vectors in a Euclidean space, independently of how the weights have been calculated.

First, you can build the model using the `linear_kernal` method, which is CPU bound.

In [5]:
%%timeit
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

28.7 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Running on my system, this takes `54ms`. This isn't an issue for small datasets like this, but becomes a pain point when dealing with millions of records.

## Compute Cosine Similarity With MXNet (GPU)

Using MXNet, we'll be able to build a content recommender the same way, but even faster than before.

First, we'll convert the TFIDF matrix to an MXNet NDArray. We'll also set a context to have the matrix exist on the GPU, using the `mx.gpu()` context.

In [9]:
mx_tfidf = mx.nd.sparse.array(tfidf_matrix, ctx=mx.gpu())

As a sanity check you can look at the `mx_tfidf` context. This ensures the data is living on the GPU.

In [10]:
mx_tfidf.context

gpu(0)

### Compute Cosine Similarities (GPU)

Now we can compute the cosine similarity of the TFIDF matrix on the GPU. We'll use the `mx_cosine_distance` function below.

In [8]:
def mx_cosine_distance(arr):
    return mx.nd.dot(arr, arr.T)

In [11]:
%%timeit
mx.nd.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()

29.7 ms ± 160 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
%%timeit

mx_cosine_sim = mx.nd.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()

The wall time on my computer is 693 microseconds!

For comparison, you can evaluate the speedup by taking the wall time of the CPU model against the GPU model.

## Sanity Checks 

To make sure the cosine similarity matrices are identical, we can review the output of both. 

Let's print the first 10 values of the first array from the `linear_kernel` implementation.

In [21]:
cosine_similarities[0, 0:10]

array([1.        , 0.10110642, 0.06487353, 0.05420526, 0.04566789,
       0.04303635, 0.03836477, 0.03348336, 0.06532573, 0.02368301])

Now print the first 10 values of the MXNet implementation. Notice the context shows our object is still on the GPU!

In [22]:
mx_cosine_sim[0, 0:10]


[1.         0.10110642 0.06487353 0.05420526 0.04566789 0.04303635
 0.03836477 0.03348336 0.06532573 0.02368301]
<NDArray 10 @gpu(0)>

## Get Recommendations

In [23]:
def get_recommendations(df, item_id, cosine_sim):
    # Function that takes in item ID as input and outputs most similar users

    indices = pd.Series(df.index, index=df['id']).drop_duplicates()

    # Get the index of the item that matches the id
    idx = indices[item_id]

    # Get the pairwsie similarity scores of all items
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the items based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar items
    sim_scores = sim_scores[1:11]

    # Get the item indices
    item_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar items
    return df.iloc[item_indices]

Review an item of interest; the ID will be used in the recommendations.

In [24]:
df[df.id == 5]

Unnamed: 0,id,description
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


Get the recommendations using `mx_cosine_sim` GPU matrix.

In [25]:
get_recommendations(df, 5, mx_cosine_sim)

Unnamed: 0,id,description
307,308,"Alpine wind jkt - On high ridges, steep ice an..."
95,96,Nine trails jkt - Somewhere between the Bridge...
280,281,Nine trails jkt - The Nine Trails Jacket is fo...
292,293,"Houdini full-zip jkt - Now you see it, now you..."
209,210,Nine trails vest - Simplicity in action - this...
363,364,Nine trails vest - Simplicity in action - thi...
96,97,Nine trails shorts - For those who view trails...
215,216,"Simple guide jkt - Skin-in by headlamp, summit..."
206,207,"Multi use shorts - Streamlined, technical and ..."
119,120,"Simple guide jkt - Skin-in by headlamp, summit..."
