# Content-Based Recommendation Systems with Apache MXNet

Recommendation systems are known for providing amazing experiences across any industry and user base. This notebook walks through building a content-based recommendation system using Scikit-Learn and MXNet.

This recommendation system will request the top N recommended news articles, relative to the content of each news article.

In [1]:
import glob
from pprint import pprint

import mxnet as mx
import numpy as np
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS

  from ._conv import register_converters as _register_converters


In [2]:
class PreprocessText:

    def __init__(self):
        self.additional_stop_words = {"-PRON-"}
        self.stop_words = set(STOP_WORDS.union(self.additional_stop_words))

    def lemmatization(self, texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
        """
        Tokenize and lemmatize all documents. The following criteria are used to evaluate each word.
            Is the token a stop word?
            Is the token comprised of letters?
            Is the token longer than 1 letter?
            Is the token an allowed POS tag?
            Is the lemmatized token a stop word?

        """
        print("Lemmatizing Text")

        # Initialize spaCy
        nlp = spacy.load('en_core_web_md', disable=["parser", "ner"])

        texts_out = []

        for text in texts:
            doc = nlp(text)
            texts_out.append([token.lemma_ for token in doc
                              if not token.is_stop
                              and token.lemma_ not in self.stop_words
                              and token.is_alpha
                              and len(token) > 1
                              and token.pos_ in allowed_postags])

            if len(texts_out) % 1000 == 0:
                print("Lemmatized {0} of {1} documents".format(
                    len(texts_out), len(texts)))

        return texts_out


def extract_features(f):
    return pd.read_csv(f, usecols=["id", "title", "publication", "content"])


def get_recommendations(df_articles, article_idx, mx_mat, n_recs=10):
    """
    Request top N article recommendations.

    INPUT
        df_articles: Pandas DataFrame containing all articles.
        user_id: User ID being provided matches.
        mx_mat: MXNet cosine similarity matrix
    OUTPUT
        Pandas DataFrame of top N article recommendations.
    """

    # Similarity and recommendations
    article_sims = mx_mat[article_idx].asnumpy()
    article_recs = np.argsort(-article_sims).tolist()[:n_recs + 1]

    # Top recommendations
    df_recs = df_articles.iloc[article_recs]
    df_recs["similarity"] = article_sims[article_recs]

    return df_recs

## Import Article Files 

Create the file path to the article files.

In [3]:
file_path = "../data/"
all_files = glob.glob(file_path + "*.csv")

Import all articles with Pandas, using the `extract_files` function above.

In [4]:
# Concatenate features across all files into a single data frame.
articles = pd.concat((extract_features(f) for f in all_files))

After importing the files, subset the first 1000 articles.

In [5]:
articles = articles.head(1000)

## Create the TF-IDF Matrix

TF-IDF is designed to reduce the number of tokens occurring within the corpus. When the TF-IDF Vectorizer is utilized, a vocabulary is created from the entire set of news articles, also referred to as "documents".

After importing the documents, define the TfidfVectorizer from Scikit-Learn, and run against the content of all the articles.

In [6]:
tf = TfidfVectorizer(analyzer="word",
                    ngram_range=(1, 3),
                    min_df=0.2, # ignore terms with a document frequency lower than 0.2 (20%)
                    stop_words="english")

In [7]:
tfidf_matrix = tf.fit_transform(articles["content"])

Convert `tfidf_matrix` to an MXNet NDArray, and perform the same dot product operation. Just like the TfidfVectorizer, `mx.nd.sparse.array` creates a sparse matrix, where the majority of the elements in the matrix are zero.

The `ctx` parameter specifies the context of where the data should reside. The context can be set to `mx.cpu()` for the DRAM & CPU, or `mx.gpu()` for the GPU memory.

In [8]:
mx_tfidf = mx.nd.sparse.array(tfidf_matrix, ctx=mx.cpu())

## Dot Product Timing: NumPy vs MXNet 

Time the dot product with NumPy and Scikit-Learn sparse matrix.

In [9]:
%timeit np.dot(tfidf_matrix, tfidf_matrix.T)

34.2 ms ± 7.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Time the dot product of the MXNet sparse matrix.

In [10]:
%%timeit
mx.nd.sparse.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()

6.7 ms ± 537 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Optional: Time the dot product of the MXNet sparse matrix on a GPU.

In [11]:
mx_tfidf = mx.nd.sparse.array(tfidf_matrix, ctx=mx.gpu())

In [12]:
%%timeit
mx.nd.sparse.dot(mx_tfidf, mx_tfidf.T)
mx.nd.waitall()

2.46 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Calculate Speed of NumPy vs MXNet

In [17]:
numpy_time = 34.2
mxnet_time_cpu = 6.7
mxnet_time_gpu = 2.46

mxnet_speedup_cpu = (1 - (mxnet_time_cpu / numpy_time)) * 100
mxnet_speedup_gpu = (1 - (mxnet_time_gpu / numpy_time)) * 100


print("The dot product is {}% faster than NumPy on CPU.".format(round(mxnet_speedup_cpu)))
print("The dot product is {}% faster than NumPy on GPU.".format(round(mxnet_speedup_gpu)))

The dot product is 80% faster than NumPy on CPU.
The dot product is 93% faster than NumPy on GPU.


## Create Cosine Similarity Matrix

In [14]:
mx_recsys = mx.nd.sparse.dot(mx_tfidf, mx_tfidf.T)

## Get Recommendations 

Get the top 10 recommendations from the article at index 3. Feel free to select any index number.

In [15]:
df_recs = get_recommendations(df_articles = articles,
    article_idx = 3, mx_mat = mx_recsys, n_recs=10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Show the recommendations in the DataFrame.

In [16]:
df_recs[["title", "similarity"]]

Unnamed: 0,title,similarity
3,I feared my life lacked meaning. Cancer pushed...,1.0
167,Chuck (aka The Bleeder) review - Liev Schreibe...,0.544347
726,Thom Yorke’s ex-partner Rachel Owen dies at 48,0.504341
373,Mr Robot returns and The Girlfriend Experience...,0.50074
764,My nieces don’t know they were conceived by do...,0.482433
563,Bridget Jones: how to turn a female character ...,0.482205
216,Robert Rauschenberg and the subversive languag...,0.476691
96,Zsa Zsa Gabor dies aged 99,0.476189
678,"The rise of K2: the drug is legal, dangerous –...",0.469531
765,Facebook is chipping away at privacy – and my ...,0.464698
