# Purpose

The purpose of this notebook is to analyze item similarities learned by training a factorization machine model. This consists of the following steps:
1. Load in movielens data
2. preprocess the data, and train a model.
3. Extract the item embedding vectors, and compute cosine similarities
4. Generate visual similarities to confirm results make sense.

In [1]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
cd ../

/Users/scottcronin/gh/recommender_deployed


In [3]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import copy
import pandas as pd
import pickle
import numpy as np
import os
from IPython.display import HTML
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
import scipy.sparse as scs
from sklearn.base import TransformerMixin
from sklearn.externals import joblib
from sklearn.metrics.pairwise import cosine_similarity
from lightfm import LightFM, cross_validation, evaluation

sns.set_context('notebook', font_scale=1.4)

  return f(*args, **kwds)


# Load Data

In [4]:
interactions = pd.read_csv('data/ratings.dat',
                           sep='::', engine='python',
                           header=None,
                           names=['uid', 'iid', 'rating', 'timestamp'],
                           usecols=['uid', 'iid', 'rating'],
                          )
display(interactions.sample(5))
print('Shape: {:>9,} x {}'.format(*interactions.shape))

Unnamed: 0,uid,iid,rating
3172023,22908,7153,3.5
1491751,10961,2324,5.0
5364696,38328,628,4.5
9581748,68670,1339,3.0
6510704,46568,3735,4.0


Shape: 10,000,054 x 3


# Preprocess data

In [5]:
from app.preprocess import Preprocessor
pp = Preprocessor(min_rating=4.0)
csr = pp.fit_transform(interactions)

# Build a model

In [6]:
from app.models import FM
lfm = LightFM(no_components=30, loss='warp', learning_rate=0.05)
fm = FM(fm_model=lfm, preprocessor=pp)
fm.fit(csr, epochs=3)

# Calculate cosine similarities on item embedding vectors

Let's begin by calculating item similarities, and sorting the index from most similar to least.

In [10]:
cs = cosine_similarity(fm.model.item_embeddings)
sims = np.argsort(-cs)

Let us collect a few popular movie ids and see if the similar movies for those make sense.

In [46]:
pop_idxs = fm.pop_model[:20]
POSTERS = joblib.load('app/objects/posters.pkl.gz')
BASE_URL = 'https://image.tmdb.org/t/p/w200'
print(pop_idxs)

[622  80 528   7  23  75  22 116 141  19 133  81 118 770  25  14 120  48
 285  83]


Lets take movie_id 23 and find the most similar movies. The first item will be movie_id 23.

In [47]:
idxs = sims[23, :][:10]
urls = [BASE_URL + POSTERS[fm.idx_to_iid[idx]] for idx in idxs]
for url in urls:
    display(HTML('<img src="{}">'.format(url)))

So the popular movie we selected is Star Wars. The next most similar moveies are other Star Wars movies and Indiana Jones movies. Thus it appears our similar item methodology does make sense to first order. 