# LightFM quick start with Movielens 100k dataset

### Comment

Two unexpected features:
- Train and Test division in a rather non-homogeneous way for movielens dataset leaving to a variation on performance
- The comparison for one example considers 3 examples on train, wich excludes many possibilities for comparison

In [1]:
# Using MAC or Windows will give you the following warning
import numpy as np
from lightfm import LightFM
from lightfm.datasets import fetch_movielens
from lightfm.evaluation import precision_at_k, recall_at_k,auc_score

In [2]:
# Load the movielens 100k dataset. Only
# five star ratings are treated as positive.
data = fetch_movielens(min_rating = 5.0)

In [37]:
print(repr(data['train']))
print(repr(data['test']))

<943x1682 sparse matrix of type '<class 'numpy.float32'>'
	with 19048 stored elements in COOrdinate format>
<943x1682 sparse matrix of type '<class 'numpy.int32'>'
	with 2153 stored elements in COOrdinate format>


WARP (Weighted Approximate-Rank Pairwise) model

Model training is accomplished via SGD (stochastic gradient descent)

In [50]:
# Instantiate and train the model
model = LightFM(loss= 'warp')
# epochs is the number of passings through data
%time model.fit(data['train'], epochs=60, num_threads=4)

CPU times: user 1.21 s, sys: 0 ns, total: 1.21 s
Wall time: 307 ms


<lightfm.lightfm.LightFM at 0x7f20e7812588>

In [51]:
# Evaluate the trained model
print("Train precision: %.2f" % precision_at_k(model, data['train'], k=10).mean())
print("Test precision: %.2f" % precision_at_k(model, data['test'],data['train'], k=10).mean())
print()
#recall_at_k
print("Train recall: %.2f" % recall_at_k(model, data['train'], k=10).mean())
print("Test recall: %.2f" % recall_at_k(model, data['test'],data['train'], k=10).mean())
print()
#auc_score
print("Train auc_score: %.2f" % auc_score(model, data['train']).mean())
print("Test auc_score: %.2f" % auc_score(model, data['test'],data['train']).mean())

Train precision: 0.37
Test precision: 0.06

Train recall: 0.36
Test recall: 0.22

Train auc_score: 0.98
Test auc_score: 0.92


In [52]:
def sample_recommendation(model, data, user_ids):


    n_users, n_items = data['train'].shape

    for user_id in user_ids:
        known_positives = data['item_labels'][data['train'].tocsr()[user_id].indices]

        scores = model.predict(user_id, np.arange(n_items))
        top_items = data['item_labels'][np.argsort(-scores)]

        print("User %s" % user_id)
        print("     Known positives:")

        for x in known_positives[:3]:
            print("        %s" % x)

        print("     Recommended:")

        for x in top_items[:10]:
            print("        %s" % x)
            
        print("     %s" % (len([t for t in top_items[:10] if t in known_positives])/10) )

sample_recommendation(model, data, [0])

User 0
     Known positives:
        Toy Story (1995)
        Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
        Dead Man Walking (1995)
     Recommended:
        Close Shave, A (1995)
        Twelve Monkeys (1995)
        Usual Suspects, The (1995)
        Fargo (1996)
        Chasing Amy (1997)
        Bound (1996)
        Wrong Trousers, The (1993)
        Shawshank Redemption, The (1994)
        City of Lost Children, The (1995)
        Clerks (1994)
     0.7


In [53]:
data['train']

<943x1682 sparse matrix of type '<class 'numpy.float32'>'
	with 19048 stored elements in COOrdinate format>

In [54]:
import numpy as np
19048*np.float32(5).nbytes

76192

In [55]:
type(data['train'])

scipy.sparse.coo.coo_matrix