# Item cold start: Stackexchange dataset

### Comment:

- What about learning curves: bias and variance x (dataset size, regularization parameter), loss x learning rate?
- theres a problem with documentation code for auc_score in test for hybrid model
- Create item embeddings representation and find similar items

In [3]:
import numpy as np

from lightfm.datasets import fetch_stackexchange

data = fetch_stackexchange('crossvalidated', 
                           test_set_fraction = 0.1, 
                           indicator_features = False, 
                           tag_features= True)

train = data['train']
test = data['test']

In [5]:
# The train, test are chronologically divided, hence, there are many non-answered questions
print('Dataset has %s users and %s items, '
     'with %s interactions in train and %s interactions in test.'
     % (train.shape[0], train.shape[1], train.getnnz(), test.getnnz()))

Dataset has 3221 users and 72360 items, with 57830 interactions in train and 4307 interactions in test.


### Pure Collaborative Filtering Model

In [14]:
from lightfm import LightFM

NUM_THREADS = 4
NUM_COMPONENTS = 30
NUM_EPOCHS = 6
ITEM_ALPHA = 1e-6

#warp model
model = LightFM(loss = 'warp',
               item_alpha = ITEM_ALPHA,
               no_components = NUM_COMPONENTS)

%time model.fit(train, epochs = NUM_EPOCHS, num_threads = NUM_THREADS)

CPU times: user 1.09 s, sys: 11.8 ms, total: 1.1 s
Wall time: 318 ms


<lightfm.lightfm.LightFM at 0x7f9a4160a0b8>

In [15]:
from lightfm.evaluation import auc_score
train_auc = auc_score(model, train, num_threads = NUM_THREADS).mean()
print('Collaborative Filtering train auc: %s' % train_auc)

Collaborative Filtering train auc: 0.97278005


In [16]:
test_auc = auc_score(model, test, num_threads = NUM_THREADS).mean()
print('Collaborative Filtering test auc: %s' % test_auc)

Collaborative Filtering test auc: 0.35952914


In [18]:
model.item_biases *=0
test_auc = auc_score(model, test, num_threads = NUM_THREADS).mean()
print('Collaborative Filtering test auc: %s' % test_auc)

Collaborative Filtering test auc: 0.51151264


### A Hybrid model

In [21]:
item_features = data['item_features']
tag_labels = data['item_feature_labels']

print('There are %s distinct tags, with values like %s'
      % (item_features.shape[1], tag_labels[:3].tolist()))

There are 1246 distinct tags, with values like ['bayesian', 'prior', 'elicitation']


In [36]:
model = LightFM(loss= 'warp',
                item_alpha = ITEM_ALPHA, 
                no_components=NUM_COMPONENTS)

model = model.fit(train, item_features= item_features,
                 epochs= NUM_EPOCHS, num_threads = NUM_THREADS)

In [50]:
#item_features again ... strange
train_auc = auc_score(model, train,
                      item_features = item_features,
                      num_threads=NUM_THREADS).mean()
print('Hybrid training auc: %s' % train_auc)

test_auc = auc_score(model,
                    test,
                    train_interactions=train,
                    item_features=item_features,
                    num_threads=NUM_THREADS, check_intersections=False).mean()

print('Hybrid test set AUC: %s' % test_auc)

Hybrid training auc: 0.8997756
Hybrid test set AUC: 0.71544963


In [47]:
from lightfm.evaluation import precision_at_k
#item_features again ... strange
train_pk = precision_at_k(model, train,
                      item_features = item_features, k=10,
                      num_threads=NUM_THREADS).mean()
print('Hybrid training auc: %s' % train_pk)

test_pk = precision_at_k(model,
                    test,
                    train_interactions=train,
                    item_features=item_features, k=10,
                    num_threads=NUM_THREADS, check_intersections=False).mean()

print('Hybrid test set precision_at_k: %s' % test_pk)

Hybrid training auc: 0.009003416
Hybrid test set AUC: 0.0031716418


In [48]:
from lightfm.evaluation import recall_at_k
#item_features again ... strange
train_rk = recall_at_k(model, train,
                      item_features = item_features, k=10,
                      num_threads=NUM_THREADS).mean()
print('Hybrid training auc: %s' % train_rk)

test_rk = recall_at_k(model,
                    test,
                    train_interactions=train,
                    item_features=item_features, k=10,
                    num_threads=NUM_THREADS, check_intersections=False).mean()

print('Hybrid test set recall_at_k: %s' % test_rk)

Hybrid training auc: 0.00962684662991416
Hybrid test set recall_at_k: 0.0037591968035629287


### Tag embeddings

In [57]:
def get_similar_tags(model, tag_id):
    #Define similarity as cosine of the angle
    #between tag latent vectors
    
    # Normalize the vectors to unity lenght
    tag_embeddings = (model.item_embeddings.T
                    / np.linalg.norm(model.item_embeddings, axis=1)).T
    query_embedding = tag_embeddings[tag_id]
    similarity = np.dot(tag_embeddings, query_embedding)
    most_similar = np.argsort(-similarity)[1:4]
    return most_similar

for tag in (u'bayesian', u'regression', u'survival'):
    tag_id = tag_labels.tolist().index(tag)
    print('Most similar tags for %s: %s' % (tag_labels[tag_id], tag_labels[get_similar_tags(model,tag_id)] ))

Most similar tags for bayesian: ['prior' 'mcmc' 'metropolis-hastings']
Most similar tags for regression: ['down-sample' 'segmented-regression' 'regression-coefficients']
Most similar tags for survival: ['cox-model' 'hazard' 'epidemiology']
