Custom dataset, negative sample and metrics #55

graytowne · 2017-09-05T22:29:39Z

Hi @maciejkula,

Thanks for your awesome work!
I am now using it for my own research and I found there are some small issues:

I think right now the project is not so friendly with custom datasets, as sometimes the users id is not start from 0, or even sometimes they are unique strings like "AMEVO2LY6VEJA". So I recommend to use a dictionary to convert unique user identifier to int. It will be nice if this project can have a build-in data converter.
For negative sampling here, it cannot guarantee the sampled item is truely "negative". In other words, it can be positive, though the probability is relatively low.
It will be good if spotlight can have more ranking metrics like "Precision", "Recall", "AUC" and "MAP".

maciejkula · 2017-09-06T07:47:24Z

Glad you find it useful!

As in any other machine learning application, you are required to translate your identifiers to contiguous numerical indices. I haven't included this as it's fairly trivial, but as you suggest it's a sufficiently common need to be worth thinking about.
That's correct. Assuming sufficiently large item set sizes, I don't think this would negatively impact model quality.
Yes! I've added an issue to the project board. Help would be welcome!

graytowne · 2017-09-06T21:40:20Z

I am trying to add precision@k and recall@k to spotlight, hope I can finish it soon.

ttrine · 2018-04-13T14:40:43Z

@maciejkula RE your second bullet point, I was having issues reliably evaluating my models, for two reasons:

'Negative' samples were not necessarily negative (as discussed above), and
The train and test sets were not disjoint, because the negative training samples are sampled uniformly at random from the whole interaction matrix instead of from a designated training subset.

To resolve this, I changed the API of the 'fit' function. I now pass in train_pos, test_pos, and test_neg. train_pos is the list of positive training examples (previously just called interactions). test_pos and test_neg together constitute the training set, which the user constructs ahead of time. I fix issue 1 by checking during negative sample generation that the samples are present in neither train_pos or test_pos. I fix issue 2 by checking during negative sample generation that the samples are not present in test_neg.
Of course, this also changes the evaluation API, because you must now explicitly pass in negative test examples. I have some code that computes various metrics in this manner (ROC curves, PR curves, AUC, AP, pairwise accuracy) - nothing productionized, it's just sitting in a jupyter notebook.
I'm mentioning these changes in case they're of interest - I can polish them up and make a PR if so. But while they suit my needs well, your perspective may be broader so I'm curious to hear your thoughts.

benfred · 2018-05-20T19:22:08Z

For the negative samples, I ran a couple of experiments with using BPR on a different project - testing the same model with excluding negative samples that had been liked by the user - and without excluding those negative samples:

Dataset	P@10: Not verifying negative	P@10: verifying negatives	Items	Negative Rejection Rate
Movielens 100k	0.116	0.228	1683	18.45%
Movielens 1M	0.179	0.243	3953	16.82%
Movielens 10M	0.151	0.185	65134	13.55%
Movielens 20M	0.131	0.228	131263	12.79%
LastFM 360K	0.098	0.108	292385	1.75%

It seems that there is a decent improvement by removing positives from the negative samples - even when almost all of the randomly selected negative items are in fact negative. As an example, with the last.fm dataset only 1.75% of negative items sampled were in fact positive - but by excluding these these items p@10 still rose 10%.

Anyways - just thought I'd pipe in here. I'm not sure the best way of implementing this in torch, I'm guessing it will require a native extension in order to do it efficiently =(

maciejkula · 2018-05-20T20:04:20Z

Hey @benfred, thanks for the heads-up! I already do this in LightFM, so it makes sense to do this here as well (though, as you say, a native extension may be necessary).

I am surprised the differences are so large, I would have imagined that not doing this would simply have acted as a mild regularizer on the model. Did you see this in SGD/backprop training? I remember implicit uses coordinate gradient descent (?) for ALS?

ttrine · 2018-05-20T20:26:51Z

Great idea running those experiments! Thanks for doing it. I just want to mention what I did to accomplish my task at hand. Since I needed to exclude negative samples, I implemented a negative train-test split routine using scipy's sparse matrix library. It's fairly efficient but it does have some limitations. I'm happy to go into the cost/benefit if someone thinks it would be useful.

…

On Sun, May 20, 2018, 4:04 PM Maciej Kula ***@***.***> wrote: Hey @benfred <https://github.com/benfred>, thanks for the heads-up! I already do this in LightFM, so it makes sense to do this here as well (though, as you say, a native extension may be necessary). I am surprised the differences are so large, I would have imagined that not doing this would simply have acted as a mild regularizer on the model. Did you see this in SGD/backprop training? I remember implicit uses coordinate gradient descent (?) for ALS? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#55 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGfk9T1VrQU5f_gDoV0ud7AlAUOk1OfTks5t0cxGgaJpZM4PNmqd> .

benfred · 2018-05-22T18:53:19Z

@maciejkula: I was also surprised the results were so large - I wouldn't have thought that this would have made that much of a difference either.

I actually started testing this out when I noticed that the BPR model in LightFM was substantially outperforming the prediction accuracy of the BPR model I added in implicit. When looking into this, I found that most of the accuracy difference was because LightFM was removing true positives from the negative samples and I wasn't originally. I am using SGD for training the BPR model.

To verify this isn't just a problem with my code, I quickly ran some tests with LightFM and found a similar drop in P@K without the check (I hacked up the 'in_positives' function to always return 0 to test out not verifying the negatives):

Using a LightFM model like:

model = LightFM(learning_rate=0.05, loss='bpr', no_components=16)
model.fit(train, epochs=20)

I got results like:

Dataset	P@10: Not verifying negative	P@10: verifying negatives	Items
Movielens 1M	0.149	0.218	3953
Movielens 10M	0.137	0.199	65134

It's worth pointing out that I'm calculating P@K differently than you do in LightFM: I don't include the positive items from the training set in the ranking (that is P@K is evaluated by ranking only negative/missing items from the training set and comparing those ranks against the withheld positive items in the test set). Removing train set positives when testing avoids some potential problems in evaluating models. I can go into much more depth on why I think this is necessary if you're interested =).

Calculating P@K including the training set positives in the ranking leads to these results for LightFM - which is probably more inline with what you've seen in your experiments:

Dataset	P@10: Not verifying negative	P@10: verifying negatives	Items
Movielens 1M	0.0892	0.111	3953
Movielens 10M	0.092	0.110	65134

The difference isn't quite as pronounced calculating P@k like this - but still pretty large.

~~Even with the fix, the BPR model in lightfm still seems to be doing a better job than the one I~~ ~~added in implicit. I haven't done any parameter tuning for either library in these experiments~~ ~~which might be part of the difference, but I'm still looking into this to make sure I didn't do~~ ~~anything else wrong =).~~ Edit: there was another issue with BPR model in implicit - which was fixed here: benfred/implicit#105 . I've updated the numbers above to reflect this fix.

Finally, the ALS Model in implicit uses a Conjugate Gradient optimizer, which is pretty optimal because of the least squares loss. It also doesn't sample negative items so it doesn't have this problem. Instead, ALS uses all the unliked data, but does it efficiently on just the positive items (I talked about this model here https://www.benfrederickson.com/matrix-factorization/ and https://www.benfrederickson.com/fast-implicit-matrix-factorization/).

maciejkula mentioned this issue Sep 6, 2017

Add more metrics #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom dataset, negative sample and metrics #55

Custom dataset, negative sample and metrics #55

graytowne commented Sep 5, 2017

maciejkula commented Sep 6, 2017 •

edited

graytowne commented Sep 6, 2017

ttrine commented Apr 13, 2018 •

edited

benfred commented May 20, 2018 •

edited

maciejkula commented May 20, 2018

ttrine commented May 20, 2018 via email

benfred commented May 22, 2018 •

edited

Custom dataset, negative sample and metrics #55

Custom dataset, negative sample and metrics #55

Comments

graytowne commented Sep 5, 2017

maciejkula commented Sep 6, 2017 • edited

graytowne commented Sep 6, 2017

ttrine commented Apr 13, 2018 • edited

benfred commented May 20, 2018 • edited

maciejkula commented May 20, 2018

ttrine commented May 20, 2018 via email

benfred commented May 22, 2018 • edited

maciejkula commented Sep 6, 2017 •

edited

ttrine commented Apr 13, 2018 •

edited

benfred commented May 20, 2018 •

edited

benfred commented May 22, 2018 •

edited