Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom dataset, negative sample and metrics #55

Open
graytowne opened this issue Sep 5, 2017 · 7 comments
Open

Custom dataset, negative sample and metrics #55

graytowne opened this issue Sep 5, 2017 · 7 comments

Comments

@graytowne
Copy link
Contributor

Hi @maciejkula,

Thanks for your awesome work!
I am now using it for my own research and I found there are some small issues:

  1. I think right now the project is not so friendly with custom datasets, as sometimes the users id is not start from 0, or even sometimes they are unique strings like "AMEVO2LY6VEJA". So I recommend to use a dictionary to convert unique user identifier to int. It will be nice if this project can have a build-in data converter.

  2. For negative sampling here, it cannot guarantee the sampled item is truely "negative". In other words, it can be positive, though the probability is relatively low.

  3. It will be good if spotlight can have more ranking metrics like "Precision", "Recall", "AUC" and "MAP".

@maciejkula
Copy link
Owner

maciejkula commented Sep 6, 2017

Glad you find it useful!

  1. As in any other machine learning application, you are required to translate your identifiers to contiguous numerical indices. I haven't included this as it's fairly trivial, but as you suggest it's a sufficiently common need to be worth thinking about.
  2. That's correct. Assuming sufficiently large item set sizes, I don't think this would negatively impact model quality.
  3. Yes! I've added an issue to the project board. Help would be welcome!

@graytowne
Copy link
Contributor Author

I am trying to add precision@k and recall@k to spotlight, hope I can finish it soon.

@ttrine
Copy link

ttrine commented Apr 13, 2018

@maciejkula RE your second bullet point, I was having issues reliably evaluating my models, for two reasons:

  1. 'Negative' samples were not necessarily negative (as discussed above), and
  2. The train and test sets were not disjoint, because the negative training samples are sampled uniformly at random from the whole interaction matrix instead of from a designated training subset.

To resolve this, I changed the API of the 'fit' function. I now pass in train_pos, test_pos, and test_neg. train_pos is the list of positive training examples (previously just called interactions). test_pos and test_neg together constitute the training set, which the user constructs ahead of time. I fix issue 1 by checking during negative sample generation that the samples are present in neither train_pos or test_pos. I fix issue 2 by checking during negative sample generation that the samples are not present in test_neg.
Of course, this also changes the evaluation API, because you must now explicitly pass in negative test examples. I have some code that computes various metrics in this manner (ROC curves, PR curves, AUC, AP, pairwise accuracy) - nothing productionized, it's just sitting in a jupyter notebook.
I'm mentioning these changes in case they're of interest - I can polish them up and make a PR if so. But while they suit my needs well, your perspective may be broader so I'm curious to hear your thoughts.

@benfred
Copy link

benfred commented May 20, 2018

For the negative samples, I ran a couple of experiments with using BPR on a different project - testing the same model with excluding negative samples that had been liked by the user - and without excluding those negative samples:

Dataset P@10: Not verifying negative P@10: verifying negatives Items Negative Rejection Rate
Movielens 100k 0.116 0.228 1683 18.45%
Movielens 1M 0.179 0.243 3953 16.82%
Movielens 10M 0.151 0.185 65134 13.55%
Movielens 20M 0.131 0.228 131263 12.79%
LastFM 360K 0.098 0.108 292385 1.75%

It seems that there is a decent improvement by removing positives from the negative samples - even when almost all of the randomly selected negative items are in fact negative. As an example, with the last.fm dataset only 1.75% of negative items sampled were in fact positive - but by excluding these these items p@10 still rose 10%.

Anyways - just thought I'd pipe in here. I'm not sure the best way of implementing this in torch, I'm guessing it will require a native extension in order to do it efficiently =(

@maciejkula
Copy link
Owner

Hey @benfred, thanks for the heads-up! I already do this in LightFM, so it makes sense to do this here as well (though, as you say, a native extension may be necessary).

I am surprised the differences are so large, I would have imagined that not doing this would simply have acted as a mild regularizer on the model. Did you see this in SGD/backprop training? I remember implicit uses coordinate gradient descent (?) for ALS?

@ttrine
Copy link

ttrine commented May 20, 2018 via email

@benfred
Copy link

benfred commented May 22, 2018

@maciejkula: I was also surprised the results were so large - I wouldn't have thought that this would have made that much of a difference either.

I actually started testing this out when I noticed that the BPR model in LightFM was substantially outperforming the prediction accuracy of the BPR model I added in implicit. When looking into this, I found that most of the accuracy difference was because LightFM was removing true positives from the negative samples and I wasn't originally. I am using SGD for training the BPR model.

To verify this isn't just a problem with my code, I quickly ran some tests with LightFM and found a similar drop in P@K without the check (I hacked up the 'in_positives' function to always return 0 to test out not verifying the negatives):

Using a LightFM model like:

model = LightFM(learning_rate=0.05, loss='bpr', no_components=16)
model.fit(train, epochs=20)

I got results like:

Dataset P@10: Not verifying negative P@10: verifying negatives Items
Movielens 1M 0.149 0.218 3953
Movielens 10M 0.137 0.199 65134

It's worth pointing out that I'm calculating P@K differently than you do in LightFM: I don't include the positive items from the training set in the ranking (that is P@K is evaluated by ranking only negative/missing items from the training set and comparing those ranks against the withheld positive items in the test set). Removing train set positives when testing avoids some potential problems in evaluating models. I can go into much more depth on why I think this is necessary if you're interested =).

Calculating P@K including the training set positives in the ranking leads to these results for LightFM - which is probably more inline with what you've seen in your experiments:

Dataset P@10: Not verifying negative P@10: verifying negatives Items
Movielens 1M 0.0892 0.111 3953
Movielens 10M 0.092 0.110 65134

The difference isn't quite as pronounced calculating P@k like this - but still pretty large.

Even with the fix, the BPR model in lightfm still seems to be doing a better job than the one I added in implicit. I haven't done any parameter tuning for either library in these experiments which might be part of the difference, but I'm still looking into this to make sure I didn't do anything else wrong =). Edit: there was another issue with BPR model in implicit - which was fixed here: benfred/implicit#105 . I've updated the numbers above to reflect this fix.

Finally, the ALS Model in implicit uses a Conjugate Gradient optimizer, which is pretty optimal because of the least squares loss. It also doesn't sample negative items so it doesn't have this problem. Instead, ALS uses all the unliked data, but does it efficiently on just the positive items (I talked about this model here https://www.benfrederickson.com/matrix-factorization/ and https://www.benfrederickson.com/fast-implicit-matrix-factorization/).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants