Unintuitive behaviour with cross_validate.random_train_test_split #328

RothNRK · 2018-07-09T20:12:33Z

The random_train_test_split makes is easy to split the interactions matrix into train and test dataset but if you have data with weights you will have to apply random_train_test_split twice with the same random_state parameter. My concern is that it would be intuitive to do something like:

from lightfm.data import Dataset
from lightfm.cross_validation import random_train_test_split

users = np.random.choice([0., 1., 2.], (10, 1))
items = np.random.choice([0., 1., 2.], (10, 1))
weight = np.random.rand(10,1)
data = np.concatenate((users, items, weight), axis=1)

dataset = Dataset()
dataset.fit(users=np.unique(data[:, 0]), items=np.unique(data[:, 1]))
interactions, weight = dataset.build_interactions((i[0], i[1], i[2]) for i in data)

test_percentage = 0.2
random_state = np.random.RandomState(seed=1)

train, test = random_train_test_split(
    interactions=interactions,
    test_percentage=test_percentage,
    random_state=random_state
)
train_weight, test_weight = random_train_test_split(
    interactions=weight,
    test_percentage=test_percentage,
    random_state=random_state
)

np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)

>>> False
>>> False
>>> False
>>> False

This will result in an incorrect split because the state of the random_state changes after the first call to random_state.shuffle. For the above example to work as intended you need to make separate but identical RandomStates:

random_state_interaction = np.random.RandomState(seed=1)
random_state_weight = np.random.RandomState(seed=1)

train, test = random_train_test_split(
    interactions=interactions,
    test_percentage=test_percentage,
    random_state=random_state_interaction
)
train_weight, test_weight = random_train_test_split(
    interactions=weight,
    test_percentage=test_percentage,
    random_state=random_state_weight
)

np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)

>>> True
>>> True
>>> True
>>> True

It works but I think it's a little awkward. Two possible solutions/suggestions:

Only require a seed parameter and create a RandomState in cross_validate._shuffle method. This has the added benefit of fitting in with the larger libraries that only require seed and not a RandomState generator. I also don't see any additional flexibility by passing in a generator instead of a simple integer.
Make a copy of random_state before applying shuffle in cross_validate._shuffle.

Thoughts?

The text was updated successfully, but these errors were encountered:

maciejkula · 2018-07-09T20:41:19Z

I think it's a good idea to transparently instantiate a RandomState. I'd be happy to accept a PR for this (I think there are a couple of places where this would be an improvement).

RothNRK · 2018-07-09T20:49:48Z

I can't say I completely agree with the transparency argument but I'm happy to make a PR for the second suggestions of making a copy of the RandomState.

maciejkula · 2018-07-09T21:01:15Z

To be clear, I meant your suggestion (1).

…

On Mon, 9 Jul 2018, 21:49 Noel Kippers, ***@***.***> wrote: I can't say I completely agree with the transparency argument but I'm happy to make a PR for the second suggestions of making a copy of the RandomState. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#328 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACSCAyA3g8OHdL_-CLxyBmrvIL9mk_99ks5uE8HtgaJpZM4VIUTW> .

tianpunchh · 2018-07-13T04:23:17Z

Maybe I do not get the point. I think "interactions" and "weights" are twins from the build_interactions(), interactions is just giving 1's if at the same location in weights matrix a nonzero value (e.g. a 0 to 5 rating) is provide. When you train the model, only one, either interactions or weights, not both of them, is gonna fed in, so why do you need to split both of them at the same time?

maciejkula · 2018-07-13T07:28:39Z

@tianpunchh weights are not always one and zero: if you supply weights during dataset building, they can will be whatever you supply. Additionally, when fitting a model, you always need to supply the interactions, and also sometimes weights (but you can never just use weights).

tianpunchh · 2018-07-13T16:34:08Z

@maciejkula
I guess I do not clarify my point well or maybe I indeed lose some important points. But thank you so much for your immediate reply.
From the Dataset.build_interactions(), it codes like this

interactions.append(user_idx, item_idx, 1)
weights.append(user_idx, item_idx, weight)

So my understanding is that "interactions matrix" is basically a 0/1 binary version of "weights matrix" (of course, only if you provide the weight otherwise the code assigns a default 1). The weight I think could be an explicit rating, like in movie case from 0 to 5 star. So for example, if a user has rated movie 2 (from the set move1, move2, move3) as star 5 and leave other 2 movies no score, interaction could be denoted as [0, 1, 0] and weight as [0, 5, 0]

My point is when I look at the LightFM.fit() function, the fitting target is either "interaction matrix" or "weight matrix", the former is kind of binary interaction indicator while the latter provides explicit rating. When reading your comment "Additionally, when fitting a model, you always need to supply the interactions, and also sometimes weights (but you can never just use weights)", I am a little confused, do you mean we can use both interactions and weights at the same time?

maciejkula · 2018-07-13T17:27:02Z

I think thinking in terms of explicit rating is a red herring here, and may be confusing rather than aiding understanding.

The weights are not ratings: they are indicators about how important a given interaction is. Interactions with higher weights will have a larger effect on the model. This is exactly equivalent to the sample_weight argument in most sklearn models.

Yes, if you are providing weights you must always provide interactions too. You can never just give weights.

tianpunchh · 2018-07-13T17:48:28Z

@maciejkula
I am half way to be clear.

I guess I missed "sample_weight" argument previously, basically sample_weight should get the weights matrix that Dataset has built?
LightFm.fit(interactions, user_features=None, item_features=None, sample_weight=None)

BTW, if for a data the rating (say 0 to 5 or non-rating) or number_of_visits or whatever scores provide more than a binary yes/no interaction. I want to take advantage of this information. Shall I put them to weight matrix (directly or maybe take a square root or something). My unclear is that for an implicit recommendation, should we forget about the explicit score?

tianpunchh · 2018-07-13T18:51:36Z

@maciejkula
I think I finally get the point. I guess if I want to maintain for example the user rating, I should make it happen in interactions rather than weights. I think it would be wise to keep all interactions equally important for most studies, in this case I can just let fit() to take a None for sample_weight

maciejkula · 2018-07-13T21:10:24Z

Yes, sample_weight gets the weight matrix. Let's say your user can both click on a product and purchase it: you may want to give a higher weight to the purchase interaction than to the click interaction. You could use explicit scores similarly: a score of 5 should get a higher weight than a score of 3. (A score of 1, on the other hand, might indicate that the user did not like the item, and you should not include the interaction at all in your training.) Keeping everything equally weighted is, as you suggest, a reasonable default.

tianpunchh · 2018-07-13T22:06:12Z

@maciejkula
The last question. So you mean any explicit scoring (for example explicit 0 to 5 score) should go to the weight matrix? How about in this case if I make the weight matrix default, but instead for the interaction matrix I make these scoring explicit 0 to 5 instead of binary 0/1? Is there any disadvantage to swap these two matrices for this particular example, although the direct element-wise product of these two matrices is obviously maintained? If the weight matrix only serves as the sample weight, I do not see any difference.

So essentially for this question, does weight has some specialty in the algo that we need let interaction matrix as binary indicator and weight will get further processing? Sorry I am not quite familiar with for example an implicit recommender system

maciejkula · 2018-07-14T10:40:31Z

Interactions for the WARP and BPR losses are binary only. The values have no effect.
You cannot swap them.

In the implementation, the presence of absence of an entry in the interaction matrix determines whether a gradient descent step is taken to updated the model to encode a preference. The weight determines the magnitude of that step.

maciejkula · 2018-07-23T16:02:22Z

Resolved. Thanks!

igorkf · 2020-11-18T16:55:46Z

After splitting in train and test, how can I know which users are in train or test?
I splitted in train/test, now I want to evaluate the model with my own metric (NDCG), but only into the users that are in the test matrix.
How can I pick the users from the test matrix?
After this I would like to map this users to my original data.

ekaterinakuzmina · 2021-12-29T16:56:45Z

Answering the above question: scipy.sparse.find can be helpful in finding interactions of what users with what items fell into train and test sets.

RothNRK mentioned this issue Jul 15, 2018

Change RandomState to seed value. #334

Merged

maciejkula closed this as completed Jul 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unintuitive behaviour with cross_validate.random_train_test_split #328

Unintuitive behaviour with cross_validate.random_train_test_split #328

RothNRK commented Jul 9, 2018 •

edited

maciejkula commented Jul 9, 2018

RothNRK commented Jul 9, 2018

maciejkula commented Jul 9, 2018 via email

tianpunchh commented Jul 13, 2018

maciejkula commented Jul 13, 2018

tianpunchh commented Jul 13, 2018

maciejkula commented Jul 13, 2018

tianpunchh commented Jul 13, 2018

tianpunchh commented Jul 13, 2018

maciejkula commented Jul 13, 2018

tianpunchh commented Jul 13, 2018 •

edited

maciejkula commented Jul 14, 2018

maciejkula commented Jul 23, 2018

igorkf commented Nov 18, 2020 •

edited

ekaterinakuzmina commented Dec 29, 2021 •

edited

Unintuitive behaviour with cross_validate.random_train_test_split #328

Unintuitive behaviour with cross_validate.random_train_test_split #328

Comments

RothNRK commented Jul 9, 2018 • edited

maciejkula commented Jul 9, 2018

RothNRK commented Jul 9, 2018

maciejkula commented Jul 9, 2018 via email

tianpunchh commented Jul 13, 2018

maciejkula commented Jul 13, 2018

tianpunchh commented Jul 13, 2018

maciejkula commented Jul 13, 2018

tianpunchh commented Jul 13, 2018

tianpunchh commented Jul 13, 2018

maciejkula commented Jul 13, 2018

tianpunchh commented Jul 13, 2018 • edited

maciejkula commented Jul 14, 2018

maciejkula commented Jul 23, 2018

igorkf commented Nov 18, 2020 • edited

ekaterinakuzmina commented Dec 29, 2021 • edited

RothNRK commented Jul 9, 2018 •

edited

tianpunchh commented Jul 13, 2018 •

edited

igorkf commented Nov 18, 2020 •

edited

ekaterinakuzmina commented Dec 29, 2021 •

edited