Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unintuitive behaviour with cross_validate.random_train_test_split #328

Closed
RothNRK opened this issue Jul 9, 2018 · 15 comments
Closed

Unintuitive behaviour with cross_validate.random_train_test_split #328

RothNRK opened this issue Jul 9, 2018 · 15 comments

Comments

@RothNRK
Copy link
Contributor

RothNRK commented Jul 9, 2018

The random_train_test_split makes is easy to split the interactions matrix into train and test dataset but if you have data with weights you will have to apply random_train_test_split twice with the same random_state parameter. My concern is that it would be intuitive to do something like:

from lightfm.data import Dataset
from lightfm.cross_validation import random_train_test_split

users = np.random.choice([0., 1., 2.], (10, 1))
items = np.random.choice([0., 1., 2.], (10, 1))
weight = np.random.rand(10,1)
data = np.concatenate((users, items, weight), axis=1)

dataset = Dataset()
dataset.fit(users=np.unique(data[:, 0]), items=np.unique(data[:, 1]))
interactions, weight = dataset.build_interactions((i[0], i[1], i[2]) for i in data)

test_percentage = 0.2
random_state = np.random.RandomState(seed=1)

train, test = random_train_test_split(
    interactions=interactions,
    test_percentage=test_percentage,
    random_state=random_state
)
train_weight, test_weight = random_train_test_split(
    interactions=weight,
    test_percentage=test_percentage,
    random_state=random_state
)

np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)

>>> False
>>> False
>>> False
>>> False

This will result in an incorrect split because the state of the random_state changes after the first call to random_state.shuffle. For the above example to work as intended you need to make separate but identical RandomStates:

random_state_interaction = np.random.RandomState(seed=1)
random_state_weight = np.random.RandomState(seed=1)

train, test = random_train_test_split(
    interactions=interactions,
    test_percentage=test_percentage,
    random_state=random_state_interaction
)
train_weight, test_weight = random_train_test_split(
    interactions=weight,
    test_percentage=test_percentage,
    random_state=random_state_weight
)

np.array_equal(train.row, train_weight.row)
np.array_equal(train.col, train_weight.col)
np.array_equal(test.row, test_weight.row)
np.array_equal(test.col, test_weight.col)

>>> True
>>> True
>>> True
>>> True

It works but I think it's a little awkward. Two possible solutions/suggestions:

  1. Only require a seed parameter and create a RandomState in cross_validate._shuffle method. This has the added benefit of fitting in with the larger libraries that only require seed and not a RandomState generator. I also don't see any additional flexibility by passing in a generator instead of a simple integer.
  2. Make a copy of random_state before applying shuffle in cross_validate._shuffle.

Thoughts?

@maciejkula
Copy link
Collaborator

I think it's a good idea to transparently instantiate a RandomState. I'd be happy to accept a PR for this (I think there are a couple of places where this would be an improvement).

@RothNRK
Copy link
Contributor Author

RothNRK commented Jul 9, 2018

I can't say I completely agree with the transparency argument but I'm happy to make a PR for the second suggestions of making a copy of the RandomState.

@maciejkula
Copy link
Collaborator

maciejkula commented Jul 9, 2018 via email

@tianpunchh
Copy link

Maybe I do not get the point. I think "interactions" and "weights" are twins from the build_interactions(), interactions is just giving 1's if at the same location in weights matrix a nonzero value (e.g. a 0 to 5 rating) is provide. When you train the model, only one, either interactions or weights, not both of them, is gonna fed in, so why do you need to split both of them at the same time?

@maciejkula
Copy link
Collaborator

@tianpunchh weights are not always one and zero: if you supply weights during dataset building, they can will be whatever you supply. Additionally, when fitting a model, you always need to supply the interactions, and also sometimes weights (but you can never just use weights).

@tianpunchh
Copy link

@maciejkula
I guess I do not clarify my point well or maybe I indeed lose some important points. But thank you so much for your immediate reply.
From the Dataset.build_interactions(), it codes like this

interactions.append(user_idx, item_idx, 1)
weights.append(user_idx, item_idx, weight) 

So my understanding is that "interactions matrix" is basically a 0/1 binary version of "weights matrix" (of course, only if you provide the weight otherwise the code assigns a default 1). The weight I think could be an explicit rating, like in movie case from 0 to 5 star. So for example, if a user has rated movie 2 (from the set move1, move2, move3) as star 5 and leave other 2 movies no score, interaction could be denoted as [0, 1, 0] and weight as [0, 5, 0]

My point is when I look at the LightFM.fit() function, the fitting target is either "interaction matrix" or "weight matrix", the former is kind of binary interaction indicator while the latter provides explicit rating. When reading your comment "Additionally, when fitting a model, you always need to supply the interactions, and also sometimes weights (but you can never just use weights)", I am a little confused, do you mean we can use both interactions and weights at the same time?

@maciejkula
Copy link
Collaborator

I think thinking in terms of explicit rating is a red herring here, and may be confusing rather than aiding understanding.

The weights are not ratings: they are indicators about how important a given interaction is. Interactions with higher weights will have a larger effect on the model. This is exactly equivalent to the sample_weight argument in most sklearn models.

Yes, if you are providing weights you must always provide interactions too. You can never just give weights.

@tianpunchh
Copy link

@maciejkula
I am half way to be clear.

I guess I missed "sample_weight" argument previously, basically sample_weight should get the weights matrix that Dataset has built?
LightFm.fit(interactions, user_features=None, item_features=None, sample_weight=None)

BTW, if for a data the rating (say 0 to 5 or non-rating) or number_of_visits or whatever scores provide more than a binary yes/no interaction. I want to take advantage of this information. Shall I put them to weight matrix (directly or maybe take a square root or something). My unclear is that for an implicit recommendation, should we forget about the explicit score?

@tianpunchh
Copy link

@maciejkula
I think I finally get the point. I guess if I want to maintain for example the user rating, I should make it happen in interactions rather than weights. I think it would be wise to keep all interactions equally important for most studies, in this case I can just let fit() to take a None for sample_weight

@maciejkula
Copy link
Collaborator

Yes, sample_weight gets the weight matrix. Let's say your user can both click on a product and purchase it: you may want to give a higher weight to the purchase interaction than to the click interaction. You could use explicit scores similarly: a score of 5 should get a higher weight than a score of 3. (A score of 1, on the other hand, might indicate that the user did not like the item, and you should not include the interaction at all in your training.) Keeping everything equally weighted is, as you suggest, a reasonable default.

@tianpunchh
Copy link

tianpunchh commented Jul 13, 2018

@maciejkula
The last question. So you mean any explicit scoring (for example explicit 0 to 5 score) should go to the weight matrix? How about in this case if I make the weight matrix default, but instead for the interaction matrix I make these scoring explicit 0 to 5 instead of binary 0/1? Is there any disadvantage to swap these two matrices for this particular example, although the direct element-wise product of these two matrices is obviously maintained? If the weight matrix only serves as the sample weight, I do not see any difference.

So essentially for this question, does weight has some specialty in the algo that we need let interaction matrix as binary indicator and weight will get further processing? Sorry I am not quite familiar with for example an implicit recommender system

@maciejkula
Copy link
Collaborator

  1. Interactions for the WARP and BPR losses are binary only. The values have no effect.
  2. You cannot swap them.

In the implementation, the presence of absence of an entry in the interaction matrix determines whether a gradient descent step is taken to updated the model to encode a preference. The weight determines the magnitude of that step.

@maciejkula
Copy link
Collaborator

Resolved. Thanks!

@igorkf
Copy link

igorkf commented Nov 18, 2020

After splitting in train and test, how can I know which users are in train or test?
I splitted in train/test, now I want to evaluate the model with my own metric (NDCG), but only into the users that are in the test matrix.
How can I pick the users from the test matrix?
After this I would like to map this users to my original data.

@ekaterinakuzmina
Copy link

ekaterinakuzmina commented Dec 29, 2021

Answering the above question: scipy.sparse.find can be helpful in finding interactions of what users with what items fell into train and test sets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants