New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unintuitive behaviour with cross_validate.random_train_test_split #328
Comments
I think it's a good idea to transparently instantiate a |
I can't say I completely agree with the transparency argument but I'm happy to make a PR for the second suggestions of making a copy of the |
To be clear, I meant your suggestion (1).
…On Mon, 9 Jul 2018, 21:49 Noel Kippers, ***@***.***> wrote:
I can't say I completely agree with the transparency argument but I'm
happy to make a PR for the second suggestions of making a copy of the
RandomState.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#328 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACSCAyA3g8OHdL_-CLxyBmrvIL9mk_99ks5uE8HtgaJpZM4VIUTW>
.
|
Maybe I do not get the point. I think "interactions" and "weights" are twins from the build_interactions(), interactions is just giving 1's if at the same location in weights matrix a nonzero value (e.g. a 0 to 5 rating) is provide. When you train the model, only one, either interactions or weights, not both of them, is gonna fed in, so why do you need to split both of them at the same time? |
@tianpunchh weights are not always one and zero: if you supply weights during dataset building, they can will be whatever you supply. Additionally, when fitting a model, you always need to supply the interactions, and also sometimes weights (but you can never just use weights). |
@maciejkula
So my understanding is that "interactions matrix" is basically a 0/1 binary version of "weights matrix" (of course, only if you provide the weight otherwise the code assigns a default 1). The weight I think could be an explicit rating, like in movie case from 0 to 5 star. So for example, if a user has rated movie 2 (from the set move1, move2, move3) as star 5 and leave other 2 movies no score, interaction could be denoted as [0, 1, 0] and weight as [0, 5, 0] My point is when I look at the LightFM.fit() function, the fitting target is either "interaction matrix" or "weight matrix", the former is kind of binary interaction indicator while the latter provides explicit rating. When reading your comment "Additionally, when fitting a model, you always need to supply the interactions, and also sometimes weights (but you can never just use weights)", I am a little confused, do you mean we can use both interactions and weights at the same time? |
I think thinking in terms of explicit rating is a red herring here, and may be confusing rather than aiding understanding. The weights are not ratings: they are indicators about how important a given interaction is. Interactions with higher weights will have a larger effect on the model. This is exactly equivalent to the Yes, if you are providing weights you must always provide interactions too. You can never just give weights. |
@maciejkula I guess I missed "sample_weight" argument previously, basically sample_weight should get the weights matrix that Dataset has built? BTW, if for a data the rating (say 0 to 5 or non-rating) or number_of_visits or whatever scores provide more than a binary yes/no interaction. I want to take advantage of this information. Shall I put them to weight matrix (directly or maybe take a square root or something). My unclear is that for an implicit recommendation, should we forget about the explicit score? |
@maciejkula |
Yes, |
@maciejkula So essentially for this question, does |
In the implementation, the presence of absence of an entry in the interaction matrix determines whether a gradient descent step is taken to updated the model to encode a preference. The weight determines the magnitude of that step. |
Resolved. Thanks! |
After splitting in train and test, how can I know which users are in train or test? |
Answering the above question: scipy.sparse.find can be helpful in finding interactions of what users with what items fell into train and test sets. |
The
random_train_test_split
makes is easy to split theinteractions
matrix into train and test dataset but if you have data with weights you will have to applyrandom_train_test_split
twice with the samerandom_state
parameter. My concern is that it would be intuitive to do something like:This will result in an incorrect split because the state of the
random_state
changes after the first call torandom_state.shuffle
. For the above example to work as intended you need to make separate but identicalRandomStates
:It works but I think it's a little awkward. Two possible solutions/suggestions:
cross_validate._shuffle
method. This has the added benefit of fitting in with the larger libraries that only requireseed
and not aRandomState
generator. I also don't see any additional flexibility by passing in a generator instead of a simple integer.random_state
before applying shuffle incross_validate._shuffle
.Thoughts?
The text was updated successfully, but these errors were encountered: