-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rating ratio implementation #68
Comments
I understand. We have |
I am talking about getRatioByUser(). Because the training / test split is random by item, this method does not guarantee a fixed number of test items for each user and that number might be zero, as in my example with a small test data set. See code fragment here:
Randoms.uniform() could be less than ratio for all a user's items and then there is no test data for that user. |
I see. Can you pull a request to fix up this issue? It is always valuable to fix up any issues. |
This issue has supported in 2.0.0-RC version, the method is getFixedRatioByUser and configuration is data.splitter.ratio=userfixed, please try it. |
I am writing small-scale data sets for students to use as test cases, and this exercise has brought to light what I think is a bug in how ratings are divided for the rating ratio split.
I have a five items rated by five users. My expectation is that an 80% split by users would mean that exactly 1 (randomly-chosen) rating for each user is omitted. That is not what happens: 2 users have 2 ratings omitted and others users none.
I understand why this happens: there is a bernoulli process and each rating is included or excluded on that basis, so in a small data set, a user might get 5 lower random draws in a row. But I don't think this is what people conventionally mean by a 80% split in evaluation.
I would prefer an implementation that draws exactly k ratings for the training data at random, where k = user profile size * ratio. I've written a simple implementation of this idea using Randoms.randInts(), which I can submit. I don't know if you want to have this as a separate configuration option ("userratiofixed", maybe) so that the current (more efficient, but less precise) behavior is still available.
The text was updated successfully, but these errors were encountered: