Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rating ratio implementation #68

Closed
rburke2233 opened this issue Dec 19, 2016 · 4 comments
Closed

Rating ratio implementation #68

rburke2233 opened this issue Dec 19, 2016 · 4 comments

Comments

@rburke2233
Copy link
Contributor

I am writing small-scale data sets for students to use as test cases, and this exercise has brought to light what I think is a bug in how ratings are divided for the rating ratio split.

I have a five items rated by five users. My expectation is that an 80% split by users would mean that exactly 1 (randomly-chosen) rating for each user is omitted. That is not what happens: 2 users have 2 ratings omitted and others users none.

I understand why this happens: there is a bernoulli process and each rating is included or excluded on that basis, so in a small data set, a user might get 5 lower random draws in a row. But I don't think this is what people conventionally mean by a 80% split in evaluation.

I would prefer an implementation that draws exactly k ratings for the training data at random, where k = user profile size * ratio. I've written a simple implementation of this idea using Randoms.randInts(), which I can submit. I don't know if you want to have this as a separate configuration option ("userratiofixed", maybe) so that the current (more efficient, but less precise) behavior is still available.

@guoguibing
Copy link
Owner

I understand. We have getRatioByUser(ratio) in the RatioDataSplitter for this purpose.

@rburke2233
Copy link
Contributor Author

I am talking about getRatioByUser(). Because the training / test split is random by item, this method does not guarantee a fixed number of test items for each user and that number might be zero, as in my example with a small test data set. See code fragment here:

			for (int j : items) {
				if (Randoms.uniform() < ratio) {
					testMatrix.set(u, j, 0.0);
				} else {
					trainMatrix.set(u, j, 0.0);
				}
			}

Randoms.uniform() could be less than ratio for all a user's items and then there is no test data for that user.

@guoguibing
Copy link
Owner

I see. Can you pull a request to fix up this issue? It is always valuable to fix up any issues.

@wangyufengkevin
Copy link
Collaborator

This issue has supported in 2.0.0-RC version, the method is getFixedRatioByUser and configuration is data.splitter.ratio=userfixed, please try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants