Rating ratio implementation #68

rburke2233 · 2016-12-19T05:29:29Z

I am writing small-scale data sets for students to use as test cases, and this exercise has brought to light what I think is a bug in how ratings are divided for the rating ratio split.

I have a five items rated by five users. My expectation is that an 80% split by users would mean that exactly 1 (randomly-chosen) rating for each user is omitted. That is not what happens: 2 users have 2 ratings omitted and others users none.

I understand why this happens: there is a bernoulli process and each rating is included or excluded on that basis, so in a small data set, a user might get 5 lower random draws in a row. But I don't think this is what people conventionally mean by a 80% split in evaluation.

I would prefer an implementation that draws exactly k ratings for the training data at random, where k = user profile size * ratio. I've written a simple implementation of this idea using Randoms.randInts(), which I can submit. I don't know if you want to have this as a separate configuration option ("userratiofixed", maybe) so that the current (more efficient, but less precise) behavior is still available.

guoguibing · 2016-12-19T07:17:29Z

I understand. We have getRatioByUser(ratio) in the RatioDataSplitter for this purpose.

rburke2233 · 2016-12-19T13:40:10Z

I am talking about getRatioByUser(). Because the training / test split is random by item, this method does not guarantee a fixed number of test items for each user and that number might be zero, as in my example with a small test data set. See code fragment here:

			for (int j : items) {
				if (Randoms.uniform() < ratio) {
					testMatrix.set(u, j, 0.0);
				} else {
					trainMatrix.set(u, j, 0.0);
				}
			}

Randoms.uniform() could be less than ratio for all a user's items and then there is no test data for that user.

guoguibing · 2016-12-20T02:16:09Z

I see. Can you pull a request to fix up this issue? It is always valuable to fix up any issues.

wangyufengkevin · 2017-01-05T07:56:59Z

This issue has supported in 2.0.0-RC version, the method is getFixedRatioByUser and configuration is data.splitter.ratio=userfixed, please try it.

rburke2233 mentioned this issue Dec 20, 2016

User ratio splitter #69

Closed

wangyufengkevin closed this as completed Jan 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rating ratio implementation #68

Rating ratio implementation #68

rburke2233 commented Dec 19, 2016

guoguibing commented Dec 19, 2016

rburke2233 commented Dec 19, 2016

guoguibing commented Dec 20, 2016

wangyufengkevin commented Jan 5, 2017

Rating ratio implementation #68

Rating ratio implementation #68

Comments

rburke2233 commented Dec 19, 2016

guoguibing commented Dec 19, 2016

rburke2233 commented Dec 19, 2016

guoguibing commented Dec 20, 2016

wangyufengkevin commented Jan 5, 2017