[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

Expertium · 2023-10-14T13:26:06Z

Which module is related to your feature request?
Scheduler, Optimizer

Is your feature request related to a problem? Please describe.
I can't find the exact comment by @dae, but I'm sure I saw a comment saying that due to the way Anki licensing works, it's possible to use review data from way more users than just those who submitted their collections for research via the Google Form. So it's possible to run the optimizer on hundreds or even thousands of collections. This could help find the best default parameters. Of course, it's hard to say whether it's practically worth it because of diminishing returns as the number of collections increases.

user1823 · 2023-10-14T13:57:18Z

Even if the data isn't very useful for obtaining better default parameters, it can be useful for spaced repetition research.

By the way, here is the link to Dae's comment: open-spaced-repetition/fsrs-rs#95 (comment)

Expertium · 2023-10-26T10:14:03Z

@L-M-Sherlock once you and Dae aren't so busy, I suggest working on this. Aside from finding more accurate default parameters, this can also help to benchmark FSRS and other algorithms more accurately. Currently, the benchmark repo has around 70 collections. If that number increased to 1000, that would be amazing.

user1823 · 2023-11-06T17:49:59Z

I wanted to mention the sample size that we need to achieve statistically significant results:

Assuming 10 million Anki users, with a 95% confidence level and 5% margin of error, we need 385 collections.

With 3% margin of error, we need 1067 collections.

At the current sample size (70) and 95% confidence interval, the margin of error is 11.71%.

If you want to play with the values, you can use this online calculator: http://www.raosoft.com/samplesize.html

dae · 2023-11-22T09:46:43Z

I have prepared a sample set of 20k collections. You can extract it with 'tar xaf ...'. It is a random sample of collections with 5000+ revlog entries, so it should contain a mix of older (still active) users, and newer users. Entries are pre-sorted in (cid, id) order. Please download a copy, as I'd like to remove it from the current location in a few weeks. You are welcome to re-host it elsewhere if you wish, but please preserve the LICENSE file if you do so.

https://apps.ankiweb.net/downloads/revlogs.tar.zst

Expertium · 2023-11-22T09:50:04Z

That's great, thank you!
@L-M-Sherlock

L-M-Sherlock · 2023-11-22T13:29:11Z

Great! I will update the benchmark tomorrow.

L-M-Sherlock · 2023-11-23T02:17:34Z

I downloaded and unzipped it. Its size is 56.6GB. The main problem is I don't know the timezone and next_day_start_at of these revlogs. Without that info, I can't convert the revlogs to dataset that FSRS could process.

dae · 2023-11-23T04:22:59Z

Dang. I was too focused on ensuring privacy, and forgot about that part. I will need to rebuild the archive.

dae · 2023-11-23T06:29:40Z

Ok, I've replaced the archive with a new version. example.py has been updated, and you can now access next_day_at, which can be used to derive the cutoff hour (see RevlogEntry::days_elapsed)

L-M-Sherlock · 2023-11-23T06:40:42Z

What about the timezone?

dae · 2023-11-23T06:55:03Z

next_day_at can be used to determine the day a review log falls on without ever considering timezone or rollover hour. If the Python optimizer requires a timezone + rollover hour, I presume you could feed it UTC, and then determine the rollover hour in UTC based on next_day_at.

L-M-Sherlock · 2023-11-24T09:24:50Z

I'm coding for the pre-procession of dataset.

The file size of data has been reduced from 57.5GB to 13.7 GB. The next step is refactoring the benchmark program.

Expertium · 2023-11-24T09:43:58Z

Maybe rewrite all algorithms (and benchmarking code) in Rust? Of course, the Rust version of FSRS will be slightly different, and the Rust version of LSTM can he different too, but I think with a dataset this big speed is more important.

user1823 · 2023-11-28T11:35:10Z

The file size of data has been reduced from 57.5GB to 13.7 GB.

@L-M-Sherlock, for reducing the size of the data, I think that you would have filtered out many revlog entries such as Manual entries, entries before a Forget, outliers, etc.

There is no doubt that this was important for doing this benchmark experiment. However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.

L-M-Sherlock · 2023-11-28T11:49:01Z

However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.

But my google drive doesn't have enough storage space to preserve the copy.

Expertium · 2023-12-03T16:04:14Z

@L-M-Sherlock since Dae is now working on Anki 23.12 beta and since you have finished benchmarking FSRS v4, please give Dae the new default parameters that are based on 700+ million reviews.

Expertium added the enhancement New feature or request label Oct 14, 2023

dae mentioned this issue Oct 30, 2023

Compare Anki SM-2 vs FSRS for video #486

Closed

Expertium mentioned this issue Nov 22, 2023

Updating the benchmark with new data open-spaced-repetition/srs-benchmark#14

Closed

Expertium closed this as completed Dec 3, 2023

L-M-Sherlock mentioned this issue Apr 8, 2024

[Question] Who should NOT use fsrs4anki / the limits to generalization are in the training set #631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

Expertium commented Oct 14, 2023

user1823 commented Oct 14, 2023

Expertium commented Oct 26, 2023

user1823 commented Nov 6, 2023

dae commented Nov 22, 2023

Expertium commented Nov 22, 2023

L-M-Sherlock commented Nov 22, 2023

L-M-Sherlock commented Nov 23, 2023

dae commented Nov 23, 2023

dae commented Nov 23, 2023

L-M-Sherlock commented Nov 23, 2023

dae commented Nov 23, 2023

L-M-Sherlock commented Nov 24, 2023

Expertium commented Nov 24, 2023

user1823 commented Nov 28, 2023

L-M-Sherlock commented Nov 28, 2023

Expertium commented Dec 3, 2023

[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

Comments

Expertium commented Oct 14, 2023

user1823 commented Oct 14, 2023

Expertium commented Oct 26, 2023

user1823 commented Nov 6, 2023

dae commented Nov 22, 2023

Expertium commented Nov 22, 2023

L-M-Sherlock commented Nov 22, 2023

L-M-Sherlock commented Nov 23, 2023

dae commented Nov 23, 2023

dae commented Nov 23, 2023

L-M-Sherlock commented Nov 23, 2023

dae commented Nov 23, 2023

L-M-Sherlock commented Nov 24, 2023

Expertium commented Nov 24, 2023

user1823 commented Nov 28, 2023

L-M-Sherlock commented Nov 28, 2023

Expertium commented Dec 3, 2023