Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] More accurate default parameters using Anki user's data and help from Dae #493

Closed
Expertium opened this issue Oct 14, 2023 · 16 comments
Labels
enhancement New feature or request

Comments

@Expertium
Copy link
Collaborator

Which module is related to your feature request?
Scheduler, Optimizer

Is your feature request related to a problem? Please describe.
I can't find the exact comment by @dae, but I'm sure I saw a comment saying that due to the way Anki licensing works, it's possible to use review data from way more users than just those who submitted their collections for research via the Google Form. So it's possible to run the optimizer on hundreds or even thousands of collections. This could help find the best default parameters. Of course, it's hard to say whether it's practically worth it because of diminishing returns as the number of collections increases.

@Expertium Expertium added the enhancement New feature or request label Oct 14, 2023
@user1823
Copy link
Collaborator

Even if the data isn't very useful for obtaining better default parameters, it can be useful for spaced repetition research.

By the way, here is the link to Dae's comment: open-spaced-repetition/fsrs-rs#95 (comment)

@Expertium
Copy link
Collaborator Author

@L-M-Sherlock once you and Dae aren't so busy, I suggest working on this. Aside from finding more accurate default parameters, this can also help to benchmark FSRS and other algorithms more accurately. Currently, the benchmark repo has around 70 collections. If that number increased to 1000, that would be amazing.

@user1823
Copy link
Collaborator

user1823 commented Nov 6, 2023

I wanted to mention the sample size that we need to achieve statistically significant results:

Assuming 10 million Anki users, with a 95% confidence level and 5% margin of error, we need 385 collections.

With 3% margin of error, we need 1067 collections.

At the current sample size (70) and 95% confidence interval, the margin of error is 11.71%.

If you want to play with the values, you can use this online calculator: http://www.raosoft.com/samplesize.html

@dae
Copy link

dae commented Nov 22, 2023

I have prepared a sample set of 20k collections. You can extract it with 'tar xaf ...'. It is a random sample of collections with 5000+ revlog entries, so it should contain a mix of older (still active) users, and newer users. Entries are pre-sorted in (cid, id) order. Please download a copy, as I'd like to remove it from the current location in a few weeks. You are welcome to re-host it elsewhere if you wish, but please preserve the LICENSE file if you do so.

https://apps.ankiweb.net/downloads/revlogs.tar.zst

@Expertium
Copy link
Collaborator Author

That's great, thank you!
@L-M-Sherlock

@L-M-Sherlock
Copy link
Member

Great! I will update the benchmark tomorrow.

@L-M-Sherlock
Copy link
Member

I downloaded and unzipped it. Its size is 56.6GB. The main problem is I don't know the timezone and next_day_start_at of these revlogs. Without that info, I can't convert the revlogs to dataset that FSRS could process.

@dae
Copy link

dae commented Nov 23, 2023

Dang. I was too focused on ensuring privacy, and forgot about that part. I will need to rebuild the archive.

@dae
Copy link

dae commented Nov 23, 2023

Ok, I've replaced the archive with a new version. example.py has been updated, and you can now access next_day_at, which can be used to derive the cutoff hour (see RevlogEntry::days_elapsed)

@L-M-Sherlock
Copy link
Member

What about the timezone?

@dae
Copy link

dae commented Nov 23, 2023

next_day_at can be used to determine the day a review log falls on without ever considering timezone or rollover hour. If the Python optimizer requires a timezone + rollover hour, I presume you could feed it UTC, and then determine the rollover hour in UTC based on next_day_at.

@L-M-Sherlock
Copy link
Member

I'm coding for the pre-procession of dataset.

image

The file size of data has been reduced from 57.5GB to 13.7 GB. The next step is refactoring the benchmark program.

@Expertium
Copy link
Collaborator Author

Maybe rewrite all algorithms (and benchmarking code) in Rust? Of course, the Rust version of FSRS will be slightly different, and the Rust version of LSTM can he different too, but I think with a dataset this big speed is more important.

@user1823
Copy link
Collaborator

The file size of data has been reduced from 57.5GB to 13.7 GB.

@L-M-Sherlock, for reducing the size of the data, I think that you would have filtered out many revlog entries such as Manual entries, entries before a Forget, outliers, etc.

There is no doubt that this was important for doing this benchmark experiment. However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.

@L-M-Sherlock
Copy link
Member

However, I think that we should preserve a copy of the dataset without filtering any revlog entries for future research.

But my google drive doesn't have enough storage space to preserve the copy.

@Expertium
Copy link
Collaborator Author

@L-M-Sherlock since Dae is now working on Anki 23.12 beta and since you have finished benchmarking FSRS v4, please give Dae the new default parameters that are based on 700+ million reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants