Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A better outlier filter for "Compute minimum recommended retention" #112

Closed
Expertium opened this issue May 17, 2024 · 2 comments
Closed

Comments

@Expertium
Copy link
Contributor

Expertium commented May 17, 2024

Currently, we filter out reviews where time=0 and time>=20 minutes. However, if the user set their "Maximum answer seconds" to 60 (default), none of this will help. So I have an idea:

  1. Select all review times (after filtering out t=0 and t>=20 minutes)
  2. Find their maximum, max(t)
  3. Remove all values that are equal to the maximum

Here's the key idea: we don't know what value the user chose as their "Maximum answer seconds". We don't have access to that setting. But we can guess what it was based on the max. value of all t. For example, if the maximum is 60 seconds, it's reasonable to assume that that's the "Maximum answer seconds". Then we can remove all reviews that are equal to that.

So if a user has times like this:
7, 8, 9, 10, 12, 15, 20, 60, 60, 60.

After the filter is applied, they will become this:
7, 8, 9, 10, 12, 15, 20

@user1823 I want to know your opinion as well

@user1823
Copy link
Contributor

user1823 commented May 17, 2024

Here is what I wrote about this before:

During calculation of the median, the exact values of the lowest and highest values don't matter. So, I don't think that we need to remove the entries equal to the maximum limit.

Rather, removing those entries would cause the median to become unexpectedly small.

Originally posted by @user1823 in #107 (comment)

For an example, let's say that a user had answer times like this:
40, 50, 60, 60, 60, 60, 60, 60, 60, 60, 60

In this case, I would believe that the average answer time was 60 seconds (or even more) but the default setting capped most of the answer times to 60 seconds and if you filter out these values, the median would be unreasonably small.

By the way, if you think that calculation of the average times still requires improvement, I suggest taking help from Durasba1. Based on their responses in https://forums.ankiweb.net/t/clarify-what-optimal-retention-means/42803/, they seem to be knowledgeable in this field.

@L-M-Sherlock
Copy link
Member

I agree with user1823. The median value is not sensitive to the outlier, so I think the current outlier filter is good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants