Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the median instead of the mean for recall costs and learn cost #107

Closed
Expertium opened this issue Apr 20, 2024 · 8 comments · Fixed by #109
Closed

Use the median instead of the mean for recall costs and learn cost #107

Expertium opened this issue Apr 20, 2024 · 8 comments · Fixed by #109

Comments

@Expertium
Copy link
Contributor

Expertium commented Apr 20, 2024

            recall_costs = recall_card_revlog.groupby(by="review_rating")[
                "review_duration"
            ].mean()

I suggest replacing the mean with the median here, as the median is not sensitive to outliers. Relevant problem: https://forums.ankiweb.net/t/clarify-what-optimal-retention-means/42803/50?u=expertium
This could help mitigate such problems when, for example, the user went away to make dinner and, as a result, the review time ended up being orders of magnitude greater than usual, skewing the mean. And don't forget to modify how the learn cost is calculated as well, the median should be used for all time costs.
Additionally, to make the estimate even more robust (granted, the median is already robust), we can remove all times >20 minutes before calculating the median, since obviously nobody spends that much time per card.

@user1823
Copy link
Contributor

user1823 commented Apr 21, 2024

Additionally, we should remove all times that are exactly equal to 0 before calculating the median.

The reason is same as that given in

I am requesting this again because I think that skipping the entries during median calculation needs to be done in a way different than during mean calculation.

@L-M-Sherlock
Copy link
Member

Should we also consider this?

image

@user1823
Copy link
Contributor

Should we also consider this?

If I understand this function correctly, the recorded answer times would never be greater than this setting. So, do we need to consider it?

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Apr 21, 2024

If I understand this function correctly, the recorded answer times would never be greater than this setting. So, do we need to consider it?

I mean, if the values reach this limit, should we include them?

@user1823
Copy link
Contributor

user1823 commented Apr 21, 2024

During calculation of the median, the exact values of the lowest and highest values don't matter. So, I don't think that we need to remove the entries equal to the maximum limit.

Rather, removing those entries would cause the median to become unexpectedly small.

You might think that is contradictory to Expertium's suggestion of excluding answer times > 20 min. But, in that case, we are dealing with a specific situation where the user has set the maximum answer time to an unreasonably high value.

@Expertium
Copy link
Contributor Author

Expertium commented Apr 21, 2024

I agree that all times equal to zero should be removed. As for "Maximum answer seconds", I think it's fine to keep capped values, but only of they don't exceed 20 minutes.

@Expertium
Copy link
Contributor Author

@L-M-Sherlock just a reminder

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Apr 22, 2024

I have checked the code which needs to update today, and found that it's a little harder than my initial vision. The hardest part is the forget_cost.

if Relearning in state_count and Relearning in state_block:
forget_cost = round(
state_duration[Relearning] / state_block[Relearning] / 1000
+ recall_cost,
1,
)

I will re-design this part when I have more available time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants