Calibration between actual retention and predicted retention is not great #215

user1823 · 2023-04-15T16:38:44Z

Btw, calibration doesn't look great on my collection. I've tried it with separate decks and got similar results

Originally posted by @Expertium in #151 (comment)

user1823 · 2023-04-15T16:41:58Z

This is also confirmed by the following stats:

Expertium · 2023-04-15T16:43:25Z

Wait, I don't get it. How is it confirmed by these stats?

user1823 · 2023-04-15T16:45:28Z

FSRS predicts that the average retention for the cards should be 96.85%.

But, over the past one month, the measured retention was just 94.2%

Expertium · 2023-04-15T16:47:37Z

Nah, that's a pretty small discrepancy. What FSRS predicts and what True Retention shows you are somewhat different things (I wouldn't ask Sherlock to implement it if I could just use True Retention instead, they're not identical), so a small discrepancy like this is fine. If it was like 70% vs 95%, that would be worrying. I would say that a difference less than 5% is fine.

L-M-Sherlock · 2023-04-15T16:54:21Z

FSRS predicts that the average retention for the cards should be 96.85%.

But, over the past one month, the measured retention was just 94.2%

They are not the same thing. The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention.

user1823 · 2023-04-15T16:56:16Z

They are not the same thing. The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention.

Oh! I failed to consider that.

So, should I close this issue?

user1823 · 2023-04-15T16:59:46Z

The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention.

But, I think that you should include this in the explanation for the average retention stats when you release the feature.

L-M-Sherlock · 2023-04-15T17:06:05Z

OK, I will improve the explanation of the FSRS stats.

user1823 · 2023-04-24T14:39:02Z

In my recent research (unpublished), the error of power functions is lower than that of exponential functions in fitting the forgetting curve. I will test it in some users' collections.

Originally posted by @L-M-Sherlock in #151 (comment)

Interesting. Woz says that the forgetting curve is definitely exponential, but when you have both difficult and easy material in the same collection, the resulting curve (a superposition of different exponential curves) is best approximated with a power function.
https://supermemo.guru/wiki/Exponential_nature_of_forgetting#Power_law_emerges_in_superposition_of_exponential_forgetting_curves

Originally posted by @Expertium in #151 (comment)

@L-M-Sherlock, you may want to test the power function in my collection. I believe that my collection includes significant amounts of both easy and difficult materials.

My deck: Default.zip (Change extension to .apkg)
w: [1.0696, 1.6707, 4.9898, -1.2855, -1.1631, 0.0, 1.7095, -0.0887, 1.0651, 1.71, -0.4937, 0.8296, 0.4661]
requestRetention: 0.94
maximumInterval: 36500
easyBonus: 2.5
hardInterval: 1.2
timezone = 'Asia/Calcutta'
next_day_starts_at = 4
revlog_start_date = "2006-10-05"

L-M-Sherlock · 2023-04-24T14:41:47Z

Thanks for the data. I will do some research here. Currently I am working on filtering out the outliers in the data of Expertium.

user1823 · 2023-04-25T14:55:20Z

@L-M-Sherlock, please let me know the results after you test the power function in my collection. I think that it might solve this issue also (at least partially).

L-M-Sherlock · 2023-04-25T15:00:01Z

Unfortunately, I even increased the number of parameters to 20 today, but the accuracy doesn't increase significantly.

L-M-Sherlock · 2023-04-25T15:00:27Z

Unfortunately, I even increased the number of parameters to 20 today, but the accuracy doesn't increase significantly. I need to do more experiments here.

Expertium · 2023-04-25T15:05:21Z

I'm curious what parameters you added. Do you mind giving a detailed description of this new model, with all the formulas?

EDIT: or even better, make a beta-version of the new optimizer and the new scheduler.

L-M-Sherlock · 2023-04-25T15:08:40Z

I will share some details about it tomorrow.

L-M-Sherlock · 2023-04-26T06:17:05Z

Add power forgetting curve and power difficulty:

I think there is not significant difference.

Expertium · 2023-04-26T07:00:05Z

Previously you said that you have increased the number of parameters to 20. Perhaps you could release a beta-version of the optimizer and a beta-version of the scheduler with those parameters so other people can experiment with them (and do side-by-side comparisons) as well?

Also, it's kinda hard to understand formulas in code form (well, for me at least), so I would really appreciate it if you wrote formulas using Latex or something like that, and posted them here (I assume you won't be making a dedicated entry on wiki).

L-M-Sherlock · 2023-04-26T07:34:48Z

Previously you said that you have increased the number of parameters to 20. Perhaps you could release a beta-version of the optimizer and a beta-version of the scheduler with those parameters so other people can experiment with them (and do side-by-side comparisons) as well?

OK, I will publish the beta-version of the optimizer in another branch. But I don't want to develop the scheduler until the optimizer is significant better than before and stable.

Expertium · 2023-04-26T08:45:53Z

Also, a suggestion - add root-mean-square error too, not just R^2.

Find the difference between "Average actual retention" and "Optimal average actual retention", and square it
Take the sum of all squared differences, divide it by their count (which is just the number of bins, like 40), and take a square root.

The interpretation is as follows: it tells us how much, on average, the calibration is off from theoretical perfection. For example, RMSE=0.05 means that on average FSRS is 5% off from theoretically perfect calibration. This is easier to interpret than R^2, though I think we should keep both.

Expertium · 2023-04-26T11:26:31Z

Also, maybe I'm just dumb, but I have no idea what is going on here.

Is this correct? If so, why is there a 9, and no new parameters?

I thought the power will be a new parameter, like this.

user1823 · 2023-04-26T11:58:55Z

I thought the power will be a new parameter, like this.

In the following image (from SuperMemo), the power function is of the form:
Retrievability = $A~t^B$

So, I think that the power in the equation shared by @L-M-Sherlock above is not what we are targeting here.

I think that A and B in the equation shared by me here should the new parameters.

L-M-Sherlock · 2023-04-26T12:40:30Z

I have to ensure that stability has the same meaning as before. You can compare the f(t) and g(t) in the graph. when t = a, f(a) = g(a) = 0.9.

user1823 · 2023-04-26T12:53:43Z

I am not sure that I understand the algorithm correctly. But,

If the original function is
$$R = A~e^{-kt}$$

shouldn't the formula for the retention in terms of stability be
$$R = 0.9~e^{-k(t-S)}$$

How did it become the following?
$$R = 0.9^{t/S}$$

And now, if the new function is
$$R = A~t^B$$

shouldn't the formula for the retention in terms of stability be
$$R = 0.9~(t/S)^B$$

How did it become the following?
$$R = (1 + \frac{t}{9S})^{-1}$$

For the above calculations, I have taken the stability to be the time at which the retention is 90%.

L-M-Sherlock · 2023-04-26T13:32:46Z

There is only one parameter for the forgetting curve function. The exponential forgetting curve function is $R(t)=e^{\ln(0.9)t/S}$, which is equivalent to $R(t)=0.9^{t/S}$.

user1823 · 2023-04-26T13:35:34Z

So, is this some other formula?

The two formulas I mention are from the image posted above. The source of the image is https://supermemo.guru/wiki/Exponential_nature_of_forgetting#Power_law_emerges_in_superposition_of_exponential_forgetting_curves.

L-M-Sherlock · 2023-04-26T13:49:08Z

In Algorithm SM-17, retrievability R corresponds with the probability of recall and represents the exponential forgetting curve. Retrievability is derived from stability and the interval:

R[n]:=exp-k*t/S[n-1]

where:

R[n] - retrievability at the n-th repetition
k - decay constant
t - time (interval)
S[n-1] - stability after the (n-1)th repetition

Expertium · 2023-04-26T14:35:01Z

So the new function has a somewhat different shape, but otherwise it's not more flexible than the old one? If that's so, then I'm not surprised that results haven't changed much. I'm sure that if instead of (1 + t / (9 * S)) ** -1 it was (1 + t / (9 * S)) ** -w, where w can be optimized, the results would be more impressive. Although I suppose that would make it more difficult to implement such function in the scheduler.

Are there any other changes to how S or D are calculated?

Also, I still don't understand where 9 comes from in that formula.

user1823 · 2023-04-26T14:41:40Z

The new formula is similar to the one below.

Wixted and Carpenter (2007) summarize the forgetting curve as:
P(recall) = $m(1 + ht)^{-f}$
Where m is the degree of initial learning (i.e. the probability at time 0), h is a scaling factor on time, and f is the exponential memory decay factor.

Source: https://notes.andymatuschak.org/zHdKY3GwoUW9xG6wQtKFqjz9jcrxdM3mxram

So, I think that like what @Expertium says, if the power is -w instead of -1, the results would be more impressive.

L-M-Sherlock · 2023-04-26T14:47:51Z

OK, I will test it tomorrow.

user1823 · 2023-04-26T15:16:46Z

@Expertium, you can try out the current version of the new optimizer here: https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Feat%2Fpow_difficulty/fsrs4anki_optimizer.ipynb

L-M-Sherlock · 2023-04-27T13:45:45Z

For each deck both loss and RMSE were lower when using the new version of the optimizer. On average, log-loss was around 4% lower, and RMSE was around 16% lower.

Pretty well! And the calibration graph shows that the blue line aligns with the orange line better than before.

L-M-Sherlock · 2023-04-27T13:48:14Z

Also, (while it's a bit late to ask) is pre-training in 2.2 really necessary? It seems like a very non-intuitive thing to me.

Because I freeze the the parameters of initial stability in the training stage. It make the parameters more stable and accurate during the training.

L-M-Sherlock · 2023-04-27T13:51:22Z

Also, for R^2 I think you should bring back the implementation you used in some of the previous versions, there is no way the values given by sklearn.metrics.r2_score are correct, something weird is going on with it.

The previous implementation is incorrect. It is not R-squared, because R-squared is not r^2. I believe the sklearn (the most popular machine learning package) implementation is correct.

Expertium · 2023-04-27T13:52:55Z

because R-squared is not r^2

https://en.wikipedia.org/wiki/Coefficient_of_determination

Idk, maybe there are different definitions and they are not equivalent.

galantra · 2023-04-27T13:55:19Z

What's interesting is that when I ran the new optimizer on the full collection, RMSE was lower than when I ran it on individual decks, which suggests that maybe having different parameters for different decks is not actually advantageous.

I noticed the same some time ago, IIRC. I imagine it is because of the data set is bigger, and a bigger data set will have a bigger impact vs. the initial Maimemo(?) data than a smaller data set would have. The Optimizer becomes more confident, so to say.

Is this correct?

L-M-Sherlock · 2023-04-27T13:58:05Z

I suppose the next step is to apply the same idea to difficulty and replace (10 * torch.pow(new_d, -1)) with something like (10 * torch.pow(new_d, -f_2)) where f_2 is another parameter.

Nice idea! I will add it tomorrow.

Expertium · 2023-04-27T15:39:57Z

Despite the fact that the specific optimizer you are using is Adam (which has a lot of heuristics to make it more adaptive), learning rate still affects the results. I changed learning rate to lr = 5e-3 and got even better results with the new forgetting curve.
lr = 5e-4
Loss: 0.4589
RMSE: 0.0466

lr = 5e-3
Loss: 0.4478
RMSE: 0.0367

user1823 · 2023-04-27T15:56:12Z

Also, increasing the n_epoch from 1 to 3 produces better results.

n_epoch = 1

n_epoch = 3

Expertium · 2023-04-27T16:13:03Z

I tried changing learning rate to 5e-2 and got this error:

L-M-Sherlock · 2023-04-28T07:40:44Z

I'm stuck in a problem. The stability has different meaning for various users when I introduce f into the power forgetting curve:

Expertium · 2023-04-28T09:57:19Z

I guess that's inevitable if we're introducing parameters to the forgetting curve. If we introduce new parameters to the calculation of S while keeping the original forgetting curve (which only depends on S and t, no parameters), this problem can be circumvented. But then we will most likely end up with a higher loss and RMSE due to the forgetting curve not being very flexible, unless we improve the calculation of S so much that it will compensate for that. Basically, we could decrease the loss/RMSE either by changing how S is calculated or by adding parameters to R=f(S). Or both.

I guess it depends on what philosophy you want to adopt for FSRS.

"Only decreasing the loss matters, regardless of how ad hoc and non-rigorous our model is, and interpretability can go to hell."
"Every value and every parameter must have a precise meaning, sacrificing the rigor and interpretability of the model is unacceptable."

EDIT: if you don't want to continue working on the power forgetting curve with a new parameter due to problems with interpretability, you can bring back the exponential forgetting curve with no parameters and instead change how difficulty is calculated (make it a power function and add a new parameter, like I mentioned in one of the messages above), which will hopefully make S more accurate without sacrificing interpretability.

L-M-Sherlock · 2023-04-28T10:50:34Z

If we finally decide to use power forgetting curve, I prefer a fixed f which is an average with training results collected from many users.

Expertium · 2023-04-28T11:24:08Z

That's a good idea, though we will need enormous amounts of data, not just from 3-5 users.
Ok, remove the power forgetting curve for now, bring back the exponential curve, and change difficulty formula to (10 * torch.pow(new_d, -f)), where f can be optimized.

user1823 · 2023-04-28T11:27:53Z

I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is.

L-M-Sherlock · 2023-04-28T11:49:06Z

I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is.

Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb

The branch: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/power-function-for-difficulty

user1823 · 2023-04-28T11:56:26Z

The stability has different meaning for various users when I introduce f into the power forgetting curve.

By the way, this problem can be solved by replacing the constant 9 in
$$R = \left(1 + \frac{t}{9S} \right)^{f}$$
by
$$\frac{10^{1/f}}{9^{1/f} - 10^{1/f}}$$

However, I am not sure how practical it is.

Edit: A further simplified version can be
$$R = \left(1 + \frac{(0.9^{1/f} - 1)~t}{S} \right)^{f}$$

Expertium · 2023-04-28T12:17:06Z

Edit: A further simplified version can be R=(1+t(0.91/f−1)S)f

Then we lose flexibility, in other words, changing f will have barely any effect on the shape of the forgetting curve. I tried it in Desmos and it seems that changing f only slightly changes the shape of the curve, defeating the whole point of introducing a new parameter.

Expertium · 2023-04-28T12:25:06Z

I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is.

Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb

The branch: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/power-function-for-difficulty

@L-M-Sherlock A minor thing, but I don't see .clamp() for w[13].

Expertium · 2023-04-28T13:05:13Z

Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb

I've only tried it on 3 decks so far, but the results aren't looking good. Loss is around 0.5% higher than before, RMSE is around 2% higher. If anything, this new version is worse.

EDIT: by "before" I mean the version without power functions and new parameters.

user1823 · 2023-04-28T13:25:59Z

According to my testing, the performance of this version is in between that of the original optimizer and the power forgetting curve one. The results are summarized below:

	Log loss	RMSE	R-squared
Original Optimizer	0.2241	0.0191	0.8728
Power forgetting curve	0.2231	0.0147	0.9234
Power difficulty	0.2234	0.0170	0.9026

Expertium · 2023-04-28T19:15:03Z

I tested it on 7 decks + entire collection, same as before. On average, log-loss is 0.6% worse than with the old (no power functions, no new parameters) algorithm, RMSE is 2% worse. I ran Wilcoxon signed-rank test on this data as well, but got high (>0.05) p-values, indicating that there is no statistically significant difference between how the old algorithm and the new one perform.
Interestingly, while it performed worse on each of the decks I tested, it actually performed mildly better on the entire collection - log-loss is 1% lower, RMSE is 3% lower.
Old (no power functions, no new parameters): log-loss=0.4831, RMSE=0.0917.
New (power function for difficulty): log-loss=0.4787, RMSE=0.0891.
Overall, it's clear that this isn't nearly as good as using a power function for the forgetting curve itself.

Expertium · 2023-04-28T21:07:46Z

Also, unlike with the power forgetting curve, this time changing the learning rate didn't help.

user1823 · 2023-04-29T16:20:23Z

Since the power difficulty approach didn't produce the desired results, we might need to either

use the power forgetting curve with a fixed value of f; or
continue using the exponential forgetting curve and think of other ways to improve the calculation of the stability.

If we finally decide to use power forgetting curve, I prefer a fixed f which is an average with training results collected from many users.

To examine the feasibility of this approach, I guess that the first step would be to compare the optimized f values for some users' collections.

The optimized f value for my collection is -0.5501.

Expertium · 2023-04-29T18:07:16Z

Mine is -0.1671, but this isn't the proper way of doing this. We need to collect the number of reviews as well (not just the value of f), so we can take a weighted average. But mroe importantly, we need at least a few dozens of people to submit their data, ideally more than a hundred.

Expertium · 2023-04-29T18:24:22Z

I just thought about something. If we use R=(1+t/9S)^-1 as a formula, then R=0.9 when S=1, so S has a meaning that is easy to explain: it's how many days it takes for your retention to drop from 100% to 90% (btw, I finally understood the meaning of 9 in that formula). However, if we change the power to any value other than 1, the meaning of S changes regardless of what this value is and regardless of whether it stays constant or no.
For example, if f=0.5 and S=1, then the meaning of S is "the number of days it takes your retention to drop from 100% to 94.87%".
What I'm trying to say is that it doesn't matter whether f is constant or can be changed. So the "let's take values of f from different users and average them" approach will also mess up the meaning of S. It's slightly better than making f optimizable for each user since at least the meaning of S will be the same across all users, but it still deviates from the original meaning that @L-M-Sherlock intended.
TLDR: if the formula is R=(1+t/9*S)^-f, then f cannot be anything other than 1, otherwise the meaning of S - number days it takes for user's retention to drop from 100% to 90% - will change, regardless of whether it's a constant or an optimizable parameter.

EDIT: I will make a new issue for submitting new formulas, aka my ideas on how to improve the algorithm. We have already reached 74 comments here, it's becoming kinda cluttered, so it's probably better to make a new issue.

L-M-Sherlock · 2023-05-25T05:03:25Z

Upon my recent thinking, I have resolved not to adopt any ideas that involve adding parameters on the forgetting curve. My reasoning is as follows:

The optimal method for fitting the forgetting curve is given the same review history, reviewing at different intervals, calculating the retention rate, and subsequently plotting the following graph:

However, in the vast majority of spaced repetition software used individually, given the same review history, the algorithm presents minor variations in intervals, coupled with a scarcity of data, rendering the estimation of the retention rate imprecise. Incorporating parameters on the forgetting curve might lead the algorithm to learn the characteristics of a specific retention rate, which would perform inadequately in extrapolation.

Expertium · 2023-07-13T10:58:51Z

I'll recommend closing this issue since the new version is getting released, and this issue hasn't been used for a long time.

L-M-Sherlock mentioned this issue Jul 7, 2023

[Feature Request] FSRS4Anki Optimizer 4.0 Beta #342

Closed

L-M-Sherlock closed this as completed Jul 13, 2023

user1823 mentioned this issue Sep 17, 2023

[Feature Request] Ideas to further improve the accuracy of the algorithm #461

Closed

Calibration between actual retention and predicted retention is not great #215

Calibration between actual retention and predicted retention is not great #215

Comments

user1823 commented Apr 15, 2023

user1823 commented Apr 15, 2023

Expertium commented Apr 15, 2023

user1823 commented Apr 15, 2023 • edited Loading

Expertium commented Apr 15, 2023 • edited Loading

L-M-Sherlock commented Apr 15, 2023

user1823 commented Apr 15, 2023

user1823 commented Apr 15, 2023

L-M-Sherlock commented Apr 15, 2023

user1823 commented Apr 24, 2023

L-M-Sherlock commented Apr 24, 2023

user1823 commented Apr 25, 2023

L-M-Sherlock commented Apr 25, 2023

L-M-Sherlock commented Apr 25, 2023

Expertium commented Apr 25, 2023 • edited Loading

L-M-Sherlock commented Apr 25, 2023

L-M-Sherlock commented Apr 26, 2023

Expertium commented Apr 26, 2023 • edited Loading

L-M-Sherlock commented Apr 26, 2023

Expertium commented Apr 26, 2023 • edited Loading

Expertium commented Apr 26, 2023 • edited Loading

user1823 commented Apr 26, 2023 • edited Loading

L-M-Sherlock commented Apr 26, 2023

user1823 commented Apr 26, 2023 • edited Loading

L-M-Sherlock commented Apr 26, 2023

user1823 commented Apr 26, 2023

L-M-Sherlock commented Apr 26, 2023

Expertium commented Apr 26, 2023 • edited Loading

user1823 commented Apr 26, 2023

L-M-Sherlock commented Apr 26, 2023

user1823 commented Apr 26, 2023

L-M-Sherlock commented Apr 27, 2023

L-M-Sherlock commented Apr 27, 2023

L-M-Sherlock commented Apr 27, 2023

Expertium commented Apr 27, 2023 • edited Loading

galantra commented Apr 27, 2023

L-M-Sherlock commented Apr 27, 2023

Expertium commented Apr 27, 2023

user1823 commented Apr 27, 2023

Expertium commented Apr 27, 2023

L-M-Sherlock commented Apr 28, 2023

Expertium commented Apr 28, 2023 • edited Loading

L-M-Sherlock commented Apr 28, 2023

Expertium commented Apr 28, 2023

user1823 commented Apr 28, 2023

L-M-Sherlock commented Apr 28, 2023

user1823 commented Apr 28, 2023 • edited Loading

Expertium commented Apr 28, 2023 • edited Loading

Expertium commented Apr 28, 2023

Expertium commented Apr 28, 2023 • edited Loading

user1823 commented Apr 28, 2023

Expertium commented Apr 28, 2023 • edited Loading

Expertium commented Apr 28, 2023

user1823 commented Apr 29, 2023

Expertium commented Apr 29, 2023

Expertium commented Apr 29, 2023 • edited Loading

L-M-Sherlock commented May 25, 2023

Expertium commented Jul 13, 2023

user1823 commented Apr 15, 2023 •

edited

Loading

Expertium commented Apr 15, 2023 •

edited

Loading

Expertium commented Apr 25, 2023 •

edited

Loading

Expertium commented Apr 26, 2023 •

edited

Loading

Expertium commented Apr 26, 2023 •

edited

Loading

Expertium commented Apr 26, 2023 •

edited

Loading

user1823 commented Apr 26, 2023 •

edited

Loading

user1823 commented Apr 26, 2023 •

edited

Loading

Expertium commented Apr 26, 2023 •

edited

Loading

Expertium commented Apr 27, 2023 •

edited

Loading

Expertium commented Apr 28, 2023 •

edited

Loading

user1823 commented Apr 28, 2023 •

edited

Loading

Expertium commented Apr 28, 2023 •

edited

Loading

Expertium commented Apr 28, 2023 •

edited

Loading

Expertium commented Apr 28, 2023 •

edited

Loading

Expertium commented Apr 29, 2023 •

edited

Loading