Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calibration between actual retention and predicted retention is not great #215

Closed
user1823 opened this issue Apr 15, 2023 · 76 comments
Closed
Labels
enhancement New feature or request

Comments

@user1823
Copy link
Collaborator

Btw, calibration doesn't look great on my collection. I've tried it with separate decks and got similar results
Calibration graph (Entire collection)

Originally posted by @Expertium in #151 (comment)

@user1823
Copy link
Collaborator Author

This is also confirmed by the following stats:

image

@Expertium
Copy link
Collaborator

Wait, I don't get it. How is it confirmed by these stats?

@user1823
Copy link
Collaborator Author

user1823 commented Apr 15, 2023

FSRS predicts that the average retention for the cards should be 96.85%.

But, over the past one month, the measured retention was just 94.2%

@Expertium
Copy link
Collaborator

Expertium commented Apr 15, 2023

Nah, that's a pretty small discrepancy. What FSRS predicts and what True Retention shows you are somewhat different things (I wouldn't ask Sherlock to implement it if I could just use True Retention instead, they're not identical), so a small discrepancy like this is fine. If it was like 70% vs 95%, that would be worrying. I would say that a difference less than 5% is fine.

@L-M-Sherlock
Copy link
Member

FSRS predicts that the average retention for the cards should be 96.85%.

But, over the past one month, the measured retention was just 94.2%

They are not the same thing. The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention.

@user1823
Copy link
Collaborator Author

They are not the same thing. The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention.

Oh! I failed to consider that.

So, should I close this issue?

@user1823
Copy link
Collaborator Author

The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention.

But, I think that you should include this in the explanation for the average retention stats when you release the feature.

@L-M-Sherlock
Copy link
Member

OK, I will improve the explanation of the FSRS stats.

@user1823
Copy link
Collaborator Author

In my recent research (unpublished), the error of power functions is lower than that of exponential functions in fitting the forgetting curve. I will test it in some users' collections.

Originally posted by @L-M-Sherlock in #151 (comment)

Interesting. Woz says that the forgetting curve is definitely exponential, but when you have both difficult and easy material in the same collection, the resulting curve (a superposition of different exponential curves) is best approximated with a power function.
https://supermemo.guru/wiki/Exponential_nature_of_forgetting#Power_law_emerges_in_superposition_of_exponential_forgetting_curves

Originally posted by @Expertium in #151 (comment)

@L-M-Sherlock, you may want to test the power function in my collection. I believe that my collection includes significant amounts of both easy and difficult materials.

  • My deck: Default.zip (Change extension to .apkg)
  • w: [1.0696, 1.6707, 4.9898, -1.2855, -1.1631, 0.0, 1.7095, -0.0887, 1.0651, 1.71, -0.4937, 0.8296, 0.4661]
  • requestRetention: 0.94
  • maximumInterval: 36500
  • easyBonus: 2.5
  • hardInterval: 1.2
  • timezone = 'Asia/Calcutta'
  • next_day_starts_at = 4
  • revlog_start_date = "2006-10-05"

@L-M-Sherlock
Copy link
Member

Thanks for the data. I will do some research here. Currently I am working on filtering out the outliers in the data of Expertium.

@user1823
Copy link
Collaborator Author

@L-M-Sherlock, please let me know the results after you test the power function in my collection. I think that it might solve this issue also (at least partially).

@L-M-Sherlock
Copy link
Member

Unfortunately, I even increased the number of parameters to 20 today, but the accuracy doesn't increase significantly.

@L-M-Sherlock
Copy link
Member

Unfortunately, I even increased the number of parameters to 20 today, but the accuracy doesn't increase significantly. I need to do more experiments here.

@Expertium
Copy link
Collaborator

Expertium commented Apr 25, 2023

I'm curious what parameters you added. Do you mind giving a detailed description of this new model, with all the formulas?

EDIT: or even better, make a beta-version of the new optimizer and the new scheduler.

@L-M-Sherlock
Copy link
Member

I will share some details about it tomorrow.

@L-M-Sherlock
Copy link
Member

Add power forgetting curve and power difficulty:

image

I think there is not significant difference.

image

@Expertium
Copy link
Collaborator

Expertium commented Apr 26, 2023

Previously you said that you have increased the number of parameters to 20. Perhaps you could release a beta-version of the optimizer and a beta-version of the scheduler with those parameters so other people can experiment with them (and do side-by-side comparisons) as well?

Also, it's kinda hard to understand formulas in code form (well, for me at least), so I would really appreciate it if you wrote formulas using Latex or something like that, and posted them here (I assume you won't be making a dedicated entry on wiki).

@L-M-Sherlock
Copy link
Member

Previously you said that you have increased the number of parameters to 20. Perhaps you could release a beta-version of the optimizer and a beta-version of the scheduler with those parameters so other people can experiment with them (and do side-by-side comparisons) as well?

OK, I will publish the beta-version of the optimizer in another branch. But I don't want to develop the scheduler until the optimizer is significant better than before and stable.

@Expertium
Copy link
Collaborator

Expertium commented Apr 26, 2023

Also, a suggestion - add root-mean-square error too, not just R^2.
download

  1. Find the difference between "Average actual retention" and "Optimal average actual retention", and square it
  2. Take the sum of all squared differences, divide it by their count (which is just the number of bins, like 40), and take a square root.

The interpretation is as follows: it tells us how much, on average, the calibration is off from theoretical perfection. For example, RMSE=0.05 means that on average FSRS is 5% off from theoretically perfect calibration. This is easier to interpret than R^2, though I think we should keep both.

@Expertium
Copy link
Collaborator

Expertium commented Apr 26, 2023

Also, maybe I'm just dumb, but I have no idea what is going on here.
image
Is this correct? If so, why is there a 9, and no new parameters?
image
I thought the power will be a new parameter, like this.
image

@user1823
Copy link
Collaborator Author

user1823 commented Apr 26, 2023

I thought the power will be a new parameter, like this.
image

In the following image (from SuperMemo), the power function is of the form:
Retrievability = $A~t^B$

So, I think that the power in the equation shared by @L-M-Sherlock above is not what we are targeting here.

I think that A and B in the equation shared by me here should the new parameters.

@L-M-Sherlock
Copy link
Member

image

I have to ensure that stability has the same meaning as before. You can compare the f(t) and g(t) in the graph. when t = a, f(a) = g(a) = 0.9.

@user1823
Copy link
Collaborator Author

user1823 commented Apr 26, 2023

I am not sure that I understand the algorithm correctly. But,

If the original function is
$$R = A~e^{-kt}$$

shouldn't the formula for the retention in terms of stability be
$$R = 0.9~e^{-k(t-S)}$$

How did it become the following?
$$R = 0.9^{t/S}$$

And now, if the new function is
$$R = A~t^B$$

shouldn't the formula for the retention in terms of stability be
$$R = 0.9~(t/S)^B$$

How did it become the following?
$$R = (1 + \frac{t}{9S})^{-1}$$

For the above calculations, I have taken the stability to be the time at which the retention is 90%.

@L-M-Sherlock
Copy link
Member

There is only one parameter for the forgetting curve function. The exponential forgetting curve function is $R(t)=e^{\ln(0.9)t/S}$, which is equivalent to $R(t)=0.9^{t/S}$.

@user1823
Copy link
Collaborator Author

So, is this some other formula?

The two formulas I mention are from the image posted above. The source of the image is https://supermemo.guru/wiki/Exponential_nature_of_forgetting#Power_law_emerges_in_superposition_of_exponential_forgetting_curves.

@L-M-Sherlock
Copy link
Member

In Algorithm SM-17, retrievability R corresponds with the probability of recall and represents the exponential forgetting curve. Retrievability is derived from stability and the interval:

R[n]:=exp-k*t/S[n-1]

where:

R[n] - retrievability at the n-th repetition
k - decay constant
t - time (interval)
S[n-1] - stability after the (n-1)th repetition

@Expertium
Copy link
Collaborator

Expertium commented Apr 26, 2023

So the new function has a somewhat different shape, but otherwise it's not more flexible than the old one? If that's so, then I'm not surprised that results haven't changed much. I'm sure that if instead of (1 + t / (9 * S)) ** -1 it was (1 + t / (9 * S)) ** -w, where w can be optimized, the results would be more impressive. Although I suppose that would make it more difficult to implement such function in the scheduler.

Are there any other changes to how S or D are calculated?

Also, I still don't understand where 9 comes from in that formula.

@user1823
Copy link
Collaborator Author

The new formula is similar to the one below.

Wixted and Carpenter (2007) summarize the forgetting curve as:
P(recall) = $m(1 + ht)^{-f}$
Where m is the degree of initial learning (i.e. the probability at time 0), h is a scaling factor on time, and f is the exponential memory decay factor.

Source: https://notes.andymatuschak.org/zHdKY3GwoUW9xG6wQtKFqjz9jcrxdM3mxram

So, I think that like what @Expertium says, if the power is -w instead of -1, the results would be more impressive.

@L-M-Sherlock
Copy link
Member

OK, I will test it tomorrow.

@user1823
Copy link
Collaborator Author

@L-M-Sherlock
Copy link
Member

For each deck both loss and RMSE were lower when using the new version of the optimizer. On average, log-loss was around 4% lower, and RMSE was around 16% lower.

Pretty well! And the calibration graph shows that the blue line aligns with the orange line better than before.

@L-M-Sherlock
Copy link
Member

Also, (while it's a bit late to ask) is pre-training in 2.2 really necessary? It seems like a very non-intuitive thing to me.

Because I freeze the the parameters of initial stability in the training stage. It make the parameters more stable and accurate during the training.

@L-M-Sherlock
Copy link
Member

Also, for R^2 I think you should bring back the implementation you used in some of the previous versions, there is no way the values given by sklearn.metrics.r2_score are correct, something weird is going on with it.

The previous implementation is incorrect. It is not R-squared, because R-squared is not r^2. I believe the sklearn (the most popular machine learning package) implementation is correct.

@Expertium
Copy link
Collaborator

Expertium commented Apr 27, 2023

because R-squared is not r^2

https://en.wikipedia.org/wiki/Coefficient_of_determination

image
image

Idk, maybe there are different definitions and they are not equivalent.

@galantra
Copy link
Contributor

What's interesting is that when I ran the new optimizer on the full collection, RMSE was lower than when I ran it on individual decks, which suggests that maybe having different parameters for different decks is not actually advantageous.

I noticed the same some time ago, IIRC. I imagine it is because of the data set is bigger, and a bigger data set will have a bigger impact vs. the initial Maimemo(?) data than a smaller data set would have. The Optimizer becomes more confident, so to say.

Is this correct?

@L-M-Sherlock
Copy link
Member

I suppose the next step is to apply the same idea to difficulty and replace (10 * torch.pow(new_d, -1)) with something like (10 * torch.pow(new_d, -f_2)) where f_2 is another parameter.

Nice idea! I will add it tomorrow.

@Expertium
Copy link
Collaborator

Despite the fact that the specific optimizer you are using is Adam (which has a lot of heuristics to make it more adaptive), learning rate still affects the results. I changed learning rate to lr = 5e-3 and got even better results with the new forgetting curve.
lr = 5e-4
Loss: 0.4589
RMSE: 0.0466

lr = 5e-3
Loss: 0.4478
RMSE: 0.0367

image

@user1823
Copy link
Collaborator Author

Also, increasing the n_epoch from 1 to 3 produces better results.

n_epoch = 1

n_epoch = 3

@Expertium
Copy link
Collaborator

I tried changing learning rate to 5e-2 and got this error:
image

@L-M-Sherlock
Copy link
Member

I'm stuck in a problem. The stability has different meaning for various users when I introduce f into the power forgetting curve:

image

@Expertium
Copy link
Collaborator

Expertium commented Apr 28, 2023

I guess that's inevitable if we're introducing parameters to the forgetting curve. If we introduce new parameters to the calculation of S while keeping the original forgetting curve (which only depends on S and t, no parameters), this problem can be circumvented. But then we will most likely end up with a higher loss and RMSE due to the forgetting curve not being very flexible, unless we improve the calculation of S so much that it will compensate for that. Basically, we could decrease the loss/RMSE either by changing how S is calculated or by adding parameters to R=f(S). Or both.

I guess it depends on what philosophy you want to adopt for FSRS.

  1. "Only decreasing the loss matters, regardless of how ad hoc and non-rigorous our model is, and interpretability can go to hell."
  2. "Every value and every parameter must have a precise meaning, sacrificing the rigor and interpretability of the model is unacceptable."

EDIT: if you don't want to continue working on the power forgetting curve with a new parameter due to problems with interpretability, you can bring back the exponential forgetting curve with no parameters and instead change how difficulty is calculated (make it a power function and add a new parameter, like I mentioned in one of the messages above), which will hopefully make S more accurate without sacrificing interpretability.

@L-M-Sherlock
Copy link
Member

If we finally decide to use power forgetting curve, I prefer a fixed f which is an average with training results collected from many users.

@Expertium
Copy link
Collaborator

That's a good idea, though we will need enormous amounts of data, not just from 3-5 users.
Ok, remove the power forgetting curve for now, bring back the exponential curve, and change difficulty formula to (10 * torch.pow(new_d, -f)), where f can be optimized.

@user1823
Copy link
Collaborator Author

I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is.

@L-M-Sherlock
Copy link
Member

I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is.

Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb

The branch: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/power-function-for-difficulty

@user1823
Copy link
Collaborator Author

user1823 commented Apr 28, 2023

The stability has different meaning for various users when I introduce f into the power forgetting curve.

By the way, this problem can be solved by replacing the constant 9 in
$$R = \left(1 + \frac{t}{9S} \right)^{f}$$
by
$$\frac{10^{1/f}}{9^{1/f} - 10^{1/f}}$$

However, I am not sure how practical it is.

Edit: A further simplified version can be
$$R = \left(1 + \frac{(0.9^{1/f} - 1)~t}{S} \right)^{f}$$

@Expertium
Copy link
Collaborator

Expertium commented Apr 28, 2023

Edit: A further simplified version can be R=(1+t(0.91/f−1)S)f

Then we lose flexibility, in other words, changing f will have barely any effect on the shape of the forgetting curve. I tried it in Desmos and it seems that changing f only slightly changes the shape of the curve, defeating the whole point of introducing a new parameter.

@Expertium
Copy link
Collaborator

I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is.

Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb

The branch: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/power-function-for-difficulty

@L-M-Sherlock A minor thing, but I don't see .clamp() for w[13].
image

@Expertium
Copy link
Collaborator

Expertium commented Apr 28, 2023

Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb

I've only tried it on 3 decks so far, but the results aren't looking good. Loss is around 0.5% higher than before, RMSE is around 2% higher. If anything, this new version is worse.

EDIT: by "before" I mean the version without power functions and new parameters.

@user1823
Copy link
Collaborator Author

According to my testing, the performance of this version is in between that of the original optimizer and the power forgetting curve one. The results are summarized below:

  Log loss RMSE R-squared
Original Optimizer 0.2241 0.0191 0.8728
Power forgetting curve 0.2231 0.0147 0.9234
Power difficulty 0.2234 0.0170 0.9026

@Expertium
Copy link
Collaborator

Expertium commented Apr 28, 2023

I tested it on 7 decks + entire collection, same as before. On average, log-loss is 0.6% worse than with the old (no power functions, no new parameters) algorithm, RMSE is 2% worse. I ran Wilcoxon signed-rank test on this data as well, but got high (>0.05) p-values, indicating that there is no statistically significant difference between how the old algorithm and the new one perform.
Interestingly, while it performed worse on each of the decks I tested, it actually performed mildly better on the entire collection - log-loss is 1% lower, RMSE is 3% lower.
Old (no power functions, no new parameters): log-loss=0.4831, RMSE=0.0917.
New (power function for difficulty): log-loss=0.4787, RMSE=0.0891.
Overall, it's clear that this isn't nearly as good as using a power function for the forgetting curve itself.

@Expertium
Copy link
Collaborator

Also, unlike with the power forgetting curve, this time changing the learning rate didn't help.

@user1823
Copy link
Collaborator Author

Since the power difficulty approach didn't produce the desired results, we might need to either

  • use the power forgetting curve with a fixed value of f; or
  • continue using the exponential forgetting curve and think of other ways to improve the calculation of the stability.

If we finally decide to use power forgetting curve, I prefer a fixed f which is an average with training results collected from many users.

To examine the feasibility of this approach, I guess that the first step would be to compare the optimized f values for some users' collections.

The optimized f value for my collection is -0.5501.

@Expertium
Copy link
Collaborator

Mine is -0.1671, but this isn't the proper way of doing this. We need to collect the number of reviews as well (not just the value of f), so we can take a weighted average. But mroe importantly, we need at least a few dozens of people to submit their data, ideally more than a hundred.

@Expertium
Copy link
Collaborator

Expertium commented Apr 29, 2023

I just thought about something. If we use R=(1+t/9S)^-1 as a formula, then R=0.9 when S=1, so S has a meaning that is easy to explain: it's how many days it takes for your retention to drop from 100% to 90% (btw, I finally understood the meaning of 9 in that formula). However, if we change the power to any value other than 1, the meaning of S changes regardless of what this value is and regardless of whether it stays constant or no.
For example, if f=0.5 and S=1, then the meaning of S is "the number of days it takes your retention to drop from 100% to 94.87%".
What I'm trying to say is that it doesn't matter whether f is constant or can be changed. So the "let's take values of f from different users and average them" approach will also mess up the meaning of S. It's slightly better than making f optimizable for each user since at least the meaning of S will be the same across all users, but it still deviates from the original meaning that @L-M-Sherlock intended.
TLDR: if the formula is R=(1+t/9*S)^-f, then f cannot be anything other than 1, otherwise the meaning of S - number days it takes for user's retention to drop from 100% to 90% - will change, regardless of whether it's a constant or an optimizable parameter.

EDIT: I will make a new issue for submitting new formulas, aka my ideas on how to improve the algorithm. We have already reached 74 comments here, it's becoming kinda cluttered, so it's probably better to make a new issue.

@L-M-Sherlock
Copy link
Member

Upon my recent thinking, I have resolved not to adopt any ideas that involve adding parameters on the forgetting curve. My reasoning is as follows:

The optimal method for fitting the forgetting curve is given the same review history, reviewing at different intervals, calculating the retention rate, and subsequently plotting the following graph:

image

However, in the vast majority of spaced repetition software used individually, given the same review history, the algorithm presents minor variations in intervals, coupled with a scarcity of data, rendering the estimation of the retention rate imprecise. Incorporating parameters on the forgetting curve might lead the algorithm to learn the characteristics of a specific retention rate, which would perform inadequately in extrapolation.

@Expertium
Copy link
Collaborator

I'll recommend closing this issue since the new version is getting released, and this issue hasn't been used for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

5 participants