-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calibration between actual retention and predicted retention is not great #215
Comments
Wait, I don't get it. How is it confirmed by these stats? |
FSRS predicts that the average retention for the cards should be 96.85%. But, over the past one month, the measured retention was just 94.2% |
Nah, that's a pretty small discrepancy. What FSRS predicts and what True Retention shows you are somewhat different things (I wouldn't ask Sherlock to implement it if I could just use True Retention instead, they're not identical), so a small discrepancy like this is fine. If it was like 70% vs 95%, that would be worrying. I would say that a difference less than 5% is fine. |
They are not the same thing. The retention in FSRS stats is calculated in all cards including those undue cards. The retention of undue cards is higher than your requested retention. |
Oh! I failed to consider that. So, should I close this issue? |
But, I think that you should include this in the explanation for the average retention stats when you release the feature. |
OK, I will improve the explanation of the FSRS stats. |
Originally posted by @L-M-Sherlock in #151 (comment)
Originally posted by @Expertium in #151 (comment) @L-M-Sherlock, you may want to test the power function in my collection. I believe that my collection includes significant amounts of both easy and difficult materials.
|
Thanks for the data. I will do some research here. Currently I am working on filtering out the outliers in the data of Expertium. |
@L-M-Sherlock, please let me know the results after you test the power function in my collection. I think that it might solve this issue also (at least partially). |
Unfortunately, I even increased the number of parameters to 20 today, but the accuracy doesn't increase significantly. |
Unfortunately, I even increased the number of parameters to 20 today, but the accuracy doesn't increase significantly. I need to do more experiments here. |
I'm curious what parameters you added. Do you mind giving a detailed description of this new model, with all the formulas? EDIT: or even better, make a beta-version of the new optimizer and the new scheduler. |
I will share some details about it tomorrow. |
Previously you said that you have increased the number of parameters to 20. Perhaps you could release a beta-version of the optimizer and a beta-version of the scheduler with those parameters so other people can experiment with them (and do side-by-side comparisons) as well? Also, it's kinda hard to understand formulas in code form (well, for me at least), so I would really appreciate it if you wrote formulas using Latex or something like that, and posted them here (I assume you won't be making a dedicated entry on wiki). |
OK, I will publish the beta-version of the optimizer in another branch. But I don't want to develop the scheduler until the optimizer is significant better than before and stable. |
In the following image (from SuperMemo), the power function is of the form: So, I think that the power in the equation shared by @L-M-Sherlock above is not what we are targeting here. I think that A and B in the equation shared by me here should the new parameters. |
I am not sure that I understand the algorithm correctly. But, If the original function is shouldn't the formula for the retention in terms of stability be How did it become the following? And now, if the new function is shouldn't the formula for the retention in terms of stability be How did it become the following? For the above calculations, I have taken the stability to be the time at which the retention is 90%. |
There is only one parameter for the forgetting curve function. The exponential forgetting curve function is |
So, is this some other formula? The two formulas I mention are from the image posted above. The source of the image is https://supermemo.guru/wiki/Exponential_nature_of_forgetting#Power_law_emerges_in_superposition_of_exponential_forgetting_curves. |
In Algorithm SM-17, retrievability R corresponds with the probability of recall and represents the exponential forgetting curve. Retrievability is derived from stability and the interval: R[n]:=exp-k*t/S[n-1] where: R[n] - retrievability at the n-th repetition |
So the new function has a somewhat different shape, but otherwise it's not more flexible than the old one? If that's so, then I'm not surprised that results haven't changed much. I'm sure that if instead of (1 + t / (9 * S)) ** -1 it was (1 + t / (9 * S)) ** -w, where w can be optimized, the results would be more impressive. Although I suppose that would make it more difficult to implement such function in the scheduler. Are there any other changes to how S or D are calculated? Also, I still don't understand where 9 comes from in that formula. |
The new formula is similar to the one below.
Source: https://notes.andymatuschak.org/zHdKY3GwoUW9xG6wQtKFqjz9jcrxdM3mxram So, I think that like what @Expertium says, if the power is -w instead of -1, the results would be more impressive. |
OK, I will test it tomorrow. |
@Expertium, you can try out the current version of the new optimizer here: https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Feat%2Fpow_difficulty/fsrs4anki_optimizer.ipynb |
Pretty well! And the calibration graph shows that the blue line aligns with the orange line better than before. |
Because I freeze the the parameters of initial stability in the training stage. It make the parameters more stable and accurate during the training. |
The previous implementation is incorrect. It is not R-squared, because R-squared is not r^2. I believe the sklearn (the most popular machine learning package) implementation is correct. |
https://en.wikipedia.org/wiki/Coefficient_of_determination Idk, maybe there are different definitions and they are not equivalent. |
I noticed the same some time ago, IIRC. I imagine it is because of the data set is bigger, and a bigger data set will have a bigger impact vs. the initial Maimemo(?) data than a smaller data set would have. The Optimizer becomes more confident, so to say. Is this correct? |
Nice idea! I will add it tomorrow. |
Despite the fact that the specific optimizer you are using is Adam (which has a lot of heuristics to make it more adaptive), learning rate still affects the results. I changed learning rate to lr = 5e-3 and got even better results with the new forgetting curve. lr = 5e-3 |
I guess that's inevitable if we're introducing parameters to the forgetting curve. If we introduce new parameters to the calculation of S while keeping the original forgetting curve (which only depends on S and t, no parameters), this problem can be circumvented. But then we will most likely end up with a higher loss and RMSE due to the forgetting curve not being very flexible, unless we improve the calculation of S so much that it will compensate for that. Basically, we could decrease the loss/RMSE either by changing how S is calculated or by adding parameters to R=f(S). Or both. I guess it depends on what philosophy you want to adopt for FSRS.
EDIT: if you don't want to continue working on the power forgetting curve with a new parameter due to problems with interpretability, you can bring back the exponential forgetting curve with no parameters and instead change how difficulty is calculated (make it a power function and add a new parameter, like I mentioned in one of the messages above), which will hopefully make S more accurate without sacrificing interpretability. |
If we finally decide to use power forgetting curve, I prefer a fixed |
That's a good idea, though we will need enormous amounts of data, not just from 3-5 users. |
I suggest trying out the new approach in a new branch. Leave the current branch (with the power forgetting curve and optimized power) as it is. |
Here is the notebook with exponential curve and 10 * torch.pow(new_d, -f): https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/power-function-for-difficulty/fsrs4anki_optimizer.ipynb The branch: https://github.com/open-spaced-repetition/fsrs4anki/tree/Expt/power-function-for-difficulty |
By the way, this problem can be solved by replacing the constant 9 in However, I am not sure how practical it is. Edit: A further simplified version can be |
Then we lose flexibility, in other words, changing f will have barely any effect on the shape of the forgetting curve. I tried it in Desmos and it seems that changing f only slightly changes the shape of the curve, defeating the whole point of introducing a new parameter. |
@L-M-Sherlock A minor thing, but I don't see .clamp() for w[13]. |
I've only tried it on 3 decks so far, but the results aren't looking good. Loss is around 0.5% higher than before, RMSE is around 2% higher. If anything, this new version is worse. EDIT: by "before" I mean the version without power functions and new parameters. |
According to my testing, the performance of this version is in between that of the original optimizer and the power forgetting curve one. The results are summarized below:
|
I tested it on 7 decks + entire collection, same as before. On average, log-loss is 0.6% worse than with the old (no power functions, no new parameters) algorithm, RMSE is 2% worse. I ran Wilcoxon signed-rank test on this data as well, but got high (>0.05) p-values, indicating that there is no statistically significant difference between how the old algorithm and the new one perform. |
Also, unlike with the power forgetting curve, this time changing the learning rate didn't help. |
Since the power difficulty approach didn't produce the desired results, we might need to either
To examine the feasibility of this approach, I guess that the first step would be to compare the optimized The optimized |
Mine is -0.1671, but this isn't the proper way of doing this. We need to collect the number of reviews as well (not just the value of |
I just thought about something. If we use R=(1+t/9S)^-1 as a formula, then R=0.9 when S=1, so S has a meaning that is easy to explain: it's how many days it takes for your retention to drop from 100% to 90% (btw, I finally understood the meaning of 9 in that formula). However, if we change the power to any value other than 1, the meaning of S changes regardless of what this value is and regardless of whether it stays constant or no. EDIT: I will make a new issue for submitting new formulas, aka my ideas on how to improve the algorithm. We have already reached 74 comments here, it's becoming kinda cluttered, so it's probably better to make a new issue. |
Upon my recent thinking, I have resolved not to adopt any ideas that involve adding parameters on the forgetting curve. My reasoning is as follows: The optimal method for fitting the forgetting curve is given the same review history, reviewing at different intervals, calculating the retention rate, and subsequently plotting the following graph: However, in the vast majority of spaced repetition software used individually, given the same review history, the algorithm presents minor variations in intervals, coupled with a scarcity of data, rendering the estimation of the retention rate imprecise. Incorporating parameters on the forgetting curve might lead the algorithm to learn the characteristics of a specific retention rate, which would perform inadequately in extrapolation. |
I'll recommend closing this issue since the new version is getting released, and this issue hasn't been used for a long time. |
Btw, calibration doesn't look great on my collection. I've tried it with separate decks and got similar results
![Calibration graph (Entire collection)](https://user-images.githubusercontent.com/83031600/231520566-8039c37c-599d-458d-b933-5c8f5907a4ca.png)
Originally posted by @Expertium in #151 (comment)
The text was updated successfully, but these errors were encountered: