-
-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Sharing ideas for further improvement of the algorithm #239
Comments
I second this. It would allow us to account for the fact that different users use the grades differently.
Edit: If we implement this, we can also do away with the |
Unfortunately, this is not always the case. I tried using 5e-3 with the power forgetting curve (with optimizable power) and I got the following results:
Also, the intervals rose too quickly with the 5e-3 as shown below: |
Interesting, in my case it actually helped with the power forgetting curve. |
@L-M-Sherlock here's my spaghetti code implementation of optimizable values for Again-Hard-Good-Easy. I tested it, and it worked well (see the table above in my previous comment). It would be great if you implemented this (and there's prbably a better way to implement it, I'll leave it to you) so we could all test it.
|
I tested this and got very impressive results:
As you can see, the optimized grades yielded the best results till now. |
Based on all the experiments we have conducted so far, I can conclude that it is best to replace as many fixed values with optimized values as possible. Though while doing this, we will have to ensure that all the parameters remain interpretable. |
@Expertium, Here is my implementation:
Edit: Corrected the three errors found in the code. |
This seems a bit excessive, but ok, I'll test it. |
I tested it. But, the results were worse than earlier. Did I make some mistake in the code? My results:
|
I'm also getting worse results. I think you made a mistake here
elif x_stab[1] == 3: should be elif x_new[1] == 3: |
Oh! You are right. |
I'm still getting worse results, no idea why |
Even after making the change, the result is the same. In retrospect, this makes sense because the initial value of x_new and x_stab is the same. So, the above error wouldn't have impacted the results. I think that there is another undiscovered mistake in my code. Otherwise, I won't expect the results to be worse than before. Edit: I found the error.
Here, Edit 2: Now, I think that this line should be replaced with |
I ran the optimizer again after making the correction. I got the following results:
Note: I haven't yet made the last correction (replacing Edit: I made this change, but the results were the same. |
Good catch, I'll correct it in my original code (a few comments above) as well. |
@Expertium, why do you think that this should produce better results? The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability. |
I don't see that in the formulas on wiki, or in the code. Can you point to a specific formula/line of code? |
According to the following code, the difficulty increases with every again and hard rating (increases more with again rating). According to the following code, the greater the difficulty, the smaller is the increase in the stability. |
Ah, that's what you mean. This isn't the same as adding up the total number of lapses and using the resulting number. Yes, difficulty depends on how many times you press "Again", but in a much less explicit way. In any case, in order out find out whether my idea works or no, we need Sherlock to implement it so we can test it. |
@Expertium, would you like to do the Wilcoxon signed-rank test to compare the results of the separated adaptive grades and the unified adaptive grades? The result of that test will make it easier to choose the one we would want to use. The latest version of the code for separated adaptive grades is here: #239 (comment) |
@L-M-Sherlock, can you implement the following so that we can test it?
|
It requires extracting the |
Well, then implement adaptive grades as described here (or in a better way, if you know how):
|
Making 2 different optimizable grades for S and D improved performance when compared to the old algorithm, but not when compared to just having optimizable grades at all. The p-value for that is not shown in the table (in the table I only compare changes to the baseline), but it was much >0.01, suggesting that making different grades for S and D doesn't improve performance. |
Here, I have tried to extract from the above comments what needs to be done now so that we don't miss anything.
|
@L-M-Sherlock Honestly, I think this is just confusing, so how about this instead:
|
You can read this for details: |
I've read it, but I'm still confused |
@L-M-Sherlock There is an important matter that I want to discuss.
The recall matrix is a pretty sophisticated thing, but it also promises great improvements in accuracy since it allows FSRS to correct its own bad predictions using real data. But while it will help with the underestimation of R, it won't actually solve the underlying problem. Recall matrix treats symptoms instead of treating the cause of the disease itself, if that makes sense. |
I will implement it later.
I don't want to include too many parameters in FSRS. Just keep it simple. Recently, I have found that even if the loss keeps the same, the scheduling could still be very different because one user hardly has enough data in all retention ranges. Two sets of parameters could both fit the current data well but have very different predictions out of the ranges. As I mention in another issue. I plan the clamp w[5] (difficulty reversion) to alleviate the ease hell. And I also clamp S decay in the latest optimizer. |
If that's your concern, rest assured, the recall matrix only adds two more optimizable parameters. |
R_Metric = math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['anki_p'])) - math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['p']))
print("R_Metric: ", R_Metric) Do you think it's your need?
I my view, the values in the matrix are also parameters. |
But they're not. Please, read the full document in #271 |
I have read it.
The matrix is used to predict the R, right? That's why I think it means the values in the matrix are parameters. |
Well, they're not parameters in the same sense as these: |
Let me try to explain it.
So, if a model is perfect, whenever another algorithm bins the data, the B-W metric will always be zero in all bins. The cross-comparison is helpful when the cheating algorithm puts all data into one bin and predicts the average retention in the entire collection. |
OK. But what's the x-axis in the new calibration? The predicted R by Anki or FSRS? |
Well, that's something I didn't consider and I have no good answer to this. If you or Expertium don't have a good idea either, we can drop this idea. |
I don't have a good idea either |
So, @L-M-Sherlock, just to be perfectly clear, are you saying that you will implement the recall matrix at some point in the future, after dealing with other issues, or are you saying that you will never implement it? |
It is a pragmatic idea, and I have replied in your issue. |
We haven't done a proper comparison between the exponential function and the power function with the new optimizer.
I want to do a fair comparison. |
I know that "remove it because I don't understand it" is pretty dumb, but I think that my version (see code 2 comments above) is much easier to interpret and we should keep it. |
In the case of FSRS bins for Anki B-W, FSRS will classify all data into 11 bins, from R = 0 to R = 1. Then we calculate he average prediction of Anki, the real retention and B-W metric for Anki. If the B-W metric > 0, it means Anki overestimated R, vice versa. Why did Woz introduce cross-comparison? I think it is used to avoid cheating.
As we know, the repetitions are unlikely uniform. But the cheating algorithm could classify all repetitions into one uniform population and guess the average recall. In cross-comparison, we will use another algorithm to split the population. It avoids the cheating method. For example, assume there are three population in the review data, their true R values being 0.7, 0.8, and 0.9 respectively. A cheating algorithm could achieve the best score with a B-W metric of 0 by predicting 0.8. We wouldn't be able to detect this with the original calibration chart since we are using the same algorithm's prediction to divide population, and then calculating the B-W metric on each population. |
Continue in #282 |
Which module is related to your feature request?
Scheduler/Optimizer
Is your feature request related to a problem? Please describe.
Here I will post all my ideas (new formulas) to improve the algorithm - decrease log-loss and RMSE.
Describe the solution you'd like
EDIT: the range should be [0, 10], according to my testing grades can go below 1 and it will improve loss.
Formulas that involve grades will have to be slightly changed - replace (G-1) and (G-3) with (G-g_again) and (G-g_good), where g_again and g_good are the numerical values assigned to "Again" and "Good", respectively.
Change this:
To this:
Basically, as a final step in computing S, we raise it to some power w_14. Remember how well using power function with a trainable parameter for the forgetting curve worked out? I believe this way we can mimic that effect. This introduces 1 new parameter.
Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this:To this:Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is nonsence. This introduces 1 new parameter.[Feature Request] Sharing ideas for further improvement of the algorithm #239 (comment)Forget about it, anything that requires R_requested won't work. R_requested can be different if the user decided to change it at some point, and more importantly, it's just not a thing in native Anki, there is no such thing as setting a desired probability of recall.Also, I suggest replacing w_9 with e^w_9. It probably won't matter, but it's just more consistent since in the other formula you use e^w_6 rather than just w_6.
None of these changes affect the forgetting curve (you can use either 0.9^t/S or (1+t/9s)^-1), so the meaning of S - number days it takes for user's retention to drop from 100% to 90% - will remain unchanged.
EDIT: purely for the sake of clarity and keeping things easy to understand, I think all formulas should be rewritten in such a way that parameters are always positive. For exampple, if you have
D^w
(w
is a parameter) andw
is is always positive - good, don't change anything. Ifw
is always negative - rewrite the formula asD^-w
and keepw
positive. This should make it easier to read the wiki.The text was updated successfully, but these errors were encountered: