Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Sharing ideas for further improvement of the algorithm #239

Closed
Expertium opened this issue Apr 29, 2023 · 450 comments
Closed
Labels
enhancement New feature or request

Comments

@Expertium
Copy link
Collaborator

Expertium commented Apr 29, 2023

Which module is related to your feature request?
Scheduler/Optimizer

Is your feature request related to a problem? Please describe.
Here I will post all my ideas (new formulas) to improve the algorithm - decrease log-loss and RMSE.

Describe the solution you'd like

  1. Make grade values optimizable parameters. Currently Again=1, Hard=2, Good=3, Easy=4. I propose that we make them optimizable as well, and clamp them (although I'm not sure about the range, should it be [1, 10] or something different?). This introduces 4 new parameters.
    EDIT: the range should be [0, 10], according to my testing grades can go below 1 and it will improve loss.
    Formulas that involve grades will have to be slightly changed - replace (G-1) and (G-3) with (G-g_again) and (G-g_good), where g_again and g_good are the numerical values assigned to "Again" and "Good", respectively.
  2. Change this:
    S_remembered_old
    To this:
    S_remembered_new
    Basically, as a final step in computing S, we raise it to some power w_14. Remember how well using power function with a trainable parameter for the forgetting curve worked out? I believe this way we can mimic that effect. This introduces 1 new parameter.
  3. Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this:
    S_forgot_old
    To this:
    S_forgot_new
    Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is nonsence. This introduces 1 new parameter.
  4. [Feature Request] Sharing ideas for further improvement of the algorithm #239 (comment) Forget about it, anything that requires R_requested won't work. R_requested can be different if the user decided to change it at some point, and more importantly, it's just not a thing in native Anki, there is no such thing as setting a desired probability of recall.
  5. If at all possible, replace learning steps and re-learning steps with functions/parameters to make them adaptive and ensure that users themselves don't have to worry about setting steps.

Also, I suggest replacing w_9 with e^w_9. It probably won't matter, but it's just more consistent since in the other formula you use e^w_6 rather than just w_6.

None of these changes affect the forgetting curve (you can use either 0.9^t/S or (1+t/9s)^-1), so the meaning of S - number days it takes for user's retention to drop from 100% to 90% - will remain unchanged.

EDIT: purely for the sake of clarity and keeping things easy to understand, I think all formulas should be rewritten in such a way that parameters are always positive. For exampple, if you have D^w (w is a parameter) and w is is always positive - good, don't change anything. If w is always negative - rewrite the formula as D^-w and keep w positive. This should make it easier to read the wiki.

@Expertium Expertium added the enhancement New feature or request label Apr 29, 2023
@user1823
Copy link
Collaborator

user1823 commented Apr 30, 2023

Make grade values optimizable parameters. Currently Again=1, Hard=2, Good=3, Easy=4. I propose that we make them optimizable as well, and clamp them (although I'm not sure about the range, should it be [1, 4] or something different?). This introduces 4 new parameters.

I second this. It would allow us to account for the fact that different users use the grades differently.

  • Some users only use the Good and Again buttons.
  • Some users mostly use the Good and Again buttons but also use the Hard and the Easy buttons intermittently.
  • Some users use all the 4 buttons.

Edit: If we implement this, we can also do away with the easy bonus setting because the optimizer would automatically calculate how much the stability/difficulty/interval should be changed with the press of the Easy button.

@Expertium
Copy link
Collaborator Author

Expertium commented Apr 30, 2023

Since Sherlock seems to be absent today, I decided to use my meager Python knowledge to implement number 2 myself. Thankfully, it's way easier than number 1 or number 3, I don't think I will be able to do those on my own, both of those would require major changes.
image

Unfortunately, it doesn't seem like this worked. Both loss and RMSE are about the same, and the new parameter (power for S) itself barely changes and is very close to 1 (usually around 0.95-1.05), which means that any large deviation from 1 is not optimal. Although increasing the learning rate seems to produce better results, it's possible that increasing the learning rate would help even without the new parameter.
I did the usual procedure, 7 decks + collection, then Wilcoxon signed-rank test, with lr=5e-4 (same as the one I used when testing without any new changes, to ensure a fair comparison): log-loss is 0.1% better (which is practically 0) on average, RMSE is 0.2% better, and the p-values are much greater than 0.05, so there is no statistically significant difference in performance between the old and new algorithms.

But there are some good news: lr=5e-3 generally produces better results than lr=5e-4 for many different versions of the algorithm, so the default value should be 5e-3 for better results.

@user1823
Copy link
Collaborator

lr=5e-3 generally produces better results than lr=5e-4 for many different versions of the algorithm, so the default value should be 5e-3 for better results.

Unfortunately, this is not always the case. I tried using 5e-3 with the power forgetting curve (with optimizable power) and I got the following results:

Learning Rate Log Loss RMSE R-squared
5e-4 0.2231 0.0147 0.9234
5e-3 0.2225 0.0162 0.9129

Also, the intervals rose too quickly with the 5e-3 as shown below:

lr = 5e-4:

lr = 5e-3:

@Expertium
Copy link
Collaborator Author

Expertium commented Apr 30, 2023

Interesting, in my case it actually helped with the power forgetting curve.
We could run 2 optimizations with different lr and then just choose whichever set of parameters minimizes the log-loss better, but that would almost double the total time needed.

@Expertium
Copy link
Collaborator Author

Expertium commented Apr 30, 2023

After some trial and error, I figured out how to implement adaptive grades. Hopefully, I didn't mess anything up.
image

I did the standard procedure (lr=5e-4, again, for the sake of consistency with previous tests, even though 5e-3 is better), and the results are decent. Log-loss went down by 2% on average, RMSE went down by around 8% on average, and the difference in performance between this and the old algorithm is statistically significant with p<0.01. While this isn't quite as impressive as the adaptive power forgetting curve, it's progress.

It's probably too early to worry, but I'm concerned about how little log-loss changes. Even the adaptive power forgetting curve only improved log-loss by 4% on average, this one improved log-loss by 2%. I feel like we won't be able to significantly decrease it unless we do something extremely clever and extremely complicated. Interestingly, RMSE improved more than log-loss in both cases (adaptive power forgetting curve and adaptive grades).

Also, I have an idea for a non-linear transformation of difficulty, I'll test it and see if it's better than power function for difficulty.
EDIT: my transformation is weird. For some reason optimizer barely changes the new parameters that I added. Whatever initial values I set, the "optimal" values are very close to them, and the loss isn't improving. I guess this idea didn't work out.

Also, at some point I should make a table comparing different changes.

@Expertium
Copy link
Collaborator Author

Expertium commented Apr 30, 2023

image
Here's a table with different modifications. The old FSRS algorithm is the baseline. A value less than 100% means that this modification performs better than the baseline, a value greater than 100% means that this modification performs worse than the baseline. p-value < 0.01 means that it's unlikely to be a coincidence and there really is a difference between a particular improved version and the baseline.
So far, the only modification worth implementing is adaptive (aka optimizable) grades. Power forgetting curve changes the meaning of S, so we're not going to do that, I suppose.

@Expertium
Copy link
Collaborator Author

Expertium commented May 1, 2023

@L-M-Sherlock here's my spaghetti code implementation of optimizable values for Again-Hard-Good-Easy. I tested it, and it worked well (see the table above in my previous comment). It would be great if you implemented this (and there's prbably a better way to implement it, I'll leave it to you) so we could all test it.

2.1 Define the model
FSRS is a time-series model for predicting memory states.

init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: again_grade_value
w[14]: hard_grade_value
w[15]: good_grade_value
w[16]: easy_grade_value
For more details about the parameters, please see: 
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''

class FSRS(nn.Module):
    def __init__(self, w):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.FloatTensor(w))
        self.zero = torch.FloatTensor([0.0])

        
    def forward(self, x, s, d):
        '''
        :param x: [review interval, review response]
        :param s: stability
        :param d: difficulty
        :return:
        '''

        x_new = x.clone()
        if x_new[1] == 1:
          x_new[1] = self.w[13]
        elif x_new[1] == 2:
            x_new[1] = self.w[14]
        elif x_new[1] == 3:
            x_new[1] = self.w[15]
        elif x_new[1] == 4:
            x_new[1] = self.w[16]

   
        if torch.equal(s, self.zero):
            # first learn, init memory states
            new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
            new_d = self.w[2] + self.w[3] * (x_new[1] - self.w[15])
            new_d = new_d.clamp(1, 10)
        else:
            r = torch.exp(np.log(0.9) * x_new[0] / s)
            new_d = d + self.w[4] * (x_new[1] - self.w[15])
            new_d = self.mean_reversion(self.w[2], new_d)
            new_d = new_d.clamp(1, 10)

            # recall
            if x[1] > 1:
                new_s = s * (1 + torch.exp(self.w[6]) *
                             (11 - new_d) *
                             torch.pow(s, self.w[7]) *
                             (torch.exp((1 - r) * self.w[8]) - 1))
            # forget
            else:
                new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
                    s, self.w[11]) * torch.exp((1 - r) * self.w[12])
        return new_s, new_d

    def loss(self, s, t, r):
        return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))

    def mean_reversion(self, init, current):
        return self.w[5] * init + (1-self.w[5]) * current


class WeightClipper(object):
    def __init__(self, frequency=1):
        self.frequency = frequency

    def __call__(self, module):
        if hasattr(module, 'w'):
            w = module.w.data
            w[0] = w[0].clamp(0.1, 10)
            w[1] = w[1].clamp(0.1, 5)
            w[2] = w[2].clamp(1, 10)
            w[3] = w[3].clamp(-5, -0.1)
            w[4] = w[4].clamp(-5, -0.1)
            w[5] = w[5].clamp(0, 0.5)
            w[6] = w[6].clamp(0, 2)
            w[7] = w[7].clamp(-0.2, -0.01)
            w[8] = w[8].clamp(0.01, 1.5)
            w[9] = w[9].clamp(0.5, 5)
            w[10] = w[10].clamp(-2, -0.01)
            w[11] = w[11].clamp(0.01, 0.9)
            w[12] = w[12].clamp(0.01, 2)
            w[13] = w[13].clamp(1, 10)
            w[14] = w[14].clamp(1, 10)
            w[15] = w[15].clamp(1, 10)
            w[16] = w[16].clamp(1, 10)
            module.w.data = w

def lineToTensor(line):
    ivl = line[0].split(',')
    response = line[1].split(',')
    tensor = torch.zeros(len(response), 2)
    for li, response in enumerate(response):
        tensor[li][0] = int(ivl[li])
        tensor[li][1] = int(response)
    return tensor

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

I tested this and got very impressive results:

  Log Loss RMSE R-squared
Original Optimizer 0.2241 0.0191 0.8728
Power forgetting curve 0.2231 0.0147 0.9234
Power difficulty 0.2234 0.0170 0.9026
Optimized grades 0.2220 0.0105 0.9632

As you can see, the optimized grades yielded the best results till now.

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

Based on all the experiments we have conducted so far, I can conclude that it is best to replace as many fixed values with optimized values as possible.

Though while doing this, we will have to ensure that all the parameters remain interpretable.

@Expertium
Copy link
Collaborator Author

Expertium commented May 1, 2023

There aren't any fixed values left, except for e in exponents. Everything else is either a variable or a parameter. Replacing e^w_1⋅x with w_2^w_1⋅x isn't going to do anything, you can still achieve the same result with just w_1.
image
So aside from making grades optimizable there's not much to do, we have to think of something completely new, like my idea to add lapses to S_forget.
image
image

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

@Expertium,
In your implementation of adaptive grades, the calculation of the stability and the difficulty is linked because both use the same value of the grade. Though the step per rating is different for both, I think that the results can be further improved if we optimize the grades for the stability and the difficulty separately.

Here is my implementation:

init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: stability_again_grade_value
w[14]: stability_hard_grade_value
w[15]: stability_good_grade_value
w[16]: stability_easy_grade_value
w[17]: difficulty_again_grade_value
w[18]: difficulty_hard_grade_value
w[19]: difficulty_good_grade_value
w[20]: difficulty_easy_grade_value
For more details about the parameters, please see: 
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''

class FSRS(nn.Module):
    def __init__(self, w):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.FloatTensor(w))
        self.zero = torch.FloatTensor([0.0])

        
    def forward(self, x, s, d):
        '''
        :param x: [review interval, review response]
        :param s: stability
        :param d: difficulty
        :return:
        '''

        x_stab = x.clone()
        x_diff = x.clone()
        if x[1] == 1:
          x_stab[1] = self.w[13]
          x_diff[1] = self.w[17]
        elif x[1] == 2:
            x_stab[1] = self.w[14]
            x_diff[1] = self.w[18]
        elif x[1] == 3:
            x_stab[1] = self.w[15]
            x_diff[1] = self.w[19]
        elif x[1] == 4:
            x_stab[1] = self.w[16]
            x_diff[1] = self.w[20]
   
        if torch.equal(s, self.zero):
            # first learn, init memory states
            new_s = self.w[0] + self.w[1] * (x_stab[1] - self.w[13])
            new_d = self.w[2] + self.w[3] * (x_diff[1] - self.w[19])
            new_d = new_d.clamp(1, 10)
        else:
            r = torch.exp(np.log(0.9) * x[0] / s)
            new_d = d + self.w[4] * (x_diff[1] - self.w[19])
            new_d = self.mean_reversion(self.w[2], new_d)
            new_d = new_d.clamp(1, 10)

            # recall
            if x[1] > 1:
                new_s = s * (1 + torch.exp(self.w[6]) *
                             (11 - new_d) *
                             torch.pow(s, self.w[7]) *
                             (torch.exp((1 - r) * self.w[8]) - 1))
            # forget
            else:
                new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
                    s, self.w[11]) * torch.exp((1 - r) * self.w[12])
        return new_s, new_d

    def loss(self, s, t, r):
        return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))

    def mean_reversion(self, init, current):
        return self.w[5] * init + (1-self.w[5]) * current

class WeightClipper(object):
    def __init__(self, frequency=1):
        self.frequency = frequency

    def __call__(self, module):
        if hasattr(module, 'w'):
            w = module.w.data
            w[0] = w[0].clamp(0.1, 10)
            w[1] = w[1].clamp(0.1, 5)
            w[2] = w[2].clamp(1, 10)
            w[3] = w[3].clamp(-5, -0.1)
            w[4] = w[4].clamp(-5, -0.1)
            w[5] = w[5].clamp(0, 0.5)
            w[6] = w[6].clamp(0, 2)
            w[7] = w[7].clamp(-0.2, -0.01)
            w[8] = w[8].clamp(0.01, 1.5)
            w[9] = w[9].clamp(0.5, 5)
            w[10] = w[10].clamp(-2, -0.01)
            w[11] = w[11].clamp(0.01, 0.9)
            w[12] = w[12].clamp(0.01, 2)
            w[13] = w[13].clamp(1, 10)
            w[14] = w[14].clamp(1, 10)
            w[15] = w[15].clamp(1, 10)
            w[16] = w[16].clamp(1, 10)
            w[17] = w[17].clamp(1, 10)
            w[18] = w[18].clamp(1, 10)
            w[19] = w[19].clamp(1, 10)
            w[20] = w[20].clamp(1, 10)
            module.w.data = w

def lineToTensor(line):
    ivl = line[0].split(',')
    response = line[1].split(',')
    tensor = torch.zeros(len(response), 2)
    for li, response in enumerate(response):
        tensor[li][0] = int(ivl[li])
        tensor[li][1] = int(response)
    return tensor

Edit: Corrected the three errors found in the code.

@Expertium
Copy link
Collaborator Author

This seems a bit excessive, but ok, I'll test it.

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

I tested it. But, the results were worse than earlier. Did I make some mistake in the code?

My results:

  Log Loss RMSE R-squared
Original Optimizer 0.2241 0.0191 0.8728
Power forgetting curve 0.2231 0.0147 0.9234
Power difficulty 0.2234 0.0170 0.9026
Optimized grades (unified) 0.2220 0.0105 0.9632
Optimized grades (separated) 0.2222 0.0159 0.9182

@Expertium
Copy link
Collaborator Author

Expertium commented May 1, 2023

I'm also getting worse results. I think you made a mistake here

        x_new = x.clone()
        x_stab = x.clone()
        x_diff = x.clone()
        if x_new[1] == 1:
          x_stab[1] = self.w[13]
          x_diff[1] = self.w[17]
        elif x_new[1] == 2:
            x_stab[1] = self.w[14]
            x_diff[1] = self.w[18]
        elif x_stab[1] == 3:
            x_stab[1] = self.w[15]
            x_diff[1] = self.w[19]
        elif x_new[1] == 4:
            x_stab[1] = self.w[16]
            x_diff[1] = self.w[20]

elif x_stab[1] == 3: should be elif x_new[1] == 3:

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

I'm also getting worse results. I think you made a mistake here
elif x_stab[1] == 3: should be elif x_new[1] == 3:

Oh! You are right.

@Expertium
Copy link
Collaborator Author

I'm still getting worse results, no idea why

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

elif x_stab[1] == 3: should be elif x_new[1] == 3:

Even after making the change, the result is the same. In retrospect, this makes sense because the initial value of x_new and x_stab is the same. So, the above error wouldn't have impacted the results.

I think that there is another undiscovered mistake in my code. Otherwise, I won't expect the results to be worse than before.

Edit: I found the error.

            # recall
            if x_new[1] != self.w[13]:
                new_s = s * (1 + torch.exp(self.w[6]) *

Here, x_new[1] should be replaced with x_stab[1].

Edit 2: Now, I think that this line should be replaced with if x[1] > 1: (the original code). This way, we can exclude the manual rating also.

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

I ran the optimizer again after making the correction. I got the following results:

  Log Loss RMSE R-squared
Original Optimizer 0.2241 0.0191 0.8728
Power forgetting curve 0.2231 0.0147 0.9234
Power difficulty 0.2234 0.0170 0.9026
Optimized grades (unified) 0.2220 0.0105 0.9632
Optimized grades (separated) 0.2214 0.0114 0.9572

Note: I haven't yet made the last correction (replacing if x_new[1] != self.w[13]: with if x[1] > 1:).

Edit: I made this change, but the results were the same.

@Expertium
Copy link
Collaborator Author

Note: I haven't yet made the last correction (replacing if x_new[1] != self.w[13]: with if x[1] > 1:).

Good catch, I'll correct it in my original code (a few comments above) as well.

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

Change the way S is calculated after a lapse ("Again") to include the number of lapses.

@Expertium, why do you think that this should produce better results? The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability.

@Expertium
Copy link
Collaborator Author

The lapses are already considered during the calculation of difficulty, which affects the calculation of the stability.

I don't see that in the formulas on wiki, or in the code. Can you point to a specific formula/line of code?

@user1823
Copy link
Collaborator

user1823 commented May 1, 2023

According to the following code, the difficulty increases with every again and hard rating (increases more with again rating).

https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#L49-L50

According to the following code, the greater the difficulty, the smaller is the increase in the stability.

https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#L56-L57

@Expertium
Copy link
Collaborator Author

According to the following code, the difficulty increases with every again and hard rating (increases more with again rating).
https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/1b4782d0db126235941d353739467fa67721df6c/reschedule.py#LL49-L50

Ah, that's what you mean. This isn't the same as adding up the total number of lapses and using the resulting number. Yes, difficulty depends on how many times you press "Again", but in a much less explicit way. In any case, in order out find out whether my idea works or no, we need Sherlock to implement it so we can test it.

@user1823
Copy link
Collaborator

user1823 commented May 2, 2023

@Expertium, would you like to do the Wilcoxon signed-rank test to compare the results of the separated adaptive grades and the unified adaptive grades? The result of that test will make it easier to choose the one we would want to use.

The latest version of the code for separated adaptive grades is here: #239 (comment)

@user1823
Copy link
Collaborator

user1823 commented May 2, 2023

@L-M-Sherlock, can you implement the following so that we can test it?

  1. Change the way S is calculated after a lapse ("Again") to include the number of lapses. So from this:
    S_forgot_old
    To this:
    S_forgot_new
    Where L - number of lapses. +1 is needed to avoid multiplication by 0, which would result in S=0. w_13 should be negative, otherwise more lapses will lead to higher S, which is non-sense. This introduces 1 new parameter.

@L-M-Sherlock
Copy link
Member

It requires extracting the lapse feature in the revlogs. And in my view, it could cause severe Ease Hell.

@Expertium
Copy link
Collaborator Author

Expertium commented May 2, 2023

Well, then implement adaptive grades as described here (or in a better way, if you know how):

2.1 Define the model
FSRS is a time-series model for predicting memory states.

init_w = [1, 1, 5, -0.5, -0.5, 0.2, 1.4, -0.12, 0.8, 2, -0.2, 0.2, 1, 1, 2, 3, 4]
'''
w[0]: initial_stability_for_again_answer
w[1]: initial_stability_step_per_rating
w[2]: initial_difficulty_for_good_answer
w[3]: initial_difficulty_step_per_rating
w[4]: next_difficulty_step_per_rating
w[5]: next_difficulty_reversion_to_mean_speed (used to avoid ease hell)
w[6]: next_stability_factor_after_success
w[7]: next_stability_stabilization_decay_after_success
w[8]: next_stability_retrievability_gain_after_success
w[9]: next_stability_factor_after_failure
w[10]: next_stability_difficulty_decay_after_success
w[11]: next_stability_stability_gain_after_failure
w[12]: next_stability_retrievability_gain_after_failure
w[13]: again_grade_value
w[14]: hard_grade_value
w[15]: good_grade_value
w[16]: easy_grade_value
For more details about the parameters, please see: 
https://github.com/open-spaced-repetition/fsrs4anki/wiki/Free-Spaced-Repetition-Scheduler
'''

class FSRS(nn.Module):
    def __init__(self, w):
        super(FSRS, self).__init__()
        self.w = nn.Parameter(torch.FloatTensor(w))
        self.zero = torch.FloatTensor([0.0])

        
    def forward(self, x, s, d):
        '''
        :param x: [review interval, review response]
        :param s: stability
        :param d: difficulty
        :return:
        '''

        x_new = x.clone()
        if x_new[1] == 1:
            x_new[1] = self.w[13]
        elif x_new[1] == 2:
            x_new[1] = self.w[14]
        elif x_new[1] == 3:
            x_new[1] = self.w[15]
        elif x_new[1] == 4:
            x_new[1] = self.w[16]

   
        if torch.equal(s, self.zero):
            # first learn, init memory states
            new_s = self.w[0] + self.w[1] * (x_new[1] - self.w[13])
            new_d = self.w[2] + self.w[3] * (x_new[1] - self.w[15])
            new_d = new_d.clamp(1, 10)
        else:
            r = torch.exp(np.log(0.9) * x_new[0] / s)
            new_d = d + self.w[4] * (x_new[1] - self.w[15])
            new_d = self.mean_reversion(self.w[2], new_d)
            new_d = new_d.clamp(1, 10)

            # recall
            if x[1] > 1:
                new_s = s * (1 + torch.exp(self.w[6]) *
                             (11 - new_d) *
                             torch.pow(s, self.w[7]) *
                             (torch.exp((1 - r) * self.w[8]) - 1))
            # forget
            else:
                new_s = self.w[9] * torch.pow(new_d, self.w[10]) * torch.pow(
                    s, self.w[11]) * torch.exp((1 - r) * self.w[12])
        return new_s, new_d

    def loss(self, s, t, r):
        return - (r * np.log(0.9) * t / s + (1 - r) * torch.log(1 - torch.exp(np.log(0.9) * t / s)))

    def mean_reversion(self, init, current):
        return self.w[5] * init + (1-self.w[5]) * current


class WeightClipper(object):
    def __init__(self, frequency=1):
        self.frequency = frequency

    def __call__(self, module):
        if hasattr(module, 'w'):
            w = module.w.data
            w[0] = w[0].clamp(0.1, 10)
            w[1] = w[1].clamp(0.1, 5)
            w[2] = w[2].clamp(1, 10)
            w[3] = w[3].clamp(-5, -0.1)
            w[4] = w[4].clamp(-5, -0.1)
            w[5] = w[5].clamp(0, 0.5)
            w[6] = w[6].clamp(0, 2)
            w[7] = w[7].clamp(-0.2, -0.01)
            w[8] = w[8].clamp(0.01, 1.5)
            w[9] = w[9].clamp(0.5, 5)
            w[10] = w[10].clamp(-2, -0.01)
            w[11] = w[11].clamp(0.01, 0.9)
            w[12] = w[12].clamp(0.01, 2)
            w[13] = w[13].clamp(1, 10)
            w[14] = w[14].clamp(1, 10)
            w[15] = w[15].clamp(1, 10)
            w[16] = w[16].clamp(1, 10)
            module.w.data = w

def lineToTensor(line):
    ivl = line[0].split(',')
    response = line[1].split(',')
    tensor = torch.zeros(len(response), 2)
    for li, response in enumerate(response):
        tensor[li][0] = int(ivl[li])
        tensor[li][1] = int(response)
    return tensor

@Expertium
Copy link
Collaborator Author

Here's the updated table:
image

Making 2 different optimizable grades for S and D improved performance when compared to the old algorithm, but not when compared to just having optimizable grades at all. The p-value for that is not shown in the table (in the table I only compare changes to the baseline), but it was much >0.01, suggesting that making different grades for S and D doesn't improve performance.

@user1823
Copy link
Collaborator

user1823 commented May 2, 2023

Here, I have tried to extract from the above comments what needs to be done now so that we don't miss anything.

  • Make the grade values optimizable parameters. Currently, Again = 1, Hard = 2, Good = 3, Easy = 4. It is proposed that we make them optimizable as well, and clamp them. The current progress on implementing this is shared by @Expertium here: [Feature Request] Sharing ideas for further improvement of the algorithm #239 (comment)
  • After the adaptive grades are implemented, we should do away with the EasyBonus setting because the optimizer would automatically calculate how much the stability/difficulty/interval should be changed with the press of the Easy button.
  • For the sake of consistency, it is suggested to either replace $w_9$ in the stability_after_lapse formula with $e^{w_9}$ or replace $e^{w_6}$ in the stability_after_recall formula with $w_6$.
  • For the sake of clarity, it is suggested that all formulas should be rewritten in such a way that parameters are always positive. For example, if you have D^w (w is a parameter) and w is is always positive - good, don't change anything. If w is always negative - rewrite the formula as D^-w and keep w positive. This should make it easier to read the wiki.

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

@L-M-Sherlock Honestly, I think this is just confusing, so how about this instead:

  1. For each review, map its grade to recall, like this:
    map(lambda x: {1: 0, 2: 1, 3: 1, 4: 1}
  2. For each review, calculate the predicted R by FSRS and by Anki
  3. For each review, calculate the square difference between true recall (1 or 0) and predicted R. You will have 2 lists - differences for Anki and differences for FSRS
  4. Sum the differences, divide them by the number of elements in the list (aka the number of reviews), and take the square root. This is basically RMSE from the calibration graph, but for each review, no bins
  5. Output RMSE(Anki) - RMSE(FSRS)
    A positive value would indicate that Anki has a larger RMSE, negative value would indicate that Anki has a lower RMSE
    This is what Woz does: https://supermemo.guru/wiki/R-Metric

@L-M-Sherlock
Copy link
Member

I'm still not sure how to interpret this: #239 (comment)
@L-M-Sherlock, would you explain step-by-step how this is calculated, and how to read this chart?

You can read this for details:
https://supermemo.guru/wiki/Universal_metric_for_cross-comparison_of_spaced_repetition_algorithms#Algorithmic_contest:_SuperMemo_2_vs._SuperMemo_17

@Expertium
Copy link
Collaborator Author

I've read it, but I'm still confused
So what do you think about implementing this: #239 (comment)?

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

@L-M-Sherlock There is an important matter that I want to discuss.
Right now, there are two main ways we could continue working on the algorithm:

  1. Implement the recall matrix: [Feature Request] Recall Matrix like in SuperMemo algorithms #271
  2. Try to solve this: [Case Analysis] Detect the outliers in data and the weakness in algorithm #275

The recall matrix is a pretty sophisticated thing, but it also promises great improvements in accuracy since it allows FSRS to correct its own bad predictions using real data. But while it will help with the underestimation of R, it won't actually solve the underlying problem. Recall matrix treats symptoms instead of treating the cause of the disease itself, if that makes sense.
Ideally, we should solve that problem first. But that could stall the progress for many months since nobody here has any good ideas (my idea to remove re-learning steps only led to a minor 10% improvement in RMSE). The recall matrix is complicated, sure, but it won't take months to implement, and I have a pretty clear vision of what the end result should look like.
So which way do you prefer? Implementing the recall matrix will be relatively fast (probably a week or less), but it won't improve theoretical formulas or data filtering. Trying to solve the underestimation of R first will improve the theoretical part of FSRS, but it will take many times longer, and we don't even know what to do.

@L-M-Sherlock
Copy link
Member

I've read it, but I'm still confused
So what do you think about implementing this: #239 (comment)?

I will implement it later.

So which way do you prefer? Implementing the recall matrix will be relatively fast (probably a week or less), but it won't improve theoretical formulas or data filtering. Trying to solve the underestimation of R first will improve the theoretical part of FSRS, but it will take many times longer, and we don't even know what to do.

I don't want to include too many parameters in FSRS. Just keep it simple. Recently, I have found that even if the loss keeps the same, the scheduling could still be very different because one user hardly has enough data in all retention ranges. Two sets of parameters could both fit the current data well but have very different predictions out of the ranges.

As I mention in another issue. I plan the clamp w[5] (difficulty reversion) to alleviate the ease hell. And I also clamp S decay in the latest optimizer.

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

I don't want to include too many parameters in FSRS.

If that's your concern, rest assured, the recall matrix only adds two more optimizable parameters.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jun 5, 2023

5. Output RMSE(Anki) - RMSE(FSRS)
A positive value would indicate that Anki has a larger RMSE, negative value would indicate that Anki has a lower RMSE
This is what Woz does: https://supermemo.guru/wiki/R-Metric

image
R_Metric = math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['anki_p'])) - math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['p']))
print("R_Metric: ", R_Metric)

Do you think it's your need?

If that's your concern, rest assured, the recall matrix only adds two more optimizable parameters.

I my view, the values in the matrix are also parameters.

@Expertium
Copy link
Collaborator Author

I my view, the values in the matrix are also parameters.

But they're not. Please, read the full document in #271

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

image ```python R_Metric = math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['anki_p'])) - math.sqrt(mean_squared_error(cross_comparison['y'], cross_comparison['p'])) print("R_Metric: ", R_Metric) ```

Do you think it's your need?

I think so, the formula looks right. But the difference is very small, only around 1.4% it seems. That's strange.

@L-M-Sherlock
Copy link
Member

But they're not. Please, read the full document in #271

I have read it.

We need to store a matrix with 12 000 entries

The matrix is used to predict the R, right? That's why I think it means the values in the matrix are parameters.

@L-M-Sherlock
Copy link
Member

So, my suggestion would be to leave the older calibration graph as it is and add FSRS calibration curve to the graph you just created (Anki calibration curve).

Do you mean this?

image

@Expertium
Copy link
Collaborator Author

But they're not. Please, read the full document in #271

I have read it.

We need to store a matrix with 12 000 entries

The matrix is used to predict the R, right? That's why I think it means the values in the matrix are parameters.

Well, they're not parameters in the same sense as these: init_w = [1, 1, 5, 0.5, 0.5, 0.2, 1.4, 0.2, 0.8, 2, 0.2, 0.2, 1]. Matrix entries are not modified by the optimizer.
Putting semantics aside, this is what Woz uses (albeit the exact implementation is different). If it's good enough for Woz, it's good enough for us. I understand if you want to address other issues first and implement the recall matrix later, but if you don't want to implement the recall matrix at all, ever, then I don't know what to say. To me it just seems like a bad decision.

@user1823
Copy link
Collaborator

user1823 commented Jun 5, 2023

So, my suggestion would be to leave the older calibration graph as it is and add FSRS calibration curve to the graph you just created (Anki calibration curve).

Do you mean this?

image

I wanted you to superimpose the two calibration curves on each other. Just use different colors for each and remove the Weighted Least Squares Regression (to prevent the graph from getting cluttered).

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jun 5, 2023

I've read it, but I'm still confused

Let me try to explain it.

  1. For each review, calculate the predicted R by FSRS and by Anki
  2. Use the predicted R by FSRS to bin the data and calculate the B-W metric in each bin for predicted R by Anki
  3. Use the predicted R by Anki to bin the data and calculate the B-W metric in each bin for predicted R by FSRS
  4. Draw the B-W metric in the graph

So, if a model is perfect, whenever another algorithm bins the data, the B-W metric will always be zero in all bins. The cross-comparison is helpful when the cheating algorithm puts all data into one bin and predicts the average retention in the entire collection.

@L-M-Sherlock
Copy link
Member

I wanted you to superimpose the two calibration curves on each other. Just use different colors for each and remove the Weighted Least Squares Regression (to prevent the graph from getting cluttered).

OK. But what's the x-axis in the new calibration? The predicted R by Anki or FSRS?

@user1823
Copy link
Collaborator

user1823 commented Jun 5, 2023

I wanted you to superimpose the two calibration curves on each other. Just use different colors for each and remove the Weighted Least Squares Regression (to prevent the graph from getting cluttered).

OK. But what's the x-axis in the new calibration? The predicted R by Anki or FSRS?

Well, that's something I didn't consider and I have no good answer to this. If you or Expertium don't have a good idea either, we can drop this idea.

@Expertium
Copy link
Collaborator Author

I don't have a good idea either

@Expertium
Copy link
Collaborator Author

So, @L-M-Sherlock, just to be perfectly clear, are you saying that you will implement the recall matrix at some point in the future, after dealing with other issues, or are you saying that you will never implement it?

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jun 5, 2023

So, @L-M-Sherlock, just to be perfectly clear, are you saying that you will implement the recall matrix at some point in the future, after dealing with other issues, or are you saying that you will never implement it?

It is a pragmatic idea, and I have replied in your issue.

@Expertium
Copy link
Collaborator Author

We haven't done a proper comparison between the exponential function and the power function with the new optimizer.
@L-M-Sherlock, it would be great if you made a copy of fsrs4anki_optimizer_alpha.ipynb, but replaced the power function with the exponential function.
Replace this:

def pow_forgetting_curve(t, s):
    return (1 + t / (9 * s)) ** -1

def next_interval(stability, retention):
    return max(1,round(9 * stability * (1/retention - 1)))

I want to do a fair comparison.

@Expertium
Copy link
Collaborator Author

image
You have a typo, it's B-W, not B-M. I believe you have that typo in the code as well

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

I'm still trying to figure out how to interpret the cross-comparison.
So I tried it with the parametrized curve because it gives me the lowest RMSE, and I want to see how the best-performing version vs Anki will look like:
image

It still makes no sense to me. Then I decided to do "Anki by Anki" and "FSRS by FSRS", like this (I also added a little extra code to make the chart look better):

plt.figure()
ax = plt.gca()
ax.set_xlim([0, 1])

from matplotlib.ticker import MultipleLocator
ax.xaxis.set_major_locator(MultipleLocator(0.1))
ax.yaxis.set_major_locator(MultipleLocator(0.1))

cross_comparison['Anki_bin'] = cross_comparison['anki_p'].map(lambda x: round(2 * x, 1) / 2)
cross_comparison['FSRS_bin'] = cross_comparison['p'].map(lambda x: round(2 * x, 1) / 2)

cross_comparison_group = cross_comparison.groupby(by='Anki_bin').agg('mean')
plt.plot(cross_comparison_group['Anki_B-W'], label='Anki')

cross_comparison_group = cross_comparison.groupby(by='FSRS_bin').agg('mean')
plt.plot(cross_comparison_group['FSRS_B-W'], label='FSRS')

plt.axhline(y = 0.0, color = 'black', linestyle = '-')
plt.legend(loc='upper center')
plt.grid(linestyle='--')
plt.title("Anki vs. FSRS")
plt.xlabel('Predicted R')
plt.ylabel('B-W Metric')
plt.show()

Here is the result:
image

This finally has an interpretation that I can understand: FSRS has a B-W metric close to zero for its own predictions of R between 65% and 100%, and then it starts underestimating R (B-W < 0, meaning that the user is "winning" more than the algorithm is "betting"). Meanwhile, Anki has B-W > 0 for its own predictions of R between 85% and 100%, meaning it overestimates R, and then Anki starts having B-W below zero for smaller values of R, meaning that it starts underestimating R.
Ok, this part is clear. But then what does it mean when we switch bins? What does it mean when instead of using Anki bins for Anki B-W Metric and FSRS bins for FSRS B-W, we switch them?
Without switching, the interpretation is pretty straightforward: the distribution of errors for different levels of predicted R; below the zero line higher = better, lower = worse (so FSRS curve being above the Anki curve is good), above the zero line higher = worse, lower = better (so FSRS curve being below the Anki curve is, once again, good). With switching, I have no idea how to interpret it.

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

image

This is without switching (I'm still using the parametrized curve) - Anki bins for Anki B-W, FSRS bins for FSRS B-W

@Expertium
Copy link
Collaborator Author

I know that "remove it because I don't understand it" is pretty dumb, but I think that my version (see code 2 comments above) is much easier to interpret and we should keep it.

@L-M-Sherlock
Copy link
Member

Ok, this part is clear. But then what does it mean when we switch bins? What does it mean when instead of using Anki bins for Anki B-W Metric and FSRS bins for FSRS B-W, we switch them?
Without switching, the interpretation is pretty straightforward: the distribution of errors for different levels of predicted R; below the zero line higher = better, lower = worse (so FSRS curve being above the Anki curve is good), above the zero line higher = worse, lower = better (so FSRS curve being below the Anki curve is, once again, good). With switching, I have no idea how to interpret it.

In the case of FSRS bins for Anki B-W, FSRS will classify all data into 11 bins, from R = 0 to R = 1. Then we calculate he average prediction of Anki, the real retention and B-W metric for Anki. If the B-W metric > 0, it means Anki overestimated R, vice versa.

Why did Woz introduce cross-comparison? I think it is used to avoid cheating.

For a population of uniform repetitions, any algorithm can cheat the perfect metric by guessing the correct average recall and betting on that recall at each prediction. If we cannot discriminate between populations of repetitions, we cannot discriminate between good and bad algorithms.
From https://supermemo.guru/wiki/Universal_metric_for_cross-comparison_of_spaced_repetition_algorithms#Cheating_the_perfect_metric

As we know, the repetitions are unlikely uniform. But the cheating algorithm could classify all repetitions into one uniform population and guess the average recall. In cross-comparison, we will use another algorithm to split the population. It avoids the cheating method.

For example, assume there are three population in the review data, their true R values being 0.7, 0.8, and 0.9 respectively. A cheating algorithm could achieve the best score with a B-W metric of 0 by predicting 0.8. We wouldn't be able to detect this with the original calibration chart since we are using the same algorithm's prediction to divide population, and then calculating the B-W metric on each population.

@L-M-Sherlock
Copy link
Member

This issue has been too long. Every time I want to check the history message, I need to click the load more button and wait for the loading. It is worse that the loading is from the top to the bottom. I think we should summarize the current issue and open a new issue to continue.

image

@L-M-Sherlock
Copy link
Member

Continue in #282

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants