Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Ideas to further improve the accuracy of the algorithm #564

Closed
Expertium opened this issue Dec 18, 2023 · 9 comments
Closed
Labels
enhancement New feature or request

Comments

@Expertium
Copy link
Collaborator

Expertium commented Dec 18, 2023

A continuation of #461.
I have changed the formula that maps R to grades again:

        def func(x, a, b, c, d):
            t = (x - 1) / 3
            f = 1 - (np.power(1 - np.power(t, a), b))
            return d * f + c * (1 - f)

        params, covs = curve_fit(
            func,
            grade_means,
            r_means,
            bounds=((0.15, 0.15, 0, 0.75), (6, 6, 0.5, 1)),
            sigma=1/np.sqrt(count),
            p0=(1, 2.5, 0, 1),
        )

Now all 4 of them can be between 0 and 1. In the initial version (the one here) R(Again)=0 and R(Easy)=1. Later I changed it to allow R(Again)>0 and now I changed it to allow R(Easy)<1. I also changed the surprise function:

            binary_r = torch.where(condition, 1, 0)
            wozloss_binary = -torch.log(1 - torch.abs(r - binary_r))
            wozloss_r_g = -torch.log(1 - torch.abs(r - grade_derived_r))
            surprise = self.w[17] * wozloss_r_g + (1 - self.w[17]) * wozloss_binary
            surprise = surprise.clamp(0, 10)
            new_d = state[:, 1] - self.w[6] * (X[:, 1] - 2.5) * surprise

Now it's a mix of the loss that uses theoretical R and binary recall, and a loss that uses theoretical R and grade-derived R.
While it did technically improve RMSE, the improvement was negligible.
image

@user1823 I suggest closing this issue: #352. At this point it's pretty clear that neither I nor you or Sherlock will be able to improve the formula for D.

@Expertium Expertium added the enhancement New feature or request label Dec 18, 2023
@Expertium
Copy link
Collaborator Author

Actually, on second thought, I should've just kept the old issue. I'm out of ideas (except for one more), so I will probably close this issue later today, once I test my final idea.
If that idea works, then we can release FSRS v5, since it will require one more parameter. FSRS v5 would have a flatter curve, new pretrain, and that one more thing.
If it doesn't work, then there will be no new parameters to add, and instead the new version should be called v4.5, not v5.

@user1823
Copy link
Collaborator

@user1823 I suggest closing this issue: #352. At this point it's pretty clear that neither I nor you or Sherlock will be able to improve the formula for D.

I don't think that the issue should be closed yet. Jarrett can actually improve the formula if he gets enough time to do a proper research (like he did when he first developed the algorithm). In addition, the issue also attracts the attention of other people who might be able to help. I am not saying that any of this will happen soon but it can happen eventually.

@Expertium
Copy link
Collaborator Author

Expertium commented Dec 18, 2023

I think our best bet at improving D is finding out a computationally efficient way to implement best-fit D, since that wuld give us a "true" value of D, in the sense that it's the value that minimizes the deviation between predicted R and the actual outcome of the review. Any other method would be just an approximation of that.
Unfortunately, Sherlock couldn't find an efficient way to implement it.

@Expertium
Copy link
Collaborator Author

Expertium commented Dec 18, 2023

I tested another idea.
In an ideal world, users would never misuse Hard and always use it as a passing grade. In the real world, users often use Hard as a failing grade. FSRS has to adapt to it somehow. At the same time, we cannot allow SInc to decrease too much. It could create a new type of Ease Hell:

  1. User presses Hard
  2. SInc<1, so stability (and the next interval) decreases
  3. Spacing effect becomes weaker
  4. Memory doesn’t become more stable
  5. User doesn’t remember this material well
  6. User presses Hard

And so the vicious cycle continues. One way to prevent that is to ensure that SInc>=1, and cannot be less than 1. But since we know that users misuse Hard, if we want to improve the accuracy, we have to account for the possibility of SInc for Hard being less than 1.
So how do we do that without creating this new Ease Hell?

    def stability_after_success(self, state: Tensor, r: Tensor, rating: Tensor) -> Tensor:
        hard_penalty = torch.where(rating == 2, self.w[15], 1)
        easy_bonus = torch.where(rating == 4, self.w[16], 1)
        lower_bound = torch.pow(1 + state[:, 0], -self.w[17])
        # SInc cannot be << 1 for small values of S, but can for large ones
        # this is to avoid a loop where users press "Hard", the intervals become shorter, spacing effect becomes weaker, memories become weaker, users press "Hard" again, etc.
        # while still allowing SInc to be <1 overall, to improve accuracy and account for the fact that some users use Hard as a failing grade
        SInc = hard_penalty * (
            1
            + torch.exp(self.w[8])
            * (11 - state[:, 1])
            * torch.pow(state[:, 0], -self.w[9])
            * (torch.exp((1 - r) * self.w[10]) - 1)
            * easy_bonus
        )

        SInc = torch.maximum(SInc, lower_bound)
        return state[:, 0] * SInc


            w[17] = w[17].clamp(0.001, 0.065)

My answer: instead of choosing a constant lower bound for SInc, we choose a dynamic lower bound that depends on S itself. For low values of S, this lower bound will be only marginally less than 1, for example, 0.95. For high values of S it can be as low as 0.5.
This way we get the best of both worlds: cards with low stabilities won’t make the user drown in an endless sea of reviews while cards with high stabilities can experience a decrease in stability when the user presses Hard, thus improving the accuracy of the algorithm. Note that this doesn’t mean that SInc for Hard is always <1 in the new version. It can be >1 too.
It didn't improve RMSE.
image

@Expertium
Copy link
Collaborator Author

I do not plan to work on improving the accuracy any further because I’m out of ideas, even the fringe ones. So unless Sherlock himself wants to try to improve accuracy, this will be the final version for at least a year, possibly longer.

@Expertium
Copy link
Collaborator Author

Expertium commented Dec 18, 2023

@L-M-Sherlock I recommend implementing FSRS v4.5 with two modifications: new curve and new pretrain for S0, but without same-day reviews. They just aren't worth it.

  1. Implement v4.5
  2. Benchmark it. Note that it should be a new entry, not just v4
  3. Give the new defaults parameters to Dae

@simias
Copy link

simias commented Jan 7, 2024

First of all I want to thank all contributors for this new SRS algorithm, it massively improved my Anki experience so far.

Regarding this particular issue, I wonder if the root of the issue is not the algo itself, but the UI. What I mean by that is that the notion of what "Hard", "Good" and "Easy" mean is rather subjective and fluid. I expect a lot of variation from people to people, and from personal experience even from deck to deck and from day to day.

Depending on the general deck difficulty I may be a lot more liberal with my choice of using "Hard" instead of "Again" for instance. I also noticed that if I had a streak of "again" reviews I'm more likely to answer "Hard" for a card instead of "Again" if it was a "near miss" just because I need a win, so to speak. This was especially true when I used SM2 because of how punishing "Again" was there (full progress reset + ease penalty by default). FSRS fortunately improves massively on this.

I also struggle to decide when to use Good and when to use Easy. Surely if it's not hard then it's Easy? In the end I often end up "meta gaming" and using the presented intervals to decide, if the interval is like 7d/15d I'm more likely to go "easy" because the next review will be soon enough anyway, if on the other hand it's like 1y/2y I almost never go "Easy" because those intervals feel really huge and I don't even know if I'll still be using this deck two years from now.

So in other words I wonder if finding an algo trick to make Hard and Easy work "correctly" may not be impossible because the dataset itself is just bad due to everybody having their own recipe for what Hard and Easy truly mean, and these recipes were probably in part created as a way to work around SM2 issues (ease hell etc...). I really find myself using Hard/Again differently with FSRS, and I rarely use Easy at all because the default intervals just make more sense to me.

@Expertium
Copy link
Collaborator Author

FSRS can adapt to almost any habit, except for using Hard as a failing grade. FSRS is hard-coded to treat Again as a memory lapse and to treat Hard/Good/Easy as success. So if somebody has a habit of pressing Hard when they forgot a card, FSRS cannot adapt to that. Theoretically, it's possible to make another version of FSRS where Hard is treated as a memory lapse, optimize both versions, and then choose whichever provides a better fit. In pseudocode:

if root_mean_square_error(FSRS) > root_mean_square_error(FSRS_hard_lapse):
    algorithm = FSRS_hard_lapse
else:
    algorithm = FSRS

But this would be very complicated in practice.

@L-M-Sherlock
Copy link
Member

In addition, hard is also considered as a correct review in Anki when calculating the correct rate in stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants