Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Recall Matrix like in SuperMemo algorithms #271

Closed
Expertium opened this issue May 28, 2023 · 88 comments
Closed

[Feature Request] Recall Matrix like in SuperMemo algorithms #271

Expertium opened this issue May 28, 2023 · 88 comments
Labels
enhancement New feature or request

Comments

@Expertium
Copy link
Collaborator

Expertium commented May 28, 2023

Which module is related to your feature request?
Scheduler, Optimizer

Is your feature request related to a problem? Please describe.
It's a way to correct out theroretical function using real data to obtain more accurate predictions in cases where the theoretical function makes poor predictions.

Describe the solution you'd like
Recall Matrix.docx
It's a pretty long description, but that's because it's a very sophisticated feature. However, it oculd potentially have a greater impact on the accuracy than any other change so far.

Additional context
While I tried to make the description as detailed and clear as possible, it's possible that the final implementation will be different. Still, this is about as good of a description as I can make. It's fine if this isn't your top priority right now, but I highly recommend implementing it in the foreseeable future.

@L-M-Sherlock
Copy link
Member

Let's discuss this issue here.

The recall matrix should be a 3-dimensional matrix, where the dimensions are S, D, and theoretical R. Similar to the B-W heatmap, but with theoretical R as a new dimension.

Why use the theoretical R as an extra dimension? Why not just save the measured R in the matrix entry like this: R_M = RecallMatrix(D_rounded, S_rounded)

@Expertium
Copy link
Collaborator Author

Because the same S could correspond to different R because of t. If, for 2 predictions, S is the same but t is different, R will be different. Grouping them together will reduce accuracy.

@L-M-Sherlock
Copy link
Member

OK. I will calculate the R-Matrix firstly.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jun 5, 2023

image
B_W_Metric_raw['r_bin'] = B_W_Metric_raw['p'].map(lambda x: round(x * 4, 1) / 4)
R_Matrix = B_W_Metric_raw.groupby(by=['s_bin', 'd_bin', 'r_bin']).agg({'y': ['mean', 'count']}).reset_index()
R_Matrix.columns = ['S_rounded', 'D_rounded', 'R_t_rounded', 'R_measured', 'count']
R_Matrix[R_Matrix['count'] > 300]

Paste it after the B-W-Metric cell. Is it your need?

@L-M-Sherlock
Copy link
Member

DiffMax = max(abs(R_Matrix['R_measured']-R_Matrix['R_t_rounded']))

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 5, 2023

It doesn't show anything, other than an indicator that the code has been executed successfully.
image

@L-M-Sherlock
Copy link
Member

@Expertium
Copy link
Collaborator Author

Yes, looks good.

@Expertium
Copy link
Collaborator Author

I know it's too early to think about it, but I thought about whether we should

  1. Allow users to have different sets of parameters for different decks and different recall matrices for each set
  2. Allow users to have different sets of parameters for different decks but use the same recall matrix
  3. Allow users to only have 1 set of parameters and 1 recall matrix, and remove the ability to specify different sets of parameters for different decks

I think 2 is the best option. If we assign a different recall matrix to each different set of parameters, those matrices will be very sparse. For example, suppose that the entire collection has 100 000 reviews, but the user decides to add a new set of parameters for a deck that only has 3 000 reviews. The recall matrix for the entire collection will have a lot of information and a lot of entries, whereas the recall matrix for that one deck will have very few entries with low counts, making it less useful. The more data, the better. There is no upside to using a sparse recall matrix based on little info.
At the same time, not allowing users to have different parameters for different decks would reduce the flexibility of FSRS. So yeah, 2 seems like the best option.

@Expertium
Copy link
Collaborator Author

@L-M-Sherlock seems like it's finally time to start working on this? I believe all other important issues have been resolved.

@L-M-Sherlock
Copy link
Member

There are still some bugs in the Helper add-on. I need to deal with them firstly.

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 19, 2023

Some more thoughts about the R-matrix:

  1. While originally I proposed this weighting system
    image
    Now I think that this will do better
    image
    The problem with the original idea is that DiffMax is always going to be either 1 or close to 1. So instead let's use w_1 to decide the maximum weight that can be assigned to a matrix entry. We can test both, but I'm willing to bet that the idea of weighting both by the number of entries and by the difference between R_t and R_M will perform worse. It's mostly a gut feeling.
  2. Keep in mind that the R-Matrix that will be stored on the user's PC has to be modified after each review; it's not read-only.
  3. Here's the tricky thing: R-Matrix has to be fully recalculated when the user changes any set of parameters. Example:
    Suppose the user has 2 sets of parameters (for different decks): A and B. He recalculated B and is now using B'. Now the correspondence between theoretical and measured R has changed, and the matrix has to be recalculated because of that. And all entries have to be recalculated since we don't keep track of which review from which deck has contributed to which entry. If an entry has x reviews, we don't know how many of them came from a deck that uses set A and how many came from a deck that uses set B (or B').
  4. During the optimization we will probably have to clear the R-matrix after each epoch to avoid "double counting" reviews. Otherwise each review will be counted n times, where n - number of epochs.

@L-M-Sherlock
Copy link
Member

Sorry for the late reply. I'm busy for attending an academic conference in few days later. And the feature requests is overflowing. The bug fixing is in high priority. I will focus on the development of R-metric after July 3.

@L-M-Sherlock
Copy link
Member

I have a question: how to optimize $w_1$ and $w_2$?

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 29, 2023

Oh yeah, about that. I believe we should completely change our loss function. I've seen a problem multiple times - changing formulas decreased log-loss by around 1%, but RMSE decreased by 5%/10%/20%.
I think we should minimize the difference between theoretical R and its corresponding R-matrix entry. In other words, the new loss will be WozLoss(Rt, Rm), or -ln(1-abs(Rt-Rm))
Although implementing that in the optimizer will be tricky. You will also have to add weighting by the number of reviews that the matrix entry is based on.

class WozLoss(nn.Module):
    
    def __init__(self):
        super(WozLoss, self).__init__()
    
    def forward(self, predictions, labels):
        loss = -torch.log(1-torch.abs(predictions-labels))
        loss = loss.clamp(0, 100)
        return loss.sum()

self.loss_fn = WozLoss()

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 29, 2023

Wait, I just realized - we can't optimize w1 and w2 that way, since those are used in weighted R, and that loss above doesn't use weighted R.
Ok, forget what I said above then. Let's not change the loss function for now. Just make the algorithm output Rw instead of Rt whenever it predicts R.
By the way, once the R-matrix is done, every single calculation involving R (in the optimizer, scheduler and helper) will have to use weighted R. That will involve changing a lot of the code, especially for the Helper add-on.
So to answer your question - w1 and w2 will be optimized the same way as all the other parameters are optimized right now.

@L-M-Sherlock
Copy link
Member

Before we use the R-Matrix, we should train the parameters of FSRS at first. How can we optimize w1 and w2 in the same way as other parameters in the same time?

@Expertium
Copy link
Collaborator Author

I don't see why we can't optimize them simultaneously. For each epoch we make the R-matrix, calculate weighted R and use it in the loss function, clear the R-matrix and repeat this for the next epoch.

@L-M-Sherlock
Copy link
Member

I don't see why we can't optimize them simultaneously. For each epoch we make the R-matrix, calculate weighted R and use it in the loss function, clear the R-matrix and repeat this for the next epoch.

Which parameters should we use to generate R-matrix? The parameters at the start of the epoch?

@Expertium
Copy link
Collaborator Author

That's a good question, I haven't thought about that. I think using parameters from the previous epoch (init_w if this is the first epoch) is reasonable.

@user1823
Copy link
Collaborator

@Expertium, what if the weight comes out to be very close to 1?

I think that the optimizer would try to do this because when weight is equal to 1, the loss becomes very small.

@Expertium
Copy link
Collaborator Author

Expertium commented Jun 30, 2023

W1 determines the maximum weight, and we will let the optimizer choose it. We'll see if that leads to such issues, for now it's hard to predict how it will behave. And even if sets w1 to 0.99 (that's how I plan to clamp w1), I don't think this is bad, since it will be 0.99 only for entries based on a lot of reviews, in other words, it will be 0.99 for R that is estimated very accurately, which is exactly what we want.
If the optimizer sets both w1 and w2 to their maximum values, yes, that will be a problem. We'll see how it goes, for now we don't know how it will behave.

@user1823
Copy link
Collaborator

I thought about it more thoroughly and realized that the model would still need to accurately classify the cards according to their stability and difficulty in order to reap good benefits of the R-matrix. So, there shouldn't be any major problem. But, let's see how it goes.

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 4, 2023

  1. I would advise you to add this code to 4.0 alpha, since that's what I'm using as the baseline, and I want to have a fair comparison (same goes for the hybrid S0 curve-fit/pre-train method in the other issue).
  2. In my .docx document I provided bins for S. I think we should use those. The reason we should use pre-determined bins is to avoid a circular dependency between S and the R-matrix, where the value of S depends on the R-matrix entry which depends on the value of S which depends on the R-matrix entry which depends...you get the idea.
    Here, I'll provide them in a more convenient form:
bins_S = [0.1, 0.3, 0.9, 1.3, 1.8, 2.5, 3.5, 4.9, 6.9, 9.8, 13.9, 19.7, 27.9, 39.5, 55.9, 79.1, 111.9, 158.3, 223.9,
          316.6, 447.7, 633.1, 895.3, 1266.1, 1790.5, 2532.1, 3580.9, 5064.2, 7161.9, 10128.5]
  1. We might need a few more epochs for it to converge, since the R-matrix is always kinda "lagging behind" due to being calculated using the parameters of the previous epoch.
  2. While you have implemented R_w, you haven't used it to its full potential. As described in my .docx document, Rw should be used to calculate a new estimate of the current S, and also in the formula for new_s.
    In other words, R_w should be calculated before this:
    new_s = torch.where(condition, self.stability_after_success(state, new_d, r), self.stability_after_failure(state, new_d, r))

@L-M-Sherlock
Copy link
Member

4. While you have implemented R_w, you haven't used it to its full potential. As described in my .docx document, Rw should be used to calculate a new estimate of the current S, and also in the formula for new_s.

The new estimate of the current S is not differentiable because we calculate it via indexing the R-Matrix. The gradient descent method is blocked by it.

@L-M-Sherlock
Copy link
Member

  1. I would advise you to add this code to 4.0 alpha, since that's what I'm using as the baseline, and I want to have a fair comparison (same goes for the hybrid S0 curve-fit/pre-train method in the other issue).

The weight to average R-Matrix has achieved 0.99, which means the model structure of FSRS is unimportant. The output of FSRS is ignored. So it would output the same result whatever I implement it in any other baselines.

@L-M-Sherlock
Copy link
Member

2. In my .docx document I provided bins for S. I think we should use those. The reason we should use pre-determined bins is to avoid a circular dependency between S and the R-matrix, where the value of S depends on the R-matrix entry which depends on the value of S which depends on the R-matrix entry which depends...you get the idea.

There is not circular dependency in my current implementation, because it is an unidirectional relation between S and R-matrix.

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 9, 2023

image
Here are the results on my collection, using https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/Expt/new-baseline/candidate/R-Matrix.ipynb (the version I copied and edited myself produced different results for some reason)
The intervals are a little weird, I agree. Although yours are much more weird.
image

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 9, 2023

So, Sherlock, do you have any ideas how to speed up optimization? Also, I recommend re-opening this issue.

@user1823
Copy link
Collaborator

user1823 commented Jul 9, 2023

The intervals are a little weird,

Do you think that this is a "little" wierd? The intervals not increasing beyond 1.7 months is a matter of significant concern.

Also, I recommend re-opening this issue.

Reopening the issue is worthwhile only if we can think of a solution for these wierd intervals.

I don't want to sound rude. I just wanted to assert that the R-matrix is not performing well and I think that the time has come to junk this idea.

@Expertium
Copy link
Collaborator Author

Do you think that this is a "little" wierd? The intervals not increasing beyond 1.7 months is a matter of significant concern.

Fair. I'm just more surprised by Sherlock's results, he gets the shortest max interval when "Good" is the first rating. That is really puzzling.
image

@L-M-Sherlock
Copy link
Member

So, Sherlock, do you have any ideas how to speed up optimization?

The main bottleneck is querying R-Matrix when FSRS predicts the memory states from the entire review history. If the sequence has 10 repetitions, it requires 10 queries. I have no idea how to speed it up.

@Expertium
Copy link
Collaborator Author

@L-M-Sherlock let's give this another try, but in a different way.
Instead of grouping reviews by D, S, and R, we will group them by delta_t (interval length), n (number of reviews), and average grade (just a simple arithmetic mean of grades). The advantage is that such a matrix can be precomputed from raw data, and we don't need any model of memory to compute it. It will be less accurate that way, but hopefully it will have fewer issues. The weighting system can remain the same.

@L-M-Sherlock
Copy link
Member

Instead of grouping reviews by D, S, and R, we will group them by delta_t (interval length), n (number of reviews), and average grade (just a simple arithmetic mean of grades).

How to index this matrix during training? delta_t and n existed for the single review. But where is average grade?

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 12, 2023

You can take the sum of all grades and pass them as state[:,2], and then you can divide the sum by the number of reviews when you need to calculate the average.

sum = X[:,1] + state[:,2]
torch.stack([new_s, new_d, sum], dim=1)

Or you can use the number of lapses, if that makes things easier. If you decide to use the number of lapses, then choose some maximum number, like 6 or 8, such that any card with more lapses than that gets grouped into the same group. Like this:
Group 1: 1 lapse
Group 2: 2 lapses
Group 3: 3-4 lapses
Group 4: 5-6 lapses
Group 5: 6-7 lapses
Group 6: 8 or more lapses

Grouping by n should be similar. That's because if the number of groups is the same as the number of unique values of n, some of them will be very sparse, with a very small number of reviews in them.

@Expertium
Copy link
Collaborator Author

I have an idea how to improve accuracy even further by introducing another matrix, S-matrix, but we need to test this new R-matrix first, to see if it works. At the very least it should be faster since it only needs to be filled once, unlike the previous version which had to be recalculated during training.

@L-M-Sherlock L-M-Sherlock reopened this Jul 13, 2023
@L-M-Sherlock
Copy link
Member

https://github.com/open-spaced-repetition/fsrs4anki/blob/Expt/r-matrix/archive/candidate/r-matrix-t-n-g.ipynb

I implement the simple version based on 4.0.0. Could you help me test it?

@Expertium
Copy link
Collaborator Author

image
I can't open it in collab

@Expertium
Copy link
Collaborator Author

image

@L-M-Sherlock
Copy link
Member

I'm updating the code. You can replace

df['lapse'] = df['r_history'].str.count('1')

with

df['lapse'] = df['r_history'].astype(str).str.count('1')

to solve it.

@Expertium
Copy link
Collaborator Author

image
Now I'm getting the same error as in the beta version.

@L-M-Sherlock
Copy link
Member

Maybe you can switch to another deck?

@Expertium
Copy link
Collaborator Author

I mean, yes, I can, but it's still a problem that has to be addressed. Ok, I will test it on other decks.

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 13, 2023

image
The optimization on my entire collection isn't done yet, but I can already tell that it behaves differently compared to the previous implementation: the parameter that determines the maximum amount of weight that can be allocated to the R-matrix entry is nowhere near 0.99. It actually seems pretty fairly balanced.

@Expertium
Copy link
Collaborator Author

@user1823 you should try the new version too: #271 (comment)
I'm curious to see if you will get any kind of strange or unexpected results.

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 13, 2023

For my collection RMSE went down only by about 13% compared to 4.0.0 beta. from 0.0589 to 0.0515; nowhere near as much as I was expecting.
EDIT: obviously this grouping isn't as good as grouping by D, S and R, but still.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jul 13, 2023

I mean, yes, I can, but it's still a problem that has to be addressed. Ok, I will test it on other decks.

I fixed this problem in the main branch and r-matrix-t-n-g.ipynb.

Just remove this line of code:

df.drop(df[df['create_date'].dt.year < 2006].index, inplace=True)

@Expertium
Copy link
Collaborator Author

Expertium commented Jul 13, 2023

I think grouping should be improved.

  1. Use the same (or slightly modified) formula that you used to round S to round delta_t. If one card has delta_t=353, and the other delta_t=352, it doesn't make much sense to put them into different groups.
    image

  2. Also group lapses differently. For example, lapses=7 and lapses=8 can safely be put into the same group.

@Expertium
Copy link
Collaborator Author

I tested this version, it performs better than v4 by around 9-10%, and it's statistically significant. I bet it can be improved with better grouping (see my comment above).

I could describe how to implement the S-matrix, my new idea, but considering that Sherlock already made up his mind about releasing v4 as is, I don't see the point of working on the S-matrix right now. I think we should do some final tests of this version of R-matrix, with better grouping, and leave it until Sherlock decides to start working on v5. Of course, if Sherlock decides to postpone the release of v4 to implement matrices, I would have nothing against it; quite the opposite, if I were in his shoes, I wouldn't release v4 before exhaustively trying everything related to matrices. But that would also require very major changes to the code of the helper add-on. And if it's true that FSRS will soon be supported on mobile, then this will increase the amount of headache tenfold or even make using matrices outright impossible, meaning that the mobile will need its own "FSRS Mobile" version with no matrices.

@L-M-Sherlock
Copy link
Member

According my recent experiments, FSRS has reached the level of SM-15:

https://github.com/open-spaced-repetition/fsrs-vs-sm15/blob/main/evaluate.ipynb

And as we known, SM-15 employs many matrixes to fit the memory of user. FSRS v4 only uses 17 parameters for that. I think the benefit of matrixes couldn't compensate the cost to implement them.

If you no longer insist on implementing matrices in FSRS, I will close this issue.

@Expertium
Copy link
Collaborator Author

We still haven't properly implemented them though. For example, the last attempt had a major flaw with grouping reviews.

That being said, if FSRS is already more accurate than SM algorithms, we can put off the matrices and work on them in the future.

@Expertium
Copy link
Collaborator Author

While I would like to try a proper implementation of the R-matrix*, even if it performs well, I still don't know how to solve the problem with non-monotonicity of average R (predicted + R-matrix entry), and even if we somehow solve that, the matrix will be difficult to implement into Anki. So I suppose we can close this issue.

*with improved grouping. Last time cards with delta_t different by 1 day, like delta_t=365 and delta_t=365, were being grouped into distinct groups for no good reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

3 participants