Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better outlier filter for trainset #121

Closed
user1823 opened this issue Nov 21, 2023 · 25 comments
Closed

Better outlier filter for trainset #121

user1823 opened this issue Nov 21, 2023 · 25 comments

Comments

@user1823
Copy link
Contributor

As I mentioned in #88 (comment), the current outlier filter is too aggressive (it removes too many reviews). This is fine for the pretrain function (because the pretrain method is fragile). However, the train function should have access to more reviews.

I agree that some sort of outlier filter is required for the trainset also. But, I think that the same outlier filter should not be used for both.

Originally posted by @user1823 in #119 (comment)

Taking the example of my collection (data: stability_for_pretrain.zip)

  • For first_rating = 1, I think that it is reasonable to keep delta_t = 1 and delta_t = 2 (in my collection) in the trainset.
  • For first_rating = 2, the data in my collection is insufficient. So, I will not discuss about it.
  • For first_rating = 3, the initial stability is calculated as about 14. So, it doesn't make sense to ONLY keep the data with delta_t < 5 in the trainset.
  • For first_rating = 4, there is no major problem in how the outlier filter is working with my data.

Originally posted by @user1823 in #119 (comment)

If a card is reviewed too early or too late, the first response will be unreliable.

So, what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions. (By both, I mean the ratio condition and the pretrain outlier condition.)

Note: This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.

Originally posted by @user1823 in #119 (comment)

@user1823
Copy link
Contributor Author

user1823 commented Nov 29, 2023

@L-M-Sherlock, what is your opinion on the suggestion?

@Expertium, you may also be interested.

@Expertium
Copy link
Contributor

So, what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions. (By both, I mean the ratio condition and the pretrain outlier condition.)

I like this idea.

@Expertium
Copy link
Contributor

@user1823 can you elaborate how exactly the new condition should work?

@user1823
Copy link
Contributor Author

user1823 commented Dec 2, 2023

I was thinking of using stability × 3 and stability / 3 as the maximum and minimum values of delta_t.

Other suggestions are welcome.

@Expertium
Copy link
Contributor

But the value of stability depends on what data is available. Do you want a two-step procedure?

  1. Estimate S0 while removing outliers based on the current condition
  2. Re-estimate S0 while removing outliers using two conditions, with the first estimate being used in the ratio
    Like that?

@user1823
Copy link
Contributor Author

user1823 commented Dec 2, 2023

Looks like you didn't notice this:

Note: This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.

I am suggesting to calculate S0 using the current filter and then using the two conditions to filter the reviews when training the other parameters.

@user1823
Copy link
Contributor Author

user1823 commented Dec 2, 2023

If a card is reviewed too early or too late, the response will be unreliable.

By the way, this statement holds true for any review, not just the first review. But, because filtering the outliers in each review is very difficult, we are filtering the outliers at the first review only.

(First review = second rating)

@Expertium
Copy link
Contributor

I am suggesting to calculate S0 using the current filter and then using the two conditions to filter the reviews when training the other parameters.

Ah, ok, I misunderstood.

@Expertium
Copy link
Contributor

@L-M-Sherlock I suggest trying user1823's idea: #121 (comment)

@user1823
Copy link
Contributor Author

user1823 commented Dec 3, 2023

I suggest trying user1823's idea

In case you guys have forgotten, I want to remind you of the following:

In general, any change that makes the data more heterogeneous (such as the above proposed change) will tend to increase the RMSE. So, we shouldn't rely on RMSE to determine the effectiveness of this change.

It is possible that RMSE can decrease after this change, but it will happen only if the idea is VERY good. This happened in open-spaced-repetition/fsrs-optimizer#16

In my opinion, the condition for implementing this idea should be that the change should not significantly increase the RMSE.

@user1823
Copy link
Contributor Author

user1823 commented Dec 6, 2023

@L-M-Sherlock, what is your opinion on this suggestion?

If you are too busy to implement this now, it is fine. But you can at least give your opinion.

@L-M-Sherlock
Copy link
Member

But you can at least give your opinion.

I'm open for that. But the real world is more complicate than my expectation:

image

@user1823
Copy link
Contributor Author

user1823 commented Dec 7, 2023

Maybe also filter reviews of cards with initial delta_t > 7 × S irrespective of whether they are filtered by the pretrain filter or not?

(delta_t = 7S implies R = 56.3%)

@user1823
Copy link
Contributor Author

I'm open for that. But the real world is more complicate than my expectation:

@L-M-Sherlock, the issue that you highlighted is not caused by my suggestion; it exists in the current version too. So, it should not be the reason to not apply my suggestion.

Basically, the reason behind my suggestion is that higher the amount of data available, better would be the trained weights (provided the data is not completely rubbish).

So, just because there are few cards for a particular delta_t, we should not discard all the ratings for those cards. The criteria to discard them should be stricter (as proposed above).

Similarly, just because a card falls in a delta_t with R = 100% (#135), we shouldn't discard all the ratings for those cards.

For sure, the above two cases are not suitable for calculating S0, but their subsequent ratings can be useful.

@Expertium
Copy link
Contributor

Expertium commented Dec 16, 2023

For sure, the above two cases are not suitable for calculating S0

The second case is suitable though. We use additive smoothing, so R will never be exactly 0% or 100% in the pretrain data. I haven't looked at the Rust code (it's hard to understand), but I'm assuming additive smoothing is implemented. If not, it definitely should be.
EDIT: it's implemented in the pretrain file:

        let recall = {
            // Laplace smoothing
            // (real_recall * n + average_recall * 1) / (n + 1)
            // https://github.com/open-spaced-repetition/fsrs4anki/pull/358/files#diff-35b13c8e3466e8bd1231a51c71524fc31a945a8f332290726214d3a6fa7f442aR491
            let real_recall = Array1::from_iter(data.iter().map(|d| d.recall));
            let n = data.iter().map(|d| d.count).sum::<f32>();
            (real_recall * n + average_recall) / (n + 1.0)
        };

Wait, Sherlock, then what's the point of removing cards with R=100%?

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Dec 17, 2023

Wait, Sherlock, then what's the point of removing cards with R=100%?

It's really weird that the retention is 100% when the interval >= 1 day. Removing these data could increase the pertaining's robustness. In this case, the number of removed cards is very small. So it's acceptable.

@Expertium
Copy link
Contributor

Is this issue still necessary?

@user1823
Copy link
Contributor Author

user1823 commented Jan 8, 2024

Yes, the previous few changes were related to the pretrainset. This issue is related to the trainset.

@L-M-Sherlock
Copy link
Member

I want to keep the current outlier filter in card level. All reviews of the same cards are filtered out if the first review is filtered out in the pretrainset.

@user1823
Copy link
Contributor Author

user1823 commented Mar 6, 2024

All reviews of the same cards are filtered out if the first review is filtered out in the pretrainset.

But why? The pretrain and train methods are completely different. It is not necessary to use the same outlier filter for both.

Also, how would you explain the following?

For first_rating = 3, the initial stability is calculated as about 14. So, it doesn't make sense to ONLY keep the data with delta_t < 5 in the trainset.

Here, the optimizer is saying that the initial stability is 14 days but the optimizer completely ignores the cards whose first interval is more than 5 days, just because there are 1000s of reviews with first interval of 1 day (because of past Anki use).

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Mar 7, 2024

The pretrain and train methods are completely different.

Their goals are the same: minimizing the loss.

Here, the optimizer is saying that the initial stability is 14 days but the optimizer completely ignores the cards whose first interval is more than 5 days, just because there are 1000s of reviews with first interval of 1 day (because of past Anki use).

The current outlier filter only filters out 5% of cards for each first rating. That's 1000 * 0.05 = 50 cards. When the number of cards whose first interval is more than 5 days exceeds 50, they will be taken into account.

For more details, please see the code and comments:

fsrs-rs/src/dataset.rs

Lines 183 to 201 in 3d0dd3a

let total = sub_groups.iter().map(|(_, vec)| vec.len()).sum::<usize>();
let mut has_been_removed = 0;
for (delta_t, sub_group) in sub_groups.iter().rev() {
// remove 5% items (20 at least) of each group
if has_been_removed + sub_group.len() >= 20.max(total / 20) {
// keep the sub_group if it includes at least six items
// and the delta_t is less than 100 days if rating is not 4
// or less than 365 days if rating is 4
if sub_group.len() >= 6 && *delta_t <= if rating != 4 { 100 } else { 365 } {
filtered_items.extend_from_slice(sub_group);
} else {
removed_pairs[rating as usize].insert(*delta_t);
}
} else {
has_been_removed += sub_group.len();
removed_pairs[rating as usize].insert(*delta_t);
}
}

@user1823
Copy link
Contributor Author

user1823 commented Mar 7, 2024

Their goals are the same: minimizing the loss.

Yes, their goals are the same. But, the method of achieving the goal is very different. Pretrain uses binning (based on delta_t) while train doesn't.

When the number of cards whose first interval is more than 5 days exceeds 50, they will be taken into account.

In training, each card is treated independent of the others (because there is no binning). So, we don't need to look at the number of cards with a particular delta_t (except for some wierd cases like first review after 1000 days of graduating the card, which should be filtered out by the filter condition suggested by me).

I agree that not a very large number of cards are affected. But, I can't think of any good reason for not including these cards in training.

@Expertium
Copy link
Contributor

I don't have a strong opinion on this matter, but if I had to choose, I would choose to implement user1823's idea.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Mar 7, 2024

How much benefit does including them bring? If the number of reviews of the same delta_t is more than 6 in the pretrainset, the current outlier filter keeps them as well. So it only removes a bit of cards whose first review is weird. In addition, to filter the trainset based on the initial stability is complex. In my opinion, the cost exceeds the benefit.

@user1823
Copy link
Contributor Author

user1823 commented Mar 7, 2024

If the number of reviews of the same delta_t is more than 6 in the pretrainset, the current outlier filter keeps them as well.

No

If the number is less than 6, then the reviews are filtered out. But if the number ≥ 6, then the decision of whether to keep them or not is based on whether they come in the bottom 5% of the total number of cards with that first rating or not.

This 5% can be quite large. For example, if I have 4000 cards with first rating of Good, then the optimizer is ignoring 200 cards for no good reason. This is not to say that the 5% limit should be adjusted. It is required for pretrain. But, when training, the filter should be slightly more loose.

This is my last comment on this issue. If you still don't want to implement it, then we can forget about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants