Better outlier filter for trainset #121

user1823 · 2023-11-21T16:25:23Z

As I mentioned in #88 (comment), the current outlier filter is too aggressive (it removes too many reviews). This is fine for the pretrain function (because the pretrain method is fragile). However, the train function should have access to more reviews.

I agree that some sort of outlier filter is required for the trainset also. But, I think that the same outlier filter should not be used for both.

Originally posted by @user1823 in #119 (comment)

Taking the example of my collection (data: stability_for_pretrain.zip)

For first_rating = 1, I think that it is reasonable to keep delta_t = 1 and delta_t = 2 (in my collection) in the trainset.

For first_rating = 2, the data in my collection is insufficient. So, I will not discuss about it.

For first_rating = 3, the initial stability is calculated as about 14. So, it doesn't make sense to ONLY keep the data with delta_t < 5 in the trainset.

For first_rating = 4, there is no major problem in how the outlier filter is working with my data.

Originally posted by @user1823 in #119 (comment)

If a card is reviewed too early or too late, the first response will be unreliable.

So, what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions. (By both, I mean the ratio condition and the pretrain outlier condition.)

Note: This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.

Originally posted by @user1823 in #119 (comment)

user1823 · 2023-11-29T11:56:22Z

@L-M-Sherlock, what is your opinion on the suggestion?

@Expertium, you may also be interested.

Expertium · 2023-11-29T11:57:58Z

So, what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions. (By both, I mean the ratio condition and the pretrain outlier condition.)

I like this idea.

Expertium · 2023-12-01T19:40:23Z

@user1823 can you elaborate how exactly the new condition should work?

user1823 · 2023-12-02T01:35:35Z

I was thinking of using stability × 3 and stability / 3 as the maximum and minimum values of delta_t.

Other suggestions are welcome.

Expertium · 2023-12-02T10:57:39Z

But the value of stability depends on what data is available. Do you want a two-step procedure?

Estimate S0 while removing outliers based on the current condition
Re-estimate S0 while removing outliers using two conditions, with the first estimate being used in the ratio
Like that?

user1823 · 2023-12-02T11:05:22Z

Looks like you didn't notice this:

Note: This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.

I am suggesting to calculate S0 using the current filter and then using the two conditions to filter the reviews when training the other parameters.

user1823 · 2023-12-02T11:11:14Z

If a card is reviewed too early or too late, the response will be unreliable.

By the way, this statement holds true for any review, not just the first review. But, because filtering the outliers in each review is very difficult, we are filtering the outliers at the first review only.

(First review = second rating)

Expertium · 2023-12-02T11:39:49Z

I am suggesting to calculate S0 using the current filter and then using the two conditions to filter the reviews when training the other parameters.

Ah, ok, I misunderstood.

Expertium · 2023-12-03T16:01:04Z

@L-M-Sherlock I suggest trying user1823's idea: #121 (comment)

user1823 · 2023-12-03T16:23:20Z

I suggest trying user1823's idea

In case you guys have forgotten, I want to remind you of the following:

In general, any change that makes the data more heterogeneous (such as the above proposed change) will tend to increase the RMSE. So, we shouldn't rely on RMSE to determine the effectiveness of this change.

It is possible that RMSE can decrease after this change, but it will happen only if the idea is VERY good. This happened in open-spaced-repetition/fsrs-optimizer#16

In my opinion, the condition for implementing this idea should be that the change should not significantly increase the RMSE.

user1823 · 2023-12-06T13:58:26Z

@L-M-Sherlock, what is your opinion on this suggestion?

If you are too busy to implement this now, it is fine. But you can at least give your opinion.

L-M-Sherlock · 2023-12-07T03:51:36Z

But you can at least give your opinion.

I'm open for that. But the real world is more complicate than my expectation:

user1823 · 2023-12-07T12:00:24Z

Maybe also filter reviews of cards with initial delta_t > 7 × S irrespective of whether they are filtered by the pretrain filter or not?

(delta_t = 7S implies R = 56.3%)

user1823 · 2023-12-16T12:54:16Z

I'm open for that. But the real world is more complicate than my expectation:

@L-M-Sherlock, the issue that you highlighted is not caused by my suggestion; it exists in the current version too. So, it should not be the reason to not apply my suggestion.

Basically, the reason behind my suggestion is that higher the amount of data available, better would be the trained weights (provided the data is not completely rubbish).

So, just because there are few cards for a particular delta_t, we should not discard all the ratings for those cards. The criteria to discard them should be stricter (as proposed above).

Similarly, just because a card falls in a delta_t with R = 100% (#135), we shouldn't discard all the ratings for those cards.

For sure, the above two cases are not suitable for calculating S0, but their subsequent ratings can be useful.

Expertium · 2023-12-16T15:52:08Z

For sure, the above two cases are not suitable for calculating S0

The second case is suitable though. We use additive smoothing, so R will never be exactly 0% or 100% in the pretrain data. I haven't looked at the Rust code (it's hard to understand), but I'm assuming additive smoothing is implemented. If not, it definitely should be.
EDIT: it's implemented in the pretrain file:

        let recall = {
            // Laplace smoothing
            // (real_recall * n + average_recall * 1) / (n + 1)
            // https://github.com/open-spaced-repetition/fsrs4anki/pull/358/files#diff-35b13c8e3466e8bd1231a51c71524fc31a945a8f332290726214d3a6fa7f442aR491
            let real_recall = Array1::from_iter(data.iter().map(|d| d.recall));
            let n = data.iter().map(|d| d.count).sum::<f32>();
            (real_recall * n + average_recall) / (n + 1.0)
        };

Wait, Sherlock, then what's the point of removing cards with R=100%?

L-M-Sherlock · 2023-12-17T05:56:58Z

Wait, Sherlock, then what's the point of removing cards with R=100%?

It's really weird that the retention is 100% when the interval >= 1 day. Removing these data could increase the pertaining's robustness. In this case, the number of removed cards is very small. So it's acceptable.

Expertium · 2024-01-07T20:43:41Z

Is this issue still necessary?

user1823 · 2024-01-08T03:09:13Z

Yes, the previous few changes were related to the pretrainset. This issue is related to the trainset.

L-M-Sherlock · 2024-03-05T03:22:58Z

I want to keep the current outlier filter in card level. All reviews of the same cards are filtered out if the first review is filtered out in the pretrainset.

user1823 · 2024-03-06T05:16:48Z

All reviews of the same cards are filtered out if the first review is filtered out in the pretrainset.

But why? The pretrain and train methods are completely different. It is not necessary to use the same outlier filter for both.

Also, how would you explain the following?

For first_rating = 3, the initial stability is calculated as about 14. So, it doesn't make sense to ONLY keep the data with delta_t < 5 in the trainset.

Here, the optimizer is saying that the initial stability is 14 days but the optimizer completely ignores the cards whose first interval is more than 5 days, just because there are 1000s of reviews with first interval of 1 day (because of past Anki use).

L-M-Sherlock · 2024-03-07T03:35:27Z

The pretrain and train methods are completely different.

Their goals are the same: minimizing the loss.

Here, the optimizer is saying that the initial stability is 14 days but the optimizer completely ignores the cards whose first interval is more than 5 days, just because there are 1000s of reviews with first interval of 1 day (because of past Anki use).

The current outlier filter only filters out 5% of cards for each first rating. That's 1000 * 0.05 = 50 cards. When the number of cards whose first interval is more than 5 days exceeds 50, they will be taken into account.

For more details, please see the code and comments:

fsrs-rs/src/dataset.rs

Lines 183 to 201 in 3d0dd3a

    
           let total = sub_groups.iter().map(|(_, vec)| vec.len()).sum::<usize>(); 
        
           let mut has_been_removed = 0; 
        
           for (delta_t, sub_group) in sub_groups.iter().rev() { 
        
               // remove 5% items (20 at least) of each group 
        
               if has_been_removed + sub_group.len() >= 20.max(total / 20) { 
        
                   // keep the sub_group if it includes at least six items 
        
                   // and the delta_t is less than 100 days if rating is not 4 
        
                   // or less than 365 days if rating is 4 
        
                   if sub_group.len() >= 6 && *delta_t <= if rating != 4 { 100 } else { 365 } { 
        
                       filtered_items.extend_from_slice(sub_group); 
        
                   } else { 
        
                       removed_pairs[rating as usize].insert(*delta_t); 
        
                   } 
        
               } else { 
        
                   has_been_removed += sub_group.len(); 
        
                   removed_pairs[rating as usize].insert(*delta_t); 
        
               } 
        
           }

user1823 · 2024-03-07T12:35:07Z

Their goals are the same: minimizing the loss.

Yes, their goals are the same. But, the method of achieving the goal is very different. Pretrain uses binning (based on delta_t) while train doesn't.

When the number of cards whose first interval is more than 5 days exceeds 50, they will be taken into account.

In training, each card is treated independent of the others (because there is no binning). So, we don't need to look at the number of cards with a particular delta_t (except for some wierd cases like first review after 1000 days of graduating the card, which should be filtered out by the filter condition suggested by me).

I agree that not a very large number of cards are affected. But, I can't think of any good reason for not including these cards in training.

Expertium · 2024-03-07T13:08:00Z

I don't have a strong opinion on this matter, but if I had to choose, I would choose to implement user1823's idea.

L-M-Sherlock · 2024-03-07T13:23:47Z

How much benefit does including them bring? If the number of reviews of the same delta_t is more than 6 in the pretrainset, the current outlier filter keeps them as well. So it only removes a bit of cards whose first review is weird. In addition, to filter the trainset based on the initial stability is complex. In my opinion, the cost exceeds the benefit.

user1823 · 2024-03-07T14:09:54Z

If the number of reviews of the same delta_t is more than 6 in the pretrainset, the current outlier filter keeps them as well.

No

If the number is less than 6, then the reviews are filtered out. But if the number ≥ 6, then the decision of whether to keep them or not is based on whether they come in the bottom 5% of the total number of cards with that first rating or not.

This 5% can be quite large. For example, if I have 4000 cards with first rating of Good, then the optimizer is ignoring 200 cards for no good reason. This is not to say that the 5% limit should be adjusted. It is required for pretrain. But, when training, the filter should be slightly more loose.

This is my last comment on this issue. If you still don't want to implement it, then we can forget about it.

L-M-Sherlock closed this as completed Mar 5, 2024

user1823 mentioned this issue Mar 30, 2024

Fix/keep data if count >= 6 open-spaced-repetition/fsrs-optimizer#99

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better outlier filter for trainset #121

Better outlier filter for trainset #121

user1823 commented Nov 21, 2023

user1823 commented Nov 29, 2023 •

edited

Loading

Expertium commented Nov 29, 2023

Expertium commented Dec 1, 2023

user1823 commented Dec 2, 2023

Expertium commented Dec 2, 2023

user1823 commented Dec 2, 2023

user1823 commented Dec 2, 2023

Expertium commented Dec 2, 2023

Expertium commented Dec 3, 2023

user1823 commented Dec 3, 2023

user1823 commented Dec 6, 2023 •

edited

Loading

L-M-Sherlock commented Dec 7, 2023

user1823 commented Dec 7, 2023

user1823 commented Dec 16, 2023

Expertium commented Dec 16, 2023 •

edited by asukaminato0721

Loading

L-M-Sherlock commented Dec 17, 2023 •

edited

Loading

Expertium commented Jan 7, 2024

user1823 commented Jan 8, 2024

L-M-Sherlock commented Mar 5, 2024

user1823 commented Mar 6, 2024

L-M-Sherlock commented Mar 7, 2024 •

edited

Loading

user1823 commented Mar 7, 2024 •

edited

Loading

Expertium commented Mar 7, 2024

L-M-Sherlock commented Mar 7, 2024 •

edited

Loading

user1823 commented Mar 7, 2024

Better outlier filter for trainset #121

Better outlier filter for trainset #121

Comments

user1823 commented Nov 21, 2023

user1823 commented Nov 29, 2023 • edited Loading

Expertium commented Nov 29, 2023

Expertium commented Dec 1, 2023

user1823 commented Dec 2, 2023

Expertium commented Dec 2, 2023

user1823 commented Dec 2, 2023

user1823 commented Dec 2, 2023

Expertium commented Dec 2, 2023

Expertium commented Dec 3, 2023

user1823 commented Dec 3, 2023

user1823 commented Dec 6, 2023 • edited Loading

L-M-Sherlock commented Dec 7, 2023

user1823 commented Dec 7, 2023

user1823 commented Dec 16, 2023

Expertium commented Dec 16, 2023 • edited by asukaminato0721 Loading

L-M-Sherlock commented Dec 17, 2023 • edited Loading

Expertium commented Jan 7, 2024

user1823 commented Jan 8, 2024

L-M-Sherlock commented Mar 5, 2024

user1823 commented Mar 6, 2024

L-M-Sherlock commented Mar 7, 2024 • edited Loading

user1823 commented Mar 7, 2024 • edited Loading

Expertium commented Mar 7, 2024

L-M-Sherlock commented Mar 7, 2024 • edited Loading

user1823 commented Mar 7, 2024

user1823 commented Nov 29, 2023 •

edited

Loading

user1823 commented Dec 6, 2023 •

edited

Loading

Expertium commented Dec 16, 2023 •

edited by asukaminato0721

Loading

L-M-Sherlock commented Dec 17, 2023 •

edited

Loading

L-M-Sherlock commented Mar 7, 2024 •

edited

Loading

user1823 commented Mar 7, 2024 •

edited

Loading

L-M-Sherlock commented Mar 7, 2024 •

edited

Loading