Feat/filter outlier in trainset #119

L-M-Sherlock · 2023-11-20T16:12:58Z

No description provided.

user1823 · 2023-11-20T17:11:18Z

Can you state in plain English what this outlier filter does?
I am unable to understand it from the code (because I am not a dev) and the added comment also doesn't seem to answer it.

What does rating = rating and delta_t = delta_t actually mean?

src/dataset.rs

L-M-Sherlock · 2023-11-21T02:10:06Z

Can you state in plain English what this outlier filter does?

I rewrite the comments in the last commit.

src/dataset.rs

open-spaced-repetition/fsrs-rs#119

user1823 · 2023-11-21T12:22:10Z

keep the items in trainset if they are not removed from filtered_items

Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset?

L-M-Sherlock · 2023-11-21T12:33:59Z

Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset?

Yes. It's consistent with the Python optimizer.

user1823 · 2023-11-21T12:44:09Z

Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset?

Yes. It's consistent with the Python optimizer.

As I mentioned in #88 (comment), this outlier filter is too aggressive (it removes too many reviews). This is fine for the pretrain function (because the pretrain method is fragile). However, the train function should have access to more reviews.

Let me explain with an example:

Let's say that for first_rating = 3 and delta_t = 9 days, there are 8 cards and for delta_t = 10 days, there are 100 cards.
Since the cards with delta_t = 9 is too low, the average R will be unreliable. So, it is desirable to filter this case in pre-train.
However, the further review history of these cards is not expected to be very different from those with initial delta_t = 10 days. So, it is not desirable to filter the further review history of these cards.

I agree that some sort of outlier filter is required for the trainset also. But, I think that the same outlier filter should not be used for both.

L-M-Sherlock · 2023-11-21T13:47:29Z

However, the further review history of these cards is not expected to be very different from those with initial delta_t = 10 days.

I don't think so. If a card is reviewed too early or too late, the first response will be unreliable. If we use the same model the calculate the initial memory state with the unreliable response, the subsequent training will be polluted.

user1823 · 2023-11-21T14:08:58Z

If a card is reviewed too early or too late, the first response will be unreliable.

I agree, but 9 days can't be called "too early" when compared to 10 days.

This is the reason I am suggesting to use a less aggressive outlier filter for the trainset.

user1823 · 2023-11-21T14:45:47Z

Taking the example of my collection (see first review data at the bottom of the comment),

For first_rating = 1, I think that it is reasonable to keep delta_t = 1 and delta_t = 2 (in my collection) in the trainset.
For first_rating = 2, the data in my collection is insufficient. So, I will not discuss about it.
For first_rating = 3, the initial stability is calculated as about 14. So, it doesn't make sense to ONLY keep the data with delta_t < 5 in the trainset.
For first_rating = 4, there is no major problem in how the outlier filter is working with my data.

If a card is reviewed too early or too late, the first response will be unreliable.

So, what about using an outlier filter based on the ratio between delta_t and stability for the trainset?

First review data: stability_for_pretrain.zip

L-M-Sherlock · 2023-11-21T15:31:35Z

The current outlier filter only removes 5% cards. Is it aggressive?

user1823 · 2023-11-21T15:36:08Z

Yes (for trainset) / No (for pretrain)

See my above comment; it includes specific examples.

L-M-Sherlock · 2023-11-21T15:56:35Z

It's really hard to design a perfect solution for that. And I think 11 0.97 33 could be also included in pretrainset.

user1823 · 2023-11-21T16:08:24Z

It's really hard to design a perfect solution for that.

Yes, but we can try to design a solution better than the current one.

As I said in #119 (comment), what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions.
(By both, I mean the ratio condition and the pretrain outlier condition.)

This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.

And I think 11 0.97 33 could be also included in pretrainset.

Based on these values, this group had only 1 lapse. Any group with only 1 lapse is too small to be used for calculating the stability.

L-M-Sherlock · 2023-11-21T16:15:41Z

As I said in #119 (comment), what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions.
(By both, I mean the ratio condition and the pretrain outlier condition.)

I recommend opening a new issue to discuss about it.

L-M-Sherlock added 2 commits November 21, 2023 00:08

Feat/filter outlier in trainset

88d0794

assert length in test_filter_outlier

dc9b2fe

L-M-Sherlock added the enhancement New feature or request label Nov 20, 2023

L-M-Sherlock linked an issue Nov 20, 2023 that may be closed by this pull request

Achieve parity with the Python optimizer #88

Closed

L-M-Sherlock requested review from dae and asukaminato0721 November 20, 2023 16:25

asukaminato0721 approved these changes Nov 20, 2023

View reviewed changes

src/dataset.rs Outdated Show resolved Hide resolved

rewrite comments

e8aa320

dae reviewed Nov 21, 2023

View reviewed changes

src/dataset.rs Outdated Show resolved Hide resolved

src/dataset.rs Outdated Show resolved Hide resolved

asukaminato0721 added 2 commits November 21, 2023 14:12

iter => into_iter

925bf98

remove retain in loop

62f5b77

dae reviewed Nov 21, 2023

View reviewed changes

src/dataset.rs Outdated Show resolved Hide resolved

asukaminato0721 added 3 commits November 21, 2023 15:01

use bool arr

c0af31a

move retain out the loop

c9756fd

hash u32 instead of pair

551a337

L-M-Sherlock requested a review from dae November 21, 2023 06:59

dae approved these changes Nov 21, 2023

View reviewed changes

L-M-Sherlock merged commit a9cc36a into main Nov 21, 2023
3 checks passed

L-M-Sherlock deleted the Feat/filter-outlier-in-trainset branch November 21, 2023 08:39

L-M-Sherlock added a commit to open-spaced-repetition/srs-benchmark that referenced this pull request Nov 21, 2023

update metrics of FSRS-rs

e5d6eae

open-spaced-repetition/fsrs-rs#119

user1823 mentioned this pull request Nov 21, 2023

Better outlier filter for trainset #121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/filter outlier in trainset #119

Feat/filter outlier in trainset #119

L-M-Sherlock commented Nov 20, 2023

user1823 commented Nov 20, 2023 •

edited

Loading

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023 •

edited

Loading

L-M-Sherlock commented Nov 21, 2023

Feat/filter outlier in trainset #119

Feat/filter outlier in trainset #119

Conversation

L-M-Sherlock commented Nov 20, 2023

user1823 commented Nov 20, 2023 • edited Loading

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 21, 2023 • edited Loading

L-M-Sherlock commented Nov 21, 2023

user1823 commented Nov 20, 2023 •

edited

Loading

user1823 commented Nov 21, 2023 •

edited

Loading