Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/filter outlier in trainset #119

Merged
merged 8 commits into from
Nov 21, 2023
Merged

Conversation

L-M-Sherlock
Copy link
Member

No description provided.

@L-M-Sherlock L-M-Sherlock added the enhancement New feature or request label Nov 20, 2023
@L-M-Sherlock L-M-Sherlock linked an issue Nov 20, 2023 that may be closed by this pull request
@user1823
Copy link
Contributor

user1823 commented Nov 20, 2023

Can you state in plain English what this outlier filter does?
I am unable to understand it from the code (because I am not a dev) and the added comment also doesn't seem to answer it.

What does rating = rating and delta_t = delta_t actually mean?

src/dataset.rs Outdated Show resolved Hide resolved
@L-M-Sherlock
Copy link
Member Author

Can you state in plain English what this outlier filter does?

I rewrite the comments in the last commit.

src/dataset.rs Outdated Show resolved Hide resolved
src/dataset.rs Outdated Show resolved Hide resolved
src/dataset.rs Outdated Show resolved Hide resolved
@L-M-Sherlock L-M-Sherlock merged commit a9cc36a into main Nov 21, 2023
3 checks passed
@L-M-Sherlock L-M-Sherlock deleted the Feat/filter-outlier-in-trainset branch November 21, 2023 08:39
L-M-Sherlock added a commit to open-spaced-repetition/srs-benchmark that referenced this pull request Nov 21, 2023
@user1823
Copy link
Contributor

keep the items in trainset if they are not removed from filtered_items

Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset?

@L-M-Sherlock
Copy link
Member Author

Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset?

Yes. It's consistent with the Python optimizer.

@user1823
Copy link
Contributor

Does this mean that if a card is removed from the pretrain dataset, then all its reviews will be removed from the trainset?

Yes. It's consistent with the Python optimizer.

As I mentioned in #88 (comment), this outlier filter is too aggressive (it removes too many reviews). This is fine for the pretrain function (because the pretrain method is fragile). However, the train function should have access to more reviews.

Let me explain with an example:

Let's say that for first_rating = 3 and delta_t = 9 days, there are 8 cards and for delta_t = 10 days, there are 100 cards.
Since the cards with delta_t = 9 is too low, the average R will be unreliable. So, it is desirable to filter this case in pre-train.
However, the further review history of these cards is not expected to be very different from those with initial delta_t = 10 days. So, it is not desirable to filter the further review history of these cards.

I agree that some sort of outlier filter is required for the trainset also. But, I think that the same outlier filter should not be used for both.

@L-M-Sherlock
Copy link
Member Author

However, the further review history of these cards is not expected to be very different from those with initial delta_t = 10 days.

I don't think so. If a card is reviewed too early or too late, the first response will be unreliable. If we use the same model the calculate the initial memory state with the unreliable response, the subsequent training will be polluted.

@user1823
Copy link
Contributor

If a card is reviewed too early or too late, the first response will be unreliable.

I agree, but 9 days can't be called "too early" when compared to 10 days.

This is the reason I am suggesting to use a less aggressive outlier filter for the trainset.

@user1823
Copy link
Contributor

Taking the example of my collection (see first review data at the bottom of the comment),

  • For first_rating = 1, I think that it is reasonable to keep delta_t = 1 and delta_t = 2 (in my collection) in the trainset.
  • For first_rating = 2, the data in my collection is insufficient. So, I will not discuss about it.
  • For first_rating = 3, the initial stability is calculated as about 14. So, it doesn't make sense to ONLY keep the data with delta_t < 5 in the trainset.
  • For first_rating = 4, there is no major problem in how the outlier filter is working with my data.

If a card is reviewed too early or too late, the first response will be unreliable.

So, what about using an outlier filter based on the ratio between delta_t and stability for the trainset?

First review data: stability_for_pretrain.zip

@L-M-Sherlock
Copy link
Member Author

The current outlier filter only removes 5% cards. Is it aggressive?

@user1823
Copy link
Contributor

Yes (for trainset) / No (for pretrain)

See my above comment; it includes specific examples.

@L-M-Sherlock
Copy link
Member Author

It's really hard to design a perfect solution for that. And I think 11 0.97 33 could be also included in pretrainset.

@user1823
Copy link
Contributor

user1823 commented Nov 21, 2023

It's really hard to design a perfect solution for that.

Yes, but we can try to design a solution better than the current one.

As I said in #119 (comment), what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions.
(By both, I mean the ratio condition and the pretrain outlier condition.)

This suggestion is for trainset. I recommend keeping the pretrain filter unchanged.

And I think 11 0.97 33 could be also included in pretrainset.

Based on these values, this group had only 1 lapse. Any group with only 1 lapse is too small to be used for calculating the stability.

@L-M-Sherlock
Copy link
Member Author

As I said in #119 (comment), what if create another condition based on the ratio (or something similar) of the delta_t and stability and then remove the reviews of cards that fulfill BOTH the conditions.
(By both, I mean the ratio condition and the pretrain outlier condition.)

I recommend opening a new issue to discuss about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Achieve parity with the Python optimizer
4 participants