Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TODO] feature: build dataset from Anki's sqlite database #3

Closed
L-M-Sherlock opened this issue Aug 18, 2023 · 17 comments
Closed

[TODO] feature: build dataset from Anki's sqlite database #3

L-M-Sherlock opened this issue Aug 18, 2023 · 17 comments
Labels
enhancement New feature or request

Comments

@L-M-Sherlock
Copy link
Member

Python implementation:
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/95694b787bb71ac9883db1201af09e334ee5ee0b/src/fsrs_optimizer/fsrs_optimizer.py#L319-L449

@L-M-Sherlock L-M-Sherlock added enhancement New feature or request help wanted Extra attention is needed labels Aug 18, 2023
@L-M-Sherlock
Copy link
Member Author

DataFrame library for Rust: https://docs.rs/polars/latest/polars/

@dae
Copy link
Collaborator

dae commented Aug 18, 2023

I wonder if this could perhaps be accomplished using standard Rust iterators/structures and SQL? Eg fetching the data from the revlog table in the desired order, and collecting it into a Vec<(CardId, Vec)> or similar. Upsides would be avoiding the addition of another non-trivial dependency, and the use of typed structs would make the code somewhat more maintainable than looking up columns by string keys. I'm not sure how easily all of those dataframe calls could be translated though, so don't know if it's practical or not.

@L-M-Sherlock
Copy link
Member Author

OK, I will try. Maybe two main loops will implement it. I use pandas in Python just for efficiency and convenience.

@L-M-Sherlock
Copy link
Member Author

I have tried to query revlog from Anki db in 5a051e9.

RevlogEntry {
    id: 1528956429777,
    cid: 1528900957309,
    usn: 1282,
    button_chosen: 4,
    interval: 3,
    last_interval: -60,
    ease_factor: 2500,
    taken_millis: 3918,
    review_kind: 0,
}

The next is to rewrite these codes:

df = pd.DataFrame(revlog)
df.columns = ['id', 'cid', 'usn', 'button_chosen', 'interval', 'last_interval', 'ease_factor', 'taken_millis', 'review_kind']

df = df[(df['cid'] <= time.time() * 1000) &
        (df['id'] <= time.time() * 1000)].copy() # remove revlog with incorrect timestamp

df_set_due_date = df[(df['review_kind'] == 4) & (df['interval'] > 0)]
df.drop(df_set_due_date.index, inplace=True) # remove revlog generated by `set due date`

df.sort_values(by=['cid', 'id'], inplace=True, ignore_index=True) # sort revlog by card_id, review_time

df['is_learn_start'] = (df['review_kind'] == 0) & (df['review_kind'].shift() != 0)  # find out the first review for each card
df['sequence_group'] = df['is_learn_start'].cumsum() # if the user never used `forget` in a card, the revlogs from the same card should has the same `sequence_group`
last_learn_start = df[df['is_learn_start']].groupby('cid')['sequence_group'].last() # get the `sequence_group` for each card's `last_learn_start`
df['last_learn_start'] = df['cid'].map(last_learn_start).fillna(0).astype(int)
df['mask'] = df['sequence_group'] >= df['last_learn_start'] # remove the revlogs which happen before the `last_learn_start`
df = df[df['mask'] == True].copy()

df = df[(df['review_kind'] != 4)].copy() # remove revlog generated by  `reschedule_cards_as_new`
df = df[(df['review_kind'] != 3) | (df['ease_factor'] != 0)].copy() # remove revlog in filtered decks with rescheduling disabled
df['review_kind'] = df['review_kind'] + 1
df.loc[df['is_learn_start'], 'review_kind'] = New # New = 0
df.drop(columns=['is_learn_start', 'sequence_group', 'last_learn_start', 'mask', 'usn', 'interval', 'last_interval', 'ease_factor'], inplace=True)

@L-M-Sherlock
Copy link
Member Author

Done in 19dc456.

@dae, I think the next step is to determine the API?

You can test the training here:

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/training.rs#L132-L146

The file collection.anki21 should be placed in the directory tests/data/.

@dae
Copy link
Collaborator

dae commented Aug 20, 2023

You work fast! :-)

I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2. How have you found performance to compare to training with CPU/GPU in PyTorch? If you weren't already aware, using 'cargo test --release' will make sure it's compiled in release mode which should be faster.

In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results. We could call it on a background thread, so users could continue to use Anki for most of the training. Users usually won't have access to a console window, so ideally we'd find some way to poll for the current progress/completion ratio, so the UI can show it. If temporary files need to be written out, it would be nice if the desired folder could be passed into the API call. Any config options that might make sense for the user to adjust should also be passed in to the call, so the frontend can expose a UI to adjust them in the future.

How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4?

(1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a reviews: Vec<Review>, where Review is a struct in the form { rating: i32; delta_time: i32; };? Then when it comes time to create a tensor, it could be done like

            Data::new(
                items.iter().map(|i| i.rating).collect(),
                Shape::new([items.len()]),
            )

@dae
Copy link
Collaborator

dae commented Aug 20, 2023

Oh, and I checked the resulting size of a release build, and it's about 11-14MB (including sqlite). Much more viable than tch-rs!

I also played around with the wgpu backend, but it was much slower than ndarray on the Linux and M2 systems I tested with.

@dae
Copy link
Collaborator

dae commented Aug 20, 2023

so ideally we'd find some way to poll for the current progress/completion ratio

If you can provide a synchronous/blocking code example that prints some text for each iteration/epoch, I can convert it so that it writes the progress to a shared data structure protected by a mutex, which we could pass in to the API call.

@L-M-Sherlock
Copy link
Member Author

L-M-Sherlock commented Aug 20, 2023

I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2.

Your observation is consistent with my inference. The performance doesn't scale linearly. The reason is:

  1. For a card with 10 revlogs, the convertor will generate ~10 FSRSItems.
  2. For these FSRSItems, the lengths of t_history/r_history are range from 1 to ~10.
  3. In the training stage, the time complexity is linear to the seq_len.

So, for a card with n revlogs, the time complexity is 1+2+3+...+n = n(n+1)/2, i.e. O(n^2)

Consider two extreme cases: one with a collection with 'n' revlogs where each card has only one revlog, and another with just one card that includes all revlogs. The calculation complexity for these cases would be 'n' and 'n^2', respectively.

@L-M-Sherlock
Copy link
Member Author

In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results.

I have implemented a method to convert revlogs to FSRSItems:

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/convertor.rs#L210-L221

But it needs to convert revlogs card by card. If the revlogs are incomplete for a card, the data will be invalid and removed.

@L-M-Sherlock
Copy link
Member Author

How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4?

I'm pretty satisfied with burn, except for the over encapsulation of training process. I couldn't implement my own training loop. Fortunately, they plan to support it (tracel-ai/burn#662 (comment)).

If it is supported, continuous learning will be feasible. For old collections, the full training only needs to apply once. Then FSRS model could be trained by each review (or a batch of reviews) in the subsequent learning.

@L-M-Sherlock
Copy link
Member Author

(1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a reviews: Vec<Review>, where Review is a struct in the form { rating: i32; delta_time: i32; };? Then when it comes time to create a tensor, it could be done like

Yeah. len(t_history) == len(r_history) is always true. But if we store it as ``reviews: Vec, then we will get a tensor with shape [seq_len, batch_size, 2]. Its dimension is three. I don't know how difficult it is to deal with a 3D tensor in the forward`. It's easy in PyTorch, but hard in Burn (partly due to my poor knowledge about Burn and Rust).

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/model.rs#L100-L119

@dae
Copy link
Collaborator

dae commented Aug 20, 2023

I may be missing something, but couldn't this perhaps be handled in the mapping of FSRSItem to Tensors? Eg in fn batch(&self, items: Vec<FSRSItem>) -> FSRSBatch<B>, the separate t_historys/r_historys could be created from the single reviews vector, like in #3 (comment)?

@L-M-Sherlock
Copy link
Member Author

Oh, I forget that. Thanks for reminder. Do you mean to rewrite FSRSItem as the following code?

pub struct FSRSItem {
    pub reviews: Vec<Review>,
    pub delta_t: f32,
    pub label: f32,
}

struct Review { 
    pub rating: i32, 
    pub delta_time: i32
}

@dae
Copy link
Collaborator

dae commented Aug 20, 2023

Yep, exactly. You could call it FSRSReview if you found it clearer.

One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4?

@L-M-Sherlock
Copy link
Member Author

One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4?

I have made a draft for the schema of review logs:

https://github.com/open-spaced-repetition/fsrs-optimizer#review-logs-schema

If other SRS programs also use FSRS, it is not a problem to treat different rating values. If they don't we can mapping their ratings values to 1-4. That's what I have done in the comparison between FSRS and SM-15:

https://github.com/open-spaced-repetition/fsrs-vs-sm15#fsrs-vs-sm-15

@asukaminato0721 asukaminato0721 pinned this issue Aug 21, 2023
@L-M-Sherlock L-M-Sherlock removed the help wanted Extra attention is needed label Aug 22, 2023
@asukaminato0721 asukaminato0721 unpinned this issue Aug 25, 2023
@dae
Copy link
Collaborator

dae commented Aug 28, 2023

How about we mark this one as done and continue tracking things in #27?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants