[TODO] feature: build dataset from Anki's sqlite database #3

L-M-Sherlock · 2023-08-18T06:35:51Z

Python implementation:
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/95694b787bb71ac9883db1201af09e334ee5ee0b/src/fsrs_optimizer/fsrs_optimizer.py#L319-L449

L-M-Sherlock · 2023-08-18T13:55:24Z

DataFrame library for Rust: https://docs.rs/polars/latest/polars/

dae · 2023-08-18T21:25:59Z

I wonder if this could perhaps be accomplished using standard Rust iterators/structures and SQL? Eg fetching the data from the revlog table in the desired order, and collecting it into a Vec<(CardId, Vec)> or similar. Upsides would be avoiding the addition of another non-trivial dependency, and the use of typed structs would make the code somewhat more maintainable than looking up columns by string keys. I'm not sure how easily all of those dataframe calls could be translated though, so don't know if it's practical or not.

L-M-Sherlock · 2023-08-19T06:45:09Z

OK, I will try. Maybe two main loops will implement it. I use pandas in Python just for efficiency and convenience.

L-M-Sherlock · 2023-08-19T08:30:08Z

I have tried to query revlog from Anki db in 5a051e9.

RevlogEntry {
    id: 1528956429777,
    cid: 1528900957309,
    usn: 1282,
    button_chosen: 4,
    interval: 3,
    last_interval: -60,
    ease_factor: 2500,
    taken_millis: 3918,
    review_kind: 0,
}

The next is to rewrite these codes:

df = pd.DataFrame(revlog)
df.columns = ['id', 'cid', 'usn', 'button_chosen', 'interval', 'last_interval', 'ease_factor', 'taken_millis', 'review_kind']

df = df[(df['cid'] <= time.time() * 1000) &
        (df['id'] <= time.time() * 1000)].copy() # remove revlog with incorrect timestamp

df_set_due_date = df[(df['review_kind'] == 4) & (df['interval'] > 0)]
df.drop(df_set_due_date.index, inplace=True) # remove revlog generated by `set due date`

df.sort_values(by=['cid', 'id'], inplace=True, ignore_index=True) # sort revlog by card_id, review_time

df['is_learn_start'] = (df['review_kind'] == 0) & (df['review_kind'].shift() != 0)  # find out the first review for each card
df['sequence_group'] = df['is_learn_start'].cumsum() # if the user never used `forget` in a card, the revlogs from the same card should has the same `sequence_group`
last_learn_start = df[df['is_learn_start']].groupby('cid')['sequence_group'].last() # get the `sequence_group` for each card's `last_learn_start`
df['last_learn_start'] = df['cid'].map(last_learn_start).fillna(0).astype(int)
df['mask'] = df['sequence_group'] >= df['last_learn_start'] # remove the revlogs which happen before the `last_learn_start`
df = df[df['mask'] == True].copy()

df = df[(df['review_kind'] != 4)].copy() # remove revlog generated by  `reschedule_cards_as_new`
df = df[(df['review_kind'] != 3) | (df['ease_factor'] != 0)].copy() # remove revlog in filtered decks with rescheduling disabled
df['review_kind'] = df['review_kind'] + 1
df.loc[df['is_learn_start'], 'review_kind'] = New # New = 0
df.drop(columns=['is_learn_start', 'sequence_group', 'last_learn_start', 'mask', 'usn', 'interval', 'last_interval', 'ease_factor'], inplace=True)

L-M-Sherlock · 2023-08-19T20:21:17Z

Done in 19dc456.

@dae, I think the next step is to determine the API?

You can test the training here:

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/training.rs#L132-L146

The file collection.anki21 should be placed in the directory tests/data/.

dae · 2023-08-20T02:39:52Z

You work fast! :-)

I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2. How have you found performance to compare to training with CPU/GPU in PyTorch? If you weren't already aware, using 'cargo test --release' will make sure it's compiled in release mode which should be faster.

In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results. We could call it on a background thread, so users could continue to use Anki for most of the training. Users usually won't have access to a console window, so ideally we'd find some way to poll for the current progress/completion ratio, so the UI can show it. If temporary files need to be written out, it would be nice if the desired folder could be passed into the API call. Any config options that might make sense for the user to adjust should also be passed in to the call, so the frontend can expose a UI to adjust them in the future.

How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4?

(1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a reviews: Vec<Review>, where Review is a struct in the form { rating: i32; delta_time: i32; };? Then when it comes time to create a tensor, it could be done like

            Data::new(
                items.iter().map(|i| i.rating).collect(),
                Shape::new([items.len()]),
            )

dae · 2023-08-20T02:42:48Z

Oh, and I checked the resulting size of a release build, and it's about 11-14MB (including sqlite). Much more viable than tch-rs!

I also played around with the wgpu backend, but it was much slower than ndarray on the Linux and M2 systems I tested with.

dae · 2023-08-20T03:27:33Z

so ideally we'd find some way to poll for the current progress/completion ratio

If you can provide a synchronous/blocking code example that prints some text for each iteration/epoch, I can convert it so that it writes the progress to a shared data structure protected by a mutex, which we could pass in to the API call.

L-M-Sherlock · 2023-08-20T06:44:04Z

I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2.

Your observation is consistent with my inference. The performance doesn't scale linearly. The reason is:

For a card with 10 revlogs, the convertor will generate ~10 FSRSItems.
For these FSRSItems, the lengths of t_history/r_history are range from 1 to ~10.
In the training stage, the time complexity is linear to the seq_len.

So, for a card with n revlogs, the time complexity is 1+2+3+...+n = n(n+1)/2, i.e. O(n^2)

Consider two extreme cases: one with a collection with 'n' revlogs where each card has only one revlog, and another with just one card that includes all revlogs. The calculation complexity for these cases would be 'n' and 'n^2', respectively.

L-M-Sherlock · 2023-08-20T06:52:47Z

In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results.

I have implemented a method to convert revlogs to FSRSItems:

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/convertor.rs#L210-L221

But it needs to convert revlogs card by card. If the revlogs are incomplete for a card, the data will be invalid and removed.

L-M-Sherlock · 2023-08-20T06:59:37Z

How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4?

I'm pretty satisfied with burn, except for the over encapsulation of training process. I couldn't implement my own training loop. Fortunately, they plan to support it (tracel-ai/burn#662 (comment)).

If it is supported, continuous learning will be feasible. For old collections, the full training only needs to apply once. Then FSRS model could be trained by each review (or a batch of reviews) in the subsequent learning.

L-M-Sherlock · 2023-08-20T07:07:47Z

(1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a reviews: Vec<Review>, where Review is a struct in the form { rating: i32; delta_time: i32; };? Then when it comes time to create a tensor, it could be done like

Yeah. len(t_history) == len(r_history) is always true. But if we store it as ``reviews: Vec, then we will get a tensor with shape [seq_len, batch_size, 2]. Its dimension is three. I don't know how difficult it is to deal with a 3D tensor in the forward`. It's easy in PyTorch, but hard in Burn (partly due to my poor knowledge about Burn and Rust).

https://github.com/open-spaced-repetition/fsrs-optimizer-burn/blob/19dc456680d950191f1c929c89eeed46a00e7314/src/model.rs#L100-L119

dae · 2023-08-20T07:24:14Z

I may be missing something, but couldn't this perhaps be handled in the mapping of FSRSItem to Tensors? Eg in fn batch(&self, items: Vec<FSRSItem>) -> FSRSBatch<B>, the separate t_historys/r_historys could be created from the single reviews vector, like in #3 (comment)?

L-M-Sherlock · 2023-08-20T07:44:16Z

Oh, I forget that. Thanks for reminder. Do you mean to rewrite FSRSItem as the following code?

pub struct FSRSItem {
    pub reviews: Vec<Review>,
    pub delta_t: f32,
    pub label: f32,
}

struct Review { 
    pub rating: i32, 
    pub delta_time: i32
}

dae · 2023-08-20T07:58:08Z

Yep, exactly. You could call it FSRSReview if you found it clearer.

One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4?

L-M-Sherlock · 2023-08-20T08:13:31Z

One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4?

I have made a draft for the schema of review logs:

https://github.com/open-spaced-repetition/fsrs-optimizer#review-logs-schema

If other SRS programs also use FSRS, it is not a problem to treat different rating values. If they don't we can mapping their ratings values to 1-4. That's what I have done in the comparison between FSRS and SM-15:

https://github.com/open-spaced-repetition/fsrs-vs-sm15#fsrs-vs-sm-15

dae · 2023-08-28T03:54:11Z

How about we mark this one as done and continue tracking things in #27?

L-M-Sherlock added enhancement New feature or request help wanted Extra attention is needed labels Aug 18, 2023

L-M-Sherlock mentioned this issue Aug 20, 2023

[TODO] rewrite FSRSItem #7

Closed

asukaminato0721 pinned this issue Aug 21, 2023

L-M-Sherlock removed the help wanted Extra attention is needed label Aug 22, 2023

asukaminato0721 unpinned this issue Aug 25, 2023

L-M-Sherlock closed this as completed Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TODO] feature: build dataset from Anki's sqlite database #3

[TODO] feature: build dataset from Anki's sqlite database #3

L-M-Sherlock commented Aug 18, 2023

L-M-Sherlock commented Aug 18, 2023

dae commented Aug 18, 2023

L-M-Sherlock commented Aug 19, 2023

L-M-Sherlock commented Aug 19, 2023

L-M-Sherlock commented Aug 19, 2023

dae commented Aug 20, 2023 •

edited

dae commented Aug 20, 2023

dae commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023 •

edited

L-M-Sherlock commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

dae commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

dae commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

dae commented Aug 28, 2023

[TODO] feature: build dataset from Anki's sqlite database #3

[TODO] feature: build dataset from Anki's sqlite database #3

Comments

L-M-Sherlock commented Aug 18, 2023

L-M-Sherlock commented Aug 18, 2023

dae commented Aug 18, 2023

L-M-Sherlock commented Aug 19, 2023

L-M-Sherlock commented Aug 19, 2023

L-M-Sherlock commented Aug 19, 2023

dae commented Aug 20, 2023 • edited

dae commented Aug 20, 2023

dae commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023 • edited

L-M-Sherlock commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

dae commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

dae commented Aug 20, 2023

L-M-Sherlock commented Aug 20, 2023

dae commented Aug 28, 2023

dae commented Aug 20, 2023 •

edited

L-M-Sherlock commented Aug 20, 2023 •

edited