-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TODO] feature: build dataset from Anki's sqlite database #3
Comments
DataFrame library for Rust: https://docs.rs/polars/latest/polars/ |
I wonder if this could perhaps be accomplished using standard Rust iterators/structures and SQL? Eg fetching the data from the revlog table in the desired order, and collecting it into a Vec<(CardId, Vec)> or similar. Upsides would be avoiding the addition of another non-trivial dependency, and the use of typed structs would make the code somewhat more maintainable than looking up columns by string keys. I'm not sure how easily all of those dataframe calls could be translated though, so don't know if it's practical or not. |
OK, I will try. Maybe two main loops will implement it. I use pandas in Python just for efficiency and convenience. |
I have tried to query revlog from Anki db in 5a051e9. RevlogEntry {
id: 1528956429777,
cid: 1528900957309,
usn: 1282,
button_chosen: 4,
interval: 3,
last_interval: -60,
ease_factor: 2500,
taken_millis: 3918,
review_kind: 0,
} The next is to rewrite these codes: df = pd.DataFrame(revlog)
df.columns = ['id', 'cid', 'usn', 'button_chosen', 'interval', 'last_interval', 'ease_factor', 'taken_millis', 'review_kind']
df = df[(df['cid'] <= time.time() * 1000) &
(df['id'] <= time.time() * 1000)].copy() # remove revlog with incorrect timestamp
df_set_due_date = df[(df['review_kind'] == 4) & (df['interval'] > 0)]
df.drop(df_set_due_date.index, inplace=True) # remove revlog generated by `set due date`
df.sort_values(by=['cid', 'id'], inplace=True, ignore_index=True) # sort revlog by card_id, review_time
df['is_learn_start'] = (df['review_kind'] == 0) & (df['review_kind'].shift() != 0) # find out the first review for each card
df['sequence_group'] = df['is_learn_start'].cumsum() # if the user never used `forget` in a card, the revlogs from the same card should has the same `sequence_group`
last_learn_start = df[df['is_learn_start']].groupby('cid')['sequence_group'].last() # get the `sequence_group` for each card's `last_learn_start`
df['last_learn_start'] = df['cid'].map(last_learn_start).fillna(0).astype(int)
df['mask'] = df['sequence_group'] >= df['last_learn_start'] # remove the revlogs which happen before the `last_learn_start`
df = df[df['mask'] == True].copy()
df = df[(df['review_kind'] != 4)].copy() # remove revlog generated by `reschedule_cards_as_new`
df = df[(df['review_kind'] != 3) | (df['ease_factor'] != 0)].copy() # remove revlog in filtered decks with rescheduling disabled
df['review_kind'] = df['review_kind'] + 1
df.loc[df['is_learn_start'], 'review_kind'] = New # New = 0
df.drop(columns=['is_learn_start', 'sequence_group', 'last_learn_start', 'mask', 'usn', 'interval', 'last_interval', 'ease_factor'], inplace=True) |
You work fast! :-) I gave this a quick test on a collection with about 200k revlog entries, and it completed in around 90s. I tried a different collection with about 1M entries, and performance did not scale linearly - at the 8 minute mark it was still early on epoch 2. How have you found performance to compare to training with CPU/GPU in PyTorch? If you weren't already aware, using 'cargo test --release' will make sure it's compiled in release mode which should be faster. In terms of integration into Anki, I imagine we could have a collection method inside the anki crate that dumps the revlog into a vec of FSRSItems (1). We could then call a method exported by this crate that takes a Vec of those items, and it would process it and return the results. We could call it on a background thread, so users could continue to use Anki for most of the training. Users usually won't have access to a console window, so ideally we'd find some way to poll for the current progress/completion ratio, so the UI can show it. If temporary files need to be written out, it would be nice if the desired folder could be passed into the API call. Any config options that might make sense for the user to adjust should also be passed in to the call, so the frontend can expose a UI to adjust them in the future. How are you feeling about the state of this at this point? Does it look like using burn might be a viable alternative to the current pytorch approach, at least if we can solve the clipping problem in #4? (1) Re FSRSItem, is the length of t_history and r_history always the same? If so, perhaps it would make things clearer to store it as a Data::new(
items.iter().map(|i| i.rating).collect(),
Shape::new([items.len()]),
) |
Oh, and I checked the resulting size of a release build, and it's about 11-14MB (including sqlite). Much more viable than tch-rs! I also played around with the wgpu backend, but it was much slower than ndarray on the Linux and M2 systems I tested with. |
If you can provide a synchronous/blocking code example that prints some text for each iteration/epoch, I can convert it so that it writes the progress to a shared data structure protected by a mutex, which we could pass in to the API call. |
Your observation is consistent with my inference. The performance doesn't scale linearly. The reason is:
So, for a card with n revlogs, the time complexity is 1+2+3+...+n = n(n+1)/2, i.e. O(n^2) Consider two extreme cases: one with a collection with 'n' revlogs where each card has only one revlog, and another with just one card that includes all revlogs. The calculation complexity for these cases would be 'n' and 'n^2', respectively. |
I have implemented a method to convert revlogs to FSRSItems: But it needs to convert revlogs card by card. If the revlogs are incomplete for a card, the data will be invalid and removed. |
I'm pretty satisfied with burn, except for the over encapsulation of training process. I couldn't implement my own training loop. Fortunately, they plan to support it (tracel-ai/burn#662 (comment)). If it is supported, continuous learning will be feasible. For old collections, the full training only needs to apply once. Then FSRS model could be trained by each review (or a batch of reviews) in the subsequent learning. |
Yeah. |
I may be missing something, but couldn't this perhaps be handled in the mapping of FSRSItem to Tensors? Eg in |
Oh, I forget that. Thanks for reminder. Do you mean to rewrite FSRSItem as the following code?
|
Yep, exactly. You could call it FSRSReview if you found it clearer. One question is how FSRS treats reviews in different SRS programs. Does it compare the different rating values, or does it just treat them as true/false? Does it assume the rating will be 1-4? |
I have made a draft for the schema of review logs: https://github.com/open-spaced-repetition/fsrs-optimizer#review-logs-schema If other SRS programs also use FSRS, it is not a problem to treat different rating values. If they don't we can mapping their ratings values to 1-4. That's what I have done in the comparison between FSRS and SM-15: https://github.com/open-spaced-repetition/fsrs-vs-sm15#fsrs-vs-sm-15 |
How about we mark this one as done and continue tracking things in #27? |
Python implementation:
https://github.com/open-spaced-repetition/fsrs-optimizer/blob/95694b787bb71ac9883db1201af09e334ee5ee0b/src/fsrs_optimizer/fsrs_optimizer.py#L319-L449
The text was updated successfully, but these errors were encountered: