Revlogs parsing #40

imrryr · 2024-01-27T14:02:54Z

revlogs2dataset.zip
Here are the stats_pb2.py and revlogs2dataset.py
Also, here are the 10 revlog.
10.zip
For file 1 I expected this result, card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3
2,5,-1,1
2,7,0,1
2,10,0,3
2,17,0,3
2,101,2,4
2,158,4,3
2,243,1,2
2,352,7,4
2,384,4,2 from revlog 1, but got this result:
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3
2,4999,-1,1
2,5001,0,1
2,5004,0,3
2,5011,0,3
2,5266,1,4
2,5841,4,3
2,6111,2,2
2,6383,7,4
2,6800,4,2

imrryr · 2024-01-27T14:04:24Z

@L-M-Sherlock

L-M-Sherlock · 2024-01-27T14:36:43Z

I get the correct result from your code. It seems an environment problem.

L-M-Sherlock · 2024-01-27T14:40:02Z

I find that only review_th is inconsistent with my result. It doesn't matter. The order is correct.

imrryr · 2024-01-27T15:30:06Z

Unfortunately, it does matter for my analysis. I need the review_th to be right to order the entire file, also the delta_t column was different. I really need to figure it out in my environment.

L-M-Sherlock · 2024-01-27T15:45:36Z

The review_th is calculated here:

https://github.com/open-spaced-repetition/fsrs-benchmark/blob/ea493cf91900d9c8fd3bd05c42518373875c799f/revlogs2dataset.py#L54

I recommend searching the document of pandas about this function.

Document: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

I'm not helpful here because I can't reproduce the bug.

It’s helpful for debugging to store the intermediate products during the converting. You can save the df into csv after each step. Then you may locate the bug.

imrryr · 2024-01-27T16:08:28Z

OK, that makes sense. It appears the densification is working but the original times coming from the revlog are different. Did you test with the revlog I provided? Not your original?

imrryr · 2024-01-27T16:11:10Z

is there any chance the stats_pb2 file is the wrong version? I got that from @dae 's package and it was needed to run your file. @L-M-Sherlock

L-M-Sherlock · 2024-01-27T16:18:00Z

I also got that from dae. Could you show some cases about the different review time?

imrryr · 2024-01-27T16:22:31Z

Here is an output before dropping the rows: review_time card_id rating review_state is_learn_start sequence_group last_learn_start mask relative_day delta_t i review_th
0 97218963 0 3 0 True 1 1 True -19683 -1 1 4863
1 97224667 0 3 0 False 1 1 True -19683 0 2 4864
2 440742459 0 3 1 False 1 1 True -19679 4 3 4997
3 933416194 0 4 1 False 1 1 True -19674 5 4 5846
4 1046892324 0 2 3 False 1 1 True -19672 2 5 6105
... ... ... ... ... ... ... ... ... ... ... .. ...
7070 -1726999624 645 3 0 False 620 620 True -19705 0 2 1367
7071 -1726339624 645 3 3 False 620 620 True -19705 0 3 1380
7072 -1697912624 645 3 1 False 620 620 True -19704 1 4 1639
7073 -1659497624 645 3 3 False 620 620 True -19704 0 5 1959
7074 -1637230624 645 3 3 False 620 620 True -19704 0 6 2077

[6966 rows x 12 columns]
card_id review_th delta_t rating
0 0 4863 -1 3
1 0 4864 0 3
2 0 4997 4 3
3 0 5846 5 4
4 0 6105 2 2
... ... ... ... ...
7070 645 1367 0 3
7071 645 1380 0 3
7072 645 1639 1 3
7073 645 1959 0 3
7074 645 2077 0 3

L-M-Sherlock · 2024-01-27T16:29:02Z

It's weird that the review_time is negative.

https://github.com/open-spaced-repetition/fsrs-benchmark/blob/ea493cf91900d9c8fd3bd05c42518373875c799f/revlogs2dataset.py#L31

Could you check whether they are correct after below this line?

imrryr · 2024-01-27T18:35:06Z

Ha! my environment demoted the int64 to int32 here, which corrupted it. Problem solved. @L-M-Sherlock
df["review_time"] = df["review_time"].astype(int) fixed with
df["review_time"] = df["review_time"].astype("int64")

L-M-Sherlock · 2024-01-28T07:14:47Z

Thanks for this report. I will fix it in soon.

L-M-Sherlock linked a pull request Jan 28, 2024 that will close this issue

Fix/replace int with int64 in astype #42

Merged

L-M-Sherlock closed this as completed in #42 Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revlogs parsing #40

Revlogs parsing #40

imrryr commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024 •

edited

Loading

imrryr commented Jan 27, 2024

imrryr commented Jan 27, 2024 •

edited

Loading

L-M-Sherlock commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 28, 2024

Revlogs parsing #40

Revlogs parsing #40

Comments

imrryr commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024 • edited Loading

imrryr commented Jan 27, 2024

imrryr commented Jan 27, 2024 • edited Loading

L-M-Sherlock commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 27, 2024

imrryr commented Jan 27, 2024

L-M-Sherlock commented Jan 28, 2024

L-M-Sherlock commented Jan 27, 2024 •

edited

Loading

imrryr commented Jan 27, 2024 •

edited

Loading