Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revlogs parsing #40

Closed
imrryr opened this issue Jan 27, 2024 · 12 comments · Fixed by #42
Closed

Revlogs parsing #40

imrryr opened this issue Jan 27, 2024 · 12 comments · Fixed by #42

Comments

@imrryr
Copy link

imrryr commented Jan 27, 2024

revlogs2dataset.zip
Here are the stats_pb2.py and revlogs2dataset.py
Also, here are the 10 revlog.
10.zip
For file 1 I expected this result, card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3
2,5,-1,1
2,7,0,1
2,10,0,3
2,17,0,3
2,101,2,4
2,158,4,3
2,243,1,2
2,352,7,4
2,384,4,2 from revlog 1, but got this result:
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3
2,4999,-1,1
2,5001,0,1
2,5004,0,3
2,5011,0,3
2,5266,1,4
2,5841,4,3
2,6111,2,2
2,6383,7,4
2,6800,4,2

@imrryr
Copy link
Author

imrryr commented Jan 27, 2024

@L-M-Sherlock

@L-M-Sherlock
Copy link
Member

image

I get the correct result from your code. It seems an environment problem.

@L-M-Sherlock
Copy link
Member

I find that only review_th is inconsistent with my result. It doesn't matter. The order is correct.

@imrryr
Copy link
Author

imrryr commented Jan 27, 2024

Unfortunately, it does matter for my analysis. I need the review_th to be right to order the entire file, also the delta_t column was different. I really need to figure it out in my environment.

@L-M-Sherlock
Copy link
Member

L-M-Sherlock commented Jan 27, 2024

The review_th is calculated here:

https://github.com/open-spaced-repetition/fsrs-benchmark/blob/ea493cf91900d9c8fd3bd05c42518373875c799f/revlogs2dataset.py#L54

I recommend searching the document of pandas about this function.

Document: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

I'm not helpful here because I can't reproduce the bug.

It’s helpful for debugging to store the intermediate products during the converting. You can save the df into csv after each step. Then you may locate the bug.

@imrryr
Copy link
Author

imrryr commented Jan 27, 2024

OK, that makes sense. It appears the densification is working but the original times coming from the revlog are different. Did you test with the revlog I provided? Not your original?

@imrryr
Copy link
Author

imrryr commented Jan 27, 2024

is there any chance the stats_pb2 file is the wrong version? I got that from @dae 's package and it was needed to run your file. @L-M-Sherlock

@L-M-Sherlock
Copy link
Member

I also got that from dae. Could you show some cases about the different review time?

@imrryr
Copy link
Author

imrryr commented Jan 27, 2024

Here is an output before dropping the rows: review_time card_id rating review_state is_learn_start sequence_group last_learn_start mask relative_day delta_t i review_th
0 97218963 0 3 0 True 1 1 True -19683 -1 1 4863
1 97224667 0 3 0 False 1 1 True -19683 0 2 4864
2 440742459 0 3 1 False 1 1 True -19679 4 3 4997
3 933416194 0 4 1 False 1 1 True -19674 5 4 5846
4 1046892324 0 2 3 False 1 1 True -19672 2 5 6105
... ... ... ... ... ... ... ... ... ... ... .. ...
7070 -1726999624 645 3 0 False 620 620 True -19705 0 2 1367
7071 -1726339624 645 3 3 False 620 620 True -19705 0 3 1380
7072 -1697912624 645 3 1 False 620 620 True -19704 1 4 1639
7073 -1659497624 645 3 3 False 620 620 True -19704 0 5 1959
7074 -1637230624 645 3 3 False 620 620 True -19704 0 6 2077

[6966 rows x 12 columns]
card_id review_th delta_t rating
0 0 4863 -1 3
1 0 4864 0 3
2 0 4997 4 3
3 0 5846 5 4
4 0 6105 2 2
... ... ... ... ...
7070 645 1367 0 3
7071 645 1380 0 3
7072 645 1639 1 3
7073 645 1959 0 3
7074 645 2077 0 3

@L-M-Sherlock
Copy link
Member

It's weird that the review_time is negative.

https://github.com/open-spaced-repetition/fsrs-benchmark/blob/ea493cf91900d9c8fd3bd05c42518373875c799f/revlogs2dataset.py#L31

Could you check whether they are correct after below this line?

@imrryr
Copy link
Author

imrryr commented Jan 27, 2024

Ha! my environment demoted the int64 to int32 here, which corrupted it. Problem solved. @L-M-Sherlock
df["review_time"] = df["review_time"].astype(int) fixed with
df["review_time"] = df["review_time"].astype("int64")

@L-M-Sherlock
Copy link
Member

Thanks for this report. I will fix it in soon.

@L-M-Sherlock L-M-Sherlock linked a pull request Jan 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants