-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revlogs parsing #40
Comments
I find that only review_th is inconsistent with my result. It doesn't matter. The order is correct. |
Unfortunately, it does matter for my analysis. I need the review_th to be right to order the entire file, also the delta_t column was different. I really need to figure it out in my environment. |
The I recommend searching the document of pandas about this function. Document: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html I'm not helpful here because I can't reproduce the bug. It’s helpful for debugging to store the intermediate products during the converting. You can save the df into csv after each step. Then you may locate the bug. |
OK, that makes sense. It appears the densification is working but the original times coming from the revlog are different. Did you test with the revlog I provided? Not your original? |
is there any chance the stats_pb2 file is the wrong version? I got that from @dae 's package and it was needed to run your file. @L-M-Sherlock |
I also got that from dae. Could you show some cases about the different review time? |
Here is an output before dropping the rows: review_time card_id rating review_state is_learn_start sequence_group last_learn_start mask relative_day delta_t i review_th [6966 rows x 12 columns] |
It's weird that the review_time is negative. Could you check whether they are correct after below this line? |
Ha! my environment demoted the int64 to int32 here, which corrupted it. Problem solved. @L-M-Sherlock |
Thanks for this report. I will fix it in soon. |
revlogs2dataset.zip
Here are the stats_pb2.py and revlogs2dataset.py
Also, here are the 10 revlog.
10.zip
For file 1 I expected this result, card_id,review_th,delta_t,rating
0,1,-1,3
0,2,0,3
0,3,4,3
0,163,6,4
0,237,1,2
0,380,11,4
1,4,-1,3
1,14,0,1
1,16,0,1
1,21,0,3
1,30,0,3
1,111,2,3
1,160,4,4
1,340,8,3
2,5,-1,1
2,7,0,1
2,10,0,3
2,17,0,3
2,101,2,4
2,158,4,3
2,243,1,2
2,352,7,4
2,384,4,2 from revlog 1, but got this result:
card_id,review_th,delta_t,rating
0,4863,-1,3
0,4864,0,3
0,4997,4,3
0,5846,5,4
0,6105,2,2
0,6745,10,4
1,4998,-1,3
1,5008,0,1
1,5010,0,1
1,5015,0,3
1,5024,0,3
1,5276,1,3
1,5843,4,4
1,6371,9,3
2,4999,-1,1
2,5001,0,1
2,5004,0,3
2,5011,0,3
2,5266,1,4
2,5841,4,3
2,6111,2,2
2,6383,7,4
2,6800,4,2
The text was updated successfully, but these errors were encountered: