Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more information about the dataset #27

Closed
chaohcc opened this issue Jul 2, 2021 · 9 comments
Closed

more information about the dataset #27

chaohcc opened this issue Jul 2, 2021 · 9 comments

Comments

@chaohcc
Copy link

chaohcc commented Jul 2, 2021

Thank you!
I have run the RMI on wiki_ts_200M and books_200M successfully. I found that there are duplicated values in "wiki_ts" dataset, is it? and i would like to know more information about dataset, where can i find the description of dataset?
There is only the primary key, but I find that when the length of each record is extended, the performance of learned index will be greatly affected, the mapping between key to position becomes complicated, so i would like trying to complete the true record length to see the mapping of keys to positions.
Best wishes!
Thank you for your help!

@andreaskipf
Copy link
Contributor

Thanks for reaching out. Yes, wiki contains duplicates. @alexandervanrenen can you provide some information about the dataset?

I don't understand what you mean by record length. We extend each 64-bit unsigned integer key with a 64-bit value (payload). The payloads don't affect the mapping of keys to positions.

However, it is true that with larger payloads the last-mile search (e.g., binary search) would become more expensive due to extra cache misses.

@alexandervanrenen
Copy link
Contributor

Sure! The wiki dataset is based on wikipedia. We gathered the timestamps of all edits to pages. I think they have a second or millisecond granularity. Which explains the collisions. You can find the data and descriptions here. If I remember correctly, we only used a subset of the english wikipedia, as that already contained more than enough time stamps.

@chaohcc
Copy link
Author

chaohcc commented Jul 2, 2021

Thanks for reaching out. Yes, wiki contains duplicates. @alexandervanrenen can you provide some information about the dataset?

I don't understand what you mean by record length. We extend each 64-bit unsigned integer key with a 64-bit value (payload). The payloads don't affect the mapping of keys to positions.

However, it is true that with larger payloads the last-mile search (e.g., binary search) would become more expensive due to extra cache misses.
Thank you for your help!
I mean that each key corresponds to a index in the array, so if the record length is 1, then as the keys sorted, the positions are 0,1,2,3... but if the record length is not 1, assume it's 20, then positions are 0,20,40,60. Although the same primary keys, and the mappings between keys and positons are changed, the performance of learned index is quite different.
Best wishes!

@chaohcc
Copy link
Author

chaohcc commented Jul 2, 2021

Sure! The wiki dataset is based on wikipedia. We gathered the timestamps of all edits to pages. I think they have a second or millisecond granularity. Which explains the collisions. You can find the data and descriptions here. If I remember correctly, we only used a subset of the english wikipedia, as that already contained more than enough time stamps.

Thank you for your help, I find the description.

@andreaskipf
Copy link
Contributor

I mean that each key corresponds to a index in the array, so if the record length is 1, then as the keys sorted, the positions are 0,1,2,3... but if the record length is not 1, assume it's 20, then positions are 0,20,40,60. Although the same primary keys, and the mappings between keys and positons are changed, the performance of learned index is quite different.
Best wishes!

Hi @chaohcc. No, the record length doesn't affect the positions since each record is fixed size. You can just multiply the model's prediction with the record size to get the byte offset of the record. The position of the 5th record will always be 4 (0-indexed). Its byte offset will be 4 * record_size. So the model only needs to know that the 5th record is stored at position 4.

With variable-sized records that's a different story. But even then you probably wouldn't train on the raw byte offsets but instead use another structure for the indirection, like store an array of pointers and index into that with the model.

I hope that helps.

@chaohcc
Copy link
Author

chaohcc commented Jul 3, 2021

I mean that each key corresponds to a index in the array, so if the record length is 1, then as the keys sorted, the positions are 0,1,2,3... but if the record length is not 1, assume it's 20, then positions are 0,20,40,60. Although the same primary keys, and the mappings between keys and positons are changed, the performance of learned index is quite different.
Best wishes!

Hi @chaohcc. No, the record length doesn't affect the positions since each record is fixed size. You can just multiply the model's prediction with the record size to get the byte offset of the record. The position of the 5th record will always be 4 (0-indexed). Its byte offset will be 4 * record_size. So the model only needs to know that the 5th record is stored at position 4.

With variable-sized records that's a different story. But even then you probably wouldn't train on the raw byte offsets but instead use another structure for the indirection, like store an array of pointers and index into that with the model.

I hope that helps.

Thank you. It's very helpful, i also try the varialbe-size records and build a mapping table with just the keys in array instead of the raw byte offsets.
Best wishes!

@chaohcc
Copy link
Author

chaohcc commented Aug 31, 2021

Sure! The wiki dataset is based on wikipedia. We gathered the timestamps of all edits to pages. I think they have a second or millisecond granularity. Which explains the collisions. You can find the data and descriptions here. If I remember correctly, we only used a subset of the english wikipedia, as that already contained more than enough time stamps.

Hi! I have another question about the dataset, I want more information about 'books_200M', is it timestamp or some other ID? Thank you for your help!

Best wishes!

@alexandervanrenen
Copy link
Contributor

@RyanMarcus If I remember correctly the book dataset has the popularity of each book. So the number of times it was accessed or bought on amazon. No timestamps.

@chaohcc
Copy link
Author

chaohcc commented Sep 1, 2021

@RyanMarcus If I remember correctly the book dataset has the popularity of each book. So the number of times it was accessed or bought on amazon. No timestamps.

Thank you so much!

@chaohcc chaohcc closed this as completed Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants