Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with id #1023

Open
980202006 opened this issue Jun 9, 2024 · 3 comments
Open

How to deal with id #1023

980202006 opened this issue Jun 9, 2024 · 3 comments

Comments

@980202006
Copy link

I have some id values ​​and I want to train them with bpe.The following is an example of the id value.

26865, 5412, 26865, 26865, 26865, 26865, 5412, 5412, 25283, 26865, 3395, 26865, 3395, 19440, 25283, 3395, 24032, 1175, 3395, 3395, 3395, 26865, 1175, 26865, 15807, 15807, 27062, 27062, 26865, 4759, 26865, 26865, 27062, 1175, 1175, 1175, 382, 382, 382, 382, 27474, 23834, 29768, 11946, 11946, 27474, 17279

I want to extract the class [26865, 26865, ] as a vocabulary.

@980202006
Copy link
Author

If I use bpe, split_by_num will truncate the id value regardless of whether split_by_whitespace is selected or not.
print(sp.id_to_piece(111)) #65, 26

@980202006
Copy link
Author

@azimjonn Could you give detailed configuration? The URL you gave is the default configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants