-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with long code data? #16
Comments
Our model only supports 512 max length. One strategy is that you can split a code into |
Thankyou! It did solve my problem! |
I am using CodeBERT embeddings in my module, does the result embeddings get affected by the token size too? @guoday |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am using codeBERT for classifying malicious code written in PHP. Some codes in the dataset are really long which far beyond the general MAX_LEN of a sentence for example 256. And setting MAX_LEN a big number would soon result in GPU resources exhausted. So I wonder if there are some fine strategies to deal with it.
The text was updated successfully, but these errors were encountered: