Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to deal with long code data? #16

Closed
jackfromeast opened this issue Dec 9, 2020 · 3 comments
Closed

How to deal with long code data? #16

jackfromeast opened this issue Dec 9, 2020 · 3 comments

Comments

@jackfromeast
Copy link

I am using codeBERT for classifying malicious code written in PHP. Some codes in the dataset are really long which far beyond the general MAX_LEN of a sentence for example 256. And setting MAX_LEN a big number would soon result in GPU resources exhausted. So I wonder if there are some fine strategies to deal with it.

@guoday
Copy link
Contributor

guoday commented Dec 10, 2020

Our model only supports 512 max length. One strategy is that you can split a code into K segments, and then feed K segments into CodeBERT separately. Finally, you average or use a RNN over K representations of CodeBERT to obtain a global representation.

@jackfromeast
Copy link
Author

Thankyou! It did solve my problem!

@Silverhorse7
Copy link

I am using CodeBERT embeddings in my module, does the result embeddings get affected by the token size too? @guoday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants