How to deal with long code data? #16

jackfromeast · 2020-12-09T05:38:06Z

I am using codeBERT for classifying malicious code written in PHP. Some codes in the dataset are really long which far beyond the general MAX_LEN of a sentence for example 256. And setting MAX_LEN a big number would soon result in GPU resources exhausted. So I wonder if there are some fine strategies to deal with it.

guoday · 2020-12-10T04:41:09Z

Our model only supports 512 max length. One strategy is that you can split a code into K segments, and then feed K segments into CodeBERT separately. Finally, you average or use a RNN over K representations of CodeBERT to obtain a global representation.

jackfromeast · 2020-12-13T02:44:22Z

Thankyou! It did solve my problem!

Silverhorse7 · 2024-03-31T07:52:33Z

I am using CodeBERT embeddings in my module, does the result embeddings get affected by the token size too? @guoday

guody5 closed this as completed Dec 17, 2020

guody5 mentioned this issue Feb 6, 2021

How to deal with long texts in Clone-detection-BigCloneBench task？ microsoft/CodeXGLUE#34

Closed

guoday mentioned this issue Nov 26, 2021

What to do when the input length exceeds the maximum length #87

Closed

guoday mentioned this issue Feb 14, 2022

How to finetune CodeBERT to do a 4 class classification task. #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with long code data? #16

How to deal with long code data? #16

jackfromeast commented Dec 9, 2020

guoday commented Dec 10, 2020

jackfromeast commented Dec 13, 2020

Silverhorse7 commented Mar 31, 2024

How to deal with long code data? #16

How to deal with long code data? #16

Comments

jackfromeast commented Dec 9, 2020

guoday commented Dec 10, 2020

jackfromeast commented Dec 13, 2020

Silverhorse7 commented Mar 31, 2024