New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to finetune CodeBERT to do a 4 class classification task. #53
Comments
Change |
Thanks for the .zip In the clonedetection folder there is just one .jsonl and the dictionary has two keys: "func" (which has a function) and "idx" and then the train/test/valid.txt files have lines containing 2 idxs (one for each function) and the binary label In the files you sent, the .jsonl has different keys, "code" and "label" and you did not send any .txt . However, after inspecting "code" I noticed that it has only 1 function per line. Does this mean, if what I want to do is compare two different functions of "code" and classify them according to 4 classes, I just concatenate them and plug them into the "code" key ? Thanks a lot |
Concatenating two functions and put them into the "code" key is a direct way to solve your problem. However, it's unfair for the second function since we truncate the function to I suggest you can add two keys. One is "code1", and the other is "code2". And then you can change here:
to
|
Thanks a lot for the prompt reply. I will try this out and let you know the result |
Hum... The model is still just predicting '0' and '1' and not outputting labels '2' and '3' despite changing the config.num_labels. Any idea of what might be going on ? |
Do you have fine-tuned the model on your dataset? When you set config.num_labels = 4, the model should do a multi-class classification. You can print |
I will try that out, I will also create a very simple dataset: code1: "a" code2: "a" -> 0 To verify that the problem is not in my dataset (my dataset is unbalanced) |
My suggestion is:
|
Sorry, I have found the bug - -!
to
|
Thank you! I will try that |
It was that! It is now predicting the 4 classes. Thanks @guoday |
Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you! |
No. This is only an example. |
Thanks for your prompt reply, I'm trying to use codebert model for source code classification tasks. Currently, I'm using POJ-104 dataset. Can you give me some suggestions about other datasets(labeled or unlabeled) that could be used for codes classification? Thank you so much. |
Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE |
Thank you! |
Hi Guoday, I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification. It will really help me if you are replying to my query. Thank you. |
|
Hi Guoday, Could you please suggest any preprocessing on source code so I could use CodeBERT for that. I mean what can be approach of preprocessing on my C source code.? |
You don't need to preprocess C source code. Just like other programming language and using original C source code as the input. |
Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification? |
please refer to #16 |
Thank You Guoday. |
Is it possible to get prediction scores too? |
Hi,
Recently I have been looking and experimenting the clone detection variant of CodeBERT to perform a 4-class classification problem. But it seems the model is only predicting 2 classes despite the task for which I am training the model having 4 classes in the data.jsonl, train.txt, valid.txt etc... Is it possible to use the examples provided to do a multi-class classification problem using CodeBERT or right now, out of the box, it is only able to solve a binary classification (using codedetection folder) problem ?
Thanks a lot
The text was updated successfully, but these errors were encountered: