New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert clonedetection example to multitask/multilabel #56
Comments
You can directly take a natural language utterance and one code piece (A or B) as the input to do binary classification in #53. If you want to use GraphCodeBERT, you just need to change |
I guess that the input of a natural language utterance and two code pieces (A and B) is too long, so that the results were not good. You can try to set |
Hum, right now I would like to try a multi-task approach instead of having to call the model twice (each time to perform inference for a different piece of code). Do you have any idea how to get started with this? Thanks a lot |
You can change here to get embedding of two pieces of code:
to outputs = self.encoder.roberta(inputs_embeds=inputs_embeddings,attention_mask=attn_mask,position_ids=position_idx)[1].reshape(-1,2,768)
And then adding a classifier on the embeddings. |
Sorry, I was not clear in what I was trying to accomplish (confusing multi-task with multi-value). Basically, forget the multi-task part , I want the model to perform multivalue classification, which means instead of predicting for example [1,0,0,0] it can predict one of the 4 cases:
I basically just changed
The model is training, however I am not sure this is the right procedure to convert the model to perform multi-label evaluation ? Could you confirm ? When I perform inference it is returning a tensor with this shape: ([[0.7289, 0.2711]], device='cuda:0') I interpret it as the first index giving me a probability about how likely the piece of code A is right and the piece of code B is wrong, so it seems expected, however like I mentioned before, I am not sure. Thanks a lot |
which script do you use? GraphCodeBERT(https://github.com/microsoft/CodeBERT/tree/71f6a3bb7e94042be034af8ba21500f7845b3fb5/GraphCodeBERT/clonedetection) or CodeBERT (#53)? |
If you use GraphCodeBERT, you can replace model.py as:
|
Hi, thanks A LOT for posting the code, I confirm I am using GraphCodeBert and I will try out what you posted above, but could you please provide what is the input format for the data, just to be sure? (namely the label) Is it going to be a list (e.g. [1,1]) instead of 1 ? |
A list, shown as the above. |
So you use CodeBERT script instead of GraphCodeBERT script~~ |
Do you mind to share me with your dataset? Only a part of data (10+ cases) is enough. I will modify the code for you. |
Thanks, here go some samples: {"code1" "Write one plus one | def func(): print("one plus one") , "code2": "Write one plus one | def func(): print("1+1")", "label": [1,1] } |
Please find the zip. CodeBERT-classification-2.zip |
Thanks a lot @guoday ! I will try it out and give feedback in a bit |
Sorry for not answering earlier, I tried it out, and it solved my issue, thanks @guoday . |
Hi, right now the GraphCodeBERT clone detection performs binary classification to decide whether 2 pieces of code are semantically equivalent or not.
The problem I am trying to solve is: Given a natural language utterance and two code pieces (A and B) as input to my model, determine whether:
I tried solving this problem as 4 class classification task in #53 , but the results were not very good, so right now what I am trying to accomplish is to transform it to a multi-class classification problem with a multi-label/multi-task, classifying each input 2 times:
[0,1] -> Whether A is right or wrong.
[0,1] -> Whether B is right or wrong.
Does anyone have any idea on how to accomplish this ?
Thanks a lot
The text was updated successfully, but these errors were encountered: