Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to finetune CodeBERT to do a 4 class classification task. #53

Closed
PedroEstevesPT opened this issue Jul 6, 2021 · 25 comments
Closed

Comments

@PedroEstevesPT
Copy link

Hi,

Recently I have been looking and experimenting the clone detection variant of CodeBERT to perform a 4-class classification problem. But it seems the model is only predicting 2 classes despite the task for which I am training the model having 4 classes in the data.jsonl, train.txt, valid.txt etc... Is it possible to use the examples provided to do a multi-class classification problem using CodeBERT or right now, out of the box, it is only able to solve a binary classification (using codedetection folder) problem ?

Thanks a lot

@guoday
Copy link
Contributor

guoday commented Jul 7, 2021

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

@PedroEstevesPT
Copy link
Author

PedroEstevesPT commented Jul 7, 2021

Thanks for the .zip
Just one question about the input format.

In the clonedetection folder there is just one .jsonl and the dictionary has two keys: "func" (which has a function) and "idx" and then the train/test/valid.txt files have lines containing 2 idxs (one for each function) and the binary label

In the files you sent, the .jsonl has different keys, "code" and "label" and you did not send any .txt . However, after inspecting "code" I noticed that it has only 1 function per line. Does this mean, if what I want to do is compare two different functions of "code" and classify them according to 4 classes, I just concatenate them and plug them into the "code" key ?

Thanks a lot

@guoday
Copy link
Contributor

guoday commented Jul 7, 2021

Concatenating two functions and put them into the "code" key is a direct way to solve your problem. However, it's unfair for the second function since we truncate the function to block_size (i.e. 256 in this case) tokens.

I suggest you can add two keys. One is "code1", and the other is "code2". And then you can change here:

def convert_examples_to_features(js,tokenizer,args):
    #source
    code=' '.join(js['code'].split())
    code_tokens=tokenizer.tokenize(code)[:args.block_size-2]
    source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

to

def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(js,tokenizer,args):
    #source
    code1=' '.join(js['code1'].split())
    code2=' '.join(js['code2'].split())
    code1_tokens=tokenizer.tokenize(code1)
    code2_tokens=tokenizer.tokenize(code2)
    _truncate_seq_pair( code1_tokens, code2_tokens, args.block_size-3)
    source_tokens =[tokenizer.cls_token]+code1_tokens+[tokenizer.sep_token]+code2_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

@PedroEstevesPT
Copy link
Author

Thanks a lot for the prompt reply. I will try this out and let you know the result

@PedroEstevesPT
Copy link
Author

Hum... The model is still just predicting '0' and '1' and not outputting labels '2' and '3' despite changing the config.num_labels. Any idea of what might be going on ?

@guoday
Copy link
Contributor

guoday commented Jul 7, 2021

Do you have fine-tuned the model on your dataset? When you set config.num_labels = 4, the model should do a multi-class classification. You can print prob.shape in model.py and you will see (bs,4) shape in prediction.

@PedroEstevesPT
Copy link
Author

I will try that out, I will also create a very simple dataset:

code1: "a" code2: "a" -> 0
code1: "b" code2: "b" -> 1
code1: "c" code2: "c" -> 2
code1: "d" code2: "d" -> 3

To verify that the problem is not in my dataset (my dataset is unbalanced)

@PedroEstevesPT
Copy link
Author

The same problem remains...

However a model.bin is indeed being created and saved in saved_models/checkpoint-best-acc . The shape also matches the 4 labels:
a

It also seems I am receiving a warning saying that model is not fine-tuned.
a

Right now I am training the model just for 1 epoch, so I can debug faster. Maybe I need to increase those epochs ?

@guoday
Copy link
Contributor

guoday commented Jul 7, 2021

My suggestion is:

  1. Fine-tune the model on your dataset. And then, load the checkpoint and print prob. If the probabilities of 3 and 4 label are normal, your dataset may be unbalance.
  2. Or maybe you need to print labels in model.py to see whether there are data with 3 and 4 label in your input.

@guoday
Copy link
Contributor

guoday commented Jul 7, 2021

Sorry, I have found the bug - -!
please change

with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
    for example,pred in zip(eval_dataset.examples,preds):
        if pred:
            f.write('1\n')
        else:
            f.write('0\n')    

to

    with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
        for example,pred in zip(eval_dataset.examples,preds):
                f.write(str(pred)+'\n')

@PedroEstevesPT
Copy link
Author

Thank you! I will try that

@PedroEstevesPT
Copy link
Author

It was that! It is now predicting the 4 classes. Thanks @guoday

@QiushiSun
Copy link

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

@guoday
Copy link
Contributor

guoday commented Nov 15, 2021

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

@QiushiSun
Copy link

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

Thanks for your prompt reply, I'm trying to use codebert model for source code classification tasks. Currently, I'm using POJ-104 dataset. Can you give me some suggestions about other datasets(labeled or unlabeled) that could be used for codes classification? Thank you so much.

@guoday
Copy link
Contributor

guoday commented Nov 15, 2021

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

@QiushiSun
Copy link

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

Thank you!

@patelpooja363
Copy link

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification.
I have 10 softwares and their source codes respective to hash values.
Here is my query-
1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case?
Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not.
2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you.
Pooja K

@guoday
Copy link
Contributor

guoday commented Feb 14, 2022

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification. I have 10 softwares and their source codes respective to hash values. Here is my query- 1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case? Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not. 2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you. Pooja K

  1. The maximum length of CodeBERT is only 512. It' hard to handle source codes with length in 7 digit number, even if for other neural network.
  2. CodeBERT is also suitable for C source code.

@patelpooja363
Copy link

Hi Guoday,
Good morning.
Thanks a lot for reply.

Could you please suggest any preprocessing on source code so I could use CodeBERT for that. I mean what can be approach of preprocessing on my C source code.?

@guoday
Copy link
Contributor

guoday commented Feb 14, 2022

You don't need to preprocess C source code. Just like other programming language and using original C source code as the input.

@patelpooja363
Copy link

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

@guoday
Copy link
Contributor

guoday commented Feb 14, 2022

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

please refer to #16

@patelpooja363
Copy link

Thank You Guoday.
Its really cleared my doubt.

@ap-la
Copy link

ap-la commented Dec 6, 2023

Is it possible to get prediction scores too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants