How to finetune CodeBERT to do a 4 class classification task. #53

PedroEstevesPT · 2021-07-06T17:04:57Z

Hi,

Recently I have been looking and experimenting the clone detection variant of CodeBERT to perform a 4-class classification problem. But it seems the model is only predicting 2 classes despite the task for which I am training the model having 4 classes in the data.jsonl, train.txt, valid.txt etc... Is it possible to use the examples provided to do a multi-class classification problem using CodeBERT or right now, out of the box, it is only able to solve a binary classification (using codedetection folder) problem ?

Thanks a lot

guoday · 2021-07-07T09:21:09Z

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

PedroEstevesPT · 2021-07-07T14:41:52Z

Thanks for the .zip
Just one question about the input format.

In the clonedetection folder there is just one .jsonl and the dictionary has two keys: "func" (which has a function) and "idx" and then the train/test/valid.txt files have lines containing 2 idxs (one for each function) and the binary label

In the files you sent, the .jsonl has different keys, "code" and "label" and you did not send any .txt . However, after inspecting "code" I noticed that it has only 1 function per line. Does this mean, if what I want to do is compare two different functions of "code" and classify them according to 4 classes, I just concatenate them and plug them into the "code" key ?

Thanks a lot

guoday · 2021-07-07T15:34:23Z

Concatenating two functions and put them into the "code" key is a direct way to solve your problem. However, it's unfair for the second function since we truncate the function to block_size (i.e. 256 in this case) tokens.

I suggest you can add two keys. One is "code1", and the other is "code2". And then you can change here:

def convert_examples_to_features(js,tokenizer,args):
    #source
    code=' '.join(js['code'].split())
    code_tokens=tokenizer.tokenize(code)[:args.block_size-2]
    source_tokens =[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

to

def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

def convert_examples_to_features(js,tokenizer,args):
    #source
    code1=' '.join(js['code1'].split())
    code2=' '.join(js['code2'].split())
    code1_tokens=tokenizer.tokenize(code1)
    code2_tokens=tokenizer.tokenize(code2)
    _truncate_seq_pair( code1_tokens, code2_tokens, args.block_size-3)
    source_tokens =[tokenizer.cls_token]+code1_tokens+[tokenizer.sep_token]+code2_tokens+[tokenizer.sep_token]
    source_ids =  tokenizer.convert_tokens_to_ids(source_tokens)
    padding_length = args.block_size - len(source_ids)
    source_ids+=[tokenizer.pad_token_id]*padding_length
    return InputFeatures(source_tokens,source_ids,js['label'])

PedroEstevesPT · 2021-07-07T15:36:58Z

Thanks a lot for the prompt reply. I will try this out and let you know the result

PedroEstevesPT · 2021-07-07T16:16:36Z

Hum... The model is still just predicting '0' and '1' and not outputting labels '2' and '3' despite changing the config.num_labels. Any idea of what might be going on ?

guoday · 2021-07-07T16:22:20Z

Do you have fine-tuned the model on your dataset? When you set config.num_labels = 4, the model should do a multi-class classification. You can print prob.shape in model.py and you will see (bs,4) shape in prediction.

PedroEstevesPT · 2021-07-07T16:25:20Z

I will try that out, I will also create a very simple dataset:

code1: "a" code2: "a" -> 0
code1: "b" code2: "b" -> 1
code1: "c" code2: "c" -> 2
code1: "d" code2: "d" -> 3

To verify that the problem is not in my dataset (my dataset is unbalanced)

PedroEstevesPT · 2021-07-07T17:15:50Z

The same problem remains...

However a model.bin is indeed being created and saved in saved_models/checkpoint-best-acc . The shape also matches the 4 labels:

It also seems I am receiving a warning saying that model is not fine-tuned.

Right now I am training the model just for 1 epoch, so I can debug faster. Maybe I need to increase those epochs ?

guoday · 2021-07-07T17:22:10Z

My suggestion is:

Fine-tune the model on your dataset. And then, load the checkpoint and print prob. If the probabilities of 3 and 4 label are normal, your dataset may be unbalance.
Or maybe you need to print labels in model.py to see whether there are data with 3 and 4 label in your input.

guoday · 2021-07-07T17:42:14Z

Sorry, I have found the bug - -!
please change

with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
    for example,pred in zip(eval_dataset.examples,preds):
        if pred:
            f.write('1\n')
        else:
            f.write('0\n')

to

    with open(os.path.join(args.output_dir,"predictions.txt"),'w') as f:
        for example,pred in zip(eval_dataset.examples,preds):
                f.write(str(pred)+'\n')

PedroEstevesPT · 2021-07-07T17:43:32Z

Thank you! I will try that

PedroEstevesPT · 2021-07-07T18:00:51Z

It was that! It is now predicting the 4 classes. Thanks @guoday

QiushiSun · 2021-11-15T02:18:21Z

CodeBERT-classification.zip

Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

guoday · 2021-11-15T02:59:25Z

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

QiushiSun · 2021-11-15T03:09:17Z

CodeBERT-classification.zip
Change config.num_labels=100 in run.py to config.num_labels=4

Hi! Thanks for your reply, I have a question that whether the data in your *zip is part of code_search_net, thank you!

No. This is only an example.

Thanks for your prompt reply, I'm trying to use codebert model for source code classification tasks. Currently, I'm using POJ-104 dataset. Can you give me some suggestions about other datasets(labeled or unlabeled) that could be used for codes classification? Thank you so much.

guoday · 2021-11-15T03:11:21Z

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

QiushiSun · 2021-11-15T14:43:16Z

Maybe you can look at CodeXGLUE https://github.com/microsoft/CodeXGLUE

Thank you!

patelpooja363 · 2022-02-13T17:06:51Z

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification.
I have 10 softwares and their source codes respective to hash values.
Here is my query-
1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case?
Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not.
2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you.
Pooja K

guoday · 2022-02-14T03:31:16Z

Hi Guoday,

I want to use CodeBERT for source code classification(Malicious and non-malicious) and also for multiclassification. I have 10 softwares and their source codes respective to hash values. Here is my query- 1.I have many functions in one source code of length in 7 digit number(like 1867949), but in your dataset keys: Code is containing one function only, restricted to 256 as trained model, so what should I do in this case? Note: We don't know which individual function is malicious in source code. We only know our whole source code is malicious or not. 2.Is CodeBERT is suitable for C source code also?

It will really help me if you are replying to my query.

Thank you. Pooja K

The maximum length of CodeBERT is only 512. It' hard to handle source codes with length in 7 digit number, even if for other neural network.
CodeBERT is also suitable for C source code.

patelpooja363 · 2022-02-14T03:58:38Z

Hi Guoday,
Good morning.
Thanks a lot for reply.

Could you please suggest any preprocessing on source code so I could use CodeBERT for that. I mean what can be approach of preprocessing on my C source code.?

guoday · 2022-02-14T04:00:54Z

You don't need to preprocess C source code. Just like other programming language and using original C source code as the input.

patelpooja363 · 2022-02-14T04:08:19Z

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

guoday · 2022-02-14T04:09:49Z

Why I am asking this is because CodeBERT length is 512 only but my input data length is more. In this case how can reduce input dimension or preprocess so that I can use CodeBERT for classification. Should I split my source code of length of 510, and concatenate all output from last hidden layer then feed for classification?

please refer to #16

patelpooja363 · 2022-02-14T04:15:49Z

Thank You Guoday.
Its really cleared my doubt.

ap-la · 2023-12-06T00:24:35Z

Is it possible to get prediction scores too?

guody5 closed this as completed Jul 8, 2021

PedroEstevesPT mentioned this issue Jul 13, 2021

Convert clonedetection example to multitask/multilabel #56

Closed

dashpritam mentioned this issue Nov 10, 2022

How to update weights of CodeBERT for a downstream task? #190

Closed

emreaydogan mentioned this issue Sep 4, 2023

App clone detection #289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to finetune CodeBERT to do a 4 class classification task. #53

How to finetune CodeBERT to do a 4 class classification task. #53

PedroEstevesPT commented Jul 6, 2021

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021 •

edited

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

guoday commented Jul 7, 2021

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

QiushiSun commented Nov 15, 2021

guoday commented Nov 15, 2021

QiushiSun commented Nov 15, 2021

guoday commented Nov 15, 2021

QiushiSun commented Nov 15, 2021

patelpooja363 commented Feb 13, 2022

guoday commented Feb 14, 2022

patelpooja363 commented Feb 14, 2022

guoday commented Feb 14, 2022 •

edited

patelpooja363 commented Feb 14, 2022

guoday commented Feb 14, 2022

patelpooja363 commented Feb 14, 2022

ap-la commented Dec 6, 2023 •

edited

How to finetune CodeBERT to do a 4 class classification task. #53

How to finetune CodeBERT to do a 4 class classification task. #53

Comments

PedroEstevesPT commented Jul 6, 2021

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021 • edited

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

guoday commented Jul 7, 2021

guoday commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

PedroEstevesPT commented Jul 7, 2021

QiushiSun commented Nov 15, 2021

guoday commented Nov 15, 2021

QiushiSun commented Nov 15, 2021

guoday commented Nov 15, 2021

QiushiSun commented Nov 15, 2021

patelpooja363 commented Feb 13, 2022

guoday commented Feb 14, 2022

patelpooja363 commented Feb 14, 2022

guoday commented Feb 14, 2022 • edited

patelpooja363 commented Feb 14, 2022

guoday commented Feb 14, 2022

patelpooja363 commented Feb 14, 2022

ap-la commented Dec 6, 2023 • edited

PedroEstevesPT commented Jul 7, 2021 •

edited

guoday commented Feb 14, 2022 •

edited

ap-la commented Dec 6, 2023 •

edited