Inability to reproduce CodeBert codesearch results #215

frede791 · 2023-01-26T09:45:42Z

Hello,

I am currently trying to reproduce the results you stated for CodeBert codesearch. However, running the instructions found in the respective readme I am unable to replicate any of the scores you found. Is there any additional setup required?

guoday · 2023-01-29T09:15:50Z

Do you mean you encounter some errors during fine-tuning or the reproduced numbers are incorrect?

frede791 · 2023-01-29T10:30:56Z

I can run fine-tuning just fine but the reproduced numbers I get do not match. This happens using the exact same hyperparameters specified in the readme.

guoday · 2023-01-29T10:43:01Z

For code search, we have two setting. One is reported in CodeBERT paper. Another is reported in other papers such as GraphCodeBERT and UniXcoder. If you refer to the setting reported in CodeBERT paper, the reproduced numbers should be almost same as the paper. Can you give some reproduced numbers you get?

frede791 · 2023-01-30T15:17:16Z

I am referring to the first one. I have run it for Javascript and Go. The scores I've received are:
go mrr: 0.0074802490516740604
javascript mrr: 0.0031123885931150517

These do not line up with the scores that are reported for the indiviual languages in the paper.

guoday · 2023-01-31T03:22:27Z

It seems that you don't fine-tune the CodeBERT. You can check training loss and whether they are normal. Besides, you need to check whether you have reloaded the model after fine-tuning.

frede791 · 2023-01-31T18:03:48Z

I don't quite understand: Are you saying that you have to run fine-tuning and inference in one single call of run_classifier.py? I assume by reloading the model you mean passing the --pred_model_dir parameter? This I have done as mentioned in the readme.

guoday · 2023-02-01T03:32:49Z

You need to fine-tune the model following this command. And then you need to check whether the training loss is normal. From your reproduced numbers, it seems that you don't fine-tune the model.

Finally, you can inference using this command. You need to reload the fine-tuned model instead of microsoft/codebert-base

frede791 · 2023-02-01T16:45:03Z

I've checked the training scores and they seem to be in order. So using the --model_name_or_path flag and the passing ./models/$lang should work in the inference?

guoday · 2023-02-02T04:07:30Z

--pred_model_dir. Your reproduced results seem that the predictions are randomly guessed.

frede791 · 2023-02-02T08:29:36Z

Oh but this is specified in the inference in the readme. Below are the exact parameters I've used:

for fine-tuning:

--model_type roberta \ --task_name codesearch \ --do_train \ --do_eval \ --eval_all_checkpoints \ --train_file train.txt \ --dev_file valid.txt \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --gradient_accumulation_steps 1 \ --overwrite_output_dir \ --data_dir ../codebert_data/codesearch/train_valid/$lang \ --output_dir ./models/$lang \ --model_name_or_path $pretrained_model

for inference:

--model_type roberta \ --model_name_or_path $pretrained_model\ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../codebert_data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file test.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt

where $pretrained_model is defined before as microsoft/codebert-base and $lang as one of the available languages. Also data directories have been adjusted to my local path but I've double-checked that they are in fact stored correctly.

guoday · 2023-02-02T08:43:19Z

Yes. The parameters are correct. I also don't what problem happens. I suggest that you can check whether training loss and prediction scores are normal, because this results look worse than randomly guess.

frede791 · 2023-02-02T13:11:09Z

Yes these are normal. This suggests that the problem lies with the inference.

frede791 · 2023-02-04T10:47:40Z

Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one.

here is the command I used:

lang=java #fine-tuning a language-specific model for each programming language
pretrained_model=microsoft/codebert-base #Roberta: roberta-base
idx=0

python run_classifier.py
--model_type roberta
--task_name codesearch
--do_train
--do_eval
--eval_all_checkpoints
--train_file train.txt
--dev_file valid.txt
--max_seq_length 200
--per_gpu_train_batch_size 32
--per_gpu_eval_batch_size 32
--learning_rate 1e-5
--num_train_epochs 8
--gradient_accumulation_steps 1
--overwrite_output_dir
--data_dir ../codebert_data/codesearch/train_valid/$lang
--output_dir ./models/$lang
--model_name_or_path $pretrained_model
--do_predict
--test_result_dir ./results/$lang/${idx}_batch_result.txt
--test_file test.txt
--pred_model_dir ./models/$lang/checkpoint-best/
--test_file test.txt \

However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are:
javascript mrr: 0.0031123885931150517
go mrr: 0.004439858735040062
java mrr: 0.0026782066123630685

Mr-Loevan · 2023-02-11T07:52:09Z

Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one.

here is the command I used:

lang=java #fine-tuning a language-specific model for each programming language pretrained_model=microsoft/codebert-base #Roberta: roberta-base idx=0

python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train.txt --dev_file valid.txt --max_seq_length 200 --per_gpu_train_batch_size 32 --per_gpu_eval_batch_size 32 --learning_rate 1e-5 --num_train_epochs 8 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir ../codebert_data/codesearch/train_valid/$lang --output_dir ./models/$lang --model_name_or_path $pretrained_model --do_predict --test_result_dir ./results/$lang/${idx}_batch_result.txt --test_file test.txt --pred_model_dir ./models/$lang/checkpoint-best/ --test_file test.txt \

However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are: javascript mrr: 0.0031123885931150517 go mrr: 0.004439858735040062 java mrr: 0.0026782066123630685

I've also encountered this problem. In train and valid stage, the accuracy and f1 score look good. However, in test stage, f1 score is 0.00!

Testing: 1000000it [00:13, 71604.13it/s]
acc = 0.974115
acc_and_f1 = 0.4870575
f1 = 0.0

Do you have any solution now? WAITING ONLINE NERVOUSLY.

fengzhangyin · 2023-02-12T04:12:14Z

Hi @Mr-Loevan @frede791,

I speculate that there are some problems only in the testing phase, because the loss shows that your training phase is normal.
What I want to emphasize is that the data format of the training phase and the test phase are different, and the evaluation methods are also different. In the training phase, we are performing two classifications on NL-PL pair, and we use F1 score as the evaluation metric. In the test phase, we follow the official evaluation metric to calculate the Mean Reciprocal Rank (MRR) for each pair of test data (c, w) over a fixed set of 999 distractor codes.
In order to get the correct resul, taking the java language as an example, three steps are required:
1. Finetune the model on the java training data.
2. Perform inference on all test batches of java.
3. Call mrr.py to calculate the MRR score.

Mr-Loevan · 2023-02-12T08:23:12Z

Hello @fengzhangyin ,
Suppose that there is NO problem in training phase, I perform inference on batch 2&3 of java.
Results as follows:

02/12/2023 14:55:40 - INFO - __main__ -   ***** Output test results *****
Testing: 1000000it [00:09, 108864.08it/s]
acc = 0.958688
acc_and_f1 = 0.479344
f1 = 0.0

02/12/2023 16:07:45 - INFO - __main__ -   ***** Output test results *****
Testing: 1000000it [00:11, 86637.52it/s]
acc = 0.953164
acc_and_f1 = 0.476582
f1 = 0.0

Then call mrr.py to calculate the MRR score:

./results/java/2_batch_result.txt
./results/java/3_batch_result.txt
java mrr: 0.005088787541109195
java mrr: 0.005088787541109195

The results are not as expected.
Could you please replicate this confusing problem if possible.

fengzhangyin · 2023-02-12T08:34:45Z

I have done experiments on ruby and got normal results. I will repeat the experiment on java.

fengzhangyin · 2023-02-13T01:35:42Z

I repeated the experiment on java and got the following results:

./results/java/2_batch_result.txt
./results/java/3_batch_result.txt
java mrr: 0.7265479698286462
java mrr: 0.7265479698286462

I execute the following training script on two GPUs:

lang=java #fine-tuning a language-specific model for each programming language 
pretrained_model=microsoft/codebert-base  #Roberta: roberta-base

python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \--data_dir ../data/codesearch/train_valid/$lang \
--output_dir ./models/$lang  \
--model_name_or_path $pretrained_model

These are the evaluation results during the training phase：

evaluate 0
acc = 0.8157233730223454
acc_and_f1 = 0.8195270500764581
f1 = 0.8233307271305708
evaluate 1
acc = 0.8222475941934432
acc_and_f1 = 0.829422294712699
f1 = 0.836596995231955
evaluate 2
acc = 0.8231936062632523
acc_and_f1 = 0.8270308666652966
f1 = 0.8308681270673408
evaluate 3
acc = 0.821236339911923
acc_and_f1 = 0.8268004887078735
f1 = 0.8323646375038238
evaluate 4
acc = 0.818300440384929
acc_and_f1 = 0.8250112473827969
f1 = 0.8317220543806647
evaluate 5
acc = 0.8194748001957266
acc_and_f1 = 0.824230601581176
f1 = 0.8289864029666255
evaluate 6
acc = 0.8161474473984668
acc_and_f1 = 0.8210608168876481
f1 = 0.8259741863768295
evaluate 7
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
evaluate ./models/java/checkpoint-best
acc = 0.8231936062632523
acc_and_f1 = 0.8270308666652966
f1 = 0.8308681270673408
evaluate ./models/java/checkpoint-last
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709
evaluate ./models/java
acc = 0.8142554232588485
acc_and_f1 = 0.8180308849410597
f1 = 0.8218063466232709

I execute the following inference script on a single GPU:

lang=java #programming language
idx=$1 #test batch idx

python3 run_classifier.py \
--model_type roberta \
--model_name_or_path microsoft/codebert-base \
--task_name codesearch \
--do_predict \
--output_dir ./models/$lang \
--data_dir ../data/codesearch/test/$lang \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--test_file batch_${idx}.txt \
--pred_model_dir ./models/$lang/checkpoint-best/ \
--test_result_dir ./results/$lang/${idx}_batch_result.txt

Then I call mrr.py and get the result of 0.7265 for the test batch 2&3 of java.

Mr-Loevan · 2023-02-13T06:44:36Z

Thank you very much ! ! !
I've got normal results successfully. The results were bad possibly because I shuffled all ${idx}_batch_result.txt.

But I don't know why this dramatically impacted MRR. And what does score in mrr.py mean? I know the notion of MRR, but how to calculate rank with score?

    correct_score = float(batch_data[batch_idx].strip().split('<CODESPLIT>')[-1])
    scores = np.array([float(data.strip().split('<CODESPLIT>')[-1]) for data in batch_data])
    rank = np.sum(scores >= correct_score)
    ranks.append(rank)

And why there are two scores in results?

1<CODESPLIT>...<CODESPLIT>....<CODESPLIT>....<CODESPLIT>3.311647891998291<CODESPLIT>-3.0937719345092773

Excuse my ignorance, I am a beginner. Thank you for your kindness and patience.

fengzhangyin · 2023-02-13T14:07:22Z

In the test data, the correct answer of the i-th batch is the i-th position, and all the rest are wrong.
correct_score = float(batch_data[batch_idx].strip().split('<CODESPLIT>')[-1]) is the score of the correct answer.
rank = np.sum(scores >= correct_score) indicates the number of scores higher than the correct answer, that is, the order of the correct answer.

NL-PL matching is formalized as a binary classification problem. The first score corresponds to category 0 (no match) and the second score corresponds to category 1 (match).

wangsiqidahaoren · 2023-02-14T02:19:45Z

The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert?

guoday · 2023-02-14T02:29:37Z

The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert?

Yes. Just change microsoft/codebert-base to microsoft/graphcodebert-base

frede791 · 2023-03-04T17:46:39Z

Hello again, I've tried to replicate the results using the executed scripts as shown by fenzhangyin however I am now getting a different error relating to non-existing paths:

`Traceback (most recent call last):
File "run_classifier.py", line 287, in load_and_cache_examples
features = torch.load(cached_features_file)
File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/itet-stor/frede791/net_scratch/codesearchenv6/lib/python3.6/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: '../data/codesearch/test/go/cached_test_batch__pytorch_model.bin_200_codesearch'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run_classifier.py", line 580, in
main()
File "run_classifier.py", line 575, in main
evaluate(args, model, tokenizer, checkpoint=None, prefix='', mode='test')
File "run_classifier.py", line 192, in evaluate
eval_dataset, instances = load_and_cache_examples(args, eval_task, tokenizer, ttype='test')
File "run_classifier.py", line 298, in load_and_cache_examples
examples, instances = processor.get_test_examples(args.data_dir, args.test_file)
File "/usr/itetnas04/data-scratch-01/frede791/data/codebert_master/utils.py", line 93, in get_test_examples
self._read_tsv(os.path.join(data_dir, test_file)), "test")
File "/usr/itetnas04/data-scratch-01/frede791/data/codebert_master/utils.py", line 64, in read_tsv
with open(input_file, "r", encoding='utf-8') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/codesearch/test/go/batch.txt'
`

fengzhangyin closed this as completed Feb 17, 2023

guoday mentioned this issue Feb 20, 2023

Some problems with test dataset in code search #225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inability to reproduce CodeBert codesearch results #215

Inability to reproduce CodeBert codesearch results #215

frede791 commented Jan 26, 2023

guoday commented Jan 29, 2023 •

edited

Loading

frede791 commented Jan 29, 2023

guoday commented Jan 29, 2023

frede791 commented Jan 30, 2023

guoday commented Jan 31, 2023

frede791 commented Jan 31, 2023

guoday commented Feb 1, 2023

frede791 commented Feb 1, 2023

guoday commented Feb 2, 2023

frede791 commented Feb 2, 2023

guoday commented Feb 2, 2023

frede791 commented Feb 2, 2023

frede791 commented Feb 4, 2023 •

edited

Loading

Mr-Loevan commented Feb 11, 2023

fengzhangyin commented Feb 12, 2023

Mr-Loevan commented Feb 12, 2023

fengzhangyin commented Feb 12, 2023

fengzhangyin commented Feb 13, 2023

Mr-Loevan commented Feb 13, 2023 •

edited

Loading

fengzhangyin commented Feb 13, 2023

wangsiqidahaoren commented Feb 14, 2023

guoday commented Feb 14, 2023

frede791 commented Mar 4, 2023

Inability to reproduce CodeBert codesearch results #215

Inability to reproduce CodeBert codesearch results #215

Comments

frede791 commented Jan 26, 2023

guoday commented Jan 29, 2023 • edited Loading

frede791 commented Jan 29, 2023

guoday commented Jan 29, 2023

frede791 commented Jan 30, 2023

guoday commented Jan 31, 2023

frede791 commented Jan 31, 2023

guoday commented Feb 1, 2023

frede791 commented Feb 1, 2023

guoday commented Feb 2, 2023

frede791 commented Feb 2, 2023

guoday commented Feb 2, 2023

frede791 commented Feb 2, 2023

frede791 commented Feb 4, 2023 • edited Loading

Mr-Loevan commented Feb 11, 2023

fengzhangyin commented Feb 12, 2023

Mr-Loevan commented Feb 12, 2023

fengzhangyin commented Feb 12, 2023

fengzhangyin commented Feb 13, 2023

Mr-Loevan commented Feb 13, 2023 • edited Loading

fengzhangyin commented Feb 13, 2023

wangsiqidahaoren commented Feb 14, 2023

guoday commented Feb 14, 2023

frede791 commented Mar 4, 2023

guoday commented Jan 29, 2023 •

edited

Loading

frede791 commented Feb 4, 2023 •

edited

Loading

Mr-Loevan commented Feb 13, 2023 •

edited

Loading