-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inability to reproduce CodeBert codesearch results #215
Comments
Do you mean you encounter some errors during fine-tuning or the reproduced numbers are incorrect? |
I can run fine-tuning just fine but the reproduced numbers I get do not match. This happens using the exact same hyperparameters specified in the readme. |
For code search, we have two setting. One is reported in CodeBERT paper. Another is reported in other papers such as GraphCodeBERT and UniXcoder. If you refer to the setting reported in CodeBERT paper, the reproduced numbers should be almost same as the paper. Can you give some reproduced numbers you get? |
I am referring to the first one. I have run it for Javascript and Go. The scores I've received are: These do not line up with the scores that are reported for the indiviual languages in the paper. |
It seems that you don't fine-tune the CodeBERT. You can check training loss and whether they are normal. Besides, you need to check whether you have reloaded the model after fine-tuning. |
I don't quite understand: Are you saying that you have to run fine-tuning and inference in one single call of |
You need to fine-tune the model following this command. And then you need to check whether the training loss is normal. From your reproduced numbers, it seems that you don't fine-tune the model. Finally, you can inference using this command. You need to reload the fine-tuned model instead of |
I've checked the training scores and they seem to be in order. So using the |
--pred_model_dir. Your reproduced results seem that the predictions are randomly guessed. |
Oh but this is specified in the inference in the readme. Below are the exact parameters I've used: for fine-tuning:
for inference:
where $pretrained_model is defined before as microsoft/codebert-base and $lang as one of the available languages. Also data directories have been adjusted to my local path but I've double-checked that they are in fact stored correctly. |
Yes. The parameters are correct. I also don't what problem happens. I suggest that you can check whether training loss and prediction scores are normal, because this results look worse than randomly guess. |
Yes these are normal. This suggests that the problem lies with the inference. |
Hello, I tried another test where I ran the inference step together with training. To do this I moved the test.txt file into the same folder as the training and validation sets and then ran run_classifier.py with both steps rolled into one. here is the command I used: lang=java #fine-tuning a language-specific model for each programming language python run_classifier.py However, the scores are still not improving suggesting there is an issue with how the model is loaded prior the the inference taking place. The scores are: |
I've also encountered this problem. In train and valid stage, the accuracy and f1 score look good. However, in test stage, f1 score is 0.00! Testing: 1000000it [00:13, 71604.13it/s] Do you have any solution now? WAITING ONLINE NERVOUSLY. |
Hi @Mr-Loevan @frede791,
|
Hello @fengzhangyin ,
Then call mrr.py to calculate the MRR score:
The results are not as expected. |
I have done experiments on ruby and got normal results. I will repeat the experiment on java. |
I repeated the experiment on java and got the following results:
I execute the following training script on two GPUs: lang=java #fine-tuning a language-specific model for each programming language
pretrained_model=microsoft/codebert-base #Roberta: roberta-base
python3 run_classifier.py \
--model_type roberta \
--task_name codesearch \
--do_train \
--do_eval \
--eval_all_checkpoints \
--train_file train.txt \
--dev_file valid.txt \
--max_seq_length 200 \
--per_gpu_train_batch_size 32 \
--per_gpu_eval_batch_size 32 \
--learning_rate 1e-5 \
--num_train_epochs 8 \
--gradient_accumulation_steps 1 \
--overwrite_output_dir \--data_dir ../data/codesearch/train_valid/$lang \
--output_dir ./models/$lang \
--model_name_or_path $pretrained_model These are the evaluation results during the training phase:
I execute the following inference script on a single GPU:
Then I call mrr.py and get the result of 0.7265 for the test batch 2&3 of java. |
Thank you very much ! ! ! But I don't know why this dramatically impacted MRR. And what does score in mrr.py mean? I know the notion of MRR, but how to calculate rank with score?
And why there are two scores in results?
Excuse my ignorance, I am a beginner. Thank you for your kindness and patience. |
In the test data, the correct answer of the i-th batch is the i-th position, and all the rest are wrong. NL-PL matching is formalized as a binary classification problem. The first score corresponds to category 0 (no match) and the second score corresponds to category 1 (match). |
The comments of both of you made me feel enlightened. I want to know that when fine-tune CodeSearch in GraphCodeBert, is 图that similar to CodeBert? |
Yes. Just change microsoft/codebert-base to microsoft/graphcodebert-base |
Hello again, I've tried to replicate the results using the executed scripts as shown by fenzhangyin however I am now getting a different error relating to non-existing paths: `Traceback (most recent call last): During handling of the above exception, another exception occurred: Traceback (most recent call last): |
Hello,
I am currently trying to reproduce the results you stated for CodeBert codesearch. However, running the instructions found in the respective readme I am unable to replicate any of the scores you found. Is there any additional setup required?
The text was updated successfully, but these errors were encountered: