New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use the pre-trained model of codebert? #58
Comments
|
Thanks for you answer. So if I want to do code search, I should fine-tune the pre-trained model |
|
I mkdir Still an error |
I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks? |
try
|
Yes, you can directly use it for python code search task. But the definition of model in |
Thank you very much! You answer helped me a lot. I see you said in the comments that you trained a model to work in six languages. Why do you emphasize Python code search? Can the model in demo folder also be used directly for Java code search? |
Because the model is only trained on python corpus. For CodeBERT, it's indeed trained in six languages. But for downstream task like code search, we fine-tuned six models for six languages. |
I have finished fine-tune according to your suggestion. But when I try to evaluate: Still an error |
I tried it several times and it worked. Thank you. |
I've run the demo.py of the Siamese-model. The log is
`Query: set a variable as hello world
Code: print('hello world')
Score: 2.4148944177682097e-08
Code: s = 'hello world'
Score: 0.999518632888794
Code: hello world
Score: 0.00048138442798517644
Query: Download an image and save the content in output_dir
Code:
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
with open(output_dir, 'wb') as f:
f.write(r.content)
Score: 0.9694535732269287
Code:
def f(image, output_dir):
with open(output_dir, 'wb') as f:
f.write(image)
Score: 9.678478818386793e-05
Code:
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
return r.content
Score: 0.03044973686337471`
But when I run the evaluation code:
python run.py \ --output_dir=./saved_models/python \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_eval \ --do_test \ --train_data_file=dataset/python/train.jsonl \ --eval_data_file=dataset/python/valid.jsonl \ --test_data_file=dataset/python/test.jsonl \ --codebase_file=dataset/python/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/python/test.log
An error has occurred
saved_models/python/test.log: No such file or directory.
I have a few questions:
1.Is the code
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
in the demo.py used to load the word splitter? But the parametermicrosoft/codebert-base
does not exist in the Siamese-model file. Do I need to create it myself? What files do I need to store after I create them?2.The run command for codebert also has the
Microsoft /codebert-base
parameter,for example:python run_classifier.py \ --model_type roberta \ --model_name_or_path microsoft/codebert-base \ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file batch_${idx}.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt
And codes in Quick Tour:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") model = RobertaModel.from_pretrained("microsoft/codebert-base")
Do I need to create it myself? What files do I need to store after I create them?
3.I tried to evaluate the pre-training model, but got an error
saved_models/python/test.log: No such file or directory.
Do I need to retrain the model by myself?The text was updated successfully, but these errors were encountered: