How to use the pre-trained model of codebert? #58

skye95git · 2021-07-21T12:22:36Z

I've run the demo.py of the Siamese-model. The log is
`Query: set a variable as hello world
Code: print('hello world')
Score: 2.4148944177682097e-08
Code: s = 'hello world'
Score: 0.999518632888794
Code: hello world
Score: 0.00048138442798517644

Query: Download an image and save the content in output_dir
Code:
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
with open(output_dir, 'wb') as f:
f.write(r.content)

Score: 0.9694535732269287
Code:
def f(image, output_dir):
with open(output_dir, 'wb') as f:
f.write(image)

Score: 9.678478818386793e-05
Code:
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
return r.content

Score: 0.03044973686337471`

But when I run the evaluation code:
python run.py \ --output_dir=./saved_models/python \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_eval \ --do_test \ --train_data_file=dataset/python/train.jsonl \ --eval_data_file=dataset/python/valid.jsonl \ --test_data_file=dataset/python/test.jsonl \ --codebase_file=dataset/python/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/python/test.log

An error has occurred saved_models/python/test.log: No such file or directory.

I have a few questions:
1.Is the code tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") in the demo.py used to load the word splitter? But the parameter microsoft/codebert-base does not exist in the Siamese-model file. Do I need to create it myself? What files do I need to store after I create them?

2.The run command for codebert also has the Microsoft /codebert-base parameter,for example:
python run_classifier.py \ --model_type roberta \ --model_name_or_path microsoft/codebert-base \ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file batch_${idx}.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt

And codes in Quick Tour:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") model = RobertaModel.from_pretrained("microsoft/codebert-base")

Do I need to create it myself? What files do I need to store after I create them?

3.I tried to evaluate the pre-training model, but got an error saved_models/python/test.log: No such file or directory. Do I need to retrain the model by myself?

The text was updated successfully, but these errors were encountered:

guoday · 2021-07-22T06:44:29Z

For the error:
saved_models/python/test.log: No such file or directory.
Please follow the readme to fine-tune CodeBERT. We don't provide model for you.
tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.
You don't need to create microsoft/codebert-base. It will automatically download from huggingface.
You need to finetune the model described in Readme.md

skye95git · 2021-07-22T06:58:00Z

For the error:
saved_models/python/test.log: No such file or directory.
Please follow the readme to fine-tune CodeBERT. We don't provide model for you.

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.

You don't need to create microsoft/codebert-base. It will automatically download from huggingface.

You need to finetune the model described in Readme.md

Thanks for you answer. So if I want to do code search, I should fine-tune the pre-trained model pytorch_model.bin in the Siamese-model, right? But I don't fine-tune the pre-trained model, demo.py also worked. Do I need to fine-tune just before evaluating?
I have another question: The result of demo.py is Score: 0.9694535732269287. What does score stand for ? Is it as close to one as possible?

guoday · 2021-07-22T07:07:50Z

if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py
Yes, it's as close to one as possible in provided code candidates.

skye95git · 2021-07-22T07:09:52Z

For the error:
saved_models/python/test.log: No such file or directory.
Please follow the readme to fine-tune CodeBERT. We don't provide model for you.

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.

You don't need to create microsoft/codebert-base. It will automatically download from huggingface.

You need to finetune the model described in Readme.md

I mkdir saved_models/java and run the fine-tune command:
python run.py \ --output_dir=./saved_models/java \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_train \ --train_data_file=dataset/java/train.jsonl \ --eval_data_file=dataset/java/valid.jsonl \ --test_data_file=dataset/java/test.jsonl \ --codebase_file=dataset/java/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/java/train.log

Still an error -bash: tee saved_models/java/train.log: No such file or directory.

skye95git · 2021-07-22T07:13:59Z

if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py

Yes, it's as close to one as possible in provided code candidates.

I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks?

guoday · 2021-07-22T07:15:55Z

For the error:
saved_models/python/test.log: No such file or directory.
Please follow the readme to fine-tune CodeBERT. We don't provide model for you.

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.

You don't need to create microsoft/codebert-base. It will automatically download from huggingface.

You need to finetune the model described in Readme.md

I mkdir saved_models/java and run the fine-tune command:
python run.py \ --output_dir=./saved_models/java \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_train \ --train_data_file=dataset/java/train.jsonl \ --eval_data_file=dataset/java/valid.jsonl \ --test_data_file=dataset/java/test.jsonl \ --codebase_file=dataset/java/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/java/train.log

Still an error -bash: tee saved_models/java/train.log: No such file or directory.

try

lang=java
mkdir -p ./saved_models/$lang
python run.py \
    --output_dir=./saved_models/$lang \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=microsoft/codebert-base \
    --do_train \
    --train_data_file=dataset/$lang/train.jsonl \
    --eval_data_file=dataset/$lang/valid.jsonl \
    --test_data_file=dataset/$lang/test.jsonl \
    --codebase_file=dataset/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456 2>&1| tee saved_models/$lang/train.log

guoday · 2021-07-22T07:20:22Z

if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py

Yes, it's as close to one as possible in provided code candidates.

I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks?

Yes, you can directly use it for python code search task. But the definition of model in demo.py and model.py is different, you can't directly load the saved model demo/python_model in run.py. This's why I tell that you need to "change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py", which can help successfully load the model.

skye95git · 2021-07-22T07:57:17Z

if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py

Yes, it's as close to one as possible in provided code candidates.

I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks?

Yes, you can directly use it for python code search task. But the definition of model in demo.py and model.py is different, you can't directly load the saved model demo/python_model in run.py. This's why I tell that you need to "change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py", which can help successfully load the model.

Thank you very much! You answer helped me a lot. I see you said in the comments that you trained a model to work in six languages. Why do you emphasize Python code search? Can the model in demo folder also be used directly for Java code search?

guoday · 2021-07-22T08:16:56Z

Because the model is only trained on python corpus. For CodeBERT, it's indeed trained in six languages. But for downstream task like code search, we fine-tuned six models for six languages.

skye95git · 2021-07-23T01:37:18Z

For the error:
saved_models/python/test.log: No such file or directory.
Please follow the readme to fine-tune CodeBERT. We don't provide model for you.

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.

You don't need to create microsoft/codebert-base. It will automatically download from huggingface.

You need to finetune the model described in Readme.md

I mkdir saved_models/java and run the fine-tune command:
python run.py \ --output_dir=./saved_models/java \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_train \ --train_data_file=dataset/java/train.jsonl \ --eval_data_file=dataset/java/valid.jsonl \ --test_data_file=dataset/java/test.jsonl \ --codebase_file=dataset/java/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/java/train.log
Still an error -bash: tee saved_models/java/train.log: No such file or directory.

try
lang=java
mkdir -p ./saved_models/$lang
python run.py \
    --output_dir=./saved_models/$lang \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=microsoft/codebert-base \
    --do_train \
    --train_data_file=dataset/$lang/train.jsonl \
    --eval_data_file=dataset/$lang/valid.jsonl \
    --test_data_file=dataset/$lang/test.jsonl \
    --codebase_file=dataset/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456 2>&1| tee saved_models/$lang/train.log

I have finished fine-tune according to your suggestion. But when I try to evaluate:
lang=java python run.py \ --output_dir=./saved_models/$lang \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_eval \ --do_test \ --train_data_file=dataset/$lang/train.jsonl \ --eval_data_file=dataset/$lang/valid.jsonl \ --test_data_file=dataset/$lang/test.jsonl \ --codebase_file=dataset/$lang/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/$lang/test.log

Still an error -bash: tee saved_models/java/test.log: No such file or directory. Similar operations worked with fine tuning, I don't know why here.

skye95git · 2021-07-23T01:42:31Z

I tried it several times and it worked. Thank you.

guody5 closed this as completed Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use the pre-trained model of codebert? #58

How to use the pre-trained model of codebert? #58

skye95git commented Jul 21, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 22, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 22, 2021

skye95git commented Jul 22, 2021

guoday commented Jul 22, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 22, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 23, 2021

skye95git commented Jul 23, 2021

How to use the pre-trained model of codebert? #58

How to use the pre-trained model of codebert? #58

Comments

skye95git commented Jul 21, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 22, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 22, 2021

skye95git commented Jul 22, 2021

guoday commented Jul 22, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 22, 2021

guoday commented Jul 22, 2021

skye95git commented Jul 23, 2021

skye95git commented Jul 23, 2021