Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the pre-trained model of codebert? #58

Closed
skye95git opened this issue Jul 21, 2021 · 11 comments
Closed

How to use the pre-trained model of codebert? #58

skye95git opened this issue Jul 21, 2021 · 11 comments

Comments

@skye95git
Copy link

I've run the demo.py of the Siamese-model. The log is
`Query: set a variable as hello world
Code: print('hello world')
Score: 2.4148944177682097e-08
Code: s = 'hello world'
Score: 0.999518632888794
Code: hello world
Score: 0.00048138442798517644

Query: Download an image and save the content in output_dir
Code:
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
with open(output_dir, 'wb') as f:
f.write(r.content)

Score: 0.9694535732269287
Code:
def f(image, output_dir):
with open(output_dir, 'wb') as f:
f.write(image)

Score: 9.678478818386793e-05
Code:
def f(image_url, output_dir):
import requests
r = requests.get(image_url)
return r.content

Score: 0.03044973686337471`

But when I run the evaluation code:
python run.py \     --output_dir=./saved_models/python \     --config_name=microsoft/codebert-base \     --model_name_or_path=microsoft/codebert-base \     --tokenizer_name=microsoft/codebert-base \     --do_eval \     --do_test \     --train_data_file=dataset/python/train.jsonl \     --eval_data_file=dataset/python/valid.jsonl \     --test_data_file=dataset/python/test.jsonl \     --codebase_file=dataset/python/codebase.jsonl \     --num_train_epochs 10 \     --code_length 256 \     --nl_length 128 \     --train_batch_size 32 \     --eval_batch_size 64 \     --learning_rate 2e-5 \     --seed 123456 2>&1| tee saved_models/python/test.log

An error has occurred saved_models/python/test.log: No such file or directory.

I have a few questions:
1.Is the code tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") in the demo.py used to load the word splitter? But the parameter microsoft/codebert-base does not exist in the Siamese-model file. Do I need to create it myself? What files do I need to store after I create them?

2.The run command for codebert also has the Microsoft /codebert-base parameter,for example:
python run_classifier.py \ --model_type roberta \ --model_name_or_path microsoft/codebert-base \ --task_name codesearch \ --do_predict \ --output_dir ./models/$lang \ --data_dir ../data/codesearch/test/$lang \ --max_seq_length 200 \ --per_gpu_train_batch_size 32 \ --per_gpu_eval_batch_size 32 \ --learning_rate 1e-5 \ --num_train_epochs 8 \ --test_file batch_${idx}.txt \ --pred_model_dir ./models/$lang/checkpoint-best/ \ --test_result_dir ./results/$lang/${idx}_batch_result.txt

And codes in Quick Tour:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") model = RobertaModel.from_pretrained("microsoft/codebert-base")

Do I need to create it myself? What files do I need to store after I create them?

3.I tried to evaluate the pre-training model, but got an error saved_models/python/test.log: No such file or directory. Do I need to retrain the model by myself?

@guoday
Copy link
Contributor

guoday commented Jul 22, 2021

  1. For the error:
    saved_models/python/test.log: No such file or directory.
    Please follow the readme to fine-tune CodeBERT. We don't provide model for you.
  2. tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.
  3. You don't need to create microsoft/codebert-base. It will automatically download from huggingface.
  4. You need to finetune the model described in Readme.md

@skye95git
Copy link
Author

  1. For the error:
    saved_models/python/test.log: No such file or directory.
    Please follow the readme to fine-tune CodeBERT. We don't provide model for you.
  2. tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.
  3. You don't need to create microsoft/codebert-base. It will automatically download from huggingface.
  4. You need to finetune the model described in Readme.md

Thanks for you answer. So if I want to do code search, I should fine-tune the pre-trained model pytorch_model.bin in the Siamese-model, right? But I don't fine-tune the pre-trained model, demo.py also worked. Do I need to fine-tune just before evaluating?
I have another question: The result of demo.py is Score: 0.9694535732269287. What does score stand for ? Is it as close to one as possible?

@guoday
Copy link
Contributor

guoday commented Jul 22, 2021

  1. if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py

  2. Yes, it's as close to one as possible in provided code candidates.

@skye95git
Copy link
Author

  1. For the error:
    saved_models/python/test.log: No such file or directory.
    Please follow the readme to fine-tune CodeBERT. We don't provide model for you.
  2. tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.
  3. You don't need to create microsoft/codebert-base. It will automatically download from huggingface.
  4. You need to finetune the model described in Readme.md

I mkdir saved_models/java and run the fine-tune command:
python run.py \     --output_dir=./saved_models/java \     --config_name=microsoft/codebert-base \     --model_name_or_path=microsoft/codebert-base \     --tokenizer_name=microsoft/codebert-base \     --do_train \     --train_data_file=dataset/java/train.jsonl \     --eval_data_file=dataset/java/valid.jsonl \     --test_data_file=dataset/java/test.jsonl \     --codebase_file=dataset/java/codebase.jsonl \     --num_train_epochs 10 \     --code_length 256 \     --nl_length 128 \     --train_batch_size 32 \     --eval_batch_size 64 \     --learning_rate 2e-5 \     --seed 123456 2>&1| tee saved_models/java/train.log

Still an error -bash:  tee saved_models/java/train.log: No such file or directory.

@skye95git
Copy link
Author

  1. if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py
  2. Yes, it's as close to one as possible in provided code candidates.

I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks?

@guoday
Copy link
Contributor

guoday commented Jul 22, 2021

  1. For the error:
    saved_models/python/test.log: No such file or directory.
    Please follow the readme to fine-tune CodeBERT. We don't provide model for you.
  2. tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.
  3. You don't need to create microsoft/codebert-base. It will automatically download from huggingface.
  4. You need to finetune the model described in Readme.md

I mkdir saved_models/java and run the fine-tune command:
python run.py \     --output_dir=./saved_models/java \     --config_name=microsoft/codebert-base \     --model_name_or_path=microsoft/codebert-base \     --tokenizer_name=microsoft/codebert-base \     --do_train \     --train_data_file=dataset/java/train.jsonl \     --eval_data_file=dataset/java/valid.jsonl \     --test_data_file=dataset/java/test.jsonl \     --codebase_file=dataset/java/codebase.jsonl \     --num_train_epochs 10 \     --code_length 256 \     --nl_length 128 \     --train_batch_size 32 \     --eval_batch_size 64 \     --learning_rate 2e-5 \     --seed 123456 2>&1| tee saved_models/java/train.log

Still an error -bash:  tee saved_models/java/train.log: No such file or directory.

try

lang=java
mkdir -p ./saved_models/$lang
python run.py \
    --output_dir=./saved_models/$lang \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=microsoft/codebert-base \
    --do_train \
    --train_data_file=dataset/$lang/train.jsonl \
    --eval_data_file=dataset/$lang/valid.jsonl \
    --test_data_file=dataset/$lang/test.jsonl \
    --codebase_file=dataset/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456 2>&1| tee saved_models/$lang/train.log

@guoday
Copy link
Contributor

guoday commented Jul 22, 2021

  1. if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py
  2. Yes, it's as close to one as possible in provided code candidates.

I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks?

Yes, you can directly use it for python code search task. But the definition of model in demo.py and model.py is different, you can't directly load the saved model demo/python_model in run.py. This's why I tell that you need to "change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py", which can help successfully load the model.

@skye95git
Copy link
Author

  1. if you want to do code search, you should fine-tune the CodeBERT model to get model.bin since I have deleted saved_models/python/checkpoint-best-mrr/model.bin. If you want to use model in demo folder to evaluate, you can change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py
  2. Yes, it's as close to one as possible in provided code candidates.

I'm a little confused. I see what you said in the comments, model in demo folder is the one you fine-tune for code search. Can't it be used directly for code search tasks?

Yes, you can directly use it for python code search task. But the definition of model in demo.py and model.py is different, you can't directly load the saved model demo/python_model in run.py. This's why I tell that you need to "change microsoft/codebert-base to demo/python_model and then remove line 343-345 and line 353-355 in run.py", which can help successfully load the model.

Thank you very much! You answer helped me a lot. I see you said in the comments that you trained a model to work in six languages. Why do you emphasize Python code search? Can the model in demo folder also be used directly for Java code search?

@guoday
Copy link
Contributor

guoday commented Jul 22, 2021

Because the model is only trained on python corpus. For CodeBERT, it's indeed trained in six languages. But for downstream task like code search, we fine-tuned six models for six languages.

@skye95git
Copy link
Author

  1. For the error:
    saved_models/python/test.log: No such file or directory.
    Please follow the readme to fine-tune CodeBERT. We don't provide model for you.
  2. tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base") is a word splitter.
  3. You don't need to create microsoft/codebert-base. It will automatically download from huggingface.
  4. You need to finetune the model described in Readme.md

I mkdir saved_models/java and run the fine-tune command:
python run.py \     --output_dir=./saved_models/java \     --config_name=microsoft/codebert-base \     --model_name_or_path=microsoft/codebert-base \     --tokenizer_name=microsoft/codebert-base \     --do_train \     --train_data_file=dataset/java/train.jsonl \     --eval_data_file=dataset/java/valid.jsonl \     --test_data_file=dataset/java/test.jsonl \     --codebase_file=dataset/java/codebase.jsonl \     --num_train_epochs 10 \     --code_length 256 \     --nl_length 128 \     --train_batch_size 32 \     --eval_batch_size 64 \     --learning_rate 2e-5 \     --seed 123456 2>&1| tee saved_models/java/train.log
Still an error -bash:  tee saved_models/java/train.log: No such file or directory.

try

lang=java
mkdir -p ./saved_models/$lang
python run.py \
    --output_dir=./saved_models/$lang \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=microsoft/codebert-base \
    --do_train \
    --train_data_file=dataset/$lang/train.jsonl \
    --eval_data_file=dataset/$lang/valid.jsonl \
    --test_data_file=dataset/$lang/test.jsonl \
    --codebase_file=dataset/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456 2>&1| tee saved_models/$lang/train.log

I have finished fine-tune according to your suggestion. But when I try to evaluate:
lang=java python run.py \ --output_dir=./saved_models/$lang \ --config_name=microsoft/codebert-base \ --model_name_or_path=microsoft/codebert-base \ --tokenizer_name=microsoft/codebert-base \ --do_eval \ --do_test \ --train_data_file=dataset/$lang/train.jsonl \ --eval_data_file=dataset/$lang/valid.jsonl \ --test_data_file=dataset/$lang/test.jsonl \ --codebase_file=dataset/$lang/codebase.jsonl \ --num_train_epochs 10 \ --code_length 256 \ --nl_length 128 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --seed 123456 2>&1| tee saved_models/$lang/test.log

Still an error -bash:  tee saved_models/java/test.log: No such file or directory. Similar operations worked with fine tuning, I don't know why here.

@skye95git
Copy link
Author

I tried it several times and it worked. Thank you.

@guody5 guody5 closed this as completed Jul 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants