Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results on natural language code retrieval #91

Closed
skye95git opened this issue Dec 10, 2021 · 15 comments
Closed

Results on natural language code retrieval #91

skye95git opened this issue Dec 10, 2021 · 15 comments

Comments

@skye95git
Copy link

捕获

Why is the MA-AVG of CODEBERT (MLM, INIT=R) about 3% higher than that of PT W/ CODE ONLY (INIT=R)?

Is it because their network structure is different?

But as described in the paper, they use the same network architecture and objective function MLM:

We develop CodeBERT by using exactly the same model architecture as RoBERTa-base. The total number of model parameters is 125M.

Is it because they use different pre-training data?
As described in the paper:

In the MLM objective, only bimodal data (i.e. datapoints of NL-PL pairs) is used for training.
RoBERTa which is continuously trained with masked language modeling on codes only.

@guoday
Copy link
Contributor

guoday commented Dec 11, 2021

PT W/ CODE ONLY (INIT=R) only uses code to pre-train but CODEBERT (MLM, INIT=R) uses code and nl to pre-train the model.

@skye95git
Copy link
Author

PT W/ CODE ONLY (INIT=R) only uses code to pre-train but CODEBERT (MLM, INIT=R) uses code and nl to pre-train the model.

So it's because of the different data used for pre-training, right?
Codebert still uses the exact same model architecture as Roberta, right?

@guoday
Copy link
Contributor

guoday commented Dec 13, 2021

Both of them use the same model architecture as Roberta and the same pre-training data. But PT W/ CODE ONLY (INIT=R) only uses code (i.e. removing natural language comment) for pre-training.

@skye95git
Copy link
Author

Both of them use the same model architecture as Roberta and the same pre-training data. But PT W/ CODE ONLY (INIT=R) only uses code (i.e. removing natural language comment) for pre-training.

Thanks for your reply! I have another question. As described in the paper:

  • The second objective is replaced token detection (RTD), which further uses a large amount of unimodal data, such as codes without paired natural language texts.
  • there are two data generators here, an NL generator $p^{G_{w}}$ and a PL generator $p^{G_{c}}$. The PL training data
    is the unimodal codes as shown in Table 1, and the NL training data comes from the documentations from bimodal data.

I wonder what data set RTD is trained on?
Isn't it said in the paper that it is trained on unimodal data? Why NL and PL are used here?
NL and PL generators are trained separately with corresponding unimodal data?
If so, what is the input to RTD? How to train on two unimodal corpora?
Both the diagram and the formula seem to indicate that NL and PL were inputted simultaneously, which appears to be NL-PL pairs.
1216
12161

@fengzhangyin
Copy link
Collaborator

We first learn two generators separately with corresponding unimodal data to generate plausible alternatives for the set of randomly masked positions. Specifically, we implement two n-gram language models with bidirectional contexts.
Then we train the NL-PL discriminator with NL-PL bimodal data to determine whether a word is the original one or not.

NL-Code discriminator is the targeted pre-trained model, both NL and code generators are thrown out in the fine-tuning step.

@skye95git
Copy link
Author

skye95git commented Dec 16, 2021

We first learn two generators separately with corresponding unimodal data to generate plausible alternatives for the set of randomly masked positions. Specifically, we implement two n-gram language models with bidirectional contexts. Then we train the NL-PL discriminator with NL-PL bimodal data to determine whether a word is the original one or not.

NL-Code discriminator is the targeted pre-trained model, both NL and code generators are thrown out in the fine-tuning step.

Thanks for your reply! I have understood that discriminator is the targeted pre-trained model, both NL and code generators are thrown out in the fine-tuning step. But I don't understand that the RTD pre-training process.

  1. We first learn two generators separately with corresponding unimodal data. Then we train the NL-PL discriminator with NL-PL bimodal data to determine whether a word is the original one or not.

So, are generators and discriminators trained separately? First, train two generators and then fix the two trained generators to train the discriminator?

  1. NL-Code discriminator is used for producing general-purpose representations in the fine-tuning step.

What is the role of the pre-training discriminator? What is the relationship between the generic representation obtained by the discriminator during the fine-tuning phase and the representation obtained by the MLM output layer? What does it have to do with the representation of [CLS] when code searches for fine tuning?

  1. What is the network structure of the discriminator? What is the relationship between this and the multi-layer Transformer used in MLM pre-training? Is the output of one network the input of another?

@skye95git
Copy link
Author

What's the difference between pre-training data and fine-tuning data?
In other words, statistics of the dataset used for training CodeBERT in the paper:
image

The command to download the preprocessed training and validation dataset in the readme:

mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm  codesearch_data.zip
cd ../../codesearch
python process_data.py
cd ..

what is the difference between them? Are pre-training and fine-tuning using the same data?

@fengzhangyin
Copy link
Collaborator

So, are generators and discriminators trained separately? First, train two generators and then fix the two trained generators to train the discriminator?

Yes, you are right.

What is the role of the pre-training discriminator? What is the relationship between the generic representation obtained by the discriminator during the fine-tuning phase and the representation obtained by the MLM output layer? What does it have to do with the representation of [CLS] when code searches for fine tuning?

What is the network structure of the discriminator? What is the relationship between this and the multi-layer Transformer used in MLM pre-training? Is the output of one network the input of another?

We use the multi-layer Transformer as the model architecture of CodeBERT. RTD and MLM are two objectives used for training CodeBERT.
We train three different versions of the model. CodeBERT(RTD) is trained only with the RTD objective. CodeBERT(MLM) is trained only with the MLM objective. CodeBERT(MLM+RTD) is first trained with MLM until convergence, and then trained with RTD.
In downstream tasks, different models are used in the same way.

@fengzhangyin
Copy link
Collaborator

Are pre-training and fine-tuning using the same data?

We only use the training data of the fine-tuning stage for pre-training.

@skye95git
Copy link
Author

skye95git commented Dec 16, 2021

We train three different versions of the model. CodeBERT(RTD) is trained only with the RTD objective. CodeBERT(MLM) is trained only with the MLM objective. CodeBERT(MLM+RTD) is first trained with MLM until convergence, and then trained with RTD.

Thanks for your reply so quickly!

  1. So, the network structure of the discriminator is also multi-level transformer. Right?

  2. If I want to pre-train CodeBERT(RTD), should I just put two trained generators (N-gram Language Models) in front of Roberta-Base?

  3. So the red line is the CodeBERT(RTD) input, and the green line is the CodeBERT structure, right?

9f50f893fe791924ebe0d4c6848de1b

@fengzhangyin
Copy link
Collaborator

Yes,you are right.
It is worth reminding that the final classification layer is binary classification.

@skye95git
Copy link
Author

skye95git commented Dec 21, 2021

Yes,you are right. It is worth reminding that the final classification layer is binary classification.

Hi, I have pretrained RoBERTa only with codd from scratch (PT W/ CODE ONLY (INIT=R)). When I fine-tune it use the script in `Siamese-model\README.md`:
lang=python
mkdir -p ./saved_models/$lang
python run.py \
    --output_dir=./saved_models/$lang \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=microsoft/codebert-base \
    --do_train \
    --train_data_file=dataset/$lang/train.jsonl \
    --eval_data_file=dataset/$lang/valid.jsonl \
    --test_data_file=dataset/$lang/test.jsonl \
    --codebase_file=dataset/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456 2>&1| tee saved_models/$lang/train.log

image

  1. The model is hard to converge. Why?

  2. As descripted in CodeBert paper:

Both training and validation datasets are created in a way that positive and negative samples are balanced. Negative samples consist of balanced number of instances with randomly replaced NL and PL.

Is Codebert's pre-training data all NL-PL bimodal data? Do you use unimodal data?

  1. GraphCodeBert download the data directly from codesearchnet:
    image
    It seems that GraphCodeBert doesn't use balanced positive and negative samples. Does graphcodebert use both NL-PL bimodal data and unimodal data during pre-training?

@guoday
Copy link
Contributor

guoday commented Dec 21, 2021

  1. I don't know the reason. Maybe you need to check your pre-trained model. You can try to use CodeBERT. If CodeBERT works, your pre-trained model should get some problems. You can also try to adjust some hyper-parameters like learning rate. If adjusting some hyper-parameters doesn't work, maybe you need to check whether your pre-training is normal.
  2. @fengzhangyin please answer this question.
  3. Please carefully read the paper of GraphCodeBERT. GraphCodeBERT uses different setting from CodeBERT on code search task. As described in GraphCodeBERT paper, GraphCodeBERT is pre-trained on 2.3M NL-PL pairs.

@fengzhangyin
Copy link
Collaborator

Is Codebert's pre-training data all NL-PL bimodal data? Do you use unimodal data?

We pre-train CodeBERT with both bimodal data and unimodal data. We only use NL-PL bimodal data to finetune and evaluate the model for code search.

@skye95git
Copy link
Author

I have completed the pre-training of PT W/ CODE ONLY (INIT=S) and CODEBERT (MLM, INIT=S) using the same model architecture of Roberta. The results of the code search differ by about 3 percentage points from those in the paper:

image

Are there any other tricks in the pre-training period?

@guody5 guody5 closed this as completed Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants