Results on natural language code retrieval #91

skye95git · 2021-12-10T07:42:39Z

Why is the MA-AVG of CODEBERT (MLM, INIT=R) about 3% higher than that of PT W/ CODE ONLY (INIT=R)?

Is it because their network structure is different?

But as described in the paper, they use the same network architecture and objective function MLM:

We develop CodeBERT by using exactly the same model architecture as RoBERTa-base. The total number of model parameters is 125M.

Is it because they use different pre-training data?
As described in the paper:

In the MLM objective, only bimodal data (i.e. datapoints of NL-PL pairs) is used for training.
RoBERTa which is continuously trained with masked language modeling on codes only.

The text was updated successfully, but these errors were encountered:

guoday · 2021-12-11T15:46:49Z

PT W/ CODE ONLY (INIT=R) only uses code to pre-train but CODEBERT (MLM, INIT=R) uses code and nl to pre-train the model.

skye95git · 2021-12-13T02:15:19Z

PT W/ CODE ONLY (INIT=R) only uses code to pre-train but CODEBERT (MLM, INIT=R) uses code and nl to pre-train the model.

So it's because of the different data used for pre-training, right？
Codebert still uses the exact same model architecture as Roberta, right？

guoday · 2021-12-13T03:01:31Z

Both of them use the same model architecture as Roberta and the same pre-training data. But PT W/ CODE ONLY (INIT=R) only uses code (i.e. removing natural language comment) for pre-training.

skye95git · 2021-12-16T02:48:43Z

Both of them use the same model architecture as Roberta and the same pre-training data. But PT W/ CODE ONLY (INIT=R) only uses code (i.e. removing natural language comment) for pre-training.

Thanks for your reply! I have another question. As described in the paper:

The second objective is replaced token detection (RTD), which further uses a large amount of unimodal data, such as codes without paired natural language texts.

there are two data generators here, an NL generator $p^{G_{w}}$ and a PL generator $p^{G_{c}}$. The PL training data
is the unimodal codes as shown in Table 1, and the NL training data comes from the documentations from bimodal data.

I wonder what data set RTD is trained on?
Isn't it said in the paper that it is trained on unimodal data? Why NL and PL are used here?
NL and PL generators are trained separately with corresponding unimodal data?
If so, what is the input to RTD? How to train on two unimodal corpora?
Both the diagram and the formula seem to indicate that NL and PL were inputted simultaneously, which appears to be NL-PL pairs.

fengzhangyin · 2021-12-16T06:42:37Z

We first learn two generators separately with corresponding unimodal data to generate plausible alternatives for the set of randomly masked positions. Specifically, we implement two n-gram language models with bidirectional contexts.
Then we train the NL-PL discriminator with NL-PL bimodal data to determine whether a word is the original one or not.

NL-Code discriminator is the targeted pre-trained model, both NL and code generators are thrown out in the fine-tuning step.

skye95git · 2021-12-16T07:10:26Z

We first learn two generators separately with corresponding unimodal data to generate plausible alternatives for the set of randomly masked positions. Specifically, we implement two n-gram language models with bidirectional contexts. Then we train the NL-PL discriminator with NL-PL bimodal data to determine whether a word is the original one or not.

NL-Code discriminator is the targeted pre-trained model, both NL and code generators are thrown out in the fine-tuning step.

Thanks for your reply! I have understood that discriminator is the targeted pre-trained model, both NL and code generators are thrown out in the fine-tuning step. But I don't understand that the RTD pre-training process.

We first learn two generators separately with corresponding unimodal data. Then we train the NL-PL discriminator with NL-PL bimodal data to determine whether a word is the original one or not.

So, are generators and discriminators trained separately? First, train two generators and then fix the two trained generators to train the discriminator?

NL-Code discriminator is used for producing general-purpose representations in the ﬁne-tuning step.

What is the role of the pre-training discriminator? What is the relationship between the generic representation obtained by the discriminator during the fine-tuning phase and the representation obtained by the MLM output layer? What does it have to do with the representation of [CLS] when code searches for fine tuning?

What is the network structure of the discriminator? What is the relationship between this and the multi-layer Transformer used in MLM pre-training? Is the output of one network the input of another?

skye95git · 2021-12-16T08:01:45Z

What's the difference between pre-training data and fine-tuning data?
In other words, statistics of the dataset used for training CodeBERT in the paper:

The command to download the preprocessed training and validation dataset in the readme:

mkdir data data/codesearch
cd data/codesearch
gdown https://drive.google.com/uc?id=1xgSR34XO8xXZg4cZScDYj2eGerBE9iGo  
unzip codesearch_data.zip
rm  codesearch_data.zip
cd ../../codesearch
python process_data.py
cd ..

what is the difference between them? Are pre-training and fine-tuning using the same data?

fengzhangyin · 2021-12-16T09:16:04Z

So, are generators and discriminators trained separately? First, train two generators and then fix the two trained generators to train the discriminator?

Yes, you are right.

What is the role of the pre-training discriminator? What is the relationship between the generic representation obtained by the discriminator during the fine-tuning phase and the representation obtained by the MLM output layer? What does it have to do with the representation of [CLS] when code searches for fine tuning?

What is the network structure of the discriminator? What is the relationship between this and the multi-layer Transformer used in MLM pre-training? Is the output of one network the input of another?

We use the multi-layer Transformer as the model architecture of CodeBERT. RTD and MLM are two objectives used for training CodeBERT.
We train three different versions of the model. CodeBERT(RTD) is trained only with the RTD objective. CodeBERT(MLM) is trained only with the MLM objective. CodeBERT(MLM+RTD) is first trained with MLM until convergence, and then trained with RTD.
In downstream tasks, different models are used in the same way.

fengzhangyin · 2021-12-16T09:23:16Z

Are pre-training and fine-tuning using the same data?

We only use the training data of the fine-tuning stage for pre-training.

skye95git · 2021-12-16T09:51:07Z

We train three different versions of the model. CodeBERT(RTD) is trained only with the RTD objective. CodeBERT(MLM) is trained only with the MLM objective. CodeBERT(MLM+RTD) is first trained with MLM until convergence, and then trained with RTD.

Thanks for your reply so quickly!

So, the network structure of the discriminator is also multi-level transformer. Right?
If I want to pre-train CodeBERT(RTD), should I just put two trained generators (N-gram Language Models) in front of Roberta-Base?
So the red line is the CodeBERT(RTD) input, and the green line is the CodeBERT structure, right?

fengzhangyin · 2021-12-16T12:17:20Z

Yes，you are right.
It is worth reminding that the final classification layer is binary classification.

skye95git · 2021-12-21T10:13:27Z

Yes，you are right. It is worth reminding that the final classification layer is binary classification.

Hi, I have pretrained RoBERTa only with codd from scratch (PT W/ CODE ONLY (INIT=R)). When I fine-tune it use the script in `Siamese-model\README.md`:

lang=python
mkdir -p ./saved_models/$lang
python run.py \
    --output_dir=./saved_models/$lang \
    --config_name=microsoft/codebert-base \
    --model_name_or_path=microsoft/codebert-base \
    --tokenizer_name=microsoft/codebert-base \
    --do_train \
    --train_data_file=dataset/$lang/train.jsonl \
    --eval_data_file=dataset/$lang/valid.jsonl \
    --test_data_file=dataset/$lang/test.jsonl \
    --codebase_file=dataset/$lang/codebase.jsonl \
    --num_train_epochs 10 \
    --code_length 256 \
    --nl_length 128 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --seed 123456 2>&1| tee saved_models/$lang/train.log

The model is hard to converge. Why?
As descripted in CodeBert paper:

Both training and validation datasets are created in a way that positive and negative samples are balanced. Negative samples consist of balanced number of instances with randomly replaced NL and PL.

Is Codebert's pre-training data all NL-PL bimodal data? Do you use unimodal data?

GraphCodeBert download the data directly from codesearchnet:

It seems that GraphCodeBert doesn't use balanced positive and negative samples. Does graphcodebert use both NL-PL bimodal data and unimodal data during pre-training?

guoday · 2021-12-21T10:22:40Z

I don't know the reason. Maybe you need to check your pre-trained model. You can try to use CodeBERT. If CodeBERT works, your pre-trained model should get some problems. You can also try to adjust some hyper-parameters like learning rate. If adjusting some hyper-parameters doesn't work, maybe you need to check whether your pre-training is normal.
@fengzhangyin please answer this question.
Please carefully read the paper of GraphCodeBERT. GraphCodeBERT uses different setting from CodeBERT on code search task. As described in GraphCodeBERT paper, GraphCodeBERT is pre-trained on 2.3M NL-PL pairs.

fengzhangyin · 2021-12-21T11:42:31Z

Is Codebert's pre-training data all NL-PL bimodal data? Do you use unimodal data?

We pre-train CodeBERT with both bimodal data and unimodal data. We only use NL-PL bimodal data to finetune and evaluate the model for code search.

skye95git · 2022-01-06T06:21:17Z

I have completed the pre-training of PT W/ CODE ONLY (INIT=S) and CODEBERT (MLM, INIT=S) using the same model architecture of Roberta. The results of the code search differ by about 3 percentage points from those in the paper:

Are there any other tricks in the pre-training period?

guody5 closed this as completed Apr 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results on natural language code retrieval #91

Results on natural language code retrieval #91

skye95git commented Dec 10, 2021

guoday commented Dec 11, 2021

skye95git commented Dec 13, 2021

guoday commented Dec 13, 2021

skye95git commented Dec 16, 2021

fengzhangyin commented Dec 16, 2021

skye95git commented Dec 16, 2021 •

edited

skye95git commented Dec 16, 2021

fengzhangyin commented Dec 16, 2021

fengzhangyin commented Dec 16, 2021

skye95git commented Dec 16, 2021 •

edited

fengzhangyin commented Dec 16, 2021

skye95git commented Dec 21, 2021 •

edited

guoday commented Dec 21, 2021 •

edited

fengzhangyin commented Dec 21, 2021

skye95git commented Jan 6, 2022

Results on natural language code retrieval #91

Results on natural language code retrieval #91

Comments

skye95git commented Dec 10, 2021

guoday commented Dec 11, 2021

skye95git commented Dec 13, 2021

guoday commented Dec 13, 2021

skye95git commented Dec 16, 2021

fengzhangyin commented Dec 16, 2021

skye95git commented Dec 16, 2021 • edited

skye95git commented Dec 16, 2021

fengzhangyin commented Dec 16, 2021

fengzhangyin commented Dec 16, 2021

skye95git commented Dec 16, 2021 • edited

fengzhangyin commented Dec 16, 2021

skye95git commented Dec 21, 2021 • edited

guoday commented Dec 21, 2021 • edited

fengzhangyin commented Dec 21, 2021

skye95git commented Jan 6, 2022

skye95git commented Dec 16, 2021 •

edited

skye95git commented Dec 16, 2021 •

edited

skye95git commented Dec 21, 2021 •

edited

guoday commented Dec 21, 2021 •

edited