CodeBERT pre-training data #203

aalkaswan · 2022-12-23T10:53:21Z

Dear authors,

I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?

In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?

Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?

Thanks in advance,

Ali

guoday · 2023-01-09T11:35:55Z

You can use the script to extract pre-training data. In CodeSearchNet data, it has a filed called docstring for NL and "function_tokens" for code without comment.
data.zip

celbree closed this as completed Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeBERT pre-training data #203

CodeBERT pre-training data #203

aalkaswan commented Dec 23, 2022

guoday commented Jan 9, 2023

CodeBERT pre-training data #203

CodeBERT pre-training data #203

Comments

aalkaswan commented Dec 23, 2022

guoday commented Jan 9, 2023