You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?
In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?
Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?
Thanks in advance,
Ali
The text was updated successfully, but these errors were encountered:
You can use the script to extract pre-training data. In CodeSearchNet data, it has a filed called docstring for NL and "function_tokens" for code without comment. data.zip
Dear authors,
I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?
In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?
Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?
Thanks in advance,
The text was updated successfully, but these errors were encountered: