Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CodeBERT pre-training data #203

Closed
aalkaswan opened this issue Dec 23, 2022 · 1 comment
Closed

CodeBERT pre-training data #203

aalkaswan opened this issue Dec 23, 2022 · 1 comment

Comments

@aalkaswan
Copy link

Dear authors,

I had a question regarding the pre-training of CodeBERT. How is the pre-training data structured exactly?

In section 3.2 of the paper, it is stated that the pre-training data is structured as [CLS] + [NL_tokens] + [SEP] + [PL_tokens] + [EOS]. In section 3.3 and in the CodeSearchNet data, the Natural language is inserted after the function definition. Which of these was used to pre-train CodeBERT?

Would it be possible to share (some of) the pre-training samples with the exact pre-processing applied?

Thanks in advance,

  • Ali
@guoday
Copy link
Contributor

guoday commented Jan 9, 2023

You can use the script to extract pre-training data. In CodeSearchNet data, it has a filed called docstring for NL and "function_tokens" for code without comment.
data.zip

@celbree celbree closed this as completed Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants