-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code-code CodeCompletion input #9
Comments
Thanks for your suggestion. |
I see...thx for the reply. I can understand such token level format is easier for the evaluations. |
another question, is the CodeGPT pretrained and finetuned with the same code corpus, but different format? (raw code for pretrain, token format for finetune?) |
I would answer your second question first.
The pre-trained dataset and fine-tuned dataset are different. CodeGPT is pre-trained on Python and Java corpus from CodeSearchNet in raw code format, and fine-tuned on PY150 and Github Java Corpus in token format. We'll update more details about CodeGPT to our repo. For the first question.
Yes. The fine-tuned model is more likely to predict code sequences in token level format. And if you would like to use it in real code completion scenario, you could use our released pre-trained CodeGPT model which is suitable for predicting raw codes. Alternately, you can try to fine-tuning CodeGPT on more data corpus (e.g. you could download PY150 and Github Java Corpus to obtain the raw code of each code file without running our preprocessing script). According to the time-cost analysis, it won't take too much time or computing resources. |
@celbree Thx for these details~ |
Yes, indeed. We do the preprocessing to ignore all the indentation and line breaks. CodeGPT is a pre-train model for CodeXGLUE participants to play with. It's train on a not large corpus and it can be viewed as a baseline for CodeXGLUE benchmark. |
@celbree Thanks~
are different |
Totally agree. Thanks for your suggestion. |
Hello~ thx for this project.
Is it possible to use initial raw code as input to test the completion model?
Currently the input format is not suitable to all the models, and it is difficult to restore the input format to the raw code.
The text was updated successfully, but these errors were encountered: