Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code-code CodeCompletion input #9

Closed
carter54 opened this issue Oct 23, 2020 · 8 comments
Closed

code-code CodeCompletion input #9

carter54 opened this issue Oct 23, 2020 · 8 comments

Comments

@carter54
Copy link

Hello~ thx for this project.
Is it possible to use initial raw code as input to test the completion model?
Currently the input format is not suitable to all the models, and it is difficult to restore the input format to the raw code.

@celbree
Copy link
Member

celbree commented Oct 24, 2020

Thanks for your suggestion.
As in most of related works about code completion, many researchers focus on token-level code completion and they tokenize raw codes into token sequences. We keep this setting in CodeXGLUE. To ensure consistency, we also use this input format for line-level code completion.
If a model works on the initial raw code, it is expected to perform well on token format after fine-tuned. In fact, our CodeGPT is pre-trained on the initial format of raw code and fine-tuned on token format.

@carter54
Copy link
Author

I see...thx for the reply. I can understand such token level format is easier for the evaluations.
But the performance of finetuned model != pretrain model, right?
If I want to use finetuned model in the production environment, I have to preprocess the input code before sending them into model, and postprocess the output before presenting in IDE...

@carter54
Copy link
Author

another question, is the CodeGPT pretrained and finetuned with the same code corpus, but different format? (raw code for pretrain, token format for finetune?)
Thx in advance.

@celbree
Copy link
Member

celbree commented Oct 27, 2020

I would answer your second question first.

another question, is the CodeGPT pretrained and finetuned with the same code corpus, but different format? (raw code for pretrain, token format for finetune?)

The pre-trained dataset and fine-tuned dataset are different. CodeGPT is pre-trained on Python and Java corpus from CodeSearchNet in raw code format, and fine-tuned on PY150 and Github Java Corpus in token format. We'll update more details about CodeGPT to our repo.

For the first question.

But the performance of finetuned model != pretrain model, right?
If I want to use finetuned model in the production environment, I have to preprocess the input code before sending them into model, and postprocess the output before presenting in IDE...

Yes. The fine-tuned model is more likely to predict code sequences in token level format. And if you would like to use it in real code completion scenario, you could use our released pre-trained CodeGPT model which is suitable for predicting raw codes. Alternately, you can try to fine-tuning CodeGPT on more data corpus (e.g. you could download PY150 and Github Java Corpus to obtain the raw code of each code file without running our preprocessing script). According to the time-cost analysis, it won't take too much time or computing resources.

@carter54
Copy link
Author

carter54 commented Oct 28, 2020

@celbree Thx for these details~
I have tried the pre-trained CodeGPT model and found that all indentation and line break are missing. Do I made any mistake when deploying the model?
Or did you preprocess the raw code by removing indentation and line break before pratraining the model? If so, any reason to do this preprocess?

@celbree
Copy link
Member

celbree commented Oct 29, 2020

Yes, indeed. We do the preprocessing to ignore all the indentation and line breaks. CodeGPT is a pre-train model for CodeXGLUE participants to play with. It's train on a not large corpus and it can be viewed as a baseline for CodeXGLUE benchmark.

@carter54
Copy link
Author

@celbree Thanks~
to my point of view, it is better to keep indentation and line breaks in the input, as indentation is import to the semantics of code.
a simple example:

a = 1
b = 2
if a > b:
    print ('I don't agree')
    print('Neither do I')
a = 1
b = 2
if a > b:
    print ('I don't agree')
print('Neither do I')

are different

@celbree
Copy link
Member

celbree commented Oct 29, 2020

Totally agree. Thanks for your suggestion.

@celbree celbree closed this as completed Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants