New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CodeParrot 🦜 codebase #14536
Add CodeParrot 🦜 codebase #14536
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this to the examples! Overall it looks good, but I think it's lacking some details that could aid with reproducibility:
- For each step (preprocessing, training, evaluation etc), I think it would be useful to have a CLI with some default values. Currently, there are some hard-coded variables / configs that require the end-user to read the source code to understand what's going on.
- Add docstrings to all the classes and functions.
A nice template to look at is the DistilBERT project.
PS I'm not sure what the conventions are for new research projects, so feel free to ignore these comments if the current PR is OK for @LysandreJik
examples/research_projects/codeparrot/scripts/codeparrot_training.py
Outdated
Show resolved
Hide resolved
examples/research_projects/codeparrot/scripts/codeparrot_training.py
Outdated
Show resolved
Hide resolved
The reason I went without a CLI is that most scripts are quite lean and a CLI would decrease readability. E.g. for initialisation would significantly longer with a CLI. Due to the required compute scale I also expect this to be run by "advanced" users. This reminded me that the training script has to be executed from the accelerate CLI so I'll add a remark about this in any case. What do you think @LysandreJik? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @lvwerra! Thanks for contributing your code, this is cool!
I agree with @lewtun that having the code as a CLI would be better for reusability, and I'd argue it would also be the best for readability: all arguments are centralized, making it easy to see what can be configured and what cannot.
I think people are also used to CLIs and it doesn't deter them from reading the code, so I personally see no drawbacks.
I see there is an image file in your PR. Could you please host it in a dataset on the hub, either on your profile or on the hf-internal-testing
organization?
The issue with adding image files to the repository is that they'll weigh it down forever as they'll be part of its history.
examples/research_projects/codeparrot/scripts/codeparrot_training.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Thanks for working on that, @lvwerra!
@@ -1,6 +1,6 @@ | |||
# CodeParrot 🦜 | |||
<p align="center"> | |||
<img src="images/code-highlighting-streamlit.png" alt="drawing" width="350"/> | |||
<img src="https://huggingface.co/datasets/lvwerra/repo-images/raw/main/code-highlighting-streamlit.png" alt="drawing" width="350"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
args = parser.parse_args() | ||
|
||
# Base tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(args.base_tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be AutoTokenizer
if you specify that it's BPE-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not a big problem and I understand if you want to keep it that way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes! I left a few minor nits, but otherwise LGTM!
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
* add readme skeleton * update readme * add initialization script * add deduplication script * add codeparrot training script * add code generation evaluation * add validation loss script * add requirements * update readme * tweak readme * make style * add highlights to readme * add CLIs to scripts * add tokenizer training script * add docstring to constant length dataset * fix defaults in arguments * update readme with cli * move image to hub * tweaks of readme * fix cli commands * add author * explain env variables * fix formatting * Update examples/research_projects/codeparrot/README.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Apply suggestions from code review Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * replace generic with gpt2 tokenizer Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Add CodeParrot 🦜 codebase
This PR adds the CodeParrot 🦜 codebase to
examples/research_projects/
. The folderscripts/
includes files for the following steps:In addition the README gives an overview and highlights the results. The requirements file fixes the dependencies.
cc @LysandreJik @thomwolf