-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CodeParrot 🦜 codebase #14536
Add CodeParrot 🦜 codebase #14536
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this to the examples! Overall it looks good, but I think it's lacking some details that could aid with reproducibility:
- For each step (preprocessing, training, evaluation etc), I think it would be useful to have a CLI with some default values. Currently, there are some hard-coded variables / configs that require the end-user to read the source code to understand what's going on.
- Add docstrings to all the classes and functions.
A nice template to look at is the DistilBERT project.
PS I'm not sure what the conventions are for new research projects, so feel free to ignore these comments if the current PR is OK for @LysandreJik
examples/research_projects/codeparrot/scripts/codeparrot_training.py
Outdated
Show resolved
Hide resolved
examples/research_projects/codeparrot/scripts/codeparrot_training.py
Outdated
Show resolved
Hide resolved
The reason I went without a CLI is that most scripts are quite lean and a CLI would decrease readability. E.g. for initialisation would significantly longer with a CLI. Due to the required compute scale I also expect this to be run by "advanced" users. This reminded me that the training script has to be executed from the accelerate CLI so I'll add a remark about this in any case. What do you think @LysandreJik? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @lvwerra! Thanks for contributing your code, this is cool!
I agree with @lewtun that having the code as a CLI would be better for reusability, and I'd argue it would also be the best for readability: all arguments are centralized, making it easy to see what can be configured and what cannot.
I think people are also used to CLIs and it doesn't deter them from reading the code, so I personally see no drawbacks.
I see there is an image file in your PR. Could you please host it in a dataset on the hub, either on your profile or on the hf-internal-testing
organization?
The issue with adding image files to the repository is that they'll weigh it down forever as they'll be part of its history.
examples/research_projects/codeparrot/scripts/codeparrot_training.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Thanks for working on that, @lvwerra!
@@ -1,6 +1,6 @@ | |||
# CodeParrot 🦜 | |||
<p align="center"> | |||
<img src="images/code-highlighting-streamlit.png" alt="drawing" width="350"/> | |||
<img src="https://huggingface.co/datasets/lvwerra/repo-images/raw/main/code-highlighting-streamlit.png" alt="drawing" width="350"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
args = parser.parse_args() | ||
|
||
# Base tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(args.base_tokenizer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be AutoTokenizer
if you specify that it's BPE-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not a big problem and I understand if you want to keep it that way)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes! I left a few minor nits, but otherwise LGTM!
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Add CodeParrot 🦜 codebase
This PR adds the CodeParrot 🦜 codebase to
examples/research_projects/
. The folderscripts/
includes files for the following steps:In addition the README gives an overview and highlights the results. The requirements file fixes the dependencies.
cc @LysandreJik @thomwolf