Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CodeParrot 🦜 codebase #14536

Merged
merged 26 commits into from Dec 2, 2021
Merged

Add CodeParrot 🦜 codebase #14536

merged 26 commits into from Dec 2, 2021

Conversation

lvwerra
Copy link
Member

@lvwerra lvwerra commented Nov 26, 2021

Add CodeParrot 🦜 codebase

This PR adds the CodeParrot 🦜 codebase to examples/research_projects/. The folder scripts/ includes files for the following steps:

  • data preprocessing
  • model initialization
  • training with accelerate
  • validation loss
  • evaluate on HumanEval benchmark

In addition the README gives an overview and highlights the results. The requirements file fixes the dependencies.

cc @LysandreJik @thomwolf

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this to the examples! Overall it looks good, but I think it's lacking some details that could aid with reproducibility:

  • For each step (preprocessing, training, evaluation etc), I think it would be useful to have a CLI with some default values. Currently, there are some hard-coded variables / configs that require the end-user to read the source code to understand what's going on.
  • Add docstrings to all the classes and functions.

A nice template to look at is the DistilBERT project.

PS I'm not sure what the conventions are for new research projects, so feel free to ignore these comments if the current PR is OK for @LysandreJik

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
@lvwerra
Copy link
Member Author

lvwerra commented Nov 26, 2021

  • For each step (preprocessing, training, evaluation etc), I think it would be useful to have a CLI with some default values. Currently, there are some hard-coded variables / configs that require the end-user to read the source code to understand what's going on.

The reason I went without a CLI is that most scripts are quite lean and a CLI would decrease readability. E.g. for initialisation would significantly longer with a CLI. Due to the required compute scale I also expect this to be run by "advanced" users. This reminded me that the training script has to be executed from the accelerate CLI so I'll add a remark about this in any case.

What do you think @LysandreJik?

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @lvwerra! Thanks for contributing your code, this is cool!

I agree with @lewtun that having the code as a CLI would be better for reusability, and I'd argue it would also be the best for readability: all arguments are centralized, making it easy to see what can be configured and what cannot.

I think people are also used to CLIs and it doesn't deter them from reading the code, so I personally see no drawbacks.


I see there is an image file in your PR. Could you please host it in a dataset on the hub, either on your profile or on the hf-internal-testing organization?

The issue with adding image files to the repository is that they'll weigh it down forever as they'll be part of its history.

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Thanks for working on that, @lvwerra!

@@ -1,6 +1,6 @@
# CodeParrot 🦜
<p align="center">
<img src="images/code-highlighting-streamlit.png" alt="drawing" width="350"/>
<img src="https://huggingface.co/datasets/lvwerra/repo-images/raw/main/code-highlighting-streamlit.png" alt="drawing" width="350"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

args = parser.parse_args()

# Base tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.base_tokenizer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be AutoTokenizer if you specify that it's BPE-only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not a big problem and I understand if you want to keep it that way)

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes! I left a few minor nits, but otherwise LGTM!

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
lvwerra and others added 2 commits December 2, 2021 10:12
Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
@lvwerra lvwerra merged commit 43f953c into master Dec 2, 2021
@lvwerra lvwerra deleted the codeparrot branch December 2, 2021 09:41
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
* add readme skeleton

* update readme

* add initialization script

* add deduplication script

* add codeparrot training script

* add code generation evaluation

* add validation loss script

* add requirements

* update readme

* tweak readme

* make style

* add highlights to readme

* add CLIs to scripts

* add tokenizer training script

* add docstring to constant length dataset

* fix defaults in arguments

* update readme with cli

* move image to hub

* tweaks of readme

* fix cli commands

* add author

* explain env variables

* fix formatting

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* Apply suggestions from code review

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

* replace generic with gpt2 tokenizer

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants