Add CodeParrot 🦜 codebase #14536

lvwerra · 2021-11-26T13:26:10Z

Add CodeParrot 🦜 codebase

This PR adds the CodeParrot 🦜 codebase to examples/research_projects/. The folder scripts/ includes files for the following steps:

data preprocessing
model initialization
training with accelerate
validation loss
evaluate on HumanEval benchmark

In addition the README gives an overview and highlights the results. The requirements file fixes the dependencies.

cc @LysandreJik @thomwolf

lewtun

Thanks a lot for adding this to the examples! Overall it looks good, but I think it's lacking some details that could aid with reproducibility:

For each step (preprocessing, training, evaluation etc), I think it would be useful to have a CLI with some default values. Currently, there are some hard-coded variables / configs that require the end-user to read the source code to understand what's going on.
Add docstrings to all the classes and functions.

A nice template to look at is the DistilBERT project.

PS I'm not sure what the conventions are for new research projects, so feel free to ignore these comments if the current PR is OK for @LysandreJik

examples/research_projects/codeparrot/README.md

examples/research_projects/codeparrot/scripts/codeparrot_training.py

examples/research_projects/codeparrot/scripts/human_eval.py

examples/research_projects/codeparrot/README.md

lvwerra · 2021-11-26T15:34:59Z

For each step (preprocessing, training, evaluation etc), I think it would be useful to have a CLI with some default values. Currently, there are some hard-coded variables / configs that require the end-user to read the source code to understand what's going on.

The reason I went without a CLI is that most scripts are quite lean and a CLI would decrease readability. E.g. for initialisation would significantly longer with a CLI. Due to the required compute scale I also expect this to be run by "advanced" users. This reminded me that the training script has to be executed from the accelerate CLI so I'll add a remark about this in any case.

What do you think @LysandreJik?

LysandreJik

This looks great @lvwerra! Thanks for contributing your code, this is cool!

I agree with @lewtun that having the code as a CLI would be better for reusability, and I'd argue it would also be the best for readability: all arguments are centralized, making it easy to see what can be configured and what cannot.

I think people are also used to CLIs and it doesn't deter them from reading the code, so I personally see no drawbacks.

I see there is an image file in your PR. Could you please host it in a dataset on the hub, either on your profile or on the hf-internal-testing organization?

The issue with adding image files to the repository is that they'll weigh it down forever as they'll be part of its history.

examples/research_projects/codeparrot/README.md

examples/research_projects/codeparrot/scripts/codeparrot_training.py

LysandreJik

This looks great to me! Thanks for working on that, @lvwerra!

LysandreJik · 2021-12-02T08:20:04Z

examples/research_projects/codeparrot/README.md

@@ -1,6 +1,6 @@
 # CodeParrot 🦜
 <p align="center">
-    <img src="images/code-highlighting-streamlit.png" alt="drawing" width="350"/>
+    <img src="https://huggingface.co/datasets/lvwerra/repo-images/raw/main/code-highlighting-streamlit.png" alt="drawing" width="350"/>


LysandreJik · 2021-12-02T08:38:59Z

examples/research_projects/codeparrot/scripts/bpe_training.py

+args = parser.parse_args()
+
+# Base tokenizer
+tokenizer = AutoTokenizer.from_pretrained(args.base_tokenizer)


Should this really be AutoTokenizer if you specify that it's BPE-only?

(Not a big problem and I understand if you want to keep it that way)

lewtun

Thanks for making the changes! I left a few minor nits, but otherwise LGTM!

examples/research_projects/codeparrot/README.md

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

leandro added 12 commits November 26, 2021 11:55

add readme skeleton

9019dd0

update readme

8247ff7

add initialization script

d90ddcc

add deduplication script

557c56a

add codeparrot training script

e19d401

add code generation evaluation

c79ffea

add validation loss script

e2505bb

add requirements

82eeb32

update readme

8ea280b

tweak readme

da6afb4

make style

a984c28

add highlights to readme

2354a5e

lvwerra requested a review from LysandreJik November 26, 2021 13:26

lewtun approved these changes Nov 26, 2021

View reviewed changes

LysandreJik reviewed Nov 26, 2021

View reviewed changes

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved

examples/research_projects/codeparrot/README.md Outdated Show resolved Hide resolved

examples/research_projects/codeparrot/scripts/codeparrot_training.py Outdated Show resolved Hide resolved

leandro added 11 commits November 30, 2021 17:28

add CLIs to scripts

5f53262

add tokenizer training script

3874cb3

add docstring to constant length dataset

e53af31

fix defaults in arguments

9568702

update readme with cli

ce12833

move image to hub

04e5afd

tweaks of readme

68cb2dc

fix cli commands

07b02a6

add author

7ba87f9

explain env variables

9c7b7a1

fix formatting

27a16df

LysandreJik approved these changes Dec 2, 2021

View reviewed changes

lewtun approved these changes Dec 2, 2021

View reviewed changes

Update examples/research_projects/codeparrot/README.md

51e2456

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

lvwerra and others added 2 commits December 2, 2021 10:12

Apply suggestions from code review

b2bc6bb

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

replace generic with gpt2 tokenizer

efc26cd

lvwerra merged commit 43f953c into master Dec 2, 2021

lvwerra deleted the codeparrot branch December 2, 2021 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CodeParrot 🦜 codebase #14536

Add CodeParrot 🦜 codebase #14536

lvwerra commented Nov 26, 2021

lewtun left a comment

lvwerra commented Nov 26, 2021

LysandreJik left a comment

LysandreJik left a comment

LysandreJik Dec 2, 2021

LysandreJik Dec 2, 2021

LysandreJik Dec 2, 2021

lewtun left a comment

Add CodeParrot 🦜 codebase #14536

Add CodeParrot 🦜 codebase #14536

Conversation

lvwerra commented Nov 26, 2021

Add CodeParrot 🦜 codebase

lewtun left a comment

Choose a reason for hiding this comment

lvwerra commented Nov 26, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Dec 2, 2021

Choose a reason for hiding this comment

LysandreJik Dec 2, 2021

Choose a reason for hiding this comment

LysandreJik Dec 2, 2021

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment