New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CodeParrot 馃 codebase #14536
Merged
Merged
Add CodeParrot 馃 codebase #14536
Changes from 18 commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
9019dd0
add readme skeleton
lvwerra 8247ff7
update readme
lvwerra d90ddcc
add initialization script
lvwerra 557c56a
add deduplication script
lvwerra e19d401
add codeparrot training script
lvwerra c79ffea
add code generation evaluation
lvwerra e2505bb
add validation loss script
lvwerra 82eeb32
add requirements
lvwerra 8ea280b
update readme
lvwerra da6afb4
tweak readme
lvwerra a984c28
make style
lvwerra 2354a5e
add highlights to readme
lvwerra 5f53262
add CLIs to scripts
lvwerra 3874cb3
add tokenizer training script
lvwerra e53af31
add docstring to constant length dataset
lvwerra 9568702
fix defaults in arguments
lvwerra ce12833
update readme with cli
lvwerra 04e5afd
move image to hub
lvwerra 68cb2dc
tweaks of readme
lvwerra 07b02a6
fix cli commands
lvwerra 7ba87f9
add author
lvwerra 9c7b7a1
explain env variables
lvwerra 27a16df
fix formatting
lvwerra 51e2456
Update examples/research_projects/codeparrot/README.md
lvwerra b2bc6bb
Apply suggestions from code review
lvwerra efc26cd
replace generic with gpt2 tokenizer
lvwerra File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
# CodeParrot 馃 | ||
<p align="center"> | ||
<img src="https://huggingface.co/datasets/lvwerra/repo-images/raw/main/code-highlighting-streamlit.png" alt="drawing" width="350"/> | ||
</p> | ||
|
||
## What is this about? | ||
lvwerra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
This is an open-source effort to train and evaluate code generation models. CodeParrot 馃 is a GPT-2 model trained from scratch on Python code. The highlights of this repo are: | ||
lvwerra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- initialize and train a GPT-2 language model from scratch for code generation | ||
- clean and deduplicate a large (>100GB) dataset with `datasets` | ||
- train with `accelerate` on multiple GPUs using data parallelism and mixed precision | ||
- continuously push checkpoints to the hub with `huggingface_hub` | ||
- stream the dataset with `datasets` during training to avoid disk bottlenecks | ||
- apply `code_eval` metric in `datasets` to evaluate on OpenAI's HumanEval benchmark | ||
lvwerra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Installation | ||
To install the dependencies simply run the following command: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
To reproduce the results you can follow the scripts in the following sections. Note that we don't show all possible arguments to the scripts. To get the full list of arguments with descriptions you can run the following command on any script: | ||
|
||
```bash | ||
python scripts/some_script.py --help | ||
``` | ||
|
||
Before you run any of the scripts make sure you are logged in and can push to the hub: | ||
```bash | ||
huggingface-cli login | ||
``` | ||
|
||
## Dataset | ||
The source of the dataset is the GitHub dump available on Google's [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The database was queried for all Python files less than 1MB in size resulting in a 180GB dataset with over 20M files. The dataset is available on the Hugging Face Hub [here](https://huggingface.co/datasets/transformersbook/codeparrot). | ||
|
||
### Preprocessing | ||
The raw dataset contains many duplications so the dataset was deduplicated and filtered using the heuristics proposed in the Codex [paper](https://arxiv.org/abs/2107.03374): | ||
|
||
- exact deduplication using each file's hash | ||
- filtering files with max line length > 1000 | ||
- filtering files with mean line length > 100 | ||
- fraction of alphanumeric characters < 0.25 | ||
- containing the word "auto-generated" or similar in the first 5 lines | ||
|
||
The script to process the full dataset can be found in `scripts/preprocessing.py`. Executing the script on 16 vCPUs takes roughly 3h and removes 70% of the original dataset. The cleaned [train](https://huggingface.co/datasets/lvwerra/codeparrot-clean-train) and [validation](https://huggingface.co/datasets/lvwerra/codeparrot-clean-valid) splits are also available on the Hub if you want to skip this step or use the data for another project. | ||
|
||
To execute the preprocessing run the following command: | ||
```bash | ||
python scripts/preprocessing.py | ||
--dataset_name lvwerra/codeparrot | ||
--output_dir codeparrot-clean | ||
``` | ||
During preprocessing the dataset is downloaded and stored locally as well as caches of the computations. Make sure you have enough free disk space to execute it. | ||
lvwerra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Tokenizer | ||
Before training a new model for code we create a new tokenizer efficient at code tokenization. To train the tokenizer you can run the following command: | ||
```bash | ||
python scripts/bpe_training.py | ||
--base_tokenizer gpt2 | ||
--dataset_name lvwerra/codeparrot-clean-train | ||
``` | ||
|
||
_Note:_ We originally trained the tokenizer on the unprocessed train split of the dataset `transformersbook/codeparrot-train`. | ||
|
||
## Training | ||
The models are randomly initialized and trained from scratch. To initialize a new model you can run | ||
|
||
```bash | ||
python scripts/initialize_model.py | ||
--config_name gpt2-large | ||
--tokenizer_name lvwerra/codeparrot | ||
--model_name codeparrot | ||
--push_to_hub True | ||
``` | ||
This will initialize a new model with the architecture and configuration of `gpt2-large` and use the tokenizer to appropriately size the input embeddings. Finally, the initilaized model is pushed the the hub. | ||
|
||
Now that the dataset, tokenizer, and model are ready we can start training the model. The main training script is built with `accelerate` to scale across a wide range of platforms and infrastructure scales. We train two models with [110M](https://huggingface.co/lvwerra/codeparrot-small/) and [1.5B](https://huggingface.co/lvwerra/codeparrot/) parameters for 25-30B tokens on a 16xA100 (40GB) machine which takes 1 day and 1 week, respectively. | ||
|
||
First you need to configure `accelerate` and Weights&Biases for login: | ||
|
||
```bash | ||
acclerate config | ||
wandb login | ||
``` | ||
|
||
Then to train the large model you can run | ||
|
||
```bash | ||
python scripts/codeparrot_training.py | ||
``` | ||
|
||
If you want to train the small model you need to make some modifications: | ||
|
||
```bash | ||
accelerate launch scripts/codeparrot_training.py | ||
--model_ckpt lvwerra/codeparrot-small | ||
--train_batch_size 12 | ||
--eval_batch_size 12 | ||
--learning_rate 5e-4 | ||
--num_warmup_steps 2000 | ||
--gradient_accumulation 1 | ||
--gradient_checkpointing False | ||
--max_train_steps 150000 | ||
--save_checkpoint_steps 15000 | ||
``` | ||
|
||
Reminder that you can see the full set of possible options with descriptions (for all scripts) by running: | ||
lvwerra marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
```bash | ||
python scripts/codeparrot_training.py --help | ||
``` | ||
|
||
## Evaluation | ||
For evaluating the language modeling loss on the validation set or any other dataset you can use the following command: | ||
```bash | ||
python scripts/validation_loss.py | ||
--model_ckpt lvwerra/codeparrot | ||
--dataset_name lvwerra/codeparrot-clean-valid | ||
``` | ||
In addition we evaluate the model on OpenAI's _HumanEval_ benchmark. You can run the evaluation with the following command: | ||
|
||
```bash | ||
python scripts/human_eval.py --model_ckpt lvwerra/codeparrot | ||
--do_sample True | ||
--temperature 0.2 | ||
--top_p 0.95 | ||
--n_samples=200 | ||
``` | ||
|
||
The results as well as reference values are shown in the following table: | ||
|
||
| Model | pass@1 | pass@10 | pass@100| | ||
|-------|--------|---------|---------| | ||
|CodeParrot 馃 (110M) | 3.80% | 6.57% | 12.78% | | ||
|CodeParrot 馃 (1.5B) | 3.58% | 8.03% | 14.96% | | ||
||||| | ||
|Codex (25M)| 3.21% | 7.1% | 12.89%| | ||
|Codex (85M)| 8.22% | 12.81% | 22.40% | | ||
|Codex (300M)| 13.17%| 20.37% | 36.27% | | ||
|Codex (12B)| 28.81%| 46.81% | 72.31% | | ||
||||| | ||
|GPT-neo (125M)| 0.75% | 1.88% | 2.97% | | ||
|GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% | | ||
|GPT-neo (2.7B)| 6.41% | 11.27% | 21.37% | | ||
|GPT-J (6B)| 11.62% | 15.74% | 27.74% | | ||
|
||
The numbers were obtained by sampling with `T = [0.2, 0.6, 0.8]` and picking the best value for each metric. Both CodeParrot 馃 models are still underfitted and longer training would likely improve the performance. | ||
|
||
## Demo | ||
Give the model a shot yourself! There are two demos to interact with the model: | ||
- [Code generation](https://huggingface.co/spaces/lvwerra/codeparrot-generation) | ||
- [Code highlighting](https://huggingface.co/spaces/lvwerra/codeparrot-highlighting) | ||
|
||
## Further Resources | ||
A detailed description of the project can be found in the chapter "Training Transformers from Scratch" in the upcoming O'Reilly book [Natural Language Processing with Transformers](https://learning.oreilly.com/library/view/natural-language-processing/9781098103231/). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
transformers==4.12.2 | ||
datasets==1.16.0 | ||
accelerate==0.5.1 | ||
wandb==0.12.0 | ||
tensorboard==2.6.0 | ||
torch==1.9.0 | ||
huggingface-hub==0.0.19 |
175 changes: 175 additions & 0 deletions
175
examples/research_projects/codeparrot/scripts/arguments.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
from dataclasses import dataclass, field | ||
from typing import Optional | ||
|
||
|
||
@dataclass | ||
class TrainingArguments: | ||
""" | ||
Configuration for training model. | ||
""" | ||
|
||
model_ckpt: Optional[str] = field( | ||
default="lvwerra/codeparrot", | ||
metadata={"help": "Model name or path of model to be trained."}, | ||
) | ||
save_dir: Optional[str] = field( | ||
default="./", | ||
metadata={"help": "Save dir where model repo is cloned and models updates are saved to."}, | ||
) | ||
dataset_name_train: Optional[str] = field( | ||
default="lvwerra/codeparrot-clean-train", metadata={"help": "Name or path of training dataset."} | ||
) | ||
dataset_name_valid: Optional[str] = field( | ||
default="lvwerra/codeparrot-clean-valid", metadata={"help": "Name or path of validation dataset."} | ||
) | ||
train_batch_size: Optional[int] = field(default=2, metadata={"help": "Batch size for training."}) | ||
valid_batch_size: Optional[int] = field(default=2, metadata={"help": "Batch size for evaluation."}) | ||
weight_decay: Optional[float] = field(default=0.1, metadata={"help": "Value of weight decay."}) | ||
shuffle_buffer: Optional[int] = field( | ||
default=1000, metadata={"help": "Size of buffer used to shuffle streaming dataset."} | ||
) | ||
learning_rate: Optional[float] = field(default=2e-4, metadata={"help": "Learning rate fo training."}) | ||
lr_scheduler_type: Optional[str] = field(default="cosine", metadata={"help": "Learning rate."}) | ||
num_warmup_steps: Optional[int] = field( | ||
default=750, metadata={"help": "Number of warmup steps in the learning rate schedule."} | ||
) | ||
gradient_accumulation_steps: Optional[int] = field( | ||
default=16, metadata={"help": "Number of gradient accumulation steps."} | ||
) | ||
gradient_checkpointing: Optional[bool] = field( | ||
default=True, metadata={"help": "Use gradient checkpointing to reduce memory footprint."} | ||
) | ||
max_train_steps: Optional[int] = field(default=50_000, metadata={"help": "Maximum number of training steps."}) | ||
max_eval_steps: Optional[int] = field( | ||
default=-1, metadata={"help": "Maximum number of evaluation steps. If -1 the full dataset is evaluated."} | ||
) | ||
seq_length: Optional[int] = field(default=1024, metadata={"help": "Sequence lengths used for training."}) | ||
seed: Optional[int] = field(default=1, metadata={"help": "Training seed."}) | ||
save_checkpoint_steps: Optional[int] = field( | ||
default=1024, | ||
metadata={"help": "Interval to save checkpoints. Measured as number of forward passes not training steps."}, | ||
) | ||
|
||
|
||
@dataclass | ||
class EvaluationArguments: | ||
""" | ||
Configuration for evaluating model. | ||
""" | ||
|
||
model_ckpt: Optional[str] = field( | ||
default="lvwerra/codeparrot", | ||
metadata={"help": "Model name or path of model to be evaluated."}, | ||
) | ||
dataset_name: Optional[str] = field( | ||
default="lvwerra/codeparrot-clean-valid", metadata={"help": "Name or path of validation dataset."} | ||
) | ||
batch_size: Optional[int] = field(default=2, metadata={"help": "Batch size used for evaluation."}) | ||
max_eval_steps: Optional[int] = field( | ||
default=-1, metadata={"help": "Maximum number of evaluation steps. If -1 the full dataset is evaluated."} | ||
) | ||
seq_length: Optional[int] = field(default=1024, metadata={"help": "Length of sequences to be evaluated."}) | ||
seed: Optional[int] = field(default=1, metadata={"help": "Random seed used for evaluation."}) | ||
|
||
|
||
@dataclass | ||
class HumanEvalArguments: | ||
""" | ||
Configuration for running evaluation on HumanEval dataset. | ||
""" | ||
|
||
model_ckpt: Optional[str] = field( | ||
default="lvwerra/codeparrot", | ||
metadata={"help": "Model name or path of model to be evaluated."}, | ||
) | ||
num_workers: Optional[int] = field(default=None, metadata={"help": "Number of workers used for code evaluation."}) | ||
do_sample: Optional[bool] = field( | ||
default=True, metadata={"help": "Sample from the language model's output distribution."} | ||
) | ||
temperature: Optional[float] = field(default=0.2, metadata={"help": "Sampling temperature used for generation."}) | ||
max_new_tokens: Optional[int] = field(default=256, metadata={"help": "Maximum number of newly generated tokens."}) | ||
top_k: Optional[int] = field(default=0, metadata={"help": "Top-k parameter used for generation."}) | ||
top_p: Optional[float] = field(default=0.95, metadata={"help": "Top-p parameter used for nucleus sampling."}) | ||
batch_size: Optional[int] = field(default=10, metadata={"help": "Number of generations to run in parallel."}) | ||
n_samples: Optional[int] = field( | ||
default=200, metadata={"help": "Number of completions to generate for each sample."} | ||
) | ||
seed: Optional[int] = field(default=1, metadata={"help": "Random seed used for evaluation."}) | ||
output_file: Optional[str] = field( | ||
default="eval_results.json", metadata={"help": "Random seed used for evaluation."} | ||
) | ||
HF_ALLOW_CODE_EVAL: Optional[str] = field( | ||
default="0", metadata={"help": "Allow `code_eval` to execute Python code on machine"} | ||
) | ||
|
||
|
||
@dataclass | ||
class PreprocessingArguments: | ||
""" | ||
Configuration for preprocessing data. | ||
""" | ||
|
||
num_workers: Optional[int] = field( | ||
default=None, | ||
metadata={ | ||
"help": "The number of CPU cores to use for parallel preprocessing. Default uses the maximum available." | ||
}, | ||
) | ||
dataset_name: Optional[str] = field( | ||
default="codeparrot", metadata={"help": "Folder or name of dataset to process."} | ||
) | ||
output_dir: Optional[str] = field( | ||
default="codeparrot-clean", metadata={"help": "Folder to save processed processed dataset."} | ||
) | ||
samples_per_file: Optional[int] = field( | ||
default=100_000, metadata={"help": "Number of files to save per JSON output file."} | ||
) | ||
text_column: Optional[str] = field(default="content", metadata={"help": "Column containing text data to process."}) | ||
line_max: Optional[float] = field( | ||
default=1000, metadata={"help": "Maximum line length in file, otherwise file is filtered."} | ||
) | ||
line_mean: Optional[float] = field( | ||
default=100, metadata={"help": "Maximum mean line length in file, otherwise file is filtered."} | ||
) | ||
alpha_frac: Optional[float] = field( | ||
default=0.25, metadata={"help": "Maximum fraction of non-alphanumeric characters, otherwise file is filtered."} | ||
) | ||
|
||
|
||
@dataclass | ||
class TokenizerTrainingArguments: | ||
""" | ||
Configuration for tokenizer training. | ||
""" | ||
|
||
base_tokenizer: Optional[str] = field( | ||
default="gpt2", | ||
metadata={"help": "Base tokenizer to build new tokenizer from."}, | ||
) | ||
dataset_name: Optional[str] = field( | ||
default="transformersbook/codeparrot-train", metadata={"help": "Dataset to train tokenizer on."} | ||
) | ||
text_column: Optional[str] = field(default="content", metadata={"help": "Column containing text data to process."}) | ||
vocab_size: Optional[int] = field(default=200000, metadata={"help": "Number of examples to train tokenizer on."}) | ||
n_examples: Optional[int] = field( | ||
default=32768, metadata={"help": "Number of examples to train the tokenizer on."} | ||
) | ||
tokenizer_name: Optional[str] = field(default="codeparrot", metadata={"help": "Name of new tokenizer."}) | ||
push_to_hub: Optional[bool] = field(default=True, metadata={"help": "Push saved tokenizer to the hub."}) | ||
|
||
|
||
@dataclass | ||
class InitializationArguments: | ||
""" | ||
Configuration for initializing new model. | ||
""" | ||
|
||
config_name: Optional[str] = field( | ||
default="gpt2-large", | ||
metadata={"help": "Configuration to use for model initialization."}, | ||
) | ||
tokenizer_name: Optional[str] = field( | ||
default="lvwerra/codeparrot", metadata={"help": "Tokenizer attached to model."} | ||
) | ||
model_name: Optional[str] = field(default="codeparrot", metadata={"help": "Name of the created model."}) | ||
push_to_hub: Optional[bool] = field(default=True, metadata={"help": "Push saved tokenizer to the hub."}) |
32 changes: 32 additions & 0 deletions
32
examples/research_projects/codeparrot/scripts/bpe_training.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
from datasets import load_dataset | ||
from tqdm import tqdm | ||
|
||
from arguments import TokenizerTrainingArguments | ||
from transformers import AutoTokenizer, HfArgumentParser | ||
from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode | ||
|
||
|
||
# Iterator for Training | ||
def batch_iterator(batch_size=10): | ||
for _ in tqdm(range(0, args.n_examples, batch_size)): | ||
yield [next(iter_dataset)[args.text_column] for _ in range(batch_size)] | ||
|
||
|
||
# Configuration | ||
parser = HfArgumentParser(TokenizerTrainingArguments) | ||
args = parser.parse_args() | ||
|
||
# Base tokenizer | ||
tokenizer = AutoTokenizer.from_pretrained(args.base_tokenizer) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this really be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Not a big problem and I understand if you want to keep it that way) |
||
base_vocab = list(bytes_to_unicode().values()) | ||
|
||
# Load dataset | ||
dataset = load_dataset(args.dataset_name, split="train", streaming=True) | ||
iter_dataset = iter(dataset) | ||
|
||
|
||
# Training and saving | ||
new_tokenizer = tokenizer.train_new_from_iterator( | ||
batch_iterator(), vocab_size=args.vocab_size, initial_alphabet=base_vocab | ||
) | ||
new_tokenizer.save_pretrained(args.tokenizer_name, push_to_hub=args.push_to_hub) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!